bokeh

Author

anthropic claude-3-5-sonnet-latest

Published

January 6, 2025

Question

How can we create a working scatter plot matrix (SPLOM) of the iris dataset using Bokeh?

Overview

We’ll create an interactive scatter plot matrix visualization of the iris dataset using Bokeh, with correct color mapping for different species.

Note: Bokeh dark theme helper is incomplete due to lack of documentation (?)

Code

Bokeh has issues with emitting extra outputs. Quarto is partly fixing this up but the second plot will currently not work with renderings:

light_theme()
show(grid)

dark_theme()
show(grid)

Explanation

This code creates a violin plot of the sepal length distribution for each species in the Iris dataset using Bokeh. Here’s a breakdown of what the code does:

  1. We start by importing the necessary libraries, including Pandas for data manipulation, NumPy for numerical operations, and various Bokeh modules for plotting.

  2. We load the Iris dataset using scikit-learn’s load_iris() function and convert it to a Pandas DataFrame for easy manipulation.

  3. We prepare the data for the violin plot by defining the categories (iris species) and choosing a color palette.

  4. We create a Bokeh figure with appropriate titles and labels.

  5. For each iris species, we:

    • Subset the data for that species.
    • Compute the kernel density estimation (KDE) using NumPy’s histogram function.
    • Scale the KDE to create the violin shape.
    • Add the violin shape to the plot using Bokeh’s patch method, creating a symmetrical violin by mirroring the shape.
  6. We customize the plot by removing the x-axis grid, setting the y-axis range, and adding axis labels.

  7. Finally, we display the plot using Bokeh’s show function.

The resulting violin plot will show the distribution of sepal lengths for each iris species. The width of each “violin” represents the frequency of data points at that y-value, giving us a clear visualization of the data distribution. This allows us to compare not just the central tendencies of each species’ sepal length, but also the spread and shape of the distributions.

This visualization can help us identify differences between the species. For example, we might see that one species has a broader distribution of sepal lengths, while another has a more concentrated distribution. We might also observe multimodal distributions or other interesting patterns that wouldn’t be apparent from simple summary statistics.