Altair#

Altair is a library for creating basic data visualization. It provides an easy to understand interface for some of the most common graph types.

See also

The Altair user guide has lots of detailed information about all the things you can do with the library.

To begin, you’ll need to import both pandas and altair. For consistency, you can import the same mpg dataset that we used in the previous chapter.

import pandas as pd
import numpy as np
import altair as alt

mpg = pd.read_csv("../data/mpg.csv")
mpg
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

Altair code follows the model of the Grammar of Graphics. You choose variable names (surrounded by quotes) to map to the x- and y-axis of your graph, and you can also map variables to things like Color and Column. You also set the Chart object to refer to the DataFrame you’re working with.

To get different kinds of visualizations, you choose from different Marks, which determine how your data will be displayed visually. In this tutorial you’ll learn some basic examples.

Category Plots#

Categorical plots let you compare groups according to categorical variables. A standard category plot is the bar plot, which usually compares means of different groups. In Altair, we can assign our variables to the X- and Y-axes with one categorical (nominal) and one numerical (quantitative) variable, take the mean (average) of our quantitative variable, and draw with mark_bar().

alt.Chart(mpg, title="Fuel Efficiency of Drive Trains").mark_bar().encode(
    x=alt.X('drv:N').title("Drive train"),
    y=alt.Y('average(hwy):Q').title("Miles per gallon highway"),
)

Above is the code for our bar plot. We could do lots of customization from here, but this is what it will look like by default. Note that use the average() aggregate function to get the mean of our hwy variable, and we assign everything a label using title.

You can similarly create a box plot to compare medians and distributions among groups instead. You can use the mark_boxplot() function, and this time you don’t need to transform any of the variables.

alt.Chart(mpg, title="Fuel Efficiency of Drive Trains").mark_boxplot().encode(
    x=alt.X('drv:N').title("Drive train"),
    y=alt.Y('hwy:Q').title("Miles per gallon highway"),
)

Distribution Plots#

Distribution plots show frequencies of particular variables. Distribution plots with just one variable are histograms, which require “binning” numeric variables. The Y-axis in a histogram is always a count.

alt.Chart(mpg, title="Distribution of City Fuel Efficiency").mark_bar().encode(
    x=alt.X('cty:Q').bin().title('Miles per gallon city'),
    y='count()',
)

Notice that you used the bin() function on the X variable above. You can make the same histogram into a density plot using the transform_density() function.

alt.Chart(mpg, title="Distribution of City Fuel Efficiency").transform_density(
    'cty',
    as_=['cty', 'density'],
).mark_area().encode(
    x=alt.X('cty:Q').title('Miles per gallon city'),
    y=alt.Y('density:Q').title('Count of Records'),
)

Distribution plots with two variables create heatmaps. For this one you’ll need mark_rect() to create the heatmap’s boxes. You’ll also use a Color encoding to add a color scale to the boxes. Both variables need to be binned.

alt.Chart(mpg, title="City Fuel Efficiency Related to Engine Displacement").mark_rect().encode(
    x=alt.X('displ:Q').bin().title('Engine displacement (gallons)'),
    y=alt.Y('cty:Q').bin().title('Miles per gallon city'),
    color=alt.Color('count():Q').scale(scheme='greenblue')
)

Relationship Plots#

To show a correlation or regression between two variables, use a simple scatterplot. In Altair, you draw a scatterplot’s points with mark_point(). Scatterplots take two numerical (quantitative) variables).

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
)

You can separate this by color with Color encoding.

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
    color=alt.Color('drv:N').title("Drive train"),
)

Line plots are also a kind of relationship plot. Line plots are often used with time variables, and the mpg dataset only includes two years. To make this easier to see, we’ll use Vega’s similar cars dataset. Note that you must use an aggregate function to average the fuel efficiency by year, like you did for the bar plot.

from vega_datasets import data
cars = data.cars()

alt.Chart(cars, title="Model Year and Fuel Efficiency").mark_line().encode(
    x=alt.X('Year:T').title("Model Year"),
    y=alt.Y('average(Miles_per_Gallon):Q').title("Fuel Efficiency (miles per gallon)"),
    color=alt.Color('Origin:N').title('Place of origin')
)

You can add a regression line to a scatter plot with the transform_regression() function. This is also called a line of best fit. You must first save the chart as a variable and then “add” the regression line to it.

chart = alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
) 

chart + chart.transform_regression('displ', 'cty').mark_line()

Faceting#

It sometimes makes sense to split data into separate graphs by category. The easiest way to do this is with the Column encoding.

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
    color=alt.Color('drv:N').title("Drive train"),
    column=alt.Column('drv:N').title("Drive train"),
)

Interactivity#

Sometimes it is useful to create plots that your reader can interact with directly, and Altair provides some simple functions for this. In scatterplots, you can add a Tooltip encoding to see what the manufacturer of the car is when you mouseover a point. You can also add the interactive() function to the end of the scatterplot code to enable scroll-to-zoom and click-and-drag features.

See also

This is only scratching the surface of what’s possible with Altair interactivity. There’s much, much more in the Altair documentation.

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
    color=alt.Color('drv:N').title("Drive train"),
    tooltip=alt.Tooltip('manufacturer:N')
).interactive()