Altair

Altair#

Altair is a library for creating basic data visualization. It provides an easy to understand interface for some of the most common graph types.

Category Plots#

Categorical plots let you compare groups according to categorical variables. A standard category plot is the bar plot, which usually compares means of different groups. In Altair, we can assign our variables to the X- and Y-axes with one categorical (nominal) and one numerical (quantitative) variable, take the mean (average) of our quantitative variable, and draw with mark_bar().

alt.Chart(mpg, title="Fuel Efficiency of Drive Trains").mark_bar().encode(
    x=alt.X('drv:N').title("Drive train"),
    y=alt.Y('average(hwy):Q').title("Miles per gallon highway"),
)

Above is the code for our bar plot. We could do lots of customization from here, but this is what it will look like by default. Note that use the average() aggregate function to get the mean of our hwy variable, and we assign everything a label using title.

You can similarly create a box plot to compare medians and distributions among groups instead. You can use the mark_boxplot() function, and this time you don’t need to transform any of the variables.

alt.Chart(mpg, title="Fuel Efficiency of Drive Trains").mark_boxplot().encode(
    x=alt.X('drv:N').title("Drive train"),
    y=alt.Y('hwy:Q').title("Miles per gallon highway"),
)

Distribution Plots#

Distribution plots show frequencies of particular variables. Distribution plots with just one variable are histograms, which require “binning” numeric variables. The Y-axis in a histogram is always a count.

alt.Chart(mpg, title="Distribution of City Fuel Efficiency").mark_bar().encode(
    x=alt.X('cty:Q').bin().title('Miles per gallon city'),
    y='count()',
)

Notice that you used the bin() function on the X variable above. You can make the same histogram into a density plot using the transform_density() function.

alt.Chart(mpg, title="Distribution of City Fuel Efficiency").transform_density(
    'cty',
    as_=['cty', 'density'],
).mark_area().encode(
    x=alt.X('cty:Q').title('Miles per gallon city'),
    y=alt.Y('density:Q').title('Count of Records'),
)

Distribution plots with two variables create heatmaps. For this one you’ll need mark_rect() to create the heatmap’s boxes. You’ll also use a Color encoding to add a color scale to the boxes. Both variables need to be binned.

alt.Chart(mpg, title="City Fuel Efficiency Related to Engine Displacement").mark_rect().encode(
    x=alt.X('displ:Q').bin().title('Engine displacement (gallons)'),
    y=alt.Y('cty:Q').bin().title('Miles per gallon city'),
    color=alt.Color('count():Q').scale(scheme='greenblue')
)

Relationship Plots#

To show a correlation or regression between two variables, use a simple scatterplot. In Altair, you draw a scatterplot’s points with mark_point(). Scatterplots take two numerical (quantitative) variables).

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
)

You can separate this by color with Color encoding.

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
    color=alt.Color('drv:N').title("Drive train"),
)

Line plots are also a kind of relationship plot. Line plots are often used with time variables, and the mpg dataset only includes two years. To make this easier to see, we’ll use Vega’s similar cars dataset. Note that you must use an aggregate function to average the fuel efficiency by year, like you did for the bar plot.

from vega_datasets import data
cars = data.cars()

alt.Chart(cars, title="Model Year and Fuel Efficiency").mark_line().encode(
    x=alt.X('Year:T').title("Model Year"),
    y=alt.Y('average(Miles_per_Gallon):Q').title("Fuel Efficiency (miles per gallon)"),
    color=alt.Color('Origin:N').title('Place of origin')
)

You can add a regression line to a scatter plot with the transform_regression() function. This is also called a line of best fit. You must first save the chart as a variable and then “add” the regression line to it.

chart = alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
) 

chart + chart.transform_regression('displ', 'cty').mark_line()

Faceting#

It sometimes makes sense to split data into separate graphs by category. The easiest way to do this is with the Column encoding.

alt.Chart(mpg, title="Engine Displacement and Fuel Efficiency").mark_point().encode(
    x=alt.X('displ:Q').title("Engine displacement (gallons)"),
    y=alt.Y('cty:Q').title("Miles per gallon city"),
    color=alt.Color('drv:N').title("Drive train"),
    column=alt.Column('drv:N').title("Drive train"),
)

Interactivity#

Sometimes it is useful to create plots that your reader can interact with directly, and Altair provides some simple functions for this. In scatterplots, you can add a Tooltip encoding to see what the manufacturer of the car is when you mouseover a point. You can also add the interactive() function to the end of the scatterplot code to enable scroll-to-zoom and click-and-drag features.

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize