{ "cells": [ { "cell_type": "markdown", "id": "486628c5-332f-4ccf-a564-ac9c2cdefc8c", "metadata": {}, "source": [ "# Altair\n", "\n", "Altair is a library for creating basic data visualization. It provides an easy to understand interface for some of the most common graph types.\n", "\n", "```{seealso}\n", "The [Altair user guide](https://altair-viz.github.io/user_guide/data.html) has lots of detailed information about all the things you can do with the library.\n", "```\n", "\n", "To begin, you'll need to import both `pandas` and `altair`. For consistency, you can import the same `mpg` dataset that we used in the previous chapter." ] }, { "cell_type": "code", "execution_count": 1, "id": "5084d7d1-a8bf-43fd-9840-81d1d828e0da", "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
manufacturermodeldisplyearcyltransdrvctyhwyflclass
0audia41.819994auto(l5)f1829pcompact
1audia41.819994manual(m5)f2129pcompact
2audia42.020084manual(m6)f2031pcompact
3audia42.020084auto(av)f2130pcompact
4audia42.819996auto(l5)f1626pcompact
....................................
229volkswagenpassat2.020084auto(s6)f1928pmidsize
230volkswagenpassat2.020084manual(m6)f2129pmidsize
231volkswagenpassat2.819996auto(l5)f1626pmidsize
232volkswagenpassat2.819996manual(m5)f1826pmidsize
233volkswagenpassat3.620086auto(s6)f1726pmidsize
\n", "

234 rows × 11 columns

\n", "
" ], "text/plain": [ " manufacturer model displ year cyl trans drv cty hwy fl \n", "0 audi a4 1.8 1999 4 auto(l5) f 18 29 p \\\n", "1 audi a4 1.8 1999 4 manual(m5) f 21 29 p \n", "2 audi a4 2.0 2008 4 manual(m6) f 20 31 p \n", "3 audi a4 2.0 2008 4 auto(av) f 21 30 p \n", "4 audi a4 2.8 1999 6 auto(l5) f 16 26 p \n", ".. ... ... ... ... ... ... .. ... ... .. \n", "229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p \n", "230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p \n", "231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p \n", "232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p \n", "233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p \n", "\n", " class \n", "0 compact \n", "1 compact \n", "2 compact \n", "3 compact \n", "4 compact \n", ".. ... \n", "229 midsize \n", "230 midsize \n", "231 midsize \n", "232 midsize \n", "233 midsize \n", "\n", "[234 rows x 11 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import altair as alt\n", "\n", "mpg = pd.read_csv(\"../data/mpg.csv\")\n", "mpg" ] }, { "cell_type": "markdown", "id": "4da83df9-b388-48c8-9ccd-d4a0c0e89262", "metadata": {}, "source": [ "Altair code follows the model of the [Grammar of Graphics](https://data.europa.eu/apps/data-visualisation-guide/foundation-of-the-grammar-of-graphics). You choose variable names (surrounded by quotes) to map to the x- and y-axis of your graph, and you can also map variables to things like `Color` and `Column`. You also set the `Chart` object to refer to the DataFrame you're working with.\n", "\n", "To get different kinds of visualizations, you choose from different `Marks`, which determine how your data will be displayed visually. In this tutorial you'll learn some basic examples.\n", "\n", "## Category Plots\n", "\n", "Categorical plots let you compare groups according to categorical variables. A standard category plot is the bar plot, which usually compares means of different groups. In Altair, we can assign our variables to the X- and Y-axes with one categorical (nominal) and one numerical (quantitative) variable, take the mean (average) of our quantitative variable, and draw with `mark_bar()`." ] }, { "cell_type": "code", "execution_count": 2, "id": "e24c49ec-cc27-4338-91f7-980c7c4a430d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Fuel Efficiency of Drive Trains\").mark_bar().encode(\n", " x=alt.X('drv:N').title(\"Drive train\"),\n", " y=alt.Y('average(hwy):Q').title(\"Miles per gallon highway\"),\n", ")" ] }, { "cell_type": "markdown", "id": "25eef8a5-2317-444c-af3c-485f73308355", "metadata": {}, "source": [ "Above is the code for our bar plot. We could do lots of customization from here, but this is what it will look like by default. Note that use the `average()` aggregate function to get the mean of our `hwy` variable, and we assign everything a label using `title`.\n", "\n", "You can similarly create a box plot to compare medians and distributions among groups instead. You can use the `mark_boxplot()` function, and this time you don't need to transform any of the variables." ] }, { "cell_type": "code", "execution_count": 3, "id": "8e9ecf9f-60ab-4a91-9329-c4e25dba63d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Fuel Efficiency of Drive Trains\").mark_boxplot().encode(\n", " x=alt.X('drv:N').title(\"Drive train\"),\n", " y=alt.Y('hwy:Q').title(\"Miles per gallon highway\"),\n", ")" ] }, { "cell_type": "markdown", "id": "ba9289e4-2659-41f1-a7a9-2770b7e1b777", "metadata": {}, "source": [ "## Distribution Plots\n", "\n", "Distribution plots show frequencies of particular variables. Distribution plots with just one variable are histograms, which require \"binning\" numeric variables. The Y-axis in a histogram is always a count." ] }, { "cell_type": "code", "execution_count": 4, "id": "f0315229-53d7-4a21-b2b6-bd9b6b7ca34c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Distribution of City Fuel Efficiency\").mark_bar().encode(\n", " x=alt.X('cty:Q').bin().title('Miles per gallon city'),\n", " y='count()',\n", ")" ] }, { "cell_type": "markdown", "id": "ae1d07d0-104e-4e7e-8b01-99f1a206dbac", "metadata": {}, "source": [ "Notice that you used the `bin()` function on the X variable above. You can make the same histogram into a density plot using the `transform_density()` function." ] }, { "cell_type": "code", "execution_count": 5, "id": "e8b8c298-06b0-4f57-b233-003949d40455", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Distribution of City Fuel Efficiency\").transform_density(\n", " 'cty',\n", " as_=['cty', 'density'],\n", ").mark_area().encode(\n", " x=alt.X('cty:Q').title('Miles per gallon city'),\n", " y=alt.Y('density:Q').title('Count of Records'),\n", ")" ] }, { "cell_type": "markdown", "id": "76398b54-54eb-4792-a6f5-fb866185ae27", "metadata": {}, "source": [ "Distribution plots with two variables create heatmaps. For this one you'll need `mark_rect()` to create the heatmap's boxes. You'll also use a `Color` encoding to add a color scale to the boxes. Both variables need to be binned." ] }, { "cell_type": "code", "execution_count": 6, "id": "bad0b366-0519-49a8-8e87-9e2e4b078e9d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"City Fuel Efficiency Related to Engine Displacement\").mark_rect().encode(\n", " x=alt.X('displ:Q').bin().title('Engine displacement (gallons)'),\n", " y=alt.Y('cty:Q').bin().title('Miles per gallon city'),\n", " color=alt.Color('count():Q').scale(scheme='greenblue')\n", ")" ] }, { "cell_type": "markdown", "id": "e7146319-bf6d-4ccf-a592-44efb2226246", "metadata": {}, "source": [ "## Relationship Plots\n", "\n", "To show a correlation or regression between two variables, use a simple scatterplot. In Altair, you draw a scatterplot's points with `mark_point()`. Scatterplots take two numerical (quantitative) variables)." ] }, { "cell_type": "code", "execution_count": 7, "id": "fb3d2787-ddea-4506-b8b7-1b3fb7dc21fd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Engine Displacement and Fuel Efficiency\").mark_point().encode(\n", " x=alt.X('displ:Q').title(\"Engine displacement (gallons)\"),\n", " y=alt.Y('cty:Q').title(\"Miles per gallon city\"),\n", ")" ] }, { "cell_type": "markdown", "id": "4ba6ac25-2445-4403-a548-e2ed2e0ff187", "metadata": {}, "source": [ "You can separate this by color with `Color` encoding." ] }, { "cell_type": "code", "execution_count": 8, "id": "a69e1832-a38c-48d2-baab-881f7d38e774", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Engine Displacement and Fuel Efficiency\").mark_point().encode(\n", " x=alt.X('displ:Q').title(\"Engine displacement (gallons)\"),\n", " y=alt.Y('cty:Q').title(\"Miles per gallon city\"),\n", " color=alt.Color('drv:N').title(\"Drive train\"),\n", ")" ] }, { "cell_type": "markdown", "id": "d2313847-a02c-47e6-8d10-b381a4c05dbe", "metadata": {}, "source": [ "Line plots are also a kind of relationship plot. Line plots are often used with time variables, and the mpg dataset only includes two years. To make this easier to see, we'll use Vega's similar `cars` dataset. Note that you must use an aggregate function to average the fuel efficiency by year, like you did for the bar plot." ] }, { "cell_type": "code", "execution_count": 27, "id": "da2b483d-5b09-4323-85c5-cce61e74fc37", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from vega_datasets import data\n", "cars = data.cars()\n", "\n", "alt.Chart(cars, title=\"Model Year and Fuel Efficiency\").mark_line().encode(\n", " x=alt.X('Year:T').title(\"Model Year\"),\n", " y=alt.Y('average(Miles_per_Gallon):Q').title(\"Fuel Efficiency (miles per gallon)\"),\n", " color=alt.Color('Origin:N').title('Place of origin')\n", ")" ] }, { "cell_type": "markdown", "id": "3af2aeda-647d-4f74-a1b7-70875954d2a9", "metadata": {}, "source": [ "You can add a regression line to a scatter plot with the `transform_regression()` function. This is also called a line of best fit. You must first save the chart as a variable and then \"add\" the regression line to it." ] }, { "cell_type": "code", "execution_count": 28, "id": "19aa2aec-b07c-4afa-a97b-86661cb1fcc7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chart = alt.Chart(mpg, title=\"Engine Displacement and Fuel Efficiency\").mark_point().encode(\n", " x=alt.X('displ:Q').title(\"Engine displacement (gallons)\"),\n", " y=alt.Y('cty:Q').title(\"Miles per gallon city\"),\n", ") \n", "\n", "chart + chart.transform_regression('displ', 'cty').mark_line()" ] }, { "cell_type": "markdown", "id": "2021e80e-b8f8-4b9b-a07c-db3b94ae74ab", "metadata": {}, "source": [ "## Faceting\n", "\n", "It sometimes makes sense to split data into separate graphs by category. The easiest way to do this is with the `Column` encoding." ] }, { "cell_type": "code", "execution_count": 29, "id": "1fd73df9-2774-49b2-9d98-69214f011866", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Engine Displacement and Fuel Efficiency\").mark_point().encode(\n", " x=alt.X('displ:Q').title(\"Engine displacement (gallons)\"),\n", " y=alt.Y('cty:Q').title(\"Miles per gallon city\"),\n", " color=alt.Color('drv:N').title(\"Drive train\"),\n", " column=alt.Column('drv:N').title(\"Drive train\"),\n", ")" ] }, { "cell_type": "markdown", "id": "95509ce6-9899-43c8-bd52-f2e7e07119d0", "metadata": {}, "source": [ "## Interactivity\n", "\n", "Sometimes it is useful to create plots that your reader can interact with directly, and Altair provides some simple functions for this. In scatterplots, you can add a `Tooltip` encoding to see what the manufacturer of the car is when you mouseover a point. You can also add the `interactive()` function to the end of the scatterplot code to enable scroll-to-zoom and click-and-drag features.\n", "\n", "```{seealso}\n", "This is only scratching the surface of what's possible with Altair interactivity. There's much, much more in the [Altair documentation](https://altair-viz.github.io/user_guide/interactions.html#).\n", "```" ] }, { "cell_type": "code", "execution_count": 33, "id": "6d9f2eb6-090b-4cb7-a486-b84f8e7c46a9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(mpg, title=\"Engine Displacement and Fuel Efficiency\").mark_point().encode(\n", " x=alt.X('displ:Q').title(\"Engine displacement (gallons)\"),\n", " y=alt.Y('cty:Q').title(\"Miles per gallon city\"),\n", " color=alt.Color('drv:N').title(\"Drive train\"),\n", " tooltip=alt.Tooltip('manufacturer:N')\n", ").interactive()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 5 }