11 Pandas

Pandas is a library for storing and manipulating tabular data, or data stored in rows and columns like a spreadsheet. Pandas is a huge library with many different functions and methods, so what follows is a brief introduction to the most important functions for data management.

Note

If you encounter any part of Pandas out in the wild that you don’t see here, you can always refer to the Pandas documentation.

11.1 DataFrames and Series

Instead of normal Python lists and dictionaries, Pandas stores data in its own specialized objects. The main one is a DataFrame, which is a lot like a spreadsheet with rows and columns.

You can create a DataFrame directly with the DataFrame() class in Pandas, but it’s more likely that you’ll read in a DataFrame from a CSV or spreadsheet file. First you must import the library, and it’s a good idea to import the numpy library as well.

Tip

Numpy is a Python library for efficiently handling arrays and matrices of numbers. Pandas uses it under the hood to run quickly. You usually won’t need to use it directly, but it’s good to have it installed to avoid any mysterious errors.

import pandas as pd
import numpy as np

Now you can use the read_csv() function to read in a comma-separated value (CSV) spreadsheet file. You must put the name of this file in quotes, and the file should be in the same directory as your Jupyter notebook (or else you should include a full path). The read_csv() function will also accept a URL that points to a CSV file online.

For this example, we’ll use the file mpg.csv which comes from R’s ggplot2 library.

mpg = pd.read_csv("../data/mpg.csv")
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

Warning

Jupyter nicely formats DataFrames as tables when you type the name of a variable containing a DataFrame. But if you use the print() function, it won’t display as well.

You can get basic information about your DataFrames columns using the .info() method.

mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.2+ KB

A Series is a lot like a Python list, and each column of a DataFrame is a Series. You can access the columns of a Dataframe with dot notation.

mpg.model

0          a4
1          a4
2          a4
3          a4
4          a4
        ...  
229    passat
230    passat
231    passat
232    passat
233    passat
Name: model, Length: 234, dtype: object

You can also turn a list into a Series with the Series() class.

myseries = pd.Series([5, 6, 7, 8])
myseries

0    5
1    6
2    7
3    8
dtype: int64

11.2 Selecting Rows and Columns

Once you have a DataFrame, you’ll typically want to filter and select different rows or columns.

To filter specific rows, Pandas uses a bracket notation. It takes conditional statements that are similar to Python conditions.

# Get cars with fewer than 6 cylinders
four_cylinders = mpg[mpg.cyl < 6]
four_cylinders

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
7	audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
226	volkswagen	new beetle	2.5	2008	5	auto(s6)	f	20	29	r	subcompact
227	volkswagen	passat	1.8	1999	4	manual(m5)	f	21	29	p	midsize
228	volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize

85 rows × 11 columns

You can also use the operators & (and), | (or), and ! (not) to combine conditional filters.

# Get Volkswagens and Fords
vw_ford = mpg[(mpg.manufacturer == 'volkswagen') | (mpg.manufacturer == 'ford')]
vw_ford

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
74	ford	expedition 2wd	4.6	1999	8	auto(l4)	r	11	17	r	suv
75	ford	expedition 2wd	5.4	1999	8	auto(l4)	r	11	17	r	suv
76	ford	expedition 2wd	5.4	2008	8	auto(l6)	r	12	18	r	suv
77	ford	explorer 4wd	4.0	1999	6	auto(l5)	4	14	17	r	suv
78	ford	explorer 4wd	4.0	1999	6	manual(m5)	4	15	19	r	suv
79	ford	explorer 4wd	4.0	1999	6	auto(l5)	4	14	17	r	suv
80	ford	explorer 4wd	4.0	2008	6	auto(l5)	4	13	19	r	suv
81	ford	explorer 4wd	4.6	2008	8	auto(l6)	4	13	19	r	suv
82	ford	explorer 4wd	5.0	1999	8	auto(l4)	4	13	17	r	suv
83	ford	f150 pickup 4wd	4.2	1999	6	auto(l4)	4	14	17	r	pickup
84	ford	f150 pickup 4wd	4.2	1999	6	manual(m5)	4	14	17	r	pickup
85	ford	f150 pickup 4wd	4.6	1999	8	manual(m5)	4	13	16	r	pickup
86	ford	f150 pickup 4wd	4.6	1999	8	auto(l4)	4	13	16	r	pickup
87	ford	f150 pickup 4wd	4.6	2008	8	auto(l4)	4	13	17	r	pickup
88	ford	f150 pickup 4wd	5.4	1999	8	auto(l4)	4	11	15	r	pickup
89	ford	f150 pickup 4wd	5.4	2008	8	auto(l4)	4	13	17	r	pickup
90	ford	mustang	3.8	1999	6	manual(m5)	r	18	26	r	subcompact
91	ford	mustang	3.8	1999	6	auto(l4)	r	18	25	r	subcompact
92	ford	mustang	4.0	2008	6	manual(m5)	r	17	26	r	subcompact
93	ford	mustang	4.0	2008	6	auto(l5)	r	16	24	r	subcompact
94	ford	mustang	4.6	1999	8	auto(l4)	r	15	21	r	subcompact
95	ford	mustang	4.6	1999	8	manual(m5)	r	15	22	r	subcompact
96	ford	mustang	4.6	2008	8	manual(m5)	r	15	23	r	subcompact
97	ford	mustang	4.6	2008	8	auto(l5)	r	15	22	r	subcompact
98	ford	mustang	5.4	2008	8	manual(m6)	r	14	20	p	subcompact
207	volkswagen	gti	2.0	1999	4	manual(m5)	f	21	29	r	compact
208	volkswagen	gti	2.0	1999	4	auto(l4)	f	19	26	r	compact
209	volkswagen	gti	2.0	2008	4	manual(m6)	f	21	29	p	compact
210	volkswagen	gti	2.0	2008	4	auto(s6)	f	22	29	p	compact
211	volkswagen	gti	2.8	1999	6	manual(m5)	f	17	24	r	compact
212	volkswagen	jetta	1.9	1999	4	manual(m5)	f	33	44	d	compact
213	volkswagen	jetta	2.0	1999	4	manual(m5)	f	21	29	r	compact
214	volkswagen	jetta	2.0	1999	4	auto(l4)	f	19	26	r	compact
215	volkswagen	jetta	2.0	2008	4	auto(s6)	f	22	29	p	compact
216	volkswagen	jetta	2.0	2008	4	manual(m6)	f	21	29	p	compact
217	volkswagen	jetta	2.5	2008	5	auto(s6)	f	21	29	r	compact
218	volkswagen	jetta	2.5	2008	5	manual(m5)	f	21	29	r	compact
219	volkswagen	jetta	2.8	1999	6	auto(l4)	f	16	23	r	compact
220	volkswagen	jetta	2.8	1999	6	manual(m5)	f	17	24	r	compact
221	volkswagen	new beetle	1.9	1999	4	manual(m5)	f	35	44	d	subcompact
222	volkswagen	new beetle	1.9	1999	4	auto(l4)	f	29	41	d	subcompact
223	volkswagen	new beetle	2.0	1999	4	manual(m5)	f	21	29	r	subcompact
224	volkswagen	new beetle	2.0	1999	4	auto(l4)	f	19	26	r	subcompact
225	volkswagen	new beetle	2.5	2008	5	manual(m5)	f	20	28	r	subcompact
226	volkswagen	new beetle	2.5	2008	5	auto(s6)	f	20	29	r	subcompact
227	volkswagen	passat	1.8	1999	4	manual(m5)	f	21	29	p	midsize
228	volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

You can use a double bracket notation to select a subset of columns.

Tip

Using single brackets or dot notation will get you a single column as a Series.

class_cty_hwy = mpg[["class", "cty", "hwy"]]
class_cty_hwy

	class	cty	hwy
0	compact	18	29
1	compact	21	29
2	compact	20	31
3	compact	21	30
4	compact	16	26
...	...	...	...
229	midsize	19	28
230	midsize	21	29
231	midsize	16	26
232	midsize	18	26
233	midsize	17	26

234 rows × 3 columns

11.3 Data Wrangling

In addtion to selecting rows and columns from DataFrames, you can also use Pandas to do a wide variety of data transformations.

11.3.1 Sorting

mpg.sort_values("year", ascending=False)

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
117	hyundai	tiburon	2.0	2008	4	manual(m5)	f	20	28	r	subcompact
120	hyundai	tiburon	2.7	2008	6	manual(m6)	f	16	24	r	subcompact
122	jeep	grand cherokee 4wd	3.0	2008	6	auto(l5)	4	17	22	d	suv
123	jeep	grand cherokee 4wd	3.7	2008	6	auto(l5)	4	15	19	r	suv
126	jeep	grand cherokee 4wd	4.7	2008	8	auto(l5)	4	9	12	e	suv
...	...	...	...	...	...	...	...	...	...	...	...
130	land rover	range rover	4.0	1999	8	auto(l4)	4	11	15	p	suv
50	dodge	dakota pickup 4wd	3.9	1999	6	auto(l4)	4	13	17	r	pickup
51	dodge	dakota pickup 4wd	3.9	1999	6	manual(m5)	4	14	17	r	pickup
125	jeep	grand cherokee 4wd	4.7	1999	8	auto(l4)	4	14	17	r	suv
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact

234 rows × 11 columns

11.3.2 Counting

mpg.value_counts("manufacturer")

manufacturer
dodge         37
toyota        34
volkswagen    27
ford          25
chevrolet     19
audi          18
hyundai       14
subaru        14
nissan        13
honda          9
jeep           8
pontiac        5
land rover     4
mercury        4
lincoln        3
Name: count, dtype: int64

11.3.3 Renaming Columns

# Note the use of a Python dictionary as this method's argument
mpg = mpg.rename({"cty":"city", "hwy": "highway"})
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

11.3.4 Create new columns

You can use assign() to create new columns based on existing ones.

mpg = mpg.assign(displ_per_cyl = mpg.displ/mpg.cyl)
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	displ_per_cyl
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	0.450000
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	0.450000
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	0.500000
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	0.500000
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	0.466667
...	...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize	0.500000
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize	0.500000
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize	0.466667
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize	0.466667
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize	0.600000

234 rows × 12 columns

11.3.5 Grouping and Summarizing

This combines a couple functions that exist within Pandas to create summary tables.

Pandas has a wide range of summary statistics that you can apply to individual columns.

# Average city fuel efficiency
mpg.cty.mean()

np.float64(16.858974358974358)

# Standard deviation of highway fuel efficiency
mpg.hwy.std()

np.float64(5.9546434411664455)

Pandas also has a .groupby() method (which returns a generator) that groups categorical variables.

mpg.groupby("manufacturer")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10a116630>

By itself, .groupby() doesn’t show anything. It needs to be combined with a summary statistic to create a summary table.

# Averages by manufacturer
# set `numeric_only=True` to avoid a warning
mpg.groupby("manufacturer").mean(numeric_only=True)

	displ	year	cyl	cty	hwy	displ_per_cyl
manufacturer
audi	2.544444	2003.500000	5.222222	17.611111	26.444444	0.484722
chevrolet	5.063158	2004.684211	7.263158	15.000000	21.894737	0.686842
dodge	4.378378	2004.108108	7.081081	13.135135	17.945946	0.616216
ford	4.536000	2002.600000	7.200000	14.000000	19.360000	0.633667
honda	1.711111	2003.000000	4.000000	24.444444	32.555556	0.427778
hyundai	2.428571	2004.142857	4.857143	18.642857	26.857143	0.509524
jeep	4.575000	2005.750000	7.250000	13.500000	17.625000	0.627604
land rover	4.300000	2003.500000	8.000000	11.500000	16.500000	0.537500
lincoln	5.400000	2002.000000	8.000000	11.333333	17.000000	0.675000
mercury	4.400000	2003.500000	7.000000	13.250000	18.000000	0.633333
nissan	3.269231	2003.846154	5.538462	18.076923	24.615385	0.589744
pontiac	3.960000	2002.600000	6.400000	17.000000	26.400000	0.615833
subaru	2.457143	2004.142857	4.000000	19.285714	25.571429	0.614286
toyota	2.952941	2002.705882	5.117647	18.529412	24.911765	0.573897
volkswagen	2.255556	2002.666667	4.592593	20.925926	29.222222	0.491049

11.3.6 Dropping Null Values

For many statistical modeling tasks, you need to drop rows that contain null values. Pandas lets you do this easily with .dropna().

Note

Pandas typically stores null values as NaN, which stands for “not a number.”

# Drop any row that contains a null value in any column
mpg = mpg.dropna()
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	displ_per_cyl
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	0.450000
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	0.450000
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	0.500000
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	0.500000
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	0.466667
...	...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize	0.500000
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize	0.500000
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize	0.466667
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize	0.466667
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize	0.600000

234 rows × 12 columns

Warning

The mpg dataset doesn’t contain any null values, so right now this code isn’t accomplishing anything. But it will work when null values are present!

You can also drop null values from only a subset of columns.

# Drop any rows that contain null values in a subset of columns
mpg = mpg.dropna(subset=["model", "displ"])
mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	displ_per_cyl
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	0.450000
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	0.450000
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	0.500000
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact	0.500000
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact	0.466667
...	...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize	0.500000
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize	0.500000
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize	0.466667
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize	0.466667
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize	0.600000

234 rows × 12 columns

11.4 Sampling

Many statistical methods, especially hypothesis tests, require you to take a random sample of your overall data. Again, Pandas provides an easy way to do this with the .sample() method.

You can take a sample of rows from an entire dataframe.

# Get 5 random rows from mpg
mpg.sample(5)

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	displ_per_cyl
179	toyota	camry	2.2	1999	4	manual(m5)	f	21	29	r	midsize	0.550000
95	ford	mustang	4.6	1999	8	manual(m5)	r	15	22	r	subcompact	0.575000
118	hyundai	tiburon	2.0	2008	4	auto(l4)	f	20	27	r	subcompact	0.500000
86	ford	f150 pickup 4wd	4.6	1999	8	auto(l4)	4	13	16	r	pickup	0.575000
47	dodge	caravan 2wd	4.0	2008	6	auto(l6)	f	16	23	r	minivan	0.666667

You can also get a sample of a specific column.

# Get 5 sample engine displacement values, as a series
mpg.displ.sample(5)

60     4.7
158    5.3
220    2.8
146    3.5
0      1.8
Name: displ, dtype: float64

You can also sample with replacement. (This is also called “bootstrap sampling.”) This makes it possible to have the same value in your sample twice.

mpg.displ.sample(5, replace=True)

25     6.2
124    4.0
13     3.1
68     4.7
81     4.6
Name: displ, dtype: float64

Pandas will also let you get a fraction of values instead of a set number in your sample.

# Get a random sample of one twentieth the size of the dataset
mpg.sample(frac=.05)

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	displ_per_cyl
64	dodge	ram 1500 pickup 4wd	4.7	2008	8	manual(m6)	4	12	16	r	pickup	0.587500
136	lincoln	navigator 2wd	5.4	2008	8	auto(l6)	r	12	18	r	suv	0.675000
138	mercury	mountaineer 4wd	4.0	2008	6	auto(l5)	4	13	19	r	suv	0.666667
223	volkswagen	new beetle	2.0	1999	4	manual(m5)	f	21	29	r	subcompact	0.500000
16	audi	a6 quattro	3.1	2008	6	auto(s6)	4	17	25	p	midsize	0.516667
186	toyota	camry solara	2.2	1999	4	auto(l4)	f	21	27	r	compact	0.550000
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact	0.500000
133	land rover	range rover	4.6	1999	8	auto(l4)	4	11	15	p	suv	0.575000
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact	0.450000
182	toyota	camry	2.4	2008	4	auto(l5)	f	21	31	r	midsize	0.600000
169	subaru	impreza awd	2.5	2008	4	auto(s4)	4	20	25	p	compact	0.625000
119	hyundai	tiburon	2.7	2008	6	auto(l4)	f	17	24	r	subcompact	0.450000

There’s one more trick you can do with sampling. Sometimes you don’t need to get a smaller random sample: instead, you just want to reshuffle every row of the dataset. You can do this by setting frac to 1. In a way, you’re taking a random sample that is 100% of the size of the dataset! (But make sure you do this without replacement.)

mpg.displ.sample(frac=1)

123    3.7
210    2.0
12     2.8
212    1.9
77     4.0
      ... 
88     5.4
169    2.5
232    2.8
222    1.9
72     5.7
Name: displ, Length: 234, dtype: float64

Pandas will remember the indices in your new Series, which means if you use this reordered sample it might put things back in order for you! To avoid this, you can reset the index and drop the old labels.

mpg.displ.sample(frac=1).reset_index(drop=True)

0      2.5
1      3.6
2      3.0
3      5.0
4      3.8
      ... 
229    2.8
230    3.0
231    4.6
232    4.0
233    3.5
Name: displ, Length: 234, dtype: float64

11.5 Combining Datasets

11.5.1 Stack Data with Concatenation

If you have two datasets with identical columns but different rows, you can combine or “stack” the two datasets into one.

# Let's imagine we have two datasets.
# One for manual transmission cars:

manual = pd.read_csv("../data/manual.csv")
manual

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
1	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
2	audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact
3	audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
4	audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
72	volkswagen	new beetle	2.0	1999	4	manual(m5)	f	21	29	r	subcompact
73	volkswagen	new beetle	2.5	2008	5	manual(m5)	f	20	28	r	subcompact
74	volkswagen	passat	1.8	1999	4	manual(m5)	f	21	29	p	midsize
75	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
76	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize

77 rows × 11 columns

# And one for automatic transmissions:
automatic = pd.read_csv("../data/automatic.csv")
automatic

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
2	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
3	audi	a4	3.1	2008	6	auto(av)	f	18	27	p	compact
4	audi	a4 quattro	1.8	1999	4	auto(l5)	4	16	25	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
152	volkswagen	new beetle	2.5	2008	5	auto(s6)	f	20	29	r	subcompact
153	volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize
154	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
155	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
156	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

157 rows × 11 columns

# You can combine these into one with pd.concat()

mpg_concat = pd.concat([manual, automatic])
mpg_concat

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
1	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
2	audi	a4	2.8	1999	6	manual(m5)	f	18	26	p	compact
3	audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact
4	audi	a4 quattro	2.0	2008	4	manual(m6)	4	20	28	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
152	volkswagen	new beetle	2.5	2008	5	auto(s6)	f	20	29	r	subcompact
153	volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize
154	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
155	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
156	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

11.5.2 Switch Rows and Columns with Pivot and Melt

Sometimes your data isn’t tidy, and rows should be columns while columns should be rows. You can resolve these problems using Pandas’ “pivot” and “melt” concepts—this is similar to Pivot Tables in Excel.

# Let's imagine you have a dataset with column headings in the rows

long_mpg = pd.read_csv("../data/melted_mpg.csv")
long_mpg

	manufacturer	model	year	trans	variable	value
0	audi	a4	1999	auto(l5)	displ	1.8
1	audi	a4	1999	manual(m5)	displ	1.8
2	audi	a4	2008	manual(m6)	displ	2.0
3	audi	a4	2008	auto(av)	displ	2.0
4	audi	a4 quattro	1999	manual(m5)	displ	1.8
...	...	...	...	...	...	...
968	volkswagen	new beetle	2008	auto(s6)	class	subcompact
969	volkswagen	passat	1999	manual(m5)	class	midsize
970	volkswagen	passat	1999	auto(l5)	class	midsize
971	volkswagen	passat	2008	auto(s6)	class	midsize
972	volkswagen	passat	2008	manual(m6)	class	midsize

973 rows × 6 columns

# Use pivot to turn the "variable" column into the column names
# You must specify which columns you want to retain as indices
# Make sure you add .reset_index()

pivoted_mpg = long_mpg.pivot(index=["manufacturer", "model", "year", "trans"], columns="variable", values="value").reset_index()
pivoted_mpg

variable	manufacturer	model	year	trans	class	cty	cyl	displ	drv	fl	hwy
0	audi	a4	1999	auto(l5)	compact	18	4	1.8	f	p	29
1	audi	a4	1999	manual(m5)	compact	21	4	1.8	f	p	29
2	audi	a4	2008	auto(av)	compact	21	4	2.0	f	p	30
3	audi	a4	2008	manual(m6)	compact	20	4	2.0	f	p	31
4	audi	a4 quattro	1999	auto(l5)	compact	16	4	1.8	4	p	25
...	...	...	...	...	...	...	...	...	...	...	...
134	volkswagen	new beetle	2008	manual(m5)	subcompact	20	5	2.5	f	r	28
135	volkswagen	passat	1999	auto(l5)	midsize	18	4	1.8	f	p	29
136	volkswagen	passat	1999	manual(m5)	midsize	21	4	1.8	f	p	29
137	volkswagen	passat	2008	auto(s6)	midsize	19	4	2.0	f	p	28
138	volkswagen	passat	2008	manual(m6)	midsize	21	4	2.0	f	p	29

139 rows × 11 columns

# You can use .melt() to go in the opposite direction
# Again you must specify the columns you want to keep as indices

melted_mpg = pivoted_mpg.melt(id_vars = ["manufacturer", "model", "year", "trans"])
melted_mpg

	manufacturer	model	year	trans	variable	value
0	audi	a4	1999	auto(l5)	class	compact
1	audi	a4	1999	manual(m5)	class	compact
2	audi	a4	2008	auto(av)	class	compact
3	audi	a4	2008	manual(m6)	class	compact
4	audi	a4 quattro	1999	auto(l5)	class	compact
...	...	...	...	...	...	...
968	volkswagen	new beetle	2008	manual(m5)	hwy	28
969	volkswagen	passat	1999	auto(l5)	hwy	29
970	volkswagen	passat	1999	manual(m5)	hwy	29
971	volkswagen	passat	2008	auto(s6)	hwy	28
972	volkswagen	passat	2008	manual(m6)	hwy	29

973 rows × 6 columns

11.5.3 Join Datasets with Merge

If your columns or rows don’t match exactly, you can’t use pd.concat(). But you can still combine datasets with a database-style JOIN (just like in SQL!) using the pd.merge() function.

Warning

This database-style combining of tables is really powerful, and it can get more complicated than what we have time to cover in this class. I recommend checking out McKinney Ch. 8.2 for more details, and this all is covered in much more detail in CIS 112!

# Let's assume you have a dataset without the vehicle class column

missing_mpg = pd.read_csv("../data/mpg_no_class.csv")
missing_mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p
...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p

234 rows × 10 columns

# And you also have a dataset (with fewer rows) that has
# class info for each make and year of vehicle

class_mpg = pd.read_csv("../data/mpg_class.csv")
class_mpg

	manufacturer	model	year	class
0	audi	a4	1999	compact
1	audi	a4	2008	compact
2	audi	a4 quattro	1999	compact
3	audi	a4 quattro	2008	compact
4	audi	a6 quattro	1999	midsize
...	...	...	...	...
71	volkswagen	jetta	2008	compact
72	volkswagen	new beetle	1999	subcompact
73	volkswagen	new beetle	2008	subcompact
74	volkswagen	passat	1999	midsize
75	volkswagen	passat	2008	midsize

76 rows × 4 columns

# You can add the class column to your dataframe using merge

merged_mpg = pd.merge(missing_mpg, class_mpg, how="left", on=["manufacturer", "model", "year"])
merged_mpg

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
0	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact
1	audi	a4	1.8	1999	4	manual(m5)	f	21	29	p	compact
2	audi	a4	2.0	2008	4	manual(m6)	f	20	31	p	compact
3	audi	a4	2.0	2008	4	auto(av)	f	21	30	p	compact
4	audi	a4	2.8	1999	6	auto(l5)	f	16	26	p	compact
...	...	...	...	...	...	...	...	...	...	...	...
229	volkswagen	passat	2.0	2008	4	auto(s6)	f	19	28	p	midsize
230	volkswagen	passat	2.0	2008	4	manual(m6)	f	21	29	p	midsize
231	volkswagen	passat	2.8	1999	6	auto(l5)	f	16	26	p	midsize
232	volkswagen	passat	2.8	1999	6	manual(m5)	f	18	26	p	midsize
233	volkswagen	passat	3.6	2008	6	auto(s6)	f	17	26	p	midsize

234 rows × 11 columns

Note

Wes McKinney’s excellent book Python for Data Analysis has lots more examples and many additional function. In particular, you can check out Chapter 5 and Chapter 8.