Pandas#

Pandas is a library for storing and manipulating tabular data, or data stored in rows and columns like a spreadsheet. Pandas is a huge library with many different functions and methods, so what follows is a brief introduction to the most important functions for data management.

See also

If you encounter any part of Pandas out in the wild that you don’t see here, you can always refer to the Pandas documentation.

DataFrames and Series#

Instead of normal Python lists and dictionaries, Pandas stores data in its own specialized objects. The main one is a DataFrame, which is a lot like a spreadsheet with rows and columns.

You can create a DataFrame directly with the DataFrame() class in Pandas, but it’s more likely that you’ll read in a DataFrame from a CSV or spreadsheet file. First you must import the library, and it’s a good idea to import the numpy library as well.

Note

Numpy is a Python library for efficiently handling arrays and matrices of numbers. Pandas uses it under the hood to run quickly. You usually won’t need to use it directly, but it’s good to have it installed to avoid any mysterious errors.

import pandas as pd
import numpy as np

Now you can use the read_csv() function to read in a comma-separated value (CSV) spreadsheet file. You must put the name of this file in quotes, and the file should be in the same directory as your Jupyter notebook (or else you should include a full path). The read_csv() function will also accept a URL that points to a CSV file online.

For this example, we’ll use the file mpg.csv which comes from R’s ggplot2 library.

mpg = pd.read_csv("../data/mpg.csv")
mpg
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

Note

Jupyter nicely formats DataFrames as tables when you type the name of a variable containing a DataFrame. But if you use the print() function, it won’t display as well.

You can get basic information about your DataFrames columns using the .info() method.

mpg.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   manufacturer  234 non-null    object 
 1   model         234 non-null    object 
 2   displ         234 non-null    float64
 3   year          234 non-null    int64  
 4   cyl           234 non-null    int64  
 5   trans         234 non-null    object 
 6   drv           234 non-null    object 
 7   cty           234 non-null    int64  
 8   hwy           234 non-null    int64  
 9   fl            234 non-null    object 
 10  class         234 non-null    object 
dtypes: float64(1), int64(4), object(6)
memory usage: 20.2+ KB

A Series is a lot like a Python list, and each column of a DataFrame is a Series. You can access the columns of a Dataframe with dot notation.

mpg.model
0          a4
1          a4
2          a4
3          a4
4          a4
        ...  
229    passat
230    passat
231    passat
232    passat
233    passat
Name: model, Length: 234, dtype: object

You can also turn a list into a Series with the Series() class.

myseries = pd.Series([5, 6, 7, 8])
myseries
0    5
1    6
2    7
3    8
dtype: int64

Selecting Rows and Columns#

Once you have a DataFrame, you’ll typically want to filter and select different rows or columns.

To filter specific rows, Pandas uses a bracket notation. It takes conditional statements that are similar to Python conditions.

# Get cars with fewer than 6 cylinders
four_cylinders = mpg[mpg.cyl < 6]
four_cylinders
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
7 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
226 volkswagen new beetle 2.5 2008 5 auto(s6) f 20 29 r subcompact
227 volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
228 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize

85 rows × 11 columns

You can also use the operators & (and), | (or), and ! (not) to combine conditional filters.

# Get Volkswagens and Fords
vw_ford = mpg[(mpg.manufacturer == 'volkswagen') | (mpg.manufacturer == 'ford')]
vw_ford
manufacturer model displ year cyl trans drv cty hwy fl class
74 ford expedition 2wd 4.6 1999 8 auto(l4) r 11 17 r suv
75 ford expedition 2wd 5.4 1999 8 auto(l4) r 11 17 r suv
76 ford expedition 2wd 5.4 2008 8 auto(l6) r 12 18 r suv
77 ford explorer 4wd 4.0 1999 6 auto(l5) 4 14 17 r suv
78 ford explorer 4wd 4.0 1999 6 manual(m5) 4 15 19 r suv
79 ford explorer 4wd 4.0 1999 6 auto(l5) 4 14 17 r suv
80 ford explorer 4wd 4.0 2008 6 auto(l5) 4 13 19 r suv
81 ford explorer 4wd 4.6 2008 8 auto(l6) 4 13 19 r suv
82 ford explorer 4wd 5.0 1999 8 auto(l4) 4 13 17 r suv
83 ford f150 pickup 4wd 4.2 1999 6 auto(l4) 4 14 17 r pickup
84 ford f150 pickup 4wd 4.2 1999 6 manual(m5) 4 14 17 r pickup
85 ford f150 pickup 4wd 4.6 1999 8 manual(m5) 4 13 16 r pickup
86 ford f150 pickup 4wd 4.6 1999 8 auto(l4) 4 13 16 r pickup
87 ford f150 pickup 4wd 4.6 2008 8 auto(l4) 4 13 17 r pickup
88 ford f150 pickup 4wd 5.4 1999 8 auto(l4) 4 11 15 r pickup
89 ford f150 pickup 4wd 5.4 2008 8 auto(l4) 4 13 17 r pickup
90 ford mustang 3.8 1999 6 manual(m5) r 18 26 r subcompact
91 ford mustang 3.8 1999 6 auto(l4) r 18 25 r subcompact
92 ford mustang 4.0 2008 6 manual(m5) r 17 26 r subcompact
93 ford mustang 4.0 2008 6 auto(l5) r 16 24 r subcompact
94 ford mustang 4.6 1999 8 auto(l4) r 15 21 r subcompact
95 ford mustang 4.6 1999 8 manual(m5) r 15 22 r subcompact
96 ford mustang 4.6 2008 8 manual(m5) r 15 23 r subcompact
97 ford mustang 4.6 2008 8 auto(l5) r 15 22 r subcompact
98 ford mustang 5.4 2008 8 manual(m6) r 14 20 p subcompact
207 volkswagen gti 2.0 1999 4 manual(m5) f 21 29 r compact
208 volkswagen gti 2.0 1999 4 auto(l4) f 19 26 r compact
209 volkswagen gti 2.0 2008 4 manual(m6) f 21 29 p compact
210 volkswagen gti 2.0 2008 4 auto(s6) f 22 29 p compact
211 volkswagen gti 2.8 1999 6 manual(m5) f 17 24 r compact
212 volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
213 volkswagen jetta 2.0 1999 4 manual(m5) f 21 29 r compact
214 volkswagen jetta 2.0 1999 4 auto(l4) f 19 26 r compact
215 volkswagen jetta 2.0 2008 4 auto(s6) f 22 29 p compact
216 volkswagen jetta 2.0 2008 4 manual(m6) f 21 29 p compact
217 volkswagen jetta 2.5 2008 5 auto(s6) f 21 29 r compact
218 volkswagen jetta 2.5 2008 5 manual(m5) f 21 29 r compact
219 volkswagen jetta 2.8 1999 6 auto(l4) f 16 23 r compact
220 volkswagen jetta 2.8 1999 6 manual(m5) f 17 24 r compact
221 volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
222 volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
223 volkswagen new beetle 2.0 1999 4 manual(m5) f 21 29 r subcompact
224 volkswagen new beetle 2.0 1999 4 auto(l4) f 19 26 r subcompact
225 volkswagen new beetle 2.5 2008 5 manual(m5) f 20 28 r subcompact
226 volkswagen new beetle 2.5 2008 5 auto(s6) f 20 29 r subcompact
227 volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
228 volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

You can use a double bracket notation to select a subset of columns.

Note

Using single brackets or dot notation will get you a single column as a Series.

class_cty_hwy = mpg[["class", "cty", "hwy"]]
class_cty_hwy
class cty hwy
0 compact 18 29
1 compact 21 29
2 compact 20 31
3 compact 21 30
4 compact 16 26
... ... ... ...
229 midsize 19 28
230 midsize 21 29
231 midsize 16 26
232 midsize 18 26
233 midsize 17 26

234 rows × 3 columns

Data Wrangling#

In addtion to selecting rows and columns from DataFrames, you can also use Pandas to do a wide variety of data transformations.

Sorting#

mpg.sort_values("year", ascending=False)
manufacturer model displ year cyl trans drv cty hwy fl class
117 hyundai tiburon 2.0 2008 4 manual(m5) f 20 28 r subcompact
120 hyundai tiburon 2.7 2008 6 manual(m6) f 16 24 r subcompact
122 jeep grand cherokee 4wd 3.0 2008 6 auto(l5) 4 17 22 d suv
123 jeep grand cherokee 4wd 3.7 2008 6 auto(l5) 4 15 19 r suv
126 jeep grand cherokee 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
... ... ... ... ... ... ... ... ... ... ... ...
130 land rover range rover 4.0 1999 8 auto(l4) 4 11 15 p suv
50 dodge dakota pickup 4wd 3.9 1999 6 auto(l4) 4 13 17 r pickup
51 dodge dakota pickup 4wd 3.9 1999 6 manual(m5) 4 14 17 r pickup
125 jeep grand cherokee 4wd 4.7 1999 8 auto(l4) 4 14 17 r suv
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact

234 rows × 11 columns

Counting#

mpg.value_counts("manufacturer")
manufacturer
dodge         37
toyota        34
volkswagen    27
ford          25
chevrolet     19
audi          18
hyundai       14
subaru        14
nissan        13
honda          9
jeep           8
pontiac        5
land rover     4
mercury        4
lincoln        3
Name: count, dtype: int64

Renaming Columns#

# Note the use of a Python dictionary as this method's argument
mpg = mpg.rename({"cty":"city", "hwy": "highway"})
mpg
manufacturer model displ year cyl trans drv cty hwy fl class
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

234 rows × 11 columns

Create new columns#

You can use assign() to create new columns based on existing ones.

mpg = mpg.assign(displ_per_cyl = mpg.displ/mpg.cyl)
mpg
manufacturer model displ year cyl trans drv cty hwy fl class displ_per_cyl
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 0.450000
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 0.450000
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 0.500000
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 0.500000
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 0.466667
... ... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize 0.500000
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize 0.500000
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize 0.466667
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize 0.466667
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize 0.600000

234 rows × 12 columns

Grouping and Summarizing#

This combines a couple functions that exist within Pandas to create summary tables.

Pandas has a wide range of summary statistics that you can apply to individual columns.

# Average city fuel efficiency
mpg.cty.mean()
16.858974358974358
# Standard deviation of highway fuel efficiency
mpg.hwy.std()
5.9546434411664455

Pandas also has a .groupby() method (which returns a generator) that groups categorical variables.

mpg.groupby("manufacturer")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11849e490>

By itself, .groupby() doesn’t show anything. It needs to be combined with a summary statistic to create a summary table.

# Averages by manufacturer
# set `numeric_only=True` to avoid a warning
mpg.groupby("manufacturer").mean(numeric_only=True)
displ year cyl cty hwy displ_per_cyl
manufacturer
audi 2.544444 2003.500000 5.222222 17.611111 26.444444 0.484722
chevrolet 5.063158 2004.684211 7.263158 15.000000 21.894737 0.686842
dodge 4.378378 2004.108108 7.081081 13.135135 17.945946 0.616216
ford 4.536000 2002.600000 7.200000 14.000000 19.360000 0.633667
honda 1.711111 2003.000000 4.000000 24.444444 32.555556 0.427778
hyundai 2.428571 2004.142857 4.857143 18.642857 26.857143 0.509524
jeep 4.575000 2005.750000 7.250000 13.500000 17.625000 0.627604
land rover 4.300000 2003.500000 8.000000 11.500000 16.500000 0.537500
lincoln 5.400000 2002.000000 8.000000 11.333333 17.000000 0.675000
mercury 4.400000 2003.500000 7.000000 13.250000 18.000000 0.633333
nissan 3.269231 2003.846154 5.538462 18.076923 24.615385 0.589744
pontiac 3.960000 2002.600000 6.400000 17.000000 26.400000 0.615833
subaru 2.457143 2004.142857 4.000000 19.285714 25.571429 0.614286
toyota 2.952941 2002.705882 5.117647 18.529412 24.911765 0.573897
volkswagen 2.255556 2002.666667 4.592593 20.925926 29.222222 0.491049

Dropping Null Values#

For many statistical modeling tasks, you need to drop rows that contain null values. Pandas lets you do this easily with .dropna().

Note

Pandas typically stores null values as NaN, which stands for “not a number.”

# Drop any row that contains a null value in any column
mpg = mpg.dropna()
mpg
manufacturer model displ year cyl trans drv cty hwy fl class displ_per_cyl
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 0.450000
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 0.450000
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 0.500000
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 0.500000
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 0.466667
... ... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize 0.500000
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize 0.500000
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize 0.466667
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize 0.466667
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize 0.600000

234 rows × 12 columns

Warning

The mpg dataset doesn’t contain any null values, so right now this code isn’t accomplishing anything. But it will work when null values are present!

You can also drop null values from only a subset of columns.

# Drop any rows that contain null values in a subset of columns
mpg = mpg.dropna(subset=["model", "displ"])
mpg
manufacturer model displ year cyl trans drv cty hwy fl class displ_per_cyl
0 audi a4 1.8 1999 4 auto(l5) f 18 29 p compact 0.450000
1 audi a4 1.8 1999 4 manual(m5) f 21 29 p compact 0.450000
2 audi a4 2.0 2008 4 manual(m6) f 20 31 p compact 0.500000
3 audi a4 2.0 2008 4 auto(av) f 21 30 p compact 0.500000
4 audi a4 2.8 1999 6 auto(l5) f 16 26 p compact 0.466667
... ... ... ... ... ... ... ... ... ... ... ... ...
229 volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize 0.500000
230 volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize 0.500000
231 volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize 0.466667
232 volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize 0.466667
233 volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize 0.600000

234 rows × 12 columns

Sampling#

Many statistical methods, especially hypothesis tests, require to take a random sample of your overall data. Again, Pandas provides an easy way to do this with the .sample() method.

You can take a sample of rows from an entire dataframe.

# Get 5 random rows from mpg
mpg.sample(5)
manufacturer model displ year cyl trans drv cty hwy fl class displ_per_cyl
90 ford mustang 3.8 1999 6 manual(m5) r 18 26 r subcompact 0.633333
214 volkswagen jetta 2.0 1999 4 auto(l4) f 19 26 r compact 0.500000
85 ford f150 pickup 4wd 4.6 1999 8 manual(m5) 4 13 16 r pickup 0.575000
22 chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) r 12 17 r suv 0.750000
6 audi a4 3.1 2008 6 auto(av) f 18 27 p compact 0.516667

You can also get a sample of a specific column.

# Get 5 sample engine displacement values, as a series
mpg.displ.sample(5)
149    3.5
233    3.6
92     4.0
216    2.0
217    2.5
Name: displ, dtype: float64

You can also sample with replacement. (This is also called “bootstrap sampling.”) This makes it possible to have the same value in your sample twice.

mpg.displ.sample(5, replace=True)
65     4.7
217    2.5
224    2.0
187    2.2
107    2.0
Name: displ, dtype: float64

Pandas will also let you get a fraction of values instead of a set number in your sample.

# Get a random sample of one twentieth the size of the dataset
mpg.sample(frac=.05)
manufacturer model displ year cyl trans drv cty hwy fl class displ_per_cyl
63 dodge durango 4wd 5.9 1999 8 auto(l4) 4 11 15 r suv 0.737500
149 nissan maxima 3.5 2008 6 auto(av) f 19 25 p midsize 0.583333
90 ford mustang 3.8 1999 6 manual(m5) r 18 26 r subcompact 0.633333
56 dodge dakota pickup 4wd 5.2 1999 8 auto(l4) 4 11 15 r pickup 0.650000
224 volkswagen new beetle 2.0 1999 4 auto(l4) f 19 26 r subcompact 0.500000
30 chevrolet k1500 tahoe 4wd 5.7 1999 8 auto(l4) 4 11 15 r suv 0.712500
204 toyota toyota tacoma 4wd 3.4 1999 6 auto(l4) 4 15 19 r pickup 0.566667
206 toyota toyota tacoma 4wd 4.0 2008 6 auto(l5) 4 16 20 r pickup 0.666667
80 ford explorer 4wd 4.0 2008 6 auto(l5) 4 13 19 r suv 0.666667
7 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact 0.450000
18 chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv 0.662500
71 dodge ram 1500 pickup 4wd 5.2 1999 8 manual(m5) 4 11 16 r pickup 0.650000

There’s one more trick you can do with sampling. Sometimes you don’t need to get a smaller random sample: instead, you just want to reshuffle every row of the dataset. You can do this by setting frac to 1. In a way, you’re taking a random sample that is 100% of the size of the dataset! (But make sure you do this without replacement.)

mpg.displ.sample(frac=1)
7      1.8
81     4.6
72     5.7
183    3.0
29     5.3
      ... 
226    2.5
208    2.0
187    2.2
90     3.8
140    5.0
Name: displ, Length: 234, dtype: float64

Pandas will remember the indices in your new Series, which means if you use this reordered sample it might put things back in order for you! To avoid this, you can reset the index and drop the old labels.

mpg.displ.sample(frac=1).reset_index(drop=True)
0      4.6
1      3.3
2      2.0
3      3.5
4      2.2
      ... 
229    4.2
230    2.0
231    4.6
232    3.9
233    2.7
Name: displ, Length: 234, dtype: float64

See also

Wes McKinney’s excellent book Python for Data Analysis has lots more examples and many additional function. In particular, you can check out Chapter 5 and Chapter 8.