Pandas#
Pandas is a library for storing and manipulating tabular data, or data stored in rows and columns like a spreadsheet. Pandas is a huge library with many different functions and methods, so what follows is a brief introduction to the most important functions for data management.
See also
If you encounter any part of Pandas out in the wild that you don’t see here, you can always refer to the Pandas documentation.
DataFrames and Series#
Instead of normal Python lists and dictionaries, Pandas stores data in its own specialized objects. The main one is a DataFrame, which is a lot like a spreadsheet with rows and columns.
You can create a DataFrame directly with the DataFrame()
class in Pandas, but it’s more likely that you’ll read in a DataFrame from a CSV or spreadsheet file. First you must import the library, and it’s a good idea to import the numpy
library as well.
Note
Numpy is a Python library for efficiently handling arrays and matrices of numbers. Pandas uses it under the hood to run quickly. You usually won’t need to use it directly, but it’s good to have it installed to avoid any mysterious errors.
import pandas as pd
import numpy as np
Now you can use the read_csv()
function to read in a comma-separated value (CSV) spreadsheet file. You must put the name of this file in quotes, and the file should be in the same directory as your Jupyter notebook (or else you should include a full path). The read_csv()
function will also accept a URL that points to a CSV file online.
For this example, we’ll use the file mpg.csv
which comes from R’s ggplot2 library.
mpg = pd.read_csv("../data/mpg.csv")
mpg
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
231 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
232 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
233 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
234 rows × 11 columns
Note
Jupyter nicely formats DataFrames as tables when you type the name of a variable containing a DataFrame. But if you use the print()
function, it won’t display as well.
You can get basic information about your DataFrames columns using the .info()
method.
mpg.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 234 entries, 0 to 233
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 manufacturer 234 non-null object
1 model 234 non-null object
2 displ 234 non-null float64
3 year 234 non-null int64
4 cyl 234 non-null int64
5 trans 234 non-null object
6 drv 234 non-null object
7 cty 234 non-null int64
8 hwy 234 non-null int64
9 fl 234 non-null object
10 class 234 non-null object
dtypes: float64(1), int64(4), object(6)
memory usage: 20.2+ KB
A Series is a lot like a Python list, and each column of a DataFrame is a Series. You can access the columns of a Dataframe with dot notation.
mpg.model
0 a4
1 a4
2 a4
3 a4
4 a4
...
229 passat
230 passat
231 passat
232 passat
233 passat
Name: model, Length: 234, dtype: object
You can also turn a list into a Series with the Series()
class.
myseries = pd.Series([5, 6, 7, 8])
myseries
0 5
1 6
2 7
3 8
dtype: int64
Selecting Rows and Columns#
Once you have a DataFrame, you’ll typically want to filter and select different rows or columns.
To filter specific rows, Pandas uses a bracket notation. It takes conditional statements that are similar to Python conditions.
# Get cars with fewer than 6 cylinders
four_cylinders = mpg[mpg.cyl < 6]
four_cylinders
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
7 | audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
226 | volkswagen | new beetle | 2.5 | 2008 | 5 | auto(s6) | f | 20 | 29 | r | subcompact |
227 | volkswagen | passat | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | midsize |
228 | volkswagen | passat | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | midsize |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
85 rows × 11 columns
You can also use the operators &
(and), |
(or), and !
(not) to combine conditional filters.
# Get Volkswagens and Fords
vw_ford = mpg[(mpg.manufacturer == 'volkswagen') | (mpg.manufacturer == 'ford')]
vw_ford
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
74 | ford | expedition 2wd | 4.6 | 1999 | 8 | auto(l4) | r | 11 | 17 | r | suv |
75 | ford | expedition 2wd | 5.4 | 1999 | 8 | auto(l4) | r | 11 | 17 | r | suv |
76 | ford | expedition 2wd | 5.4 | 2008 | 8 | auto(l6) | r | 12 | 18 | r | suv |
77 | ford | explorer 4wd | 4.0 | 1999 | 6 | auto(l5) | 4 | 14 | 17 | r | suv |
78 | ford | explorer 4wd | 4.0 | 1999 | 6 | manual(m5) | 4 | 15 | 19 | r | suv |
79 | ford | explorer 4wd | 4.0 | 1999 | 6 | auto(l5) | 4 | 14 | 17 | r | suv |
80 | ford | explorer 4wd | 4.0 | 2008 | 6 | auto(l5) | 4 | 13 | 19 | r | suv |
81 | ford | explorer 4wd | 4.6 | 2008 | 8 | auto(l6) | 4 | 13 | 19 | r | suv |
82 | ford | explorer 4wd | 5.0 | 1999 | 8 | auto(l4) | 4 | 13 | 17 | r | suv |
83 | ford | f150 pickup 4wd | 4.2 | 1999 | 6 | auto(l4) | 4 | 14 | 17 | r | pickup |
84 | ford | f150 pickup 4wd | 4.2 | 1999 | 6 | manual(m5) | 4 | 14 | 17 | r | pickup |
85 | ford | f150 pickup 4wd | 4.6 | 1999 | 8 | manual(m5) | 4 | 13 | 16 | r | pickup |
86 | ford | f150 pickup 4wd | 4.6 | 1999 | 8 | auto(l4) | 4 | 13 | 16 | r | pickup |
87 | ford | f150 pickup 4wd | 4.6 | 2008 | 8 | auto(l4) | 4 | 13 | 17 | r | pickup |
88 | ford | f150 pickup 4wd | 5.4 | 1999 | 8 | auto(l4) | 4 | 11 | 15 | r | pickup |
89 | ford | f150 pickup 4wd | 5.4 | 2008 | 8 | auto(l4) | 4 | 13 | 17 | r | pickup |
90 | ford | mustang | 3.8 | 1999 | 6 | manual(m5) | r | 18 | 26 | r | subcompact |
91 | ford | mustang | 3.8 | 1999 | 6 | auto(l4) | r | 18 | 25 | r | subcompact |
92 | ford | mustang | 4.0 | 2008 | 6 | manual(m5) | r | 17 | 26 | r | subcompact |
93 | ford | mustang | 4.0 | 2008 | 6 | auto(l5) | r | 16 | 24 | r | subcompact |
94 | ford | mustang | 4.6 | 1999 | 8 | auto(l4) | r | 15 | 21 | r | subcompact |
95 | ford | mustang | 4.6 | 1999 | 8 | manual(m5) | r | 15 | 22 | r | subcompact |
96 | ford | mustang | 4.6 | 2008 | 8 | manual(m5) | r | 15 | 23 | r | subcompact |
97 | ford | mustang | 4.6 | 2008 | 8 | auto(l5) | r | 15 | 22 | r | subcompact |
98 | ford | mustang | 5.4 | 2008 | 8 | manual(m6) | r | 14 | 20 | p | subcompact |
207 | volkswagen | gti | 2.0 | 1999 | 4 | manual(m5) | f | 21 | 29 | r | compact |
208 | volkswagen | gti | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | compact |
209 | volkswagen | gti | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | compact |
210 | volkswagen | gti | 2.0 | 2008 | 4 | auto(s6) | f | 22 | 29 | p | compact |
211 | volkswagen | gti | 2.8 | 1999 | 6 | manual(m5) | f | 17 | 24 | r | compact |
212 | volkswagen | jetta | 1.9 | 1999 | 4 | manual(m5) | f | 33 | 44 | d | compact |
213 | volkswagen | jetta | 2.0 | 1999 | 4 | manual(m5) | f | 21 | 29 | r | compact |
214 | volkswagen | jetta | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | compact |
215 | volkswagen | jetta | 2.0 | 2008 | 4 | auto(s6) | f | 22 | 29 | p | compact |
216 | volkswagen | jetta | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | compact |
217 | volkswagen | jetta | 2.5 | 2008 | 5 | auto(s6) | f | 21 | 29 | r | compact |
218 | volkswagen | jetta | 2.5 | 2008 | 5 | manual(m5) | f | 21 | 29 | r | compact |
219 | volkswagen | jetta | 2.8 | 1999 | 6 | auto(l4) | f | 16 | 23 | r | compact |
220 | volkswagen | jetta | 2.8 | 1999 | 6 | manual(m5) | f | 17 | 24 | r | compact |
221 | volkswagen | new beetle | 1.9 | 1999 | 4 | manual(m5) | f | 35 | 44 | d | subcompact |
222 | volkswagen | new beetle | 1.9 | 1999 | 4 | auto(l4) | f | 29 | 41 | d | subcompact |
223 | volkswagen | new beetle | 2.0 | 1999 | 4 | manual(m5) | f | 21 | 29 | r | subcompact |
224 | volkswagen | new beetle | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | subcompact |
225 | volkswagen | new beetle | 2.5 | 2008 | 5 | manual(m5) | f | 20 | 28 | r | subcompact |
226 | volkswagen | new beetle | 2.5 | 2008 | 5 | auto(s6) | f | 20 | 29 | r | subcompact |
227 | volkswagen | passat | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | midsize |
228 | volkswagen | passat | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | midsize |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
231 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
232 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
233 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
You can use a double bracket notation to select a subset of columns.
Note
Using single brackets or dot notation will get you a single column as a Series.
class_cty_hwy = mpg[["class", "cty", "hwy"]]
class_cty_hwy
class | cty | hwy | |
---|---|---|---|
0 | compact | 18 | 29 |
1 | compact | 21 | 29 |
2 | compact | 20 | 31 |
3 | compact | 21 | 30 |
4 | compact | 16 | 26 |
... | ... | ... | ... |
229 | midsize | 19 | 28 |
230 | midsize | 21 | 29 |
231 | midsize | 16 | 26 |
232 | midsize | 18 | 26 |
233 | midsize | 17 | 26 |
234 rows × 3 columns
Data Wrangling#
In addtion to selecting rows and columns from DataFrames, you can also use Pandas to do a wide variety of data transformations.
Sorting#
mpg.sort_values("year", ascending=False)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
117 | hyundai | tiburon | 2.0 | 2008 | 4 | manual(m5) | f | 20 | 28 | r | subcompact |
120 | hyundai | tiburon | 2.7 | 2008 | 6 | manual(m6) | f | 16 | 24 | r | subcompact |
122 | jeep | grand cherokee 4wd | 3.0 | 2008 | 6 | auto(l5) | 4 | 17 | 22 | d | suv |
123 | jeep | grand cherokee 4wd | 3.7 | 2008 | 6 | auto(l5) | 4 | 15 | 19 | r | suv |
126 | jeep | grand cherokee 4wd | 4.7 | 2008 | 8 | auto(l5) | 4 | 9 | 12 | e | suv |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
130 | land rover | range rover | 4.0 | 1999 | 8 | auto(l4) | 4 | 11 | 15 | p | suv |
50 | dodge | dakota pickup 4wd | 3.9 | 1999 | 6 | auto(l4) | 4 | 13 | 17 | r | pickup |
51 | dodge | dakota pickup 4wd | 3.9 | 1999 | 6 | manual(m5) | 4 | 14 | 17 | r | pickup |
125 | jeep | grand cherokee 4wd | 4.7 | 1999 | 8 | auto(l4) | 4 | 14 | 17 | r | suv |
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
234 rows × 11 columns
Counting#
mpg.value_counts("manufacturer")
manufacturer
dodge 37
toyota 34
volkswagen 27
ford 25
chevrolet 19
audi 18
hyundai 14
subaru 14
nissan 13
honda 9
jeep 8
pontiac 5
land rover 4
mercury 4
lincoln 3
Name: count, dtype: int64
Renaming Columns#
# Note the use of a Python dictionary as this method's argument
mpg = mpg.rename({"cty":"city", "hwy": "highway"})
mpg
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize |
231 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize |
232 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize |
233 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize |
234 rows × 11 columns
Create new columns#
You can use assign()
to create new columns based on existing ones.
mpg = mpg.assign(displ_per_cyl = mpg.displ/mpg.cyl)
mpg
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | displ_per_cyl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact | 0.450000 |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact | 0.450000 |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact | 0.500000 |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact | 0.500000 |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact | 0.466667 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize | 0.500000 |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize | 0.500000 |
231 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize | 0.466667 |
232 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize | 0.466667 |
233 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize | 0.600000 |
234 rows × 12 columns
Grouping and Summarizing#
This combines a couple functions that exist within Pandas to create summary tables.
Pandas has a wide range of summary statistics that you can apply to individual columns.
# Average city fuel efficiency
mpg.cty.mean()
16.858974358974358
# Standard deviation of highway fuel efficiency
mpg.hwy.std()
5.9546434411664455
Pandas also has a .groupby()
method (which returns a generator) that groups categorical variables.
mpg.groupby("manufacturer")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11849e490>
By itself, .groupby()
doesn’t show anything. It needs to be combined with a summary statistic to create a summary table.
# Averages by manufacturer
# set `numeric_only=True` to avoid a warning
mpg.groupby("manufacturer").mean(numeric_only=True)
displ | year | cyl | cty | hwy | displ_per_cyl | |
---|---|---|---|---|---|---|
manufacturer | ||||||
audi | 2.544444 | 2003.500000 | 5.222222 | 17.611111 | 26.444444 | 0.484722 |
chevrolet | 5.063158 | 2004.684211 | 7.263158 | 15.000000 | 21.894737 | 0.686842 |
dodge | 4.378378 | 2004.108108 | 7.081081 | 13.135135 | 17.945946 | 0.616216 |
ford | 4.536000 | 2002.600000 | 7.200000 | 14.000000 | 19.360000 | 0.633667 |
honda | 1.711111 | 2003.000000 | 4.000000 | 24.444444 | 32.555556 | 0.427778 |
hyundai | 2.428571 | 2004.142857 | 4.857143 | 18.642857 | 26.857143 | 0.509524 |
jeep | 4.575000 | 2005.750000 | 7.250000 | 13.500000 | 17.625000 | 0.627604 |
land rover | 4.300000 | 2003.500000 | 8.000000 | 11.500000 | 16.500000 | 0.537500 |
lincoln | 5.400000 | 2002.000000 | 8.000000 | 11.333333 | 17.000000 | 0.675000 |
mercury | 4.400000 | 2003.500000 | 7.000000 | 13.250000 | 18.000000 | 0.633333 |
nissan | 3.269231 | 2003.846154 | 5.538462 | 18.076923 | 24.615385 | 0.589744 |
pontiac | 3.960000 | 2002.600000 | 6.400000 | 17.000000 | 26.400000 | 0.615833 |
subaru | 2.457143 | 2004.142857 | 4.000000 | 19.285714 | 25.571429 | 0.614286 |
toyota | 2.952941 | 2002.705882 | 5.117647 | 18.529412 | 24.911765 | 0.573897 |
volkswagen | 2.255556 | 2002.666667 | 4.592593 | 20.925926 | 29.222222 | 0.491049 |
Dropping Null Values#
For many statistical modeling tasks, you need to drop rows that contain null values. Pandas lets you do this easily with .dropna()
.
Note
Pandas typically stores null values as NaN, which stands for “not a number.”
# Drop any row that contains a null value in any column
mpg = mpg.dropna()
mpg
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | displ_per_cyl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact | 0.450000 |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact | 0.450000 |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact | 0.500000 |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact | 0.500000 |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact | 0.466667 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize | 0.500000 |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize | 0.500000 |
231 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize | 0.466667 |
232 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize | 0.466667 |
233 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize | 0.600000 |
234 rows × 12 columns
Warning
The mpg
dataset doesn’t contain any null values, so right now this code isn’t accomplishing anything. But it will work when null values are present!
You can also drop null values from only a subset of columns.
# Drop any rows that contain null values in a subset of columns
mpg = mpg.dropna(subset=["model", "displ"])
mpg
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | displ_per_cyl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | audi | a4 | 1.8 | 1999 | 4 | auto(l5) | f | 18 | 29 | p | compact | 0.450000 |
1 | audi | a4 | 1.8 | 1999 | 4 | manual(m5) | f | 21 | 29 | p | compact | 0.450000 |
2 | audi | a4 | 2.0 | 2008 | 4 | manual(m6) | f | 20 | 31 | p | compact | 0.500000 |
3 | audi | a4 | 2.0 | 2008 | 4 | auto(av) | f | 21 | 30 | p | compact | 0.500000 |
4 | audi | a4 | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | compact | 0.466667 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
229 | volkswagen | passat | 2.0 | 2008 | 4 | auto(s6) | f | 19 | 28 | p | midsize | 0.500000 |
230 | volkswagen | passat | 2.0 | 2008 | 4 | manual(m6) | f | 21 | 29 | p | midsize | 0.500000 |
231 | volkswagen | passat | 2.8 | 1999 | 6 | auto(l5) | f | 16 | 26 | p | midsize | 0.466667 |
232 | volkswagen | passat | 2.8 | 1999 | 6 | manual(m5) | f | 18 | 26 | p | midsize | 0.466667 |
233 | volkswagen | passat | 3.6 | 2008 | 6 | auto(s6) | f | 17 | 26 | p | midsize | 0.600000 |
234 rows × 12 columns
Sampling#
Many statistical methods, especially hypothesis tests, require to take a random sample of your overall data. Again, Pandas provides an easy way to do this with the .sample()
method.
You can take a sample of rows from an entire dataframe.
# Get 5 random rows from mpg
mpg.sample(5)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | displ_per_cyl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | ford | mustang | 3.8 | 1999 | 6 | manual(m5) | r | 18 | 26 | r | subcompact | 0.633333 |
214 | volkswagen | jetta | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | compact | 0.500000 |
85 | ford | f150 pickup 4wd | 4.6 | 1999 | 8 | manual(m5) | 4 | 13 | 16 | r | pickup | 0.575000 |
22 | chevrolet | c1500 suburban 2wd | 6.0 | 2008 | 8 | auto(l4) | r | 12 | 17 | r | suv | 0.750000 |
6 | audi | a4 | 3.1 | 2008 | 6 | auto(av) | f | 18 | 27 | p | compact | 0.516667 |
You can also get a sample of a specific column.
# Get 5 sample engine displacement values, as a series
mpg.displ.sample(5)
149 3.5
233 3.6
92 4.0
216 2.0
217 2.5
Name: displ, dtype: float64
You can also sample with replacement. (This is also called “bootstrap sampling.”) This makes it possible to have the same value in your sample twice.
mpg.displ.sample(5, replace=True)
65 4.7
217 2.5
224 2.0
187 2.2
107 2.0
Name: displ, dtype: float64
Pandas will also let you get a fraction of values instead of a set number in your sample.
# Get a random sample of one twentieth the size of the dataset
mpg.sample(frac=.05)
manufacturer | model | displ | year | cyl | trans | drv | cty | hwy | fl | class | displ_per_cyl | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
63 | dodge | durango 4wd | 5.9 | 1999 | 8 | auto(l4) | 4 | 11 | 15 | r | suv | 0.737500 |
149 | nissan | maxima | 3.5 | 2008 | 6 | auto(av) | f | 19 | 25 | p | midsize | 0.583333 |
90 | ford | mustang | 3.8 | 1999 | 6 | manual(m5) | r | 18 | 26 | r | subcompact | 0.633333 |
56 | dodge | dakota pickup 4wd | 5.2 | 1999 | 8 | auto(l4) | 4 | 11 | 15 | r | pickup | 0.650000 |
224 | volkswagen | new beetle | 2.0 | 1999 | 4 | auto(l4) | f | 19 | 26 | r | subcompact | 0.500000 |
30 | chevrolet | k1500 tahoe 4wd | 5.7 | 1999 | 8 | auto(l4) | 4 | 11 | 15 | r | suv | 0.712500 |
204 | toyota | toyota tacoma 4wd | 3.4 | 1999 | 6 | auto(l4) | 4 | 15 | 19 | r | pickup | 0.566667 |
206 | toyota | toyota tacoma 4wd | 4.0 | 2008 | 6 | auto(l5) | 4 | 16 | 20 | r | pickup | 0.666667 |
80 | ford | explorer 4wd | 4.0 | 2008 | 6 | auto(l5) | 4 | 13 | 19 | r | suv | 0.666667 |
7 | audi | a4 quattro | 1.8 | 1999 | 4 | manual(m5) | 4 | 18 | 26 | p | compact | 0.450000 |
18 | chevrolet | c1500 suburban 2wd | 5.3 | 2008 | 8 | auto(l4) | r | 14 | 20 | r | suv | 0.662500 |
71 | dodge | ram 1500 pickup 4wd | 5.2 | 1999 | 8 | manual(m5) | 4 | 11 | 16 | r | pickup | 0.650000 |
There’s one more trick you can do with sampling. Sometimes you don’t need to get a smaller random sample: instead, you just want to reshuffle every row of the dataset. You can do this by setting frac
to 1. In a way, you’re taking a random sample that is 100% of the size of the dataset! (But make sure you do this without replacement.)
mpg.displ.sample(frac=1)
7 1.8
81 4.6
72 5.7
183 3.0
29 5.3
...
226 2.5
208 2.0
187 2.2
90 3.8
140 5.0
Name: displ, Length: 234, dtype: float64
Pandas will remember the indices in your new Series, which means if you use this reordered sample it might put things back in order for you! To avoid this, you can reset the index and drop the old labels.
mpg.displ.sample(frac=1).reset_index(drop=True)
0 4.6
1 3.3
2 2.0
3 3.5
4 2.2
...
229 4.2
230 2.0
231 4.6
232 3.9
233 2.7
Name: displ, Length: 234, dtype: float64