Data Wrangling with pandas

CIS 241, Dr. Ladd

Wrangling Data

Why not just edit the data in the spreadsheet?

DataFrames are like spreadsheets for Python.

# These are our standard imports
import pandas as pd # Data analysis
import numpy as np # Numerical Calculation
import altair as alt # Visualization

Normally, get data from CSV files.

mydata = pd.read_csv("name_of_file.csv")
# Use the name of a file or a URL
taxis = pd.read_csv('https://raw.githubusercontent.com/mwaskom/
        seaborn-data/master/taxis.csv')
taxis

Every variable (column) has a data type.

  • int: integers
  • float: floats, doubles, or real numbers
  • str: string, character, or object
  • Other data types: boolean, date-times, factors
taxis.info() # Pandas works by adding methods to dataframes

Selecting rows and columns

Brackets let you get a subset of observations (rows).

cheap_fares = taxis[taxis.total < 10]
cheap_fares

Remember to save everything in variables

Row selection uses standard comparisons.

  • > greater than
  • >= greater than or equal to
  • < less than
  • <= less than or equal to
  • != not equal
  • == equal (note the double equals sign!)

You can also use logical operators to combine comparisons.

& “and”, | “or”, and ! “not”

Logical operators can also be combined.

extreme_fares = taxis[(taxis.total < 7) | (taxis.total > 50)]
extreme_fares

You try it!

Get only the rows for the green taxis.

The .sort_values() method lets you sort rows by value.

taxis.sort_values("distance", ascending=False)

You can also select a set of variables (columns).

prices = taxis[['fare','tip','tolls','total']]
prices

.rename() lets you rename columns.

taxis = taxis.rename(columns={"fare": "base_fare"})
taxis

Notice we kept the same variable name here!

.assign() lets you add new columns based on existing ones.

taxis_new_column = taxis.assign(total_per_person = taxis.total/taxis.passengers)
taxis_new_column

Grouping and Summarizing

We use .groupby() with summary statistics to make summary tables.

Summary tables are new dataframes that summarize our original data.

This paradigm is known as split-apply-combine, and it’s key to data analysis.

You can use summary statistic methods to get values for a whole column.

taxis.tip.mean()

Stat functions to use: mean(), median(), min(), max(), std().

Groupby lets you group data, to get summaries for each group.

taxis.groupby(['dropoff_borough'])

It doesn’t look like anything on its own!

Now we can put it all together!

# Use multiple methods to "chain" operations
taxis.groupby(["dropoff_borough"]).mean(numeric_only=True)

Chapter 5 and Chapter 8 have lots more guidance, and many examples for you to try!