Hypothesis Testing 2: Correlation

CIS 241, Dr. Ladd

spacebar to go to the next slide, esc/menu to navigate

Our first form of bivariate analysis.

Correlation always involves two or more variables (columns).

Pearson’s correlation, r, is the default in Pandas.

# Load cars sample dataset
cars = data.cars()

# Calculate correlations between all columns in a dataframe 
cars.corr(numeric_only=True)

# Calculate correlation between just two variables
cars.Miles_per_Gallon.corr(cars.Displacement)

Visualizing Correlation

Scatterplots show potential correlation between two variables

The y-axis shows the dependent variable, while the x-axis shows the independent variable.

Make a scatterplot with Altair

alt.Chart(cars, title="Fuel Efficiency and Engine Displacement").mark_point().encode(
    x=alt.X("Displacement:Q", title="Engine Displacement (liters)"),
    y=alt.Y("Miles_per_Gallon:Q", title="Fuel Efficiency (mpg)")
).interactive()

Make a scatterplot with Altair

Add a line of best fit to make a regression plot

scatter = alt.Chart(cars, title="Fuel Efficiency and Engine Displacement").mark_point().encode(
    x=alt.X("Displacement:Q", title="Engine Displacement (liters)"),
    y=alt.Y("Miles_per_Gallon:Q", title="Fuel Efficiency (mpg)")
).interactive()

scatter + scatter.transform_regression('Displacement','Miles_per_Gallon').mark_line()

Add a line of best fit to make a regression plot

Avoid overplotting with heatmaps or kernel density estimation.

# Heatmap example
alt.Chart(cars, title="Fuel Efficiency and Engine Displacement").mark_rect().encode(
    x=alt.X("Displacement:Q", title="Engine Displacement (liters)").bin(),
    y=alt.Y("Miles_per_Gallon:Q", title="Fuel Efficiency (mpg)").bin(),
    color=alt.Color("count():Q")
)

Avoid overplotting with heatmaps or kernel density estimation.

Correlation matrix shows all possible correlations.

# Re-arrange correlation matrix data
cars_corr = (cars.corr(numeric_only=True)
             .stack()
             .reset_index()
             .rename(columns={0:'corr','level_0':'var1','level_1':'var2'})
            )
# Create correlation heatmap
base = alt.Chart(cars_corr, title="Cars Correlation Matrix").mark_rect().encode(
    x=alt.X("var1:N",title=None),
    y=alt.Y("var2:N",title=None),
    color=alt.Color("corr",title="Correlation coefficient").scale(scheme='blueorange')
).properties(width=300,height=300)
# Add text labels for coefficients
text = base.mark_text(baseline='middle').encode(
    alt.Text('corr:Q', format=".2f"),
    color=alt.condition(
        (alt.datum.corr < -0.5) | (alt.datum.corr > 0.5),
        alt.value('white'),
        alt.value('black')
    )
)
base+text # Display visualization

Correlation matrix shows all possible correlations.

Hypothesis Tests for Correlation

How do we know if a correlation coefficient is statistically significant?

There are standard parametric approaches to this, but we can use permutation!

We need a new simulation function for correlation.

def simulate_correlation(df,var1,var2):
    shuffled = df[var1].sample(frac=1).reset_index(drop=True)
    corr = shuffled.corr(df[var2])
    return corr

Let’s Try It!

Using the function from the previous slide, run 5000 permutations of the correlation between engine displacement and miles per gallon.

Graph the results as a histogram and calculate a p-value. Is this a statistically significant correlation?

Don’t be fooled!

Always use summary statistics and visualization together.

If we have the same mean, standard deviation, and correlation we might expect the data sets to be similar…

But they could be very clearly and visually distinct!

Data Challenge

Use pandas to find the summary statistics for each dataset in the datasaurus_dozen.

  • Find mean, standard deviation, and correlation for both x and y of each dataset. (You may need to group things by the “dataset” column.)
  • When you’re done, try making scatter plots! (You may need to use the column encoding.)