CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
Not how they are related.
Correlation always involves two or more variables (columns).
Pearson’s correlation coefficient multiplies the deviations from the mean for two variables, and divides by the product of the standard deviation.
Tells us the strength of a correlation.
The y-axis shows the dependent variable, while the x-axis shows the independent variable.
scatter = alt.Chart(cars, title="Fuel Efficiency and Engine Displacement").mark_point().encode(
x=alt.X("Displacement:Q", title="Engine Displacement (liters)"),
y=alt.Y("Miles_per_Gallon:Q", title="Fuel Efficiency (mpg)")
).interactive()
scatter + scatter.transform_regression('Displacement','Miles_per_Gallon').mark_line()
# Re-arrange correlation matrix data
cars_corr = (cars.corr(numeric_only=True)
.stack()
.reset_index()
.rename(columns={0:'corr','level_0':'var1','level_1':'var2'})
)
# Create correlation heatmap
base = alt.Chart(cars_corr, title="Cars Correlation Matrix").mark_rect().encode(
x=alt.X("var1:N",title=None),
y=alt.Y("var2:N",title=None),
color=alt.Color("corr",title="Correlation coefficient").scale(scheme='blueorange')
).properties(width=300,height=300)
# Add text labels for coefficients
text = base.mark_text(baseline='middle').encode(
alt.Text('corr:Q', format=".2f"),
color=alt.condition(
(alt.datum.corr < -0.5) | (alt.datum.corr > 0.5),
alt.value('white'),
alt.value('black')
)
)
base+text # Display visualization
There are standard parametric approaches to this, but we can use permutation!
Using the function from the previous slide, run 5000 permutations of the correlation between engine displacement and miles per gallon.
Graph the results as a histogram and calculate a p-value. Is this a statistically significant correlation?
Always use summary statistics and visualization together.
But they could be very clearly and visually distinct!
Use pandas
to find the summary statistics for each dataset in the datasaurus_dozen
.
column
encoding.)