CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
Hypothesis tests protect researchers from being fooled by random chance!
The mean depth of Chinstrap penguin bills is 18.42mm, and the mean depth of Adelie penguin bills is 18.35mm. The observed difference in means is 0.074mm. Does that mean Chinstrap penguins really have taller bills?
If we assume there’s no difference between the two species’ bills in our random model, then we can see that our observed difference is not far from the assumed mean of 0. In this case, more than 30% of our data could be “more extreme” than this result!
In a comparison of means test, the null hypothesis would assume that the means of Adelie and Chinstrap bill depth are equal, that there is no difference between them, and that any observed difference we see is the result of randomness.
We attempt to disprove the null hypothesis by showing that the observed data isn’t the result of randomness. This is a reductio ad absurdum.
If there’s a null hypothesis, there has to be an alternative hypothesis.
If the null hypothesis is that A and B are equal, then the alternative hypothesis would be that A and B are not equal (either smaller or bigger).
One-tailed: We only care about a non-equal result in one direction, i.e. if A > B but not if A < B.
Two-tailed: We care about differences in both directions, i.e. A != B but could be larger or smaller.
Different research questions lead to different alternative hypotheses.
Is the median house price in Pittsburgh larger than the median price in Washington?
Is the mean number of mountain lions per 100 km^2 equal in North and South America?
NHANES reports the average starting age of smoking is 19. Is this correct, or is the true mean lower than this?
Is the bill depth of Chinstrap penguins greater than the bill depth of Adelie penguins?
In a permutation test, you rearrange groups randomly to determine a permutation distribution.
It shows you what the distribution would look like if the difference between the groups was the result of random variation.
penguins
dataset.import pandas as pd
import numpy as np
import altair as alt
penguins = pd.read_csv('https://jrladd.com/CIS241/data/penguins.csv')
penguins
alt.Chart(penguins,title="Comparing Penguins' Bill Depth by Species").mark_boxplot().encode(
x=alt.X('species:N').title("Species of Penguin"),
y=alt.Y('bill_depth_mm:Q').title("Bill Depth (mm)").scale(zero=False),
color=alt.Color('species:N').legend(None)
).properties(width=200)
# Get two groups
chinstrap_bill_depth = penguins[penguins.species == "Chinstrap"].bill_depth_mm
adelie_bill_depth = penguins[penguins.species == "Adelie"].bill_depth_mm
# Calculate the difference in means
observed_difference = chinstrap_bill_depth.mean() - adelie_bill_depth.mean()
observed_difference
def simulate_two_groups(data1, data2):
n = len(data1) #Get length of first group
data = pd.concat([data1, data2]) #Get all data
data = data.sample(frac=1) #Reshuffle all data
group1 = data.iloc[:n] #Get random first group
group2 = data.iloc[n:] #Get random second group
return group1.mean() - group2.mean() #Calculate mean difference
You can reuse this code!
alt.data_transformers.disable_max_rows() # Don't limit the data
# Create a histogram
histogram = alt.Chart(permutations).mark_bar().encode(
x=alt.X("permutations:Q").bin(maxbins=20),
y=alt.Y("count():Q")
)
permutations = permutations.assign(observed_difference=observed_difference) # Add the mean to the dataframe
# Add a vertical line
observed_line = alt.Chart(permutations).mark_rule(color="red", strokeDash=(8,4)).encode(
x=alt.X("observed_difference")
)
# Combine the two plots
histogram + observed_line
How is this different from previous Altair plots?
Why does this code work?
Is our result statistically significant? Is it practically significant?
Instead, we can measure the probability of obtaining results as unusual as the observed result.
This probability is called the p-value!
Given a random model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed result.
In our example, our 32% was a p-value of .32!
If the p-value is lower than .05 (5%), we can reject the null hypothesis and our result is statistically significant.
If the p-value is higher than .05 (5%), we fail to reject the null hypothesis and our result is not statistically significant.
This is just a rule of thumb!
Different permutation tests calculate p-values for other kinds of differences.
Consider results that are:
Type I Error (alpha-error) is rejecting the null hypothesis when it is true.
Type II Error (beta-error) is failing to reject the null hypothesis when it is false.
Misreading or overemphasizing the p-value can lead us to error!
Determine if users spend significantly more time on Page B than they do on Page A.
For a good summary of what we discussed about hypothesis testing, you can watch this short YouTube video: https://youtu.be/bf3egy7TQ2Q.