for Comparison of Means
CIS 241: Data Mining, Dr. Ladd
Consider two sample groups, A and B. (Such as the male and female groups of your alcohol use data.)
In a t-test, the null hypothesis would assume that the means of A and B are equal, that there is no difference between them, and that any observed difference we see is the result of randomness.
We attempt to disprove the null hypothesis by showing that the observed data isn’t the result of randomness.
If there’s a null hypothesis, there has to be an alternative hypothesis.
If the null hypothesis is that A and B are equal, then the alternative hypothesis would be that A and B are not equal (either smaller or bigger).
One-tailed: We only care about a non-equal result in one direction, i.e. if A > B but not if A < B.
Two-tailed: We care about differences in both directions, i.e. A != B but could be larger or smaller.
Different research questions lead to different alternative hypotheses.
Is the median house price in Pittsburgh larger than the median price in Washington?
Is the mean number of mountain lions per 100 km^2 equal in North and South America?
NHANES reports the average starting age of smoking is 19. Is this correct, or is the true mean lower than this?
Say you have two web pages, Page A and Page B, and you’ve measured the amount of time internet users spend on each page. You’re trying to decide whether to replace Page A with Page B.
The Null Hypothesis is that:
mean(A) = mean(B)
The Alternative Hypothesis is that:
mean(B) > mean(A)
(one-tailed)
We have two clear groups: the people who saw Page A and the ones who saw Page B. But we could reshuffle this data a thousand times in a thousand different configurations, where the session times are separated into equally sized but random groups.
In the end we’d have a distribution of how much the means differ among a thousand random groups.
In this case, we care about how often the random differences were greater than the observed difference.
I.e., how often the values were to the right of the dotted line.
In this case, that was about 12% percent of the time. That’s a lot! And that means that this observed difference isn’t all that unusual.
Instead, we can measure the probability of obtaining results as unusual as the observed result.
This probability is called the p-value!
Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed result.
In our example, our 12% was a p-value of .12!
If the p-value is lower than .05 (5%), we can reject the null hypothesis.
If the p-value is higher than .05 (5%), we fail to reject the null hypothesis and our result could be random.
This is just a rule of thumb!
In our example of two groups in our data, we could test whether their difference in means is significant using a t-test. It calculates a p-value based on a “t-distribution.”
Different statistical tests calculate p-values for other kinds of differences.
Consider results that are:
Type I Error (alpha-error) is rejecting the null hypothesis when it is true.
Type II Error (beta-error) is failing to reject the null hypothesis when it is false.
Misreading or overemphasizing the p-value can lead us to error!
It solves that problem using resampling (without replacement).
It doesn’t matter whether the samples are normally distributed or whether their variance is equal. There are no assumptions in a permutation test.
In a permutation test, you rearrange groups randomly to determine a permutation distribution.
It shows you what the distribution would look like if the difference between the groups was the result of random variation.
Let’s look at the steps of a permutation test that would replace a two-sample t-test…
penguins
dataset.import pandas as pd
import numpy as np
import altair as alt
penguins = pd.read_csv('https://jrladd.com/CIS241/data/penguins.csv')
penguins
alt.Chart(penguins,title="Comparing Penguins' Bill Depth by Species").mark_boxplot().encode(
x=alt.X('species:N').title("Species of Penguin"),
y=alt.Y('bill_depth_mm:Q').title("Bill Depth (mm)").scale(zero=False),
color=alt.Color('species:N').legend(None)
).properties(width=200)
First name the function and define input. Then do something and return a result!
def simulate_two_groups(data1, data2):
n = len(data1) #Get length of first group
data = pd.concat([data1, data2]) #Get all data
data = data.sample(frac=1) #Reshuffle all data
group1 = data.iloc[:n] #Get random first group
group2 = data.iloc[n:] #Get random second group
return group1.mean() - group2.mean() #Calculate mean difference
You can reuse this code!
alt.data_transformers.disable_max_rows() # Don't limit the data
# Create a histogram
histogram = alt.Chart(mean_perms).mark_bar().encode(
x=alt.X("mean_perms:Q").bin(maxbins=20),
y=alt.Y("count():Q")
)
mean_perms = mean_perms.assign(mean_diff=mean_diff) # Add the mean to the dataframe
# Add a vertical line
observed_difference = alt.Chart(mean_perms).mark_rule(color="red", strokeDash=(8,4)).encode(
x=alt.X("mean_diff")
)
# Combine the two plots
histogram + observed_difference
How is this different from previous Altair plots?
Why does this code work?
Is our result statistically significant? Is it practically significant?
Determine if users spend significantly more time on Page B than they do on Page A.