Resampling and Permutation

DA 101, Dr. Ladd

How do you solve a problem like a T-Test?

T-Tests rely on assumptions.

  • Normality (Are both samples normally distributed?)
  • Equality of Variance (Are the variables spread out roughly the same amount?)

If either of these assumptions aren’t met, our t-test could be misleading!

“Parametric” hypothesis tests were designed to solve a problem before computers existed, but now there are other approaches.

Resampling

Resampling is simply drawing multiple random samples from observed data.

I can grab a sample of 5 observations. Then I can “resample” 5 more. Then 5 more, and so on and so on.

First proposed in the 1960s, resampling procedures weren’t practical until computing took off in the 1980s.

Resampling is an umbrella term. It can include:

  • The Bootstrap, used to assess the reliability of an estimated statistic
  • Permutation Tests, used as an alternative to parametric hypothesis tests

You can resample with or without replacement.

Replacement means an item is returned to the sample before the next draw (i.e. you could wind up with the same observation multiple times).

Permutation Tests

A permutation test solves the t-test problem!

It doesn’t matter whether the samples are normally distributed or whether their variance is equal. There are no assumptions in a permutation test.

Permute means to change the order of a set of values.

In a permutation test, you rearrange groups randomly to determine a permutation distribution.

A permutation distribution embodies the null hypothesis.

It shows you what the distribution would look like if the difference between the groups was the result of random variation.

You can create a permutation procedure to replace different kinds of statistical tests.

Let’s look at the steps of a permutation test that would replace a two-sample t-test…

  1. Randomly resample (without replacement) a group the same size at the first group.
  2. From the remaining data, randomly resample (without replacement) a group the same size as the second group.
  3. Calculate the difference in means between the two resamples. This is one permutation.
  4. Repeat these steps as many times as you want to create a permutation distribution.
  5. Compare the observed difference in the real groups to the permutation distribution.

You can use the permutation distribution to calculate a p-value.

Permutation Tests in R

Let’s look again at the mpg dataset.

library(tidyverse)

mpg <- mpg

ggplot(mpg, aes(x=class,y=cty)) +
    geom_boxplot()

What’s the difference in means?

compact_sub <- mpg %>%
    filter(class=="compact"|class=="subcompact")

cs_means <- compact_sub %>%
    group_by(class) %>%
    summarize(avg_cty = mean(cty), count = n())

Functions create reusable code

adding_func <- function(x,n) {
    sum <- x + n
    return(sum)
}

adding_func(4,2)

First name the function and define input. Then do something and return a result!

Repeat functions with replicate

repeated_adding <- replicate(100, adding_func(4,2))

Create a random sample

sample(mpg$cty, 10)

Let’s create a function for permutation!

permutation_func <- function(x, nA) {
    idx_a <- sample(1:length(x), nA) # Get a sample the size of group 1
    idx_b <- setdiff(1:length(x), idx_a) # Get the rest of the data
    mean_diff <- mean(x[idx_a]) - mean(x[idx_b]) # Subtract the means
    return(mean_diff)
}

You can reuse this!

Now we can make 1000 permutations.

perm_means <- replicate(1000, permutation_func(compact_sub$cty, 47))

Where does the 47 come from?

Let’s look at the results in a histogram

ggplot(,aes(x=perm_means)) +
    geom_histogram() +
    geom_vline(xintercept = .2, color="red")

Where did .2 come from? Can we calculate it more accurately?

Finally, we can calculate a p-value:

mean(perm_means > .2)

Why does this code work?

Is our result statistically significant? Is it practically significant?

You Try It!

Permutation Exercise

Determine if users spend significantly more time on Page B than they do on Page A.

  1. Download web_page_data.csv.
  2. Make a boxplot of session times for Pages A and B.
  3. Calculate the observed difference in means.
  4. Run 2000 permutations of randomly resampled groups.
  5. Make a histogram of permutation results and show the observed difference as a vertical line.
  6. Calculate the p-value for your permutation test.
// reveal.js plugins