CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
John Tukey pioneered Exploratory Data Analysis starting in 1962 and again with a book in 1977.
The dog’s heights (in mm) are: 600, 470, 170, 430, and 300
AKA “average”
\(\dfrac{600+470+170+430+300}{5} = 394\)
AKA “50th percentile”
600, 470, 170, 430, 300
AKA “quantile”
600, 470, 170, 430, 300
600, 470, 170, 430, 300
AKA “extreme value”
\(\dfrac{206^2+76^2+(-224)^2+36^2+(-94)^2}{5} = 21,704\)
It gives us a “standard” way of knowing what is normal, or what is extra large/extra small.
Were these the results you expected?
When you have “N” data values:
The Entire Population: divide by N when calculating variance (like we did)
A Sample: divide by N-1 when calculating variance
Sample variance: \(\dfrac{108,520}{4}=27,130\)
Sample standard deviation: \(\sqrt{27,130}=164\)
Think of it as a “correction” when your data is only a sample. Pandas does this by default!
The histogram for miles per gallon highway in the mpg dataset.
In a normal distribution, 95% of the values lie within 2 standard deviations of the mean.
Be careful: normal distributions are assumed for many statistical analyses!
You can see how the box plot and the histogram are similar but different.
We’ll talk more about correlation in a couple weeks!
I can grab a sample of 5 observations. Then I can “resample” 5 more. Then 5 more, and so on and so on.
Replacement means an item is returned to the sample before the next draw (i.e. you could wind up with the same observation multiple times).
Work with a partner to calculate the confidence interval for the mean height in the dogs
dataframe (remember to use your Pandas cheatsheet and past slideshows):
n_rows
.for
loop through a list of 5000 numbers using range(5000)
.
sample()
function where n=
your number of rows and replace=True
.mean()
of the height column of each sample, and assign it to a variable.append()
to add your sample means to an empty list.pd.Series(your_list)
.print()
your results using f-strings.
n_rows = dogs.shape[0] # First find the number of rows in the dataframe.
# Then take a sample with replacement from your dataset, and
# Calculate the mean each time. Do this 5000 times.
bootstrap_samples = []
for i in range(5000):
sample_mean = dogs.sample(n=n_rows, replace=True).height.mean()
bootstrap_samples.append(sample_mean)
# Put the results into a pandas Series object
bootstrap_samples = pd.Series(bootstrap_samples)
# Calculate the 95th percentile and the 5th percentile.
top_percentile = bootstrap_samples.quantile(.95)
bottom_percentile = bootstrap_samples.quantile(.05)
# Print the results using nice f-strings.
print(f"The mean dog height in our data is {dogs.height.mean():.3f}, with a 90% confidence interval of {bottom_percentile:.3f} to {top_percentile:.3f}.")