CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
AKA “average”
\(\dfrac{600+470+170+430+300}{5} = 394\)
AKA “50th percentile”
AKA “quantile”
AKA “extreme value”
\(\dfrac{206^2+76^2+(-224)^2+36^2+(-94)^2}{5} = 21,704\)
Were these the results you expected?
When you have “N” data values:
The Entire Population: divide by N when calculating variance (like we did)
A Sample: divide by N-1 when calculating variance
Sample variance: \(\dfrac{108,520}{4}=27,130\)
Sample standard deviation: \(\sqrt{27,130}=164\)
Think of it as a “correction” when your data is only a sample. Pandas does this by default!
Be careful: normal distributions are assumed for many statistical analyses!
We’ll talk more about correlation in a couple weeks!
I can grab a sample of 5 observations. Then I can “resample” 5 more. Then 5 more, and so on and so on.
Replacement means an item is returned to the sample before the next draw (i.e. you could wind up with the same observation multiple times).
Work with a partner to calculate the confidence interval for the mean height in the dogs
dataframe (remember to use your Pandas cheatsheet and past slideshows):
n_rows
.for
loop through a list of 5000 numbers using range(5000)
.
sample()
function where n=
your number of rows and replace=True
.append()
to add your samples to an empty list.pd.Series(your_list)
.print()
your results using f-strings.
n_rows = dogs.shape[0] # First find the number of rows in the dataframe.
# Then take a sample with replacement from your dataset, and
# Calculate the mean each time. Do this 5000 times.
bootstrap_samples = []
for i in range(5000):
sample_mean = dogs.sample(n=n_rows, replace=True).height.mean()
bootstrap_samples.append(sample_mean)
# Put the results into a pandas Series object
bootstrap_samples = pd.Series(bootstrap_samples)
# Calculate the 95th percentile and the 5th percentile.
top_percentile = bootstrap_samples.quantile(.95)
bottom_percentile = bootstrap_samples.quantile(.05)
# Print the results using nice f-strings.
print(f"The mean dog height in our data is {dogs.height.mean():.3f}, with a 90% confidence interval of {bottom_percentile:.3f} to {top_percentile:.3f}.")