CIS 241, Dr. Ladd
AKA “average”
\(\dfrac{600+470+170+430+300}{5} = 394\)
AKA “50th percentile”
AKA “quantile”
AKA “extreme value”
\(\dfrac{206^2+76^2+(-224)^2+36^2+(-94)^2}{5} = 21,704\)
Rottweilers are tall, and dachsunds are short—compared to the standard deviation from the mean.
Were these the results you expected?
When you have “N” data values:
Sample variance: \(\dfrac{108,520}{4}=27,130\)
Sample standard deviation: \(\sqrt{27,130}=164\)
Think of it as a “correction” when your data is only a sample. Pandas does this by default!
Be careful: normal distributions are assumed for many statistical analyses!
We’ll talk more about correlation in a couple weeks!
I can grab a sample of 5 observations. Then I can “resample” 5 more. Then 5 more, and so on and so on.
Replacement means an item is returned to the sample before the next draw (i.e. you could wind up with the same observation multiple times).
n_rows = dogs.shape[0] # First find the number of rows in the dataframe.
# Then take a sample with replacement from your dataset, and
# Calculate the mean each time. Do this 5000 times.
bootstrap_samples = [dogs.sample(n=n_rows, replace=True).height.mean()
for i in range(5000)]
# Put the results into a pandas Series object
bootstrap_samples = pd.Series(bootstrap_samples)
# Calculate the 95th percentile and the 5th percentile.
top_percentile = bootstrap_samples.quantile(.95)
bottom_percentile = bootstrap_samples.quantile(.05)
# Print the results using nice f-strings.
print(f"""The mean dog height in our data is {dogs.height.mean():.3f},
with a 90% confidence interval
of {bottom_percentile:.3f} to {top_percentile:.3f}.""")