Code
import pandas as pd
import altair as alt
import numpy as np
= pd.read_csv("workshops/sample_data/mpg.csv")
mpg "Unnamed: 0",axis=1,inplace=True) mpg.drop(
This file includes code and examples for explaining graphs and statistical output in DA101. Communicating results is a crucial part of good data analysis, and we try to communicate all results completely and accurately and in terms of the data.
These short examples are designed to give you general guidance. I cannot provide a comprehensive example or answer that you could “copy” every time to have an A+ explanation, but I can provide an example and some pointers to help you get started.
(This is adapted from “How to explain in DA101”.)
The way you explain your graphs will change throughout the semester as you learn more details about what the graph shows and also learn more technical lingo for how to identify different aspects of the graph, including visual interpretation of summary statistics, and how to identify potentially significant differences or outliers.
In the beginning of class (let’s say weeks 1-2) I won’t assume you have prior technical knowledge of data analysis, and it is okay to stick to general descriptive and observational descriptions of what you’re seeing in a graph that you make. What are you noticing? What stands out to you? Do you see anything that looks like a pattern in the points or that indicates similarity among groups?
Later on (let’s say weeks 3+) you will increasingly gain technical language to be able to talk about your graphs and describe your observations. As you gain these skills you can still describe what you are noticing and seeing in your graphs, but you will increasingly describe summary statistics…
import pandas as pd
import altair as alt
import numpy as np
= pd.read_csv("workshops/sample_data/mpg.csv")
mpg "Unnamed: 0",axis=1,inplace=True) mpg.drop(
="Fuel Efficiency Among Different Vehicle Types").mark_boxplot(size=40).encode(
alt.Chart(mpg, title=alt.X("class:N", title="Type or Class of Vehicle"),
x=alt.Y("hwy:Q", title="Highway miles per gallon"),
y=alt.Color("class:N", legend=None)
color=500) ).properties(width
This boxplot of the mpg
data set shows the distributions of highway fuel efficiency across the seven different kinds of cars in the data. Pickups and SUVs seem to have lower fuel efficiency than the other cars, which makes sense because they are bigger, heavier vehicles. Smaller vehicles like compact and midsize cars have greater fuel efficiency, and subcompacts have similarly high fuel efficiency but the data seems to be more spread out because the box is longer. Overall it seems like vehicle class is related to fuel efficiency, with smaller cars tending to have greater efficiency.
This boxplot of the mpg
data set shows the distributions of highway fuel efficiency across the seven different kinds of cars in the data. Pickups and SUVs have medians and interquartile ranges well below the other vehicles, suggesting a statistically significant difference. Compact cars have a nearly identical IQR and median to midsize cars, as evidenced by the size of the two boxes, though there are a handful of outliers in the compact group. Subcompacts have a larger interquartile range than any of the other groups, which suggests greater variability in their fuel efficiency distribution. Overall the graph suggests that larger vehicle classes tend to have lower fuel efficiency distributions, while smaller vehicles seem to have greater fuel efficiency.
="Relationship Between Engine Size and City Fuel Efficiency").mark_point().encode(
alt.Chart(mpg, title=alt.X("displ:Q", title="Engine Displacement, in liters").scale(zero=False),
x=alt.Y("cty:Q", title="City miles per gallon").scale(zero=False)
y=500).interactive() ).properties(width
This scatter plot of the mpg
data set shows the relationship between the size of a car’s engine (using the engine displacement variable) and a car’s city fuel efficiency. Because of the downward slope of the dots as the graph goes from left to right, it appears that as engines get bigger the city fuel efficiency gets smaller. After about 4.5 liters the slope levels off, suggesting there isn’t as strong a relationship past this point. Overall, we could conclude that a car’s city fuel efficiency may partially depend on the size of the engine.
This scatter plot of the mpg
data set shows the relationship between the size of a car’s engine (using the engine displacement variable) and a car’s city fuel efficiency. There looks to be a negative correlation between the two variables: as engine displacement goes up, city miles per gallon goes down. Adding a line of best fit to this graph or calculating a correlation coefficient would give us a better indication of the possible correlation. After about 4.5 liters, the points no longer slope downward, which may indicate that after a certain threshold, engine displacement has no direct correlation to fuel efficiency. Overall we could conclude that our dependent variable, city miles per gallon, negatively correlates with our independent variable, engine displacement, and therefore that as engine size gets larger fuel efficiency drops.
We will learn several statistical tests and models throughout the semester. In data analysis, there is much more to do than to simply write the code for the model and generate “correct” output or report a p-value. In most cases, explaining the output and validation from the models will require several sentences that help to translate the quantitative results in terms of the data. In general, when running these models and interpreting output, there are a few key things to keep in mind.
I’ll provide a few examples below to walk through a t-test, a correlation test, and a linear regression. These are not “perfect” or “set in stone” formats for explaining, but rather think of them as an aid to thought to help guide you in your journey of learning how to explain and translate like a data analyst.
def simulate_two_groups(data1, data2):
= len(data1), len(data2)
n, m = np.append(data1, data2)
data
np.random.shuffle(data)= data[:n]
group1 = data[n:]
group2 return group1.mean() - group2.mean()
= mpg[mpg["class"] == "compact"].hwy
compact_hwy = mpg[mpg["class"] == "midsize"].hwy
midsize_hwy print(f"Mean mpg of compact: {compact_hwy.mean():.2f}")
print(f"Mean mpg of midsize: {midsize_hwy.mean():.2f}")
= compact_hwy.mean()-midsize_hwy.mean()
observed_diff print(f"Difference in means of compact and midsize cars: {observed_diff:.3f} miles per gallon")
= pd.DataFrame({"mean_perms":[simulate_two_groups(compact_hwy,midsize_hwy) for i in range(10000)]}) mean_perms
Mean mpg of compact: 28.30
Mean mpg of midsize: 27.29
Difference in means of compact and midsize cars: 1.005 miles per gallon
# Don't limit the data
alt.data_transformers.disable_max_rows() # Create a histogram
= alt.Chart(mean_perms).mark_bar().encode(
histogram =alt.X("mean_perms:Q").bin(maxbins=50),
x=alt.Y("count():Q")
y=500)
).properties(width= mean_perms.assign(mean_diff=observed_diff) # Add the mean to the dataframe
mean_perms # Add a vertical line
= alt.Chart(mean_perms).mark_rule(color="red", strokeDash=(8,4)).encode(
observed_difference =alt.X("mean_diff")
x
)# Combine the two plots
+ observed_difference histogram
= np.mean(mean_perms.mean_perms > observed_diff)
p_value print(f"p-value = {p_value}")
p-value = 0.0641
The permutation test suggests there is no significant difference in the mean highway miles per gallon of the midsize and compact vehicle classes (p=0.06), though the p-value is very close to 0.05. I was expecting this because these vehicles are very similar in size and because their ranges seem to overlap on the boxplot. The true difference in the means is 1.005 miles per gallon, which doesn’t seem like very much. The mean highway miles per gallon used for compact cars was 28.30 mpg and the mean for midsize cars was 27.29 mpg. The result is not statistically significant, and it’s not practically significant either (1 more mile per gallon doesn’t seem like that much greater fuel efficiency).
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
Target variable: miles per gallon city (cty)
Predictor variables: engine displacement (displ), model year (year), number of cylinders (cyl), vehicle class (class)
= "cty"
target = ["displ", "year", "cyl", "class"]
predictors
alt.Chart(mpg).mark_point().encode("column"), type='quantitative'),
alt.X(alt.repeat("row"), type='quantitative')
alt.Y(alt.repeat(
).properties(=150,
width=150
height
).repeat(=["displ", "year", "cyl"],
row=["displ", "year", "cyl"]
column )
=True) mpg[predictors].corr(numeric_only
displ | year | cyl | |
---|---|---|---|
displ | 1.000000 | 0.147843 | 0.930227 |
year | 0.147843 | 1.000000 | 0.122245 |
cyl | 0.930227 | 0.122245 | 1.000000 |
The pairplot and correlation matrix above show correlations for the three numerical predictor variables I chose (engine displacement, model year, and number of cylinders). The last predictor variable, vehicle class, was excluded because it is categorical. As you can see from the steep regression line in the pairplot and the high correlation coefficient of 0.93, engine displacement and number of cylinders are highly correlated. It wouldn’t be valid to use both in a regression model, so I will exclude number of cylinders going forward and use only engine displacement, model year, and vehicle class.
= ["displ", "year", "class"]
predictors
= pd.get_dummies(mpg[predictors], drop_first=True)
X = mpg[target]
Y
= train_test_split(
X_train, X_test, y_train, y_test
X,
Y, =0.4,
test_size=0)
random_state
= LinearRegression()
our_model
our_model.fit(X_train, y_train)
print(f"Intercept: {our_model.intercept_:.3f}")
for c,p in zip(our_model.coef_,X.columns):
print(f"Coefficient for {p}: {c:.4f}")
Intercept: -50.633
Coefficient for displ: -2.1743
Coefficient for year: 0.0399
Coefficient for class_compact: -4.4195
Coefficient for class_midsize: -4.3175
Coefficient for class_minivan: -6.2129
Coefficient for class_pickup: -6.6474
Coefficient for class_subcompact: -3.6546
Coefficient for class_suv: -5.7987
= our_model.predict(X_test)
predictions = y_test - predictions
residuals print(f"Root mean squared error: {np.sqrt(mean_squared_error(y_test, predictions)):.2f}")
print(f"Coefficient of determination (R-squared): {r2_score(y_test, predictions):.2f}")
Root mean squared error: 2.59
Coefficient of determination (R-squared): 0.69
This linear regression model looks at the effect engine displacement in liters, the year the vehicle was made, and the type or class of vehicle have on city miles per gallon fuel efficiency. The coefficient for engine displacement suggests a negative relationship: for each liter of engine displacement, city miles per gallon decreases by 2.17. The coefficient for model year suggest that fuel efficiency increases very slightly (0.04 mpg) for each year in which the car is made. Compared to our “baseline” category of a two-seater car, all other vehicle classes have lower fuel efficiency. All of these coefficients make sense: we would expect newer cars that are smaller (like a two-seater with low engine displacement) to have greater fuel efficiency.
\(R^2\) is .69, suggesting that 69% of the variation in city fuel efficiency is accounted for by engine displacement. I am unsure if this result is practically signficant. For a mechanical process like the fuel efficiency of an engine, we might expect to see an \(R^2\) higher than 67%. The root mean squared error is 2.59, meaning that the residuals are off on average by more than 2 miles per gallon. This seems like a fair amount and raises some questions about the accuracy of the model.
= pd.DataFrame({'Predictions': predictions, 'Residuals':residuals})
results
="Histogram of Residuals").mark_bar().encode(
alt.Chart(results, title=alt.X('Residuals:Q', title="Residuals").bin(maxbins=20),
x=alt.Y('count():Q', title="Value Counts")
y=500) ).properties(width
The histogram above shows the distribution of residuals for the model. While it appears that the residuals are centered near 0, there are some outliers to the right of the graph that prevent the residuals from having a normal distribution. This suggests that our model may not be totally reliable.
# Plot the absolute value of residuals against the predicted values
= alt.Chart(results, title="Testing for Heteroskedasticity").mark_point().encode(
chart =alt.X('Predictions:Q').title("Predicted Values").scale(zero=False),
x=alt.Y('y:Q').title("Absolute value of Residuals")
y='abs(datum.Residuals)').properties(width=500)
).transform_calculate(y
+ chart.transform_loess('Predictions', 'y').mark_line() chart
The above plot shows the predicted values plotted against the absolute value of the residuals. Like in the histogram of residuals, we see a few outliers that are slightly skewing the results. But overall the line across the plot is mostly horizontal, suggesting that we do not see much heteroskedasticity in our model. While our model may not be ideal, it is probably valid.