This file includes code and examples for explaining graphs and statistical output in DA101. Communicating results is a crucial part of good data analysis, and we try to communicate all results completely and accurately and in terms of the data.
These short examples are designed to give you general guidance. I cannot provide a comprehensive example or answer that you could “copy” every time to have an A+ explanation, but I can provide an example and some pointers to help you get started.
The way you explain your graphs will change throughout the semester as you learn more details about what the graph shows and also learn more technical lingo for how to identify different aspects of the graph, including visual interpretation of summary statistics, and how to identify potentially significant differences or outliers.
In the beginning of class (let’s say weeks 1-2) I won’t assume you have prior technical knowledge of data analytics, and it is OK to stick to general descriptive and observational descriptions of what you’re seeing in a graph that you make. What are you noticing? What stands out to you? Do you see anything that looks like a pattern in the points or that indicates similarity among groups?
Later on (let’s say weeks 3+) you will increasingly gain technical language to be able to talk about your graphs and describe your observations. As you gain these skills you can still describe what you are noticing and seeing in your graphs, but you will increasingly describe summary statistics…
ggplot(mpg, aes(class,hwy,color=class)) +
geom_boxplot(outlier.color="gray") +
labs(title="Fuel Efficiency Among Different Vehicle Types", x="Type or Class of Vehicle", y="Highway miles per gallon") +
theme(legend.position = "none")
This boxplot of the mpg
data set shows the distributions of highway fuel efficiency across the seven different kinds of cars in the data. Pickups and SUVs seem to have lower fuel efficiency than the other cars, which makes sense because they are bigger, heavier vehicles. Smaller vehicles like compact and midsize cars have greater fuel efficiency, and subcompacts have similarly high fuel efficiency but the data seems to be more spread out because the box is longer. Overall it seems like vehicle class is related to fuel efficiency, with smaller cars tending to have greater efficiency.
This boxplot of the mpg
data set shows the distributions of highway fuel efficiency across the seven different kinds of cars in the data. Pickups and SUVs have medians and interquartile ranges well below the other vehicles, suggesting a statistically significant difference. Compact cars have a nearly identical IQR and median to midsize cars, as evidenced by the size of the two boxes, though there are a handful of outliers in the compact group. Subcompacts have a larger interquartile range than any of the other groups, which suggests greater variability in their fuel efficiency distribution. Overall the graph suggests that larger vehicle classes tend to have lower fuel efficiency distributions, while smaller vehicles seem to have greater fuel efficiency.
ggplot(mpg, aes(displ,cty)) +
geom_point() +
geom_jitter() +
labs(title="Relationship Between Engine Size and City Fuel Efficiency", x="Engine Displacement, in liters", y="City miles per gallon")
This scatter plot of the mpg
data set shows the relationship between the size of a car’s engine (using the engine displacement variable) and a car’s city fuel efficiency. Because of the downward slope of the dots as the graph goes from left to right, it appears that as engines get bigger the city fuel efficiency gets smaller. After about 4.5 liters the slope levels off, suggesting there isn’t as strong a relationship past this point. Overall, we could conclude that a car’s city fuel efficiency may partially depend on the size of the engine.
This scatter plot of the mpg
data set shows the relationship between the size of a car’s engine (using the engine displacement variable) and a car’s city fuel efficiency. There looks to be a negative correlation between the two variables: as engine displacement goes up, city miles per gallon goes down. Adding a line of best fit to this graph or calculating a correlation coefficient would give us a better indication of the possible correlation. After about 4.5 liters, the points no longer slope downward, which may indicate that after a certain threshold, engine displacement has no direct correlation to fuel efficiency. Overall we could conclude that our dependent variable, city miles per gallon, negatively correlates with our independent variable, engine displacement, and therefore that as engine size gets larger fuel efficiency drops.
We will learn several statistical tests throughout the semester. In data analytics, there is much more to do than to simply write the code for the test and generate “correct” output or report a p-value. In most cases, explaining the output from the tests will require several sentences that help to translate the quantitative results in terms of the data. In general, when running these tests and interpreting output, there are a few key things to keep in mind.
Do you have a logical reason for running the test?
After you’ve run the test, were you able to identify and report the key values from the statistical output?
After you have identified and reported the key values, can you connect them back to the data and the question at hand?
Finally, can you describe the results in terms of statistical and practical significance?
I’ll provide a few examples below to walk through a t-test, a correlation test, and a linear regression. These are not “perfect” or “set in stone” formats for explaining, but rather think of them as an aid to thought to help guide you in your journey of learning how to explain and translate like a data analyst.
compact_minivan <- filter(mpg, class=="compact"|class=="minivan")
t.test(hwy~class,compact_minivan)
##
## Welch Two Sample t-test
##
## data: hwy by class
## t = 7.1386, df = 28.137, p-value = 8.836e-08
## alternative hypothesis: true difference in means between group compact and group minivan is not equal to 0
## 95 percent confidence interval:
## 4.231785 7.636687
## sample estimates:
## mean in group compact mean in group minivan
## 28.29787 22.36364
The t-test suggests a significant difference in the mean highway miles per gallon of the minivan and compact vehicle classes (p=8.836e-08, t=7.14, df=28.14). I was expecting this because these vehicles are usually very different in size and because their ranges seem quite different on the boxplot. The true difference in the means is suggested to be 4.23-7.64 miles per gallon, which doesn’t seem like very much. The mean highway miles per gallon used for compact cars was 28.30 mpg and the mean for minivans was 22.36 mpg. While the result is statistically significant, I’m not sure if it’s practically significant (6 more miles per gallon doesn’t seem like that much greater fuel efficiency).
cor.test(mpg$displ,mpg$cty)
##
## Pearson's product-moment correlation
##
## data: mpg$displ and mpg$cty
## t = -20.205, df = 232, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8406782 -0.7467508
## sample estimates:
## cor
## -0.798524
The correlation test suggests a significant negative correlation between engine displacement in liters and city miles per gallon (p=<2.2e-16, t=-20.21, df=232). This makes sense because it is likely that as engine size increases, fuel efficiency would decrease, and because the points on the scatter plot appear to have a downward slope. The true correlation is suggested to fall between -0.84 and -0.74, a relatively narrow range that’s pretty close to -1. The true Pearson’s correlation coefficient for these two variables was -0.80. The result is statistically significant, and it’s likely to be practically significant as well: -0.8 is a fairly strong negative correlation showing that city miles per gallon changes depending on engine displacement.
regression <- linear_reg() %>%
set_engine("lm") %>%
fit(cty~displ, data = mpg)
summary(regression$fit)
##
## Call:
## stats::lm(formula = cty ~ displ, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3109 -1.4695 -0.2566 1.1087 14.0064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.9915 0.4821 53.91 <2e-16 ***
## displ -2.6305 0.1302 -20.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.567 on 232 degrees of freedom
## Multiple R-squared: 0.6376, Adjusted R-squared: 0.6361
## F-statistic: 408.2 on 1 and 232 DF, p-value: < 2.2e-16
# You could also use these to get the same results
# tidy(regression)
# glance(regression)
This linear regression model looks at the effect engine displacement in liters has on city miles per gallon fuel efficiency. Like the correlation test, it suggests a negative relationship: for each liter of engine displacement, city miles per gallon decreases by 2.6. \(R^2\) is .63, suggesting that 63% of the variation in city fuel efficiency is accounted for by engine displacement. While this result is statistically significant (p<2.2e-16), I am unsure that it’s practically signficant. For a mechanical process like the fuel efficiency of an engine, we might expect to see an \(R^2\) higher than 63%. The variance of the data as cty
goes up might be part of the reason for this (see the scatter plot with regression line, below).
ggplot(mpg, aes(displ,cty)) +
geom_point() +
geom_jitter() +
stat_smooth(method="lm") +
labs(title="Relationship Between Engine Size and City Fuel Efficiency", x="Engine Displacement, in liters", y="City miles per gallon")
## `geom_smooth()` using formula 'y ~ x'