# The Best vs. The Rest
## What *really* separates elite baseball teams?

**Complete by: Tuesday 24 Sept. at 10:55am**

## Introduction

Sports analytics is one of the most popular and discussed forms of data analysis. No matter what sport interests you (gymnastics, football, skiing, etc., etc.), there is surely data out there, as well as lots of people looking for ways to analyze and better understand the sport. Sometimes, the purpose of the data is to win more games! This, after all, is the plot of the movie *Moneyball*, in which the Oakland A's use data analysis to recruit players and increase their chances of winning the pennant.

For this week's sports data workshop, we'll work with a data set from the 2023 MLB season. **Sabermetrics**, the term for sports analytics with baseball data, is one of the oldest forms of sports data science. The box score was invented in 1858! Since the 1970s, analysts have examined the game of baseball from almost every conceivable angle, and this week we'll join the fray.

In 2023 a few teams performed spectactularly well: the Los Angeles Dodgers, the Atlanta Braves, and the controversial World Series-winning Texas Rangers. The Arizona Diamondbacks also had a surprisingly good run—they were National League Champions (i.e. second place in MLB) despite not scoring as many runs as other elite teams. Our question this week will be: is Arizona really that different from these other teams, in terms of scoring potential? Was their success the result of good play, random chance, or both?

## Contextual and Ethical Considerations

It's sometimes easy to imagine sports data as "neutral," but just like any kind of data there are potential ethics concerns. As always we should consider the stakeholders of any data set. Who stands to gain by a data set being handled well? Who could be hurt if that same data is handled poorly? In the case of the baseball data, we should consider that professional sports are a big business, which includes the professional livelihoods of not only players but lots of support staff. Representing a sport accurately can help the people who make their livings in that sport. Likewise, sports fans are deeply committed to their favorite players and teams, and accurate data collection and management can help those fans to better interact with the sport. Misrepresenting sports data could lead to certain players or teams being underfunded, or it could even lead to rule changes that might endanger athletes' health or overall ability to perform. Good sports analysis should always take the stakeholders into account and consider the ethical implications of any analysis.

The data we will use comes from [baseball-reference.com](https://baseball-reference.com), a popular baseball statistics website. The site isn't *quite* equipped with the data we want, which is runs per player per team. The data for players who played for multiple teams in a single season is automatically aggregated. This is great for some purposes, but bad for ours! We'll need to do some data wrangling to remedy this.

**To begin, import the libraries you will need below:**

**Now let's get some data. Go to the Baseball Reference page for [2023 Standard Batting](https://www.baseball-reference.com/leagues/majors/2023-standard-batting.shtml) and export the Player Standard Batting table. Save the data as a CSV, load it into JupyterHub, and read it in this notebook.** A lot of data science tasks require you to search for and download data yourself. We will go over how to do this together in class—it will be different from linking to a CSV like we've done in the past.

What do you notice about this data set? How many rows does it have? What do the rows represent? What do some of the columns represent (you don't need to write them all, but give us a sense)? How might this affect our ability to compare one player to another? **Write your answers below:**

[Your answers here.]

## Data Wrangling

Before we can begin, there are a few ways we should wrangle our data, like I mentioned above. This data includes a lot of players that never scored a single run, most of them pitchers who traditionally aren't strong batters. First, **you will need to filter out all of the players who scored 0 runs**.

You also need to account for those players who played on more than one team in 2023. Baseball Reference marks these players as "2TM" and "3TM" in the `Team` column. **You will also need to filter out any of these players.** Perform these two wrangling steps in the cell below:

Before we move on, it would be useful to see the total runs scored by each team in 2023. **Create a summary table showing the sum of runs for every team. Sort this new table from most to least runs.**

We mentioned we were interested in the best teams from 2023. Look at the four teams that were mentioned in this notebook's introduction. Where do they fall in the list? What's the difference between Arizona and the rest of these teams? **Write your answers below:**

[Your answers here.]

## Exploring the Data

Now that we've wrangled our data, we can start exploring it. In this workshop, rather than looking at total runs, we'll consider the average number of runs scored by players on each team. Begin by finding out how number of runs scored varies based on the team. **Create a boxplot that expresses this:** *(n.b. This graph may be impossible to read in the usual orientation. To make it more readable, try putting the categorical variable on the y-axis and the numerical variable on the x-axis. You may need to do this with later plots as well.*)

Interpret this graph below in the usual way. What is the difference between Arizona and Atlanta specifically? Is there a way you could make the difference between these two easier to spot? **Write your answer below:**

[Your answers here.]

In the hypothesis tests that we've been working on, we typically care more about the *mean* then we do about the *median*. The boxplot above shows us the median number of runs scored for each team. **Now make a bar plot showing the mean for each team instead. To that plot, add error bars showing the confidence interval for each calculated mean. (To do this, you'll need to look at the Altair documentation for [error bars](https://altair-viz.github.io/user_guide/marks/errorbar.html#using-error-bars-to-visualize-aggregated-data) and use the *layering* technique we learned about in class.) Below, interpret the plot fully and be sure to explain what the error bars represent.**

[Your interpretation here.]

The plot above still makes it hard to understand the difference between Arizona and Atlanta. **Filter the data set to include only these two teams. Give this dataframe a new name so you can keep track. Display the dataframe when you're done.**

**Now make the bar plot (with error bars) again with the new dataframe, and interpret it again. Which team appears to score more runs per player, on average? Are you confident in your interpretation, based on what the graph tells you?**

[Your interpretation here.]

Finally, if we were doing a traditional t-test, we would want to make sure that both the Atlanta sample and the Arizona sample had a roughly equal number of values. **Using the `count()` for the Y-encoding, create a bar plot showing the number of players for Atlanta and Arizona. Which team do we have more data for? Would we be able to use a t-test on this data? Why or why not?**

[Your interpretation here.]

## Running a Hypothesis Test

Let's create a permutation-based hypothesis test to see whether the difference between Atlanta and Arizona in our data is statistically significant. This is another way of asking: are the Braves really that much better than the Diamondbacks in terms of runs, or would they be roughly equal if they could have played many more games? **What would be our null hypothesis? What would be our alternative hypothesis? Write your answers below:**

[Your answers here.]

**Begin by calculating the observed difference in means between Atlanta and Arizona. Call this variable `observed_difference` and display it.**

**Now let's get our permutation function.** (It's okay to copy this directly from the slide.) **Below, go through the function line-by-line, explaining what each line does:**

[Your answer here.]

Using the `simulate_two_groups()` function, **write a loop that runs 10,000 permutations of the difference between Atlanta and Arizona. Call this variable `permutations`.**

Now you're ready to view some results! **Make a histogram showing the permuation distribution, and plot the observed difference in means as a red dotted line. Below, interpret the plot fully. Does it seem like the observed difference in means is statistically significant based on this plot?** (Remember: you can refer to the [How to Explain document](https://jrladd.com/CIS241/resources/how-to-explain) on our course website and Sakai for guidance on how to interpret output *fully and accurately*.)

[Your interpretation here.]

Last but not least, **calculate the p-value for your permutation test. Interpret the p-value fully, making note of *both* statistical and practical significant and relating your findings to what you saw in the previous plot.**

[Your interpretation here.]

## Conclusion

**Write a brief conclusion summarizing what you found out from the permutation test. Based on the p-value and the typical alpha of 0.05, do you believe there is a statistically significant difference in mean runs between Arizona and Atlanta? Do you think the difference in means is *practically significant*? How does this change your impressions of how a team might succeed in baseball? Based on other features of the data, especially sample size, do you believe you can trust the results of the permutation test? Are there any next steps you might recommend for this analysis?**

[Your conclusion here.]