# The Best vs. The Rest
## More analysis of baseball teams

**Complete by: Tuesday 1 Oct. at 10:55am**

## Ethical Considerations

It's sometimes easy to imagine sports data as "neutral," but just like any kind of data there are potential ethics concerns. As always we should consider the stakeholders of any data set. Who stands to gain by a data set being handled well? Who could be hurt if that same data is handled poorly? In the case of the baseball data, we should consider that professional sports are a big business, which includes the professional livelihoods of not only players but lots of support staff. Representing a sport accurately can help the people who make their livings in that sport. Likewise, sports fans are deeply committed to their favorite players and teams, and accurate data collection and management can help those fans to better interact with the sport. Misrepresenting sports data could lead to certain players or teams being underfunded, or it could even lead to rule changes that might endanger athletes' health or overall ability to perform. Good sports analysis should always take the stakeholders into account and consider the ethical implications of any analysis.

The data we will use comes from [baseball-reference.com](https://baseball-reference.com), a popular baseball statistics website. The site isn't *quite* equipped with the data we want, which is statistics on both batting and pitching at the same time. This is great for some purposes, but bad for ours! We'll need to do some data wrangling to remedy this.

**Begin by importing the usual libraries:**

## Data Import and Wrangling

For this different set of questions, we'll need different data. Go to the [2023 Team Stats page at Baseball Reference](https://www.baseball-reference.com/leagues/majors/2023.shtml) and download **both** the "Team Standard Batting" and "Team Standard Pitching" tables into separate CSVs.

To work with these datasets, we will need to ***combine*** them by [*merging*](https://pandas.pydata.org/docs/reference/api/pandas.merge.html) the DataFrames together. This step is on your Pandas Cheatsheet under "Combine Data Sets." Read in both CSVs and combine them, using *suffixes* for any columns with the same name:

What does each row in this dataset represent? There are many columns here (more than fifty!), and you don't need to describe each one. But in general, what kinds of information are available in the columns? (Some of the abbreviations here will be unfamiliar to you, so look at the Glossary at Baseball Reference and don't be shy about Googling things.) How is the data set different from the one we worked with last week? **Write your answers below:**

[Your answers here.]

## Finding Correlations, Exploring the Data

Now that we've gained some familiarity with the data, we want to see what factors might correlate with the number of runs each team scored in 2023. This is the `R` column from the original `batting` table, but it may have a new name now. This will be our *dependent variable* for all of our subsequent analysis.

It's up to you to choose some *independent variables* that might correlate with the number of runs scored. To get a sense of the possibilities, **calculate the correlation matrix for all numerical variables in the dataset**. Alongside that, **visualize the correlation matrix in Altair**. [*In class we added labels to this graph, but there are too many variables here for that. Don't include the text labels this time, and make the graph big enough that it's readable. Focus on creating a clear, readable plot.*] **Accurately interpret both the plot and the statistical output.**

Remember that you can create new cells by clicking the `+` button above, and you can switch the cell from Code to Markdown using the dropdown. 

**Next, choose 3 independent variables from the data set, list them, and explain why you chose each one. Use your common sense, the exploratory information above, and knowledge of the data to pick variables you think could plausibly have an effect on the number of runs scored. Remember that independent and dependent variables for correlation *must be numerical*.** 

**Then, make 3 regression plots (these are scatterplots with a regression transform) to compare each independent variable with runs scored. Interpret each plot, explaining whether or not there appears to be a trend or correlation. (As always, make sure all plots have titles and labels, and you may need to make the regression line a different color to ensure it's visible.) Use tooltips so you can easily see which team is which on your plots!**

**Finally, calculate Pearson's correlation coefficient between each of your independent variables and runs scored. (Keep track of these with distinct variable names.) Explain whether the coefficient says the correlation is positive or negative, strong or weak. Is this also reflected in the plots?** Take your time with this, and make sure you've treated all three independent variables thoroughly.

## Testing Correlations

Now you've done some EDA and located some possibilities for correlation. You've tested these against your intuition about the data, but let's do some real hypothesis testing. Are we likely to find the same correlation coefficients that we found in this sample in the full population?

**Choose *two* of the independent variables from above. Perhaps they're the ones you think are most plausible, or the ones you're most curious about. Run two separate permutation tests, testing the correlation between each independent variable and number of runs scored.**

**For each permutation test you will:**

- Write out the null and alternative hypothesis.
- Run 10,000 permutations using the function we created in class.
- Graph the permutation distribution with a red line showing the observed correlation.
- Calculate a p-value.
- Interpret the plot and the p-value, explaining whether the correlation is statistically significant and/or practically significant.

## Conclusion

**Write a few sentences summarizing your results. Did you find any variables that seem to correlate with a higher number of runs scored? What did you learn about the causes of high scoring teams by doing this analysis? What limitations did you find in this data, and what are your thoughts about what to try next?**

[Your conclusion here.]