# Analyzing More Tennis Matches, with Correlation

**Complete by: Tuesday 27 Feb. at 9am**  
Data: <https://jrladd.com/CIS241/data/atp_matches_2021.csv>

In the last workshop, we tried to figure out what makes certain tennis matches take so long. We hypothesized that perhaps certain players take longer than others, and we tested this by looking at games played by just two of tennis's big stars: Rafael Nadal and Andy Murray.

But what if the length of matches has less to do with individual players and more to do with other factors? Today, we'll explore this question.

## Ethical Considerations

It's sometimes easy to imagine sports data as "neutral," but just like any kind of data there are potential ethics concerns. As always we should consider the stakeholders of any data set. Who stands to gain by a data set being handled well? Who could be hurt if that same data is handled poorly? In the case of the tennis data, we should consider that professional sports are a big business, which includes the professional livelihoods of not only players but lots of support staff. Representing a sport accurately can help the people who make their livings in that sport. Likewise, sports fans are deeply committed to their favorite players and teams, and accurate data collection and management can help those fans to better interact with the sport. Misrepresenting sports data could lead to certain players or teams being underfunded, or it could even lead to rule changes that might endanger athletes' health or overall ability to perform. Good sports analysis should always take the stakeholders into account and consider the ethical implications of any analysis.

**Begin by importing the usual libraries:**

In your conclusions to the last workshop, many of you pointed out that we could be more sure of our findings if we had a larger sample of data. This week we'll use a much larger data set.

Sports analytics expert Jeff Sackmann has created [a series of tidy datasets](https://github.com/JeffSackmann/tennis_atp) on matches from the ATP Tourâ€”the official set of tournaments sponsored by the Association for Tennis Professionals. We're working with the data from the 2021 ATP Tour. **Read in the data from the URL above now:**

What does each row in this dataset represent? There are many columns here (almost fifty!), and you don't need to describe each one. But in general, what kinds of information are available in the columns? (Some of the abbreviations here will be unfamiliar to you, and we'll cover that in the next step.) How is the data set different from the one we worked with last week? **Write your answers below:**

[Your answers here.]

## Finding Correlations, Exploring the Data

Now that we've gained some familiarity with the data, we want to see what factors might correlate with the length of tennis matches. Thankfully, the data provides us with a `minutes`  variable that tells us exactly how long each match lasted. This will be our *dependent variable* for all of our subsequent analysis.

It's up to you to choose some *independent variables* that might correlate with the length of the game. Look over the data and choose a few possible variables. You'll need to investigate the data's documentation in order to understand all the abbreviations! Thankfully Sackmann provides [a file that explains what all the column abbreviations mean](https://github.com/JeffSackmann/tennis_atp/blob/master/matches_data_dictionary.txt).

**In this section, choose 4 independent variables from the data set, list them, and explain why you chose each one. Use your common sense and knowledge of the data to pick variables you think could plausibly have an effect on the length of the game. Remember that independent and dependent variables for correlation *must be numerical*.** 

**Then, make 4 regression plots (these are scatterplots with a regression transform) to compare each independent variable with `minutes`. Interpret each plot, explaining whether or not there appears to be a trend or correlation. (As always, make sure all plots have titles and labels, and you may need to make the regression line a different color to ensure it's visible.)**

**Finally, calculate Pearson's correlation coefficient between each of your independent variables and `minutes`. (Keep track of these with distinct variable names.) Explain whether the coefficient says the correlation is positive or negative, strong or weak. Is this also reflected in the plots?**

Remember that you can create new cells by clicking the `+` button above, and you can switch the cell from Code to Markdown using the dropdown. Take your time with this, and make sure you've treated all four independent variables thoroughly.

## Testing Correlations

Now you've done some EDA and located some possibilities for correlation. You've tested these against your intuition about the data, but let's do some real hypothesis testing. Are we likely to find the same correlation coefficients that we found in this sample in the full population?

**Choose *two* of the independent variables from above. Perhaps they're the ones you think are most plausible, or the ones you're most curious about. Run two separate permutation tests, testing the correlation between each independent variable and `minutes` (the length of the match).**

**For each permutation test you will:**

- Write out the null and alternative hypothesis.
- Run 10,000 permutations using the function we created in class.
- Graph the permutation distribution with a red line showing the observed correlation.
- Calculate a p-value.
- Interpret the plot and the p-value, explaining whether the correlation is statistically significant and/or practically significant.

## Conclusion

**Write a few sentences summarizing your results. Did you find any variables that seem to correlate with longer tennis matches? What did you learn about the causes of long tennis matches by doing this analysis? What limitations did you find in this data, and what are your thoughts about what to try next?**

[Your conclusion here.]