# Analyzing Tennis Matches

**Complete by: Tuesday 20 Feb. at 10:55am**  
Data: <https://jrladd.com/CIS241/data/serve_times.csv>

## Introduction

Sports analytics is one of the most popular and discussed forms of data analysis. No matter what sport interests you (gymnastics, football, skiing, etc., etc.), there is surely data out there and lots of people looking for ways to analyze and better understand the sport. Sometimes, the purpose of the data is to win more games! This, after all, is the plot of the movie *Moneyball*, in which the Oakland A's use data analysis to recruit players and increase their chances of winning the pennant.

For this week's sports data workshop, we'll work with a data set from the 2015 French Open, which attempts to answer the question of why some tennis matches take a very long time to complete. Is it because of specific players, or is there some other factor? 

## Ethical Considerations

It's sometimes easy to imagine sports data as "neutral," but just like any kind of data there are potential ethics concerns. As always we should consider the stakeholders of any data set. Who stands to gain by a data set being handled well? Who could be hurt if that same data is handled poorly? In the case of the tennis data, we should consider that professional sports are a big business, which includes the professional livelihoods of not only players but lots of support staff. Representing a sport accurately can help the people who make their livings in that sport. Likewise, sports fans are deeply committed to their favorite players and teams, and accurate data collection and management can help those fans to better interact with the sport. Misrepresenting sports data could lead to certain players or teams being underfunded, or it could even lead to rule changes that might endanger athletes' health or overall ability to perform. Good sports analysis should always take the stakeholders into account and consider the ethical implications of any analysis.

**To begin, import the libraries you will need below:**

**Now read the `serve_times.csv` URL (above), and show the dataframe:**

What do you notice about this data set? How many rows does it have? What do the rows represent? How might this affect our ability to compare one player to another? **Write your answers below:**

[Your answers here.]

## Exploring the Data

Start by finding out how much the second before the next point (after a serve) varies based on who is serving. **Create a boxplot that expresses this:** *(n.b. This graph may be impossible to read in the usual orientation. To make it more readable, try putting the categorical variable on the y-axis and the numerical variable on the x-axis. You may need to do this with later plots as well.*)

What does this graph tell you about the difference between the serves of two tennis legends: Rafael Nadal and Andy Murray? Is there a way you could make the difference between these two easier to spot? **Write your answer below:**

[Your answers here.]

In the hypothesis tests that we've been working on, we typically care more about the *mean* then we do about the *median*. The boxplot above shows us the median seconds before the next point. **Now make a bar plot showing the mean for each server instead. To that plot, add error bars showing the confidence interval for each calculated mean. Below, interpret the plot fully and be sure to explain what the error bars represent.**

[Your interpretation here.]

The plot above still makes it hard to understand the difference between Murray and Nadal. **Filter the data set to include only these two players. Give this dataframe a new name so you can keep track. Display the dataframe when you're done.**

**Now make the bar plot (with error bars) again with the new dataframe, and interpret it again. Which server appears to have longer times before a point? Are you confident in your interpretation, based on what the graph tells you?**

[Your interpretation here.]

Finally, if we were doing a traditional t-test, we would want to make sure that both the Murray sample and the Nadal sample had a roughly equal number of values. **Using the `count()` for the Y-encoding, create a bar plot showing the number of entries for Nadal and Murray. Which player do we have more data for? Would we be able to use a t-test on this data? Why or why not?**

[Your interpretation here.]

## Running a Hypothesis Test

Let's create a permutation-based hypothesis test to see whether the difference between Nadal and Murray in our data is statistically significant. **What would be our null hypothesis? What would be our alternative hypothesis? Write your answers below:**

[Your answers here.]

**Begin by calculating the observed difference in means between Nadal and Murray. Call this variable `mean_diff` and display it.**

**Now let's get our permutation function.** (It's okay to copy this directly from the slide.) **Below, go through the function line-by-line, explaining what each line does:**

[Your answer here.]

Using the `mean_perm()` function, **write a list comprehension that runs 10,000 permutations of the difference between Nadal and Murray. Call this variable `mean_perms`.** Will you want to use the full `tennis` dataframe or your filtered dataframe?

Now you're ready to view some results! **Make a histogram showing the permuation distribution, and plot the observed difference in means as a red dotted line. Below, interpret the plot fully. Does it seem like the observed difference in means is statistically significant based on this plot?** (Remember: you can refer to the [How to Explain document](https://jrladd.com/CIS241/resources/how-to-explain) on our course website and Sakai for guidance on how to interpret output *fully and accurately*.)

[Your interpretation here.]

Last but not least, **calculate the p-value for your permutation test. Interpret the p-value fully, making note of *both* statistical and practical significant and relating your findings to what you saw in the previous plot.**

[Your interpretation here.]

## Conclusion

**Write a brief conclusion summarizing what you found out from the permutation test. Based on the p-value and the typical alpha of 0.05, do you believe there is a statistically significant difference in means between Nadal and Murray? Do you think the difference in means is *practically significant*? Based on other features of the data, especially sample size, do you believe you can trust the results of the permutation test? Are there any next steps you might recommend for this analysis?**

[Your conclusion here.]