# Movie Dialogue 2

**Complete by: Tuesday 4 Feb. at class time**  
Data: <https://jrladd.com/CIS241/data/Pudding-Film-Dialogue.csv>

# Introduction

*These are the same as last time, but I'm including them here for the necessary context.*

Over the past 10 years, there's been a lot of public conversation about gender and racial imbalance in Hollywood. Movements like [#OscarsSoWhite](https://www.nytimes.com/2020/02/06/movies/oscarssowhite-history.html) and trends like [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test) have explicitly called out the extent to which certain groups are underrepresented in Hollywood films. But exactly what are the proportions of representation in movies? Before we can address this apparent imbalance, we have to understand its scope and extent.

In 2016, a group of researchers working for the data visualization website [The Pudding](https://pudding.cool/) set out to do just that in order to better understand gender imbalance in film. They compiled a data set of approximately 2000 scripts from throughout movie history and, using text analysis, recorded how much each character speaks. The data was then refined and enhanced first by [Dr. Melanie Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/00-Datasets/00-Datasets.html#hollywood-film-dialogue-by-character-gender-and-age) and then by Dr. John Ladd. I (Dr. Ladd speaking) added genre categories and runtime information from the [larger IMDB dataset](https://www.imdb.com/interfaces/).

Using this census of film scripts, it's possible to better understand what gender imbalance looks like in movies across history and to ask questions about what may affect that imbalance. For example, we might ask: what factors affect the proportion of dialogue in a movie spoken by female characters?

# Ethical Considerations

Discussion of gender bias and imbalance can be deeply sensitive. The issue of whether women are underrepresented in the film industry affects thousands of women actors, directors, and crew members. Representation is also essential for audiences: as long as movies remain a major cultural force, everyone wants to see themselves accurately and fairly portrayed on screen. It's important that data analysts neither underplay nor exaggerate the imbalances that exist in this industry.

But gender, as a concept, is historically fraught, and using a binary definition of gender for a data analysis like this is useful but not fully accurate. I'll let [Dr. Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/03-Pandas-Basics-Part3.html#the-puddings-film-dialogue-data) explain:

>Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They [*The Pudding* researchers] claim, in fact, that one of the most [frequently asked questions](https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751) about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!” The short answer is that they don’t. To determine character gender, they used actors’ IMDB information, which they acknowledge is an imperfect approach: “Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.”

>As we work with this data, we want to be critical and cognizant of this approach to gender. How does such a binary understanding of gender, gleaned from the IMDB pages of actors, influence our later results and conclusions? What do we gain by using such an approach, and what do we lose? How else might we have encoded or determined gender for the same data?

# Data Wrangling

Let's start by getting our data in shape. **In the cell below, import the necessary libraries (`pandas`, `numpy`, `altair`), and don't forget any extra steps for Altair:**

**Now read in the same data URL from last week. Name the dataframe `dialogue` again, and display it:**

Take a look at the data wrangling assignment from last week. You had to do a couple essential things to the data to get it in shape for analysis. You had to run two filters on two separate columns of data, and you had to change the type of another column. **Do all three of those again in the same cell below, making sure to add comments so it's clear what you're doing:**

Now we're ready to do more advanced work with this data set.

For this project, we care about individual characters, but we also care about aggregates within specific movies. We want to know how often female characters speak overall in each movie. To answer questions about movies, we need to create a **summary table** that groups and summarizes our characters by films. We can do that with the code below.

In [None]:
movies = (dialogue.groupby(['title','gender','gross', 'runtimeMinutes', 'release_year', 'genres'])
                  .agg({'proportion_of_dialogue': 'sum', 'words': 'sum', 'age': 'mean'})
                  .reset_index())
movies

The above code is somewhat new to you! We made a summary table using a `.groupby()`, but the other functions are new. See if you can figure out what's going on in the code above. Remember to refer to the Pandas documentation if you're curious about a specific function. (I also wrapped everything in extra parantheses, so that I could run functions on their own lines. This is a useful Pandas trick!)

**Explain the code above line-by-line, noting what each line of code accomplishes. Write your answer below**:

Now you're ready to begin visualizing this data in interesting ways!

# Exploratory Data Analysis

## Characters

First, we'll use the `dialogue` table to understand more about the original dataset. Let's find out how many characters have different proportions of dialogue, i.e. "how many characters speak more than 50% of the dialogue in their films?".

### ***IMPORTANT: EVERY SINGLE VISUALIZATION IN THIS NOTEBOOK SHOULD HAVE A TITLE AND AXIS LABELS ADDED BY YOU!*** 

**To visualize this, create a histogram of the `proportion_of_dialogue` variable. Use the `column=` encoding to see two side-by-side histograms for each gender. Change the number of bins to make the graph more readable. You may need to look at the [Altair documentation](https://altair-viz.github.io/user_guide/data.html)**

**Below the histogram, write a few sentences of interpretation. What does this graph tell you about the distribution and breakdown of how much men and women speak in movies?**


[Interpretation here.]

Next, it would be nice to see if any of these trends of gender difference are affected by the genre of the movie. To do this, we can look just at how many words women speak across the different genres.

**First, create a new dataframe of only the female characters using a bracket selection. Call this dataset `dialogue_women`.**

**Using this dataset, calculate the average number of words that female characters speak in these films. Once you've calculated the mean for our sample, also calculate the 90% confidence interval for the mean (using bootstrap sampling). Does the confidence interval seem large or small to you? What does this tell you about our sample mean?**

**Using the same `dialogue_women` dataset, create a boxplot showing the distribution of how many `words` women speak in a movie across the different `genres`. Be sure to add better axis labels to your plot with `.title()`. Below the boxplot, write a short interpretation of the distributions you observe. In which genres do women speak more?**

*There are a few things we can do to make this graph more readable. First, we can invert the usual x- and y-axis arrangement: with this many categories, it makes more sense to have the categorical variable along the y-axis. We can add a `color=` encoding to differentiate the boxes (though you should remove the legend with `legend=None`). And we can limit the x-axis, to only show the most important part of the graph, by chaining the `scale()` function to our X encoding: `.scale(domain=(0,5000), clamp=True)`. Try making the graph with and without these features to see how it changes, but make sure the final version has them all.*

[Interpretation here.]

## Movies

Now that we've learned a little more about individual characters, we're ready to learn about the proportions of dialogue in entire movies. We can use the `movies` summary table that we created in the data wrangling step to accomplish this.

First, let's figure out how much men speak in movies vs. how much women speak. Will we find out, as we did with individual characters, that male characters tend to speak more than female ones?

**Create a boxplot looking at the distribution of total words (`words`) spoken by the two different `gender` categories in the `movies` dataset. Write a brief interpretation of this graph.**


[Interpretation here.]

We found a clear gender imbalance in this graph! Now we would like to know: are there other variables that help us to explain this rather stark difference?

To answer this question, we'll focus just on the proportion that women characters speak in a given movie. (Since this data assumes a gender binary, the proportion that men speak will always be the inverse.) **To do this, filter the `movies` dataset to create a new dataset showing only the statistics on `women`. Call this new dataset `movies_women`.**

Now we have a dataframe containing information on the total proportion of dialogue spoken by women in each movie. This is contained in the `proportion_of_dialogue` variable, and we created it back in the data wrangling step. This is the variable we'd like to learn more about! We would like to know how this variable changes *depending on* other variables: so we call this variable our **dependent variable**.

What things might affect the proportion of dialogue spoken by women? Does the length of the movie make a difference? What about the average age of the female characters? What about the year the movie was released? There are lots of different options in this data set! To find out the answers to these kinds of questions, we need to choose some **independent variables**, variables that might affect or change the dependent variable.

Look through the data and choose 3 independent variables. **List them below and briefly note why you chose them and what question you're trying to answer.** *In this case, all of your independent variables should be **numerical/quantitative**.*

Okay, one last big task to finish off this lab! **To assess the relationships between your dependent variable (`sum_dialogue`) and your chosen independent variables, create THREE scatterplots showing each relationship. Write a brief interpretation of each one. Did you find the relationship/trend you imagined?**

*Remember, in a scatterplot the dependent variable always goes on the y-axis.* 

And here are two Altair tips: (1) it might help to add the `.interactive()` function to your plots, and (2) to get plots to begin with the lowest value instead of starting all the way at zero, you can add the scale function to either axis with `zero` set to `False`: `.scale(zero=False)`.

[Write your selections for variables here, and explain why you chose them.]

In [1]:
# Scatter plot 1


[Interpretation here.]

In [2]:
# Scatter plot 2


[Interpretation here.]

In [3]:
# Scatter plot 3


[Interpretation here.]

# Conclusion

*Write a short paragraph summarizing what you learned about gender in movies from this lab. Which findings were expected, and which were surprising? How did the different visualizations help you toward a general conclusion? What caveats/next steps do you want to offer?*