# Movie Dialogue

**Complete by: Tuesday 28 Jan. at class time**  
Data: <https://jrladd.com/CIS241/data/Pudding-Film-Dialogue.csv>

## Introduction

Over the past 10 years, there's been a lot of public conversation about gender and racial imbalance in Hollywood. Movements like [#OscarsSoWhite](https://www.nytimes.com/2020/02/06/movies/oscarssowhite-history.html) and trends like [the Bechdel test](https://en.wikipedia.org/wiki/Bechdel_test) have explicitly called out the extent to which certain groups are underrepresented in Hollywood films. But exactly what are the proportions of representation in movies? Before we can address this apparent imbalance, we have to understand its scope and extent.

In 2016, a group of researchers working for the data visualization website [The Pudding](https://pudding.cool/) set out to do just that in order to better understand gender imbalance in film. They compiled a data set of approximately 2000 scripts from throughout movie history and, using text analysis, recorded how much each character speaks. The data was then refined and enhanced first by [Dr. Melanie Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/00-Datasets/00-Datasets.html#hollywood-film-dialogue-by-character-gender-and-age) and then by Dr. John Ladd. I (Dr. Ladd speaking) added genre categories and runtime information from the [larger IMDB dataset](https://www.imdb.com/interfaces/).

Using this census of film scripts, it's possible to better understand what gender imbalance looks like in movies across history and to ask questions about what may affect that imbalance. For example, we might ask: what factors affect the proportion of dialogue in a movie spoken by female characters?

## Ethical Considerations

Discussion of gender bias and imbalance can be deeply sensitive. The issue of whether women are underrepresented in the film industry affects thousands of women actors, directors, and crew members. Representation is also essential for audiences: as long as movies remain a major cultural force, everyone wants to see themselves accurately and fairly portrayed on screen. It's important that data analysts neither underplay nor exaggerate the imbalances that exist in this industry.

But gender, as a concept, is historically fraught, and using a binary definition of gender for a data analysis like this is useful but not fully accurate. I'll let [Dr. Walsh](https://melaniewalsh.github.io/Intro-Cultural-Analytics/03-Data-Analysis/03-Pandas-Basics-Part3.html#the-puddings-film-dialogue-data) explain:

>Yet transforming complex social constructs like gender into quantifiable data is tricky and historically fraught. They [*The Pudding* researchers] claim, in fact, that one of the most [frequently asked questions](https://medium.com/@matthew_daniels/faq-for-the-film-dialogue-by-gender-project-40078209f751) about the piece is about gender: “Wait, but let’s talk about gender. How do you know the monster in Monsters Inc. is a boy!” The short answer is that they don’t. To determine character gender, they used actors’ IMDB information, which they acknowledge is an imperfect approach: “Sometimes, women voice male characters. Bart Simpson, for example, is voiced by a woman. We’re aware that this means some of the data is wrong, AND we’re still fine with the methodology and approach.”

>As we work with this data, we want to be critical and cognizant of this approach to gender. How does such a binary understanding of gender, gleaned from the IMDB pages of actors, influence our later results and conclusions? What do we gain by using such an approach, and what do we lose? How else might we have encoded or determined gender for the same data?


## Let's Begin

First you'll need to import two of the main libraries that we've been discussing in class: `pandas` and `numpy`. Abbreviate pandas as pd and numpy as np. Type your code in the cell below:

In [1]:
# Type your code here

Next, you'll need some data. Use the function `pd.read_csv()` to read the data URL correctly (you'll find the data URL at the very top of this page), and assign it to a variable called `dialogue`. Then display the new dialogue DataFrame you created.

*n.b. You could always download the CSV file, upload it to Jupyter hub, and read the file directly without the URL. But using the URL is faster and easier!*

Above you should see a table of the data. Take a moment to make sure you understand what you're seeing. 

- What does each row represent? a movie? a character? a line of dialogue?
- What does each column represent? What are the different variables and their types (categorical or numerical)?
- Are there any potential problems you can see in this data set?

Write your answers below using Jupyter's "Markdown" feature. Select the cell below, and then change the dropdown at the top of the screen from "Code" to "Markdown." This will allow you to type regular text into the cell and have it render correctly.

## Wrangling

Now that you're a little more familiar with the data and have it loaded into Pandas, we're ready to begin some wrangling. Let's start by sorting the DataFrame according to the `gender` column. What do you notice? Write your code to sort the DataFrame below, and then write what you notice in a Markdown cell below that.

[Double-click here to enter your answer.]

You'll have noticed that there's something you need to filter *out* of the data above. Let's remove all rows where the gender of the character is unknown. Write your filter below, and store it in the same variable name, `dialogue`. Display the sorted DataFrame again to make sure you did it correctly! (Don't forget to add some comments so you remember *why* you created this filter.)

Now our DataFrame is less likely to generate errors when we group and sort by gender. Next, let's check on our data types. Use the `.info()` method below to see the data types for each column.

One of these is wrong! It should be a number, but instead it's an "object", i.e. a string. Which one is it? Write your answer below, and also explain why it's important to change this from a string to a number:

[Double-click here to enter your answer.]

Now that you've located the trouble, write some code to change this to numeric data. Hint: use the `pd.to_numeric()` function on the column you need, and remember that you can always look things up in the [Pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) if you need to.

When you're done, run `dialogue.info()` again to make sure it worked.

The column should now say "float" instead of "object." That will also help us prevent errors in the future.

Finally, we know the number of words each character speaks, and we know the overall proportion of how much they speak in the film compared to others. It would also be nice to know how much they speak relative to the *length* of the film.

In the code cell below, create a new column called `words_min` that looks at the number of words people speak compared to the runtime of the film. What columns will you need to use to create this one, and what mathematical operation will you have to perform? Make sure to save your work in the same `dialogue` variable, and display the DataFrame when you're done to make sure the new column is there.

## Exploring

There's a fine line between "data wrangling" and exploring our data set more deeply. We can use the same tools to do both.

Let's start by finding some averages. Find the average age of all characters in this data set:

Does this number seem low or high to you? To verify, sort the dataframe in descending order by age:

Something strange is going on! If we were the data collectors, we could probably find the correct information and fix it. But since we're not, let's simply get rid of the nonsensical ages. Choose a reasonable cut-off and filter the data set accordingly. Display the sorted data set again to make sure it's right.

Now calculate the average age again. Notice a difference?

So far we've used filtering and selecting to clean up our data set. We can also use it to see some things more clearly. Let's look at the data set with just the `title`, `release_year`, `character`, `gender`, and `words` columns.

You can see how easy it is to zero in on the variables you care most about!

Let's also do this with rows. Display just the rows for a single movie: one of my favorites, 1993's *Jurassic Park*.

What do you notice about the dialogue breakdown by gender in this movie? Write a few observations below. If you would find it easier to sort this filtered data frame by words or proportion of dialogue, you can amend the code above to do that.

Do this one more time with whatever movie you choose! Remember that the movies in this data set go up until 2015.

We've been doing a lot of things individually so far, but there's also a method, `.describe()`, that will give us averages and other summary statistics for every column. Let's try that now:

This shows mean, min, and max values (and more) for every numerical column in our data set. What's the average length of a film? Now we know! We'll dig into this more as part of our lesson for next week.

And lastly, what we care about the most in this data set—the reason the data was created—is to understand the differences between characters of different genders. Let's create a summary table *grouping* the data set by gender, to see what the *average* proportion of dialogue is for men and women:

This seemed like a sensible thing to do, but the numbers don't make much sense. What do you think went wrong? Write your thoughts below:

[Double-click here to enter your answer.]

Let's try again! This time, group by the movie's title *and* gender. This will let us see the breakdown of dialogue proportion in every film.

But also, for each movie, it makes more sense to *add up* the proportions of dialogue than it does to average them. Let's use `.sum()` instead of `.mean()` this time, and limit our results to just the `proportion_of_dialogue` column.

This will generate a lot of data, but just scan it quickly and get your impressions.

In [2]:
# pd.options.display.max_rows = None # Remove the hashtag at the beginning of this line & use this code to display every possible row

# Type your code here


What overall patterns did you notice? What questions would you want to ask next? Write some thoughts below:

[Double-click here to enter your answer.]

## Conclusion

Great work! You've successfully wrangled your first data set and got a sense of what Pandas can do. Next week, we'll start to visualize our data set, combining our data wrangling skills with charts and graphs.

To submit this notebook to the Sakai assignment, you'll want to download it as an HTML file. In JupyterHub, go to to File -> Save and Export Notebook As... -> HTML. This will download the HTML file to your computer. Open the HTML file and make sure all your work is appearing in it like expected. If it is, go ahead and submit this file to Sakai.