# Modeling Shakespeare with Neural Networks

**Complete by: Tuesday 15 Apr. at class time**  
Data: (See below.)

At the start of the semester, we looked at what data analysis could show us about the history of film. Since then we've explored many different subjects where we might expect to find lots of data: sports, ecology, business, health. Now we need to ask: can we use data analysis to understand a subject when we don't have any numbers at all?

Shakespeare might seem like the farthest possible thing from data science, but the reality is that people have been analyzing Shakespeare with data just as long as they've been writing books and essays about him. In this workshop, we'll explore all 37 of Shakespeare's plays using data.

We can use classification with neural networks to help us understand a question that readers of Shakespeare's plays have argued over for generations: what genre categories do the plays belong to? In the First Folio (the first complete publication of most of Shakespeare's plays, published in 1623), the publishers attempted to categorize the plays in the table of contents:

<a title="William Shakespeare
, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:First_Folio,_Shakespeare_-_0017.jpg"><img width="512" alt="First Folio, Shakespeare - 0017" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/First_Folio%2C_Shakespeare_-_0017.jpg/512px-First_Folio%2C_Shakespeare_-_0017.jpg"></a>

This is a reasonable first attempt! We've got a nice even set of 3 categories: Comedy, Tragedy, and History. Scholars have since added a fourth category, Romance or Tragicomedy, that includes plays like *The Tempest*, *The Winter's Tale*, *Cymbeline*, and *Pericles*. Last week, you clustered Shakespeare's plays to determine what potential groupings of plays may exist. **In this week's workshop, you'll use an artificial neural network to classify Shakespeare's plays by genre.** Here are the steps you should take:

## Data Wrangling

1. Import all the necessary libraries.
2. Using the same files and the same code as last time, import the Shakespeare data and turn it into a DataFrame of TF-IDF scores. Remember to remove the `.ipynb_checkpoints` row.
3. Because we used *unsupervised* clustering last time, we didn't need a target variable. But we're using a *supervised* approach this time. Let's use the First Folio title page, shown above, as our guide to the target variable of genre. Instead of making everyone type that out individually, I've created a Python `dictionary` with that information, and put it into a pandas series, in the cell below. You will use this `genres` variable to make your `y` target.

## Modeling

1. Prepare your data for modeling. You'll use your TF-IDF dataframe (probably called `shakespeare`) as your X, and the `genres` series you created in the previous step will be your y. You can use `shakespeare` directly as the full dataframe, since you're including *every* feature as a predictor.
2. Split the data and fit your model to the training data. Remember, because TF-IDF is already a type of scaling, you don't need to use `StandardScaler()` here like you normally would.
3. Fit your Shakespeare data to the `MLPClassifier()`. **Be thoughtful about the hyperparameters you're using** since these will *greatly* affect your model's accuracy.
4. Get predictions, categories, and probabilities from your model. Make a pandas DataFrame that shows the probabilities of each play in your test data for each genre (plays will be the rows and genres will be the columns). What does this table tell you about how the model worked on certain plays?
5. Explain each step with comments and/or markdown cells.

## Validation

1. Run the usual validation steps: create a confusion matrix and use `classification_report` to get all your accuracy scores. (Remember that you can't make an ROC curve because this isn't a binary classifier.)
2. Use cross-validation to see how your model performed. How does the cross validation score compare to the score of your specific split of the data?
3. Explain each step with comments and/or markdown cells. How did your model perform overall? Would you trust this neural network to classify plays correctly?

In [25]:
genres = {'much-ado-about-nothing': 'comedy',
 'richard-iii': 'history',
 'the-winters-tale': 'romance',
 'richard-ii': 'history',
 'henry-vi-part-3': 'history',
 'the-two-noble-kinsmen': 'romance',
 'timon-of-athens': 'tragedy',
 'the-merchant-of-venice': 'comedy',
 'loves-labors-lost': 'comedy',
 'troilus-and-cressida': 'tragedy',
 'a-midsummer-nights-dream': 'comedy',
 'henry-iv-part-1': 'history',
 'henry-vi-part-1': 'history',
 'henry-v': 'history',
 'pericles': 'romance',
 'the-merry-wives-of-windsor': 'comedy',
 'as-you-like-it': 'comedy',
 'king-john': 'history',
 'cymbeline': 'romance',
 'alls-well-that-ends-well': 'comedy',
 'henry-viii': 'history',
 'julius-caesar': 'tragedy',
 'the-tempest': 'romance',
 'macbeth': 'tragedy',
 'hamlet': 'tragedy',
 'the-taming-of-the-shrew': 'comedy',
 'coriolanus': 'tragedy',
 'othello': 'tragedy',
 'romeo-and-juliet': 'tragedy',
 'measure-for-measure': 'comedy',
 'antony-and-cleopatra': 'tragedy',
 'henry-vi-part-2': 'history',
 'titus-andronicus': 'tragedy',
 'twelfth-night': 'comedy',
 'henry-iv-part-2': 'history',
 'king-lear': 'tragedy',
 'the-comedy-of-errors': 'comedy',
 'the-two-gentlemen-of-verona': 'comedy'}
genres = pd.Series(genres)