# Modeling Shakespeare with K-Means Clustering

**Complete by: Tuesday 8 Apr. at class time**  
Data: (See below.)

At the start of the semester, we looked at what data analysis could show us about the history of film. Since then we've explored many different subjects where we might expect to find lots of data: sports, ecology, business, health. Now we need to ask: can we use data analysis to understand a subject when we don't have any numbers at all?

Shakespeare might seem like the farthest possible thing from data science, but the reality is that people have been analyzing Shakespeare with data just as long as they've been writing books and essays about him. In this workshop, we'll explore all 37 of Shakespeare's plays using data.

We can use K-Means clustering to help us understand a question that readers of Shakespeare's plays have argued over for generations: what genre categories do the plays belong to? In the First Folio (the first complete publication of most of Shakespeare's plays, published in 1623), the publishers attempted to categorize the plays in the table of contents:

<a title="William Shakespeare
, Public domain, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:First_Folio,_Shakespeare_-_0017.jpg"><img width="512" alt="First Folio, Shakespeare - 0017" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8a/First_Folio%2C_Shakespeare_-_0017.jpg/512px-First_Folio%2C_Shakespeare_-_0017.jpg"></a>

This is a reasonable first attempt! We've got a nice even set of 3 categories: Comedy, Tragedy, and History. Scholars have since added a fourth category, Romance or Tragicomedy, that includes plays like *The Tempest*, *The Winter's Tale*, *Cymbeline*, and *Pericles*. **In this workshop, you'll attempt to cluster Shakespeare's plays to determine what potential groupings of plays may exist.**

## What's In a Number?

But wait! You can download a folder full of Shakespeare's plays—the text is in lots of different forms—but none of this is a tidy dataset or CSV. How do we turn Shakespeare's plays into data? One way to model the similarity between texts is to *count their words*. In fact, studies have shown that just the 100 or so most common words in a text can be enough for good classification.

Using Python, we can easily read a bunch of text files and count their words, but we'd run into a scaling problem very quickly. Some words are used hundreds of times, but others appear only once or twice. Instead of using z-scores like we have in the past, the field of Information Retrieval uses a technique called [TF-IDF (term frequency–inverse document frequency)](https://programminghistorian.org/en/lessons/analyzing-documents-with-tfidf) to normalize word counts based on how likely they are to appear. Sci-kit learn has a single class, `TfidfVectorizer()` that lets us do all of this in one step. You'll need that along with some new Python libraries to read in the zipped files and convert them into a dataframe of TF-IDF counts.

Look up the documentation for `TfidfVectorizer()` and see if you can figure out what to do. Don't spend more than 5 or 10 minutes puzzling it over, though. When you're ready, you can reveal the solution.

<details>
<summary>Click here for the TF-IDF solution</summary>
<pre><code># The new libraries
import requests, re, zipfile, io
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

#Empty lists for titles and texts
titles = [] #use as row labels
texts = [] #the data we will analyze

shakeszip = requests.get("https://jrladd.com/CIS241/data/shakespeare.zip")

#Unzip the folder, get all the files out, and save the play titles
with zipfile.ZipFile(io.BytesIO(shakeszip.content)) as myzip: # Look inside our zipfile
    for i in myzip.infolist(): # Loop through each file
        if i.is_dir() == False and i.filename.startswith('__MACOSX') == False: # Filter out the pointless duplicates
            titles.append(re.split(r"/|_TXT",i.filename)[1]) # Add titles to list
            texts.append(myzip.read(i.filename)) # Add the text to list

#Create a vectorizer instance, save only 100 words
vectorizer = TfidfVectorizer(max_features=100)

#Transform files into TF-IDF
shakespeare = vectorizer.fit_transform(texts)
#Turn vectorizer results into readable dataframe
shakespeare = pd.DataFrame(shakespeare.toarray(), index=titles, columns=vectorizer.get_feature_names_out())
shakespeare</code></pre>
</details>

**In the cells below, import the necessary libraries for TF-IDF and KMeans clustering, read in the Shakespeare files, and turn them into a dataframe of TF-IDF values**:

# K-Means Clustering

Now you have some data: normalized counts of the most frequent 100 words in Shakespeare's plays.

In this next section, do the following:

1) Determine the number of clusters you will need. Consider the explanation written above, but also try the **elbow method**. Explain *why* you chose the value for K that you did. *Because you're using TF-IDF for scaling, you do **not** need z-scores*.
2) Run K-Means clustering on the Shakespeare data. Look at the results to see if they match what literary scholars think about how different plays go together (i.e. do they match the table of contents above?). Write down some of your thoughts.
3) Assess your clustering model by looking at the size of the clusters and a visualization of the cluster means. Did your model do a good job? What features seem to be separating the clusters? Write down some ideas, and suggest one or two next steps.