Often, a dataset will include a column or set of columns that have some unstructured text: not just a category but a set of descriptions or other information in full sentences. In those cases, you’ll sometimes want to extract information from the text and turn it into numerical or categorical data.

This short guide will show you how to explore and analyze text data with some simple methods. This is by no means a complete guide to text analysis! Instead, I’ll show you how to use a few basic text analysis methods to work better with your tabular data.

Libraries and Data

First you’ll need to import some libraries (make sure you install them first). Tidyverse includes a library called stringr which has a robust set of functions for working with text data. (Remember: “strings” or “character data” are synonyms for text data.)

I’ve included two more libraries that will help you work with text in R: tidytext, which cleans up and “tokenizes” text data, and wordcloud, which creates “wordcloud” visualizations of text.

Begin by loading these three libraries:

library(tidyverse)
library(tidytext)
library(wordcloud)

Now let’s read in some data. For this demonstration, I’ve chosen to look at some data on chess matches from lichess.com, which I found on Kaggle.

chess <- read_csv("chess.csv")
glimpse(chess)
## Rows: 20,058
## Columns: 16
## $ id             <chr> "TZJHLljE", "l1NXvwaE", "mIICvQHh", "kWKvrqYL", "9tXo1A…
## $ rated          <lgl> FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE…
## $ created_at     <dbl> 1.50421e+12, 1.50413e+12, 1.50413e+12, 1.50411e+12, 1.5…
## $ last_move_at   <dbl> 1.50421e+12, 1.50413e+12, 1.50413e+12, 1.50411e+12, 1.5…
## $ turns          <dbl> 13, 16, 61, 61, 95, 5, 33, 9, 66, 119, 39, 38, 60, 31, …
## $ victory_status <chr> "outoftime", "resign", "mate", "mate", "mate", "draw", …
## $ winner         <chr> "white", "black", "white", "white", "white", "draw", "w…
## $ increment_code <chr> "15+2", "5+10", "5+10", "20+0", "30+3", "10+0", "10+0",…
## $ white_id       <chr> "bourgris", "a-00", "ischia", "daniamurashov", "nik2211…
## $ white_rating   <dbl> 1500, 1322, 1496, 1439, 1523, 1250, 1520, 1413, 1439, 1…
## $ black_id       <chr> "a-00", "skinnerua", "a-00", "adivanov2009", "adivanov2…
## $ black_rating   <dbl> 1191, 1261, 1500, 1454, 1469, 1002, 1423, 2108, 1392, 1…
## $ moves          <chr> "d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5 Bf4", "…
## $ opening_eco    <chr> "D10", "B00", "C20", "D02", "C41", "B27", "D00", "B00",…
## $ opening_name   <chr> "Slav Defense: Exchange Variation", "Nimzowitsch Defens…
## $ opening_ply    <dbl> 5, 4, 3, 3, 5, 4, 10, 5, 6, 4, 1, 9, 3, 2, 8, 7, 8, 8, …

Using glimpse, you can see that this data has lots of information about each chess match. In particular, it has two text columns that will interest us. One is the “moves” column, which includes chess notation for every move in a game. The other is the “opening_name” column, which includes the many, sometimes odd, names for chess openings. They include things like “Italian Game: Anti-Fried Liver Defense” and “Sicilian Defense: Smith-Morra Gambit #2.”

Names of chess openings developed over hundreds of years, so the names aren’t consistent categories. Let’s try to make a bar chart to see which openings are the most common:

ggplot(chess, aes(x=opening_name)) +
  geom_bar()

This isn’t helpful at all! There are way too many bars, and we can’t even read the text labels at the bottom.

Our usual methods for working with data categories won’t work because there’s too much variation in the names. There are a bunch of different variations of the popular Sicilian Defense, for instance. But we might want to know simply how many times the Sicilian Defense is used, regardless of the variation.

To work with data this way, we need text analysis methods!

Cleaning Texts and Counting Words

As a first step, let’s try to see what the most common words in these opening names are. Maybe the word “sicilian” appears a lot, or the phrase “queen’s gambit.”

To know, we first need to count the words, and our first step to that is to tokenize the text. A “token” in text analysis usually just refers to an individual word. So “tokenizing” a text means splitting it up into individual words.

First we need to create a tibble (i.e. a dataframe) that contains just the opening_name column. Then we can use unnest_tokens() function from tidytext to create a table where each word (token) is in its own row.

openings <- tibble(text = chess$opening_name)
tidy_openings <- openings %>%
  unnest_tokens(word, text)
head(tidy_openings)
## # A tibble: 6 × 1
##   word       
##   <chr>      
## 1 slav       
## 2 defense    
## 3 exchange   
## 4 variation  
## 5 nimzowitsch
## 6 defense

If we look at the first few rows of tidy_openings, we can see that now each word is in its own row!

Now we can simply use dplyr to count up the word frequencies:

tidy_openings %>%
  count(word, sort=TRUE) %>%
  head() # Only show the first few rows
## # A tibble: 6 × 2
##   word          n
##   <chr>     <int>
## 1 defense   11701
## 2 variation  8024
## 3 game       4931
## 4 opening    3198
## 5 sicilian   2951
## 6 gambit     2641

We can see that the most frequent words used to describe openings are “defense,” “variation,” and “game”. This makes sense!

Visualizing Word Frequency with Wordclouds

We don’t have to rely only a list to see word frequency. We can use a “wordcloud” to visualize word frequencies according to the size and color of words. Let’s try it:

wordcloud(tidy_openings$word, max.words = 100, random.order = FALSE, colors=brewer.pal(8, "Dark2"))

Looking at it this way, we can immediately see a problem with our approach. The words “defense” and “variation” (and to some extent, “game”) are disproportionately frequent compared to the rest of the words (which are mostly small and green). But we don’t care as much about these words!

We know that words like “defense” and “variation” appear in lots of different kinds of openings. Instead of seeing these words, we’d like to see words that signal unique openings: words like “sicilian,” “lopez,” and “giuoco.”

It’s very common in text analysis to remove stopwords: words that are very frequent but don’t carry a lot of information the researchers care about. Often a list of stopwords includes small, very frequent words like articles (a, an, the), prepositions (of, from, to), or pronouns (she, her, they). In our case, we want our stopwords to be common chess terms that don’t carry a lot of meaning, like “defense.”

I’ve assembled a short list of stopwords below. You can remove them with anti_join() and create the wordcloud again:

stopwords_chess <- tibble(word = c("defense", "variation", "opening", "game", "gambit", "attack"))
tidy_openings_stop <- tidy_openings %>%
  anti_join(stopwords_chess)
wordcloud(tidy_openings_stop$word, max.words = 100, random.order = FALSE, colors=brewer.pal(8, "Dark2"))

This wordcloud is much better! Instead of seeing a few very frequent words that don’t carry a lot of meaning, we’re seeing a lot of more moderately frequent words that do indicate specific openings. We can see that the Sicilian is a pretty common opening, and so is the Ruy Lopez.

This isn’t a perfect approach: “queens” is very common here, but we know that multiple openings include the word “queen.” The Queen’s Gambit and the Queen’s Pawn Opening are two common examples. So if we want to be really thorough, we may not want to stop at simply tokenizing the words into individual terms. But it’s a good enough start, and it gives us a much better idea of the common openings we might be dealing with.

Making New Columns from Text Data

Now we know that the Sicilian Defense and the Queen’s Gambit may be two of the most common openings. But how often are they used in relation to each other? We can use stringr functions to do pattern matching in text columns: look for a specific word or phrase and create new columns based on that info.

Let’s create two new columns of binary data, where the value is TRUE or FALSE based on whether “Sicilian Defense” or “Queen’s Gambit” appears in the opening name:

chess <- chess %>%
  mutate(sicilian = str_detect(opening_name, "Sicilian Defense"),
         queensgambit = str_detect(opening_name, "Queen's Gambit"))

Now we can create a bar chart based on one of those columns:

ggplot(chess, aes(x=sicilian, fill=sicilian)) +
  geom_bar()

This is much better than our first bar chart, but not ideal. It shows us how many times the Sicilian Defense was used (a little more than 2500), but it compares that to “FALSE”, the number of times it wasn’t.

We’d rather compare the amount of times the Sicilian Defense was used to the amount of times the Queen’s Gambit was used. We need a little more advanced logic to do this, but if we apply what we know about mutate, str_detect, and ifelse, we can do it. Pay attention to the nested logic we use to create a new column:

chess <- chess %>%
  mutate(sicilian_queensgambit = ifelse(
    str_detect(opening_name, "Sicilian Defense"), 
    "Sicilian", 
    ifelse(str_detect(opening_name, "Queen's Gambit"),
           "Queen's Gambit", 
           NA)
    ))

If the opening name contains “Sicilian Defense”, the new column is marked as Sicilian; if it contains “Queen’s Gambit,” it’s marked as Queen’s Gambit; and if it contains neither, it’s marked as NA. Now that’s a categorical variable we can use!

Let’s drop the null values and create a new bar chart:

chess %>%
  drop_na(sicilian_queensgambit) %>%
  ggplot(aes(x=sicilian_queensgambit, fill=sicilian_queensgambit)) +
    geom_bar()

This is a much more informative and interesting bar chart. (Remember our first, unreadable one?) We can see that the Sicilian is more than twice or almost three times as popular as the Queen’s Gambit. Don’t tell Netflix!

Counting Checks

Let’s try one more stringr technique. Sometimes we want to turn a text column into numerical data by counting things.

In the “moves” column, there’s chess notation for every single move in a game. In chess notation, a plus sign means someone has checked their opponents king. What if we wanted to know the number of checks in each game? We could count the plus signs in each row of the moves column using str_count:

chess <- chess %>%
  mutate(checks = str_count(moves, "\\+"))

# We use \\ before the plus sign so that R treats
# the plus sign as an exact character to match, not
# as a special operator.

With this information we could form new research questions. Are there more checks when White wins the match or when Black wins? Let’s find out with a box plot:

ggplot(chess, aes(x=winner,y=checks,color=winner)) +
  geom_boxplot()

A somewhat surprising result! There doesn’t appear to be much difference in checks if black or white wins, but there do seem to be more checks when the game is draw. This makes sense if you think about it: games that end in a draw often last longer. The players are evenly matched, with both sides getting many chances to check the other’s king but not winning the game outright.

Without these basic text searching and counting skills, we never would have had the data to make these visualizations! Are there columns in your own datasets where these methods might be applied?