Tutorial 5 Preprocessing

This is our fifth tutorial for running R. In this tutorial, we’ll learn about text pre-processing. Text data is often messy and noisy, and careful preprocessing can soften the edges for downstream analyses.

By the end of this notebook, you should be familiar with the following:

  1. Stopword lists

  2. Stemming

  3. preText

  4. N-grams and phrasemachine

  5. Readability

Recap: Tokenization

This week, we’ll return to looking at the Harry Potter series. We’ll first install and load the packages and data for today’s notebook.

library(devtools)
Loading required package: usethis
library(tidytext)
library(plyr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(quanteda)
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.
# Define the folder containing the .rda files (Change to your path). 
folder <- "/Users/mpang/Dropbox/Teaching Resources/DACSS_TAD/HarryPotter"

# Get the list of all .rda files in the folder
rda_files <- list.files(folder, pattern = "\\.rda$", full.names = TRUE)

# Load all .rda files into the environment
lapply(rda_files, load, .GlobalEnv)
[[1]]
[1] "chamber_of_secrets"

[[2]]
[1] "deathly_hallows"

[[3]]
[1] "goblet_of_fire"

[[4]]
[1] "half_blood_prince"

[[5]]
[1] "order_of_the_phoenix"

[[6]]
[1] "philosophers_stone"

[[7]]
[1] "prisoner_of_azkaban"

As a reminder, we have seven books — each stored as a character vector where each chapter is an element in that vector — now available in our workspace. These are:

  1. philosophers_stone: Harry Potter and the Philosophers Stone (1997)

  2. chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)

  3. prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)

  4. goblet_of_fire: Harry Potter and the Goblet of Fire (2000)

  5. order_of_the_phoenix: Harry Potter and the Order of the Phoenix

  6. half_blood_prince: Harry Potter and the Half-Blood Prince (2005)

  7. deathly_hallows: Harry Potter and the Deathly Hallows (2007)

As you’ll recall, we want to convert these to corpus objects that are easier to work with.

philosophers_stone_corpus <- corpus(philosophers_stone)
philosophers_stone_summary <- summary(philosophers_stone_corpus) 
philosophers_stone_summary
Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences
  text1  1271   5693       350
  text2  1066   4154       237
  text3  1226   4656       297
  text4  1193   4832       322
  text5  1818   8446       563
  text6  1563   8016       566
  text7  1374   5487       351
  text8  1095   3608       198
  text9  1422   6195       411
 text10  1293   5237       334
 text11  1110   4215       277
 text12  1509   6790       447
 text13  1076   3953       262
 text14  1110   4394       308
 text15  1385   6486       459
 text16  1581   8357       591
 text17  1489   7172       506

We then add indicators for books and chapters, and create the metadata. This will be useful later when we add all the books together into a single corpus.

philosophers_stone_summary$book <- "Philosopher's Stone"

# create a chapter indicator
philosophers_stone_summary$chapter <- as.numeric(str_extract(philosophers_stone_summary$Text, "[0-9]+"))

# add the metadata
docvars(philosophers_stone_corpus) <- philosophers_stone_summary

philosophers_stone_summary
Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences                book chapter
  text1  1271   5693       350 Philosopher's Stone       1
  text2  1066   4154       237 Philosopher's Stone       2
  text3  1226   4656       297 Philosopher's Stone       3
  text4  1193   4832       322 Philosopher's Stone       4
  text5  1818   8446       563 Philosopher's Stone       5
  text6  1563   8016       566 Philosopher's Stone       6
  text7  1374   5487       351 Philosopher's Stone       7
  text8  1095   3608       198 Philosopher's Stone       8
  text9  1422   6195       411 Philosopher's Stone       9
 text10  1293   5237       334 Philosopher's Stone      10
 text11  1110   4215       277 Philosopher's Stone      11
 text12  1509   6790       447 Philosopher's Stone      12
 text13  1076   3953       262 Philosopher's Stone      13
 text14  1110   4394       308 Philosopher's Stone      14
 text15  1385   6486       459 Philosopher's Stone      15
 text16  1581   8357       591 Philosopher's Stone      16
 text17  1489   7172       506 Philosopher's Stone      17

Next, we move to tokenization. The default breaks on white space.

philosophers_stone_tokens <- tokens(philosophers_stone_corpus)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "."       "and"    
 [8] "Mrs"     "."       "Dursley" ","       "of"     
[ ... and 5,681 more ]

text2 :
 [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
[ ... and 4,142 more ]

text3 :
 [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
 [6] "The"         "escape"      "of"          "the"         "Brazilian"  
[11] "boa"         "constrictor"
[ ... and 4,644 more ]

text4 :
 [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "."      
 [8] "They"    "knocked" "again"   "."       "Dudley" 
[ ... and 4,820 more ]

text5 :
 [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "."        "Although" "he"       "could"   
[ ... and 8,434 more ]

text6 :
 [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
 [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 8,004 more ]

[ reached max_ndoc ... 11 more documents ]

You’ll recall that — in Week 2 — we covered a couple of options from here about what we can do next with our texts. A few dynamics are quickly evident from the text above: punctuation and numbers are still present. Those are straightforward to remove when you tokenize.

# you can also drop punctuation
philosophers_stone_tokens <- tokens(philosophers_stone_corpus, 
    remove_punct = T)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "and"     "Mrs"    
 [8] "Dursley" "of"      "number"  "four"    "Privet" 
[ ... and 4,784 more ]

text2 :
 [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
[ ... and 3,556 more ]

text3 :
 [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
 [6] "The"         "escape"      "of"          "the"         "Brazilian"  
[11] "boa"         "constrictor"
[ ... and 3,984 more ]

text4 :
 [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "They"   
 [8] "knocked" "again"   "Dudley"  "jerked"  "awake"  
[ ... and 3,910 more ]

text5 :
 [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "Although" "he"       "could"    "tell"    
[ ... and 6,998 more ]

text6 :
 [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
 [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 6,750 more ]

[ reached max_ndoc ... 11 more documents ]
# as well as numbers
philosophers_stone_tokens <- tokens(philosophers_stone_corpus, 
    remove_punct = T,
    remove_numbers = T)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "and"     "Mrs"    
 [8] "Dursley" "of"      "number"  "four"    "Privet" 
[ ... and 4,784 more ]

text2 :
 [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
[ ... and 3,556 more ]

text3 :
 [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
 [6] "The"         "escape"      "of"          "the"         "Brazilian"  
[11] "boa"         "constrictor"
[ ... and 3,981 more ]

text4 :
 [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "They"   
 [8] "knocked" "again"   "Dudley"  "jerked"  "awake"  
[ ... and 3,907 more ]

text5 :
 [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "Although" "he"       "could"    "tell"    
[ ... and 6,989 more ]

text6 :
 [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
 [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 6,749 more ]

[ reached max_ndoc ... 11 more documents ]

Often capitalization is not something that we want to retain. Here, we’ll change everything to lower case. Note that, though I have not run into any instance of someone actually doing this, you could also convert to all upper-case tokens by tokens_toupper()

philosophers_stone_tokens <- tokens_tolower(philosophers_stone_tokens)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "the"     "boy"     "who"     "lived"   "mr"      "and"     "mrs"    
 [8] "dursley" "of"      "number"  "four"    "privet" 
[ ... and 4,784 more ]

text2 :
 [1] "the"       "vanishing" "glass"     "nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "dursleys"  "had"      
[ ... and 3,556 more ]

text3 :
 [1] "the"         "letters"     "from"        "no"          "one"        
 [6] "the"         "escape"      "of"          "the"         "brazilian"  
[11] "boa"         "constrictor"
[ ... and 3,981 more ]

text4 :
 [1] "the"     "keeper"  "of"      "the"     "keys"    "boom"    "they"   
 [8] "knocked" "again"   "dudley"  "jerked"  "awake"  
[ ... and 3,907 more ]

text5 :
 [1] "diagon"   "alley"    "harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "although" "he"       "could"    "tell"    
[ ... and 6,989 more ]

text6 :
 [1] "the"            "journey"        "from"           "platform"      
 [5] "nine"           "and"            "three-quarters" "harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 6,749 more ]

[ reached max_ndoc ... 11 more documents ]

Stopword lists

Depending on your research project, many of the words above may be of limited interest. Consider “the”; it’s unlikely to ever be of much interest in understanding the thematic content of a set of texts. Often we just want to remove these sorts of function words that contribute so little substantive meanings. Most commonly, these are referred to as stopwords.

Quanteda package

With quanteda, we can remove stopwords using any of few pre-defined lists that come shipped with the package. Here, we can print that list out first, then remove the tokens:

length(print(stopwords("en")))
  [1] "i"          "me"         "my"         "myself"     "we"        
  [6] "our"        "ours"       "ourselves"  "you"        "your"      
 [11] "yours"      "yourself"   "yourselves" "he"         "him"       
 [16] "his"        "himself"    "she"        "her"        "hers"      
 [21] "herself"    "it"         "its"        "itself"     "they"      
 [26] "them"       "their"      "theirs"     "themselves" "what"      
 [31] "which"      "who"        "whom"       "this"       "that"      
 [36] "these"      "those"      "am"         "is"         "are"       
 [41] "was"        "were"       "be"         "been"       "being"     
 [46] "have"       "has"        "had"        "having"     "do"        
 [51] "does"       "did"        "doing"      "would"      "should"    
 [56] "could"      "ought"      "i'm"        "you're"     "he's"      
 [61] "she's"      "it's"       "we're"      "they're"    "i've"      
 [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
 [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
 [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
 [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
 [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
 [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
 [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"    
[101] "who's"      "what's"     "here's"     "there's"    "when's"    
[106] "where's"    "why's"      "how's"      "a"          "an"        
[111] "the"        "and"        "but"        "if"         "or"        
[116] "because"    "as"         "until"      "while"      "of"        
[121] "at"         "by"         "for"        "with"       "about"     
[126] "against"    "between"    "into"       "through"    "during"    
[131] "before"     "after"      "above"      "below"      "to"        
[136] "from"       "up"         "down"       "in"         "out"       
[141] "on"         "off"        "over"       "under"      "again"     
[146] "further"    "then"       "once"       "here"       "there"     
[151] "when"       "where"      "why"        "how"        "all"       
[156] "any"        "both"       "each"       "few"        "more"      
[161] "most"       "other"      "some"       "such"       "no"        
[166] "nor"        "not"        "only"       "own"        "same"      
[171] "so"         "than"       "too"        "very"       "will"      
[1] 175
# Available for other languages as well
head(stopwords("zh", source="misc"),50)
 [1] "按"     "按照"   "俺"     "们"     "阿"     "别"     "别人"   "别处"  
 [9] "是"     "别的"   "别管"   "说"     "不"     "不仅"   "不但"   "不光"  
[17] "不单"   "不只"   "不外乎" "不如"   "不妨"   "不尽"   "然"     "不得"  
[25] "不怕"   "不惟"   "不成"   "不拘"   "料"     "不是"   "不比"   "不然"  
[33] "特"     "不独"   "不管"   "不至于" "若"     "不论"   "不过"   "不问"  
[41] "比"     "方"     "比如"   "及"     "本身"   "本着"   "本地"   "本人"  
[49] "本"     "巴巴"  
# remove stopwords from our tokens object
philosophers_stone_tokens1 <- tokens_select(philosophers_stone_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")

length(philosophers_stone_tokens1)
[1] 17
print(philosophers_stone_tokens1)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "boy"       "lived"     "mr"        "mrs"       "dursley"   "number"   
 [7] "four"      "privet"    "drive"     "proud"     "say"       "perfectly"
[ ... and 2,507 more ]

text2 :
 [1] "vanishing" "glass"     "nearly"    "ten"       "years"     "passed"   
 [7] "since"     "dursleys"  "woken"     "find"      "nephew"    "front"    
[ ... and 1,906 more ]

text3 :
 [1] "letters"      "one"          "escape"       "brazilian"    "boa"         
 [6] "constrictor"  "earned"       "harry"        "longest-ever" "punishment"  
[11] "time"         "allowed"     
[ ... and 2,213 more ]

text4 :
 [1] "keeper"   "keys"     "boom"     "knocked"  "dudley"   "jerked"  
 [7] "awake"    "`"        "cannon"   "`"        "said"     "stupidly"
[ ... and 2,201 more ]

text5 :
 [1] "diagon"   "alley"    "harry"    "woke"     "early"    "next"    
 [7] "morning"  "although" "tell"     "daylight" "kept"     "eyes"    
[ ... and 4,141 more ]

text6 :
 [1] "journey"        "platform"       "nine"           "three-quarters"
 [5] "harry's"        "last"           "month"          "dursleys"      
 [9] "fun"            "true"           "dudley"         "now"           
[ ... and 3,759 more ]

[ reached max_ndoc ... 11 more documents ]

Tidytext package

In addition to quanteda, there are other packages containing different lists of stopwords, for example, tidytext. Tidytext provides various lexicons for English stop words. Sources of the stopwords are: snowball, SMART, or onix.

tidytextstopwords <- tidytext::stop_words
View(tidytextstopwords)
table(tidytextstopwords$lexicon)

    onix    SMART snowball 
     404      571      174 
#OR
tidytext::get_stopwords(source = "smart")
# A tibble: 571 × 2
   word        lexicon
   <chr>       <chr>  
 1 a           smart  
 2 a's         smart  
 3 able        smart  
 4 about       smart  
 5 above       smart  
 6 according   smart  
 7 accordingly smart  
 8 across      smart  
 9 actually    smart  
10 after       smart  
# ℹ 561 more rows
tidytext::get_stopwords(language = "de") # change language
# A tibble: 231 × 2
   word  lexicon 
   <chr> <chr>   
 1 aber  snowball
 2 alle  snowball
 3 allem snowball
 4 allen snowball
 5 aller snowball
 6 alles snowball
 7 als   snowball
 8 also  snowball
 9 am    snowball
10 an    snowball
# ℹ 221 more rows

Remove stop words using tidytext stopwords. And we see find the number of tokens after removing stopwords is different as we use different packages.

tidystop <- get_stopwords(source = "smart")

philosophers_stone_tokens2 <-     tokens_select(philosophers_stone_tokens, 
                     pattern = tidystop$word,
                     selection = "remove")

length(philosophers_stone_tokens2)
[1] 17
print(philosophers_stone_tokens2)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "boy"       "lived"     "mr"        "mrs"       "dursley"   "number"   
 [7] "privet"    "drive"     "proud"     "perfectly" "normal"    "people"   
[ ... and 1,989 more ]

text2 :
 [1] "vanishing" "glass"     "ten"       "years"     "passed"    "dursleys" 
 [7] "woken"     "find"      "nephew"    "front"     "step"      "privet"   
[ ... and 1,533 more ]

text3 :
 [1] "letters"      "escape"       "brazilian"    "boa"          "constrictor" 
 [6] "earned"       "harry"        "longest-ever" "punishment"   "time"        
[11] "allowed"      "cupboard"    
[ ... and 1,814 more ]

text4 :
 [1] "keeper"   "keys"     "boom"     "knocked"  "dudley"   "jerked"  
 [7] "awake"    "`"        "cannon"   "`"        "stupidly" "crash"   
[ ... and 1,760 more ]

text5 :
 [1] "diagon"   "alley"    "harry"    "woke"     "early"    "morning" 
 [7] "daylight" "eyes"     "shut"     "tight"    "`"        "dream"   
[ ... and 3,315 more ]

text6 :
 [1] "journey"        "platform"       "three-quarters" "harry's"       
 [5] "month"          "dursleys"       "fun"            "true"          
 [9] "dudley"         "scared"         "harry"          "stay"          
[ ... and 2,898 more ]

[ reached max_ndoc ... 11 more documents ]

Stemming

Now, we turn to stemming. Of the pre-processing options, stemming is perhaps the most controversial. The underlying idea is to collapse multiple tokens that are of the same general form into a single token (the word stem). For example, consider “tackle”, which could take a multitude of different forms: “tackle”, “tackling”, “tackled”, or “tackles”. Without stemming, each of these is treated as a unique token. That’s potentially undesirable, particularly when the central concept (or stem) is likely to be the real point of substantive interest. Therefore, we can stem the tokens in our vocabulary, yielding a new set of tokens. In the above example, this would yield “tackl”.

# we use the tokens after removing stopwords using quanteda
philosophers_stone_tokens <- tokens_wordstem(philosophers_stone_tokens1)
philosophers_stone_tokens
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "boy"     "live"    "mr"      "mrs"     "dursley" "number"  "four"   
 [8] "privet"  "drive"   "proud"   "say"     "perfect"
[ ... and 2,507 more ]

text2 :
 [1] "vanish"  "glass"   "near"    "ten"     "year"    "pass"    "sinc"   
 [8] "dursley" "woken"   "find"    "nephew"  "front"  
[ ... and 1,906 more ]

text3 :
 [1] "letter"      "one"         "escap"       "brazilian"   "boa"        
 [6] "constrictor" "earn"        "harri"       "longest-ev"  "punish"     
[11] "time"        "allow"      
[ ... and 2,213 more ]

text4 :
 [1] "keeper" "key"    "boom"   "knock"  "dudley" "jerk"   "awak"   "`"     
 [9] "cannon" "`"      "said"   "stupid"
[ ... and 2,201 more ]

text5 :
 [1] "diagon"   "alley"    "harri"    "woke"     "earli"    "next"    
 [7] "morn"     "although" "tell"     "daylight" "kept"     "eye"     
[ ... and 4,141 more ]

text6 :
 [1] "journey"     "platform"    "nine"        "three-quart" "harri"      
 [6] "last"        "month"       "dursley"     "fun"         "true"       
[11] "dudley"      "now"        
[ ... and 3,759 more ]

[ reached max_ndoc ... 11 more documents ]

##Stemming Vs. Lemmatization

Stemming is a process that stems or removes last few characters from a word, this often leads to incorrect meanings and spelling (such as the “tackl” we saw earlier). In contrast, Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

Now let’s experience the difference:

We have an example sentence. Then we preprocess that data: tokenization, remove punctuation and stop words.

example <- "This location does not have good service. When through drive-through and they forgot our drinks and our sides. While they were preparing what they forgot, we could see another girl who had her back to us and it was obvious that she was on the phone. ANy other KFC would be better."

example_ready <- tokens(example,
                       remove_punct=T)

example_ready <- tokens_select(example_ready,
                       pattern=stopwords("en"),
                       select = "remove")

Now we compare stemming and lemmatization. Note that we can also lemmatize tokens using cleanNLP package. Please check Tutorial 4 for more details.

# stemming
stemming <- tokens_wordstem(example_ready)
as.character(stemming)
 [1] "locat"         "good"          "servic"        "drive-through"
 [5] "forgot"        "drink"         "side"          "prepar"       
 [9] "forgot"        "see"           "anoth"         "girl"         
[13] "back"          "us"            "obvious"       "phone"        
[17] "KFC"           "better"       
# lemmatization
lemmitized <- tokens_replace(example_ready,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

as.character(lemmitized)
 [1] "location"      "good"          "service"       "drive-through"
 [5] "forget"        "drink"         "side"          "prepare"      
 [9] "forget"        "see"           "another"       "girl"         
[13] "back"          "us"            "obvious"       "phone"        
[17] "KFC"           "good"         
# What's lexicon package's hash_lemmas?
head(lexicon::hash_lemmas,100)
Key: <token>
              token         lemma
             <char>        <char>
  1:         'cause       because
  2:             'd         would
  3:            'em          them
  4:            'll          will
  5:             'm            be
  6:            'n'           and
  7:           'til         until
  8:            've          have
  9:        a-bombs        a-bomb
 10:      aardvarks      aardvark
 11:          abaci        abacus
 12:       abacuses        abacus
 13:       abalones       abalone
 14:      abandoned       abandon
 15:     abandoning       abandon
 16:       abandons       abandon
 17:         abased         abase
 18:         abases         abase
 19:        abashed         abash
 20:        abashes         abash
 21:       abashing         abash
 22:        abasing         abase
 23:         abated         abate
 24:     abatements     abatement
 25:         abates         abate
 26:        abating         abate
 27:      abattoirs      abattoir
 28:       abbesses        abbess
 29:         abbeys         abbey
 30:         abbots         abbot
 31:    abbreviated    abbreviate
 32:    abbreviates    abbreviate
 33:   abbreviating    abbreviate
 34:  abbreviations  abbreviation
 35:       abcesses        abcess
 36:      abdicated      abdicate
 37:      abdicates      abdicate
 38:     abdicating      abdicate
 39:    abdications    abdication
 40:       abdomens       abdomen
 41:       abdomina       abdomen
 42:     abdominals     abdominal
 43:       abducted        abduct
 44:      abducting        abduct
 45:     abductions     abduction
 46:      abductors      abductor
 47:        abducts        abduct
 48:    aberrations    aberration
 49:          abets          abet
 50:        abetted          abet
 51:       abetting          abet
 52:       abhorred         abhor
 53:      abhorring         abhor
 54:         abhors         abhor
 55:         abided         abide
 56:         abides         abide
 57:        abiding         abide
 58:      abilities       ability
 59:       abjected        abject
 60:      abjecters      abjecter
 61:      abjecting        abject
 62:        abjects        abject
 63:        abjured        abjure
 64:        abjures        abjure
 65:       abjuring        abjure
 66:        ablated        ablate
 67:        ablates        ablate
 68:       ablating        ablate
 69:      ablations      ablation
 70:          abler          able
 71:         ablest          able
 72:      ablutions      ablution
 73:  abnormalities   abnormality
 74:          abode         abide
 75:         abodes         abode
 76:      abolished       abolish
 77:      abolishes       abolish
 78:     abolishing       abolish
 79:  abolitionists  abolitionist
 80:     abolitions     abolition
 81:     abominated     abominate
 82:     abominates     abominate
 83:    abominating     abominate
 84:   abominations   abomination
 85:    aboriginals    aboriginal
 86:     aborigines     aborigine
 87:        aborted         abort
 88: abortifacients abortifacient
 89:       aborting         abort
 90:   abortionists   abortionist
 91:      abortions      abortion
 92:         aborts         abort
 93:       abounded        abound
 94:      abounding        abound
 95:        abounds        abound
 96:    about-faces    about-face
 97:    about-turns    about-turn
 98:        abraded        abrade
 99:        abrades        abrade
100:       abrading        abrade
              token         lemma

##What to do?

Clearly, you have a multitude of options for pre-processing texts. The choice, then, for what to do becomes complicated. Should you remove stop words? Numbers? Punctuation? Capitalization? Should you stem? What combination and why? Anyone who has done work in the area has faced scrutiny — warranted or not — from reviewers questioning the pre-processing choices.

At the most basic level, the answer to this should be theoretically driven. Consider the research question and whether each of the varied pre-processing steps would make sense for that research question. Are you missing something if you remove capitalization or punctuation? Are there different versions of the same term that might be meaningful for your analysis? It’s most important to be able to defend the choices on the merits.

Of course, you also have the ability to simply, in an appendix, include versions of analyses that carry out the analysis while employing the pre-processing steps highlighted by the reviewer. Especially if your code is well-written and the computational time required for the analysis is low, this may be another useful response.

preText

A final way of approaching this is more systematic, and is proposed by Matt Denny — a former UMass (and Penn State University) student! — and Art Spirling in their paper. They’ve written an R package to execute the recommendations from their paper, which you can download directly from github. Note that there are a ton of dependencies associated with the package. You can see a vignette describing the use of preText here.

Install the package first

#devtools::install_github("matthewjdenny/preText")
library(preText)
preText: Diagnostics to Assess the Effects of Text Preprocessing Decisions
Version 0.7.2 created on 2021-07-25.
copyright (c) 2021, Matthew J. Denny, Georgetown University
                    Arthur Spirling, NYU
Type vignette('getting_started_with_preText') to get started.
Development website: https://github.com/matthewjdenny/preText

We use the U.S. presidential inaugural speeches from Quanteda example data

corp <- data_corpus_inaugural
# use first 10 documents for example
documents <- corp[1:10,]
# take a look at the document names
print(names(documents))
 [1] "1789-Washington" "1793-Washington" "1797-Adams"      "1801-Jefferson" 
 [5] "1805-Jefferson"  "1809-Madison"    "1813-Madison"    "1817-Monroe"    
 [9] "1821-Monroe"     "1825-Adams"     

Having loaded in some data, we can now make use of the factorial_preprocessing() function, which will preprocess the data 64 or 128 different ways (depending on whether n-grams are included). In this example, we are going to preprocess the documents all 128 different ways. This should take between 5 and 10 minutes on most modern laptops. Longer documents and larger numbers of documents will significantly increase run time and memory usage.

preprocessed_documents <- factorial_preprocessing(
    documents,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.2,
    verbose = FALSE)
Preprocessing 10 documents 128 different ways...

This function will output a list object with three fields. The first of these is $choices, a data.frame containing indicators for each of the preprocessing steps used. The second is $dfm_list, which is a list with 64 or 128 entries, each of which contains a quanteda::dfm object preprocessed according to the specification in the corresponding row in choices. Each DFM in this list will be labeled to match the row names in choices, but you can also access these labels from the $labels field. We can look at the first few rows of choices below:

names(preprocessed_documents)
[1] "choices"  "dfm_list" "labels"  
head(preprocessed_documents$choices)
              removePunctuation removeNumbers lowercase stem removeStopwords
P-N-L-S-W-I-3              TRUE          TRUE      TRUE TRUE            TRUE
N-L-S-W-I-3               FALSE          TRUE      TRUE TRUE            TRUE
P-L-S-W-I-3                TRUE         FALSE      TRUE TRUE            TRUE
L-S-W-I-3                 FALSE         FALSE      TRUE TRUE            TRUE
P-N-S-W-I-3                TRUE          TRUE     FALSE TRUE            TRUE
N-S-W-I-3                 FALSE          TRUE     FALSE TRUE            TRUE
              infrequent_terms use_ngrams
P-N-L-S-W-I-3             TRUE       TRUE
N-L-S-W-I-3               TRUE       TRUE
P-L-S-W-I-3               TRUE       TRUE
L-S-W-I-3                 TRUE       TRUE
P-N-S-W-I-3               TRUE       TRUE
N-S-W-I-3                 TRUE       TRUE

Now that we have our preprocessed documents, we can perform the preText procedure on the factorial preprocessed corpus using the preText() function. It will be useful now to give a name to our data set using the dataset_name argument, as this will show up in some of the plots we generate with the output.

preText_results <- preText(
    preprocessed_documents,
    dataset_name = "Inaugural Speeches",
    distance_method = "cosine",
    num_comparisons = 20,
    verbose = FALSE)
Generating document distances...
Generating preText Scores...
Generating regression results..
The R^2 for this model is: 0.6000916 
Regression results (negative coefficients imply less risk):
                 Variable Coefficient    SE
1               Intercept       0.112 0.005
2      Remove Punctuation       0.018 0.004
3          Remove Numbers       0.002 0.004
4               Lowercase       0.000 0.004
5                Stemming      -0.002 0.004
6        Remove Stopwords      -0.036 0.004
7 Remove Infrequent Terms      -0.008 0.004
8              Use NGrams      -0.027 0.004
Complete in: 8.181 seconds...

The preText() function returns a list of result with four fields:

  1. $preText_scores: A data.frame containing preText scores and preprocessing step labels for each preprocessing step as columns. Note that there is no preText score for the case of no prepprocessing steps.

  2. $ranked_preText_scores: A data.frame that is identical to $preText_scores except that it is ordered by the magnitude of the preText score

  3. $choices: A data.frame containing binary indicators of which preprocessing steps were applied to factorial preprocessed DFM.

  4. $regression_results: A data.frame containing regression results where indicators for each preprocessing decision are regressed on the preText score for that specification.

We can now feed these results to two functions that will help us make better sense of them. preText_score_plot() creates a dot plot of scores for each preprocessing specification:

preText_score_plot(preText_results)
Warning in ggplot2::geom_point(ggplot2::aes(x = Variable, y = Coefficient), :
Ignoring unknown parameters: `linewidth`

Here, the least risky specifications have the lowest preText score and are displayed at the top of the plot. We can also see the conditional effects of each preprocessing step on the mean preText score for each specification that included that step. Here again, a negative coefficient indicates that a step tends to reduce the unusualness of the results, while a positive coefficient indicates that applying the step is likely to produce more unusual results for that corpus.

regression_coefficient_plot(preText_results,
                            remove_intercept = TRUE)
Warning in ggplot2::geom_point(ggplot2::aes(y = Variable, x = Coefficient), :
Ignoring unknown parameters: `linewidth`

In this particular toy example, we see that including n-grams and removing stop words tends to produce more “normal” results, while removing punctuation tends to produce more unusual results.

N-grams and phrasemachine

Like mentioned in the lecture, curse of dimensionality means we get lots of meaningless bigrams, trigrams (e.g. “is”, “is of”, “is of the”). What if we want to find substantively meaningful n-grams? The solution is phrasemachine.

#devtools::install_github("slanglab/phrasemachine/R/phrasemachine")
  
library(phrasemachine)                      
phrasemachine: Simple Phrase Extraction
Version 1.2.0 created on 2017-05-29.
copyright (c) 2016, Matthew J. Denny, Abram Handler, Brendan O'Connor.
Type help('phrasemachine') or
vignette('getting_started_with_phrasemachine') to get started.
Development website: https://github.com/slanglab/phrasemachine

We load in the U.S. presidential inaugural speeches from Quanteda example data, and use the first 5 documents for example.

corp <- quanteda::corpus(quanteda::data_corpus_inaugural)

documents <- as.character(corp)[1:5]
print(names(documents))
[1] "1789-Washington" "1793-Washington" "1797-Adams"      "1801-Jefferson" 
[5] "1805-Jefferson" 

Phrasemachine provides one main function: phrasemachine(), which takes as input a vector of strings (one string per document), or a quanteda corpus object. This function returns phrases extracted from the input documents in one of two forms. Find more information here

phrases <- phrasemachine(documents,
                         minimum_ngram_length = 2,
                         maximum_ngram_length = 8,
                         return_phrase_vectors = TRUE,
                         return_tag_sequences = TRUE)
Currently tagging document 1 of 5 
Currently tagging document 2 of 5 
Currently tagging document 3 of 5 
Currently tagging document 4 of 5 
Currently tagging document 5 of 5 
Extracting phrases from document 1 of 5 
Extracting phrases from document 2 of 5 
Extracting phrases from document 3 of 5 
Extracting phrases from document 4 of 5 
Extracting phrases from document 5 of 5 
# look at some example phrases
print(phrases[[1]]$phrases[1:10])
 [1] "Fellow-Citizens_of_the_Senate" "House_of_Representatives"     
 [3] "vicissitudes_incident"         "vicissitudes_incident_to_life"
 [5] "incident_to_life"              "greater_anxieties"            
 [7] "14th_day"                      "14th_day_of_the_present_month"
 [9] "day_of_the_present_month"      "present_month"                

Readability

While we’ve covered some pre-processing above, we’re going to take a detour now before we head into representing texts next week. One form of analysis that can occassionally be interesting, though with limitations, is the readability of a text. The general idea here is that some forms of composition are easier to understand, while others may take more scrutiny. Interestingly, most of these are based on transformations around some basic aspects of text: the length of words or sentences, for instance.

There are loads of actual measures for readability; fortunately, nearly all are included with the quanteda. For the full list, check out the reference page here. Here, we’ll play around and check out the readability of Philosopher’s Stone by chapter, to see how the book progresses.

library(quanteda)
library(quanteda.textstats)
# calculate readability
readability <- textstat_readability(philosophers_stone_corpus, 
                                    measure = c("Flesch.Kincaid")) 

# add in a chapter number
readability$chapter <- c(1:nrow(readability))

# look at the dataset
head(readability)
  document Flesch.Kincaid chapter
1    text1       5.727817       1
2    text2       6.184489       2
3    text3       5.583587       3
4    text4       4.526660       4
5    text5       5.105107       5
6    text6       4.225491       6
# plot results
ggplot(readability, aes(x = chapter, y = Flesch.Kincaid)) +
  geom_line() + 
  geom_smooth() + 
  theme_bw()
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

That’s interesting, but as I noted, we have lots of options. Lets take a look at a couple of others.

readability <- textstat_readability(philosophers_stone_corpus, 
                                    measure = c("Flesch.Kincaid", "FOG", "Coleman.Liau.grade")) 

# add in a chapter number
readability$chapter <- c(1:nrow(readability))

# look at the dataset
head(readability)
  document Flesch.Kincaid      FOG Coleman.Liau.grade chapter
1    text1       5.727817 7.951456           6.729091       1
2    text2       6.184489 7.974007           6.899500       2
3    text3       5.583587 7.275985           6.783842       3
4    text4       4.526660 6.632315           5.094981       4
5    text5       5.105107 6.970226           6.234268       5
6    text6       4.225491 6.424974           5.230121       6
# plot results
ggplot(readability, aes(x = chapter)) +
  geom_line(aes(y = Flesch.Kincaid), color = "black") + 
  geom_line(aes(y = FOG), color = "red") + 
  geom_line(aes(y = Coleman.Liau.grade), color = "blue") + 
  theme_bw()

The shapes are all pretty close. Let’s look at some correlations.

cor(readability$Flesch.Kincaid, readability$FOG, use = "complete.obs")
[1] 0.9338218
cor(readability$Coleman.Liau.grade, readability$FOG, use = "complete.obs")
[1] 0.9130903

This is a little bit of the story and is pretty straightforward. Remember how all of these are just “magic number” transformations of the same underlying measures (syllables, word counts, sentence length, etc.)? Because of that, they are generally really highly correlated. That’s a good thing.

So what’s the downside? Well, I can’t do any better to explain the problem than Mark Liberman’s Language Log post here.