Tutorial 6 Text Representation I

In this tutorial, we’ll learn about representing texts. This week, we’ll continue looking at the Harry Potter series. We’ll first install and load the packages for today’s notebook.

library(devtools)

Loading required package: usethis

#devtools::install_github("bradleyboehmke/harrypotter")
library(harrypotter)
library(tidytext)
library(plyr)
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2
──

✔ ggplot2 3.4.2     ✔ purrr   1.0.1
✔ tibble  3.2.1     ✔ dplyr   1.1.1
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()

library(quanteda)

Package version: 3.3.1
Unicode version: 14.0
ICU version: 70.1
Parallel computing: 4 of 4 threads used.
See https://quanteda.io for tutorials and examples.

library(quanteda.textplots)

As a reminder, we have seven books — each stored as a character vector where each chapter is an element in that vector — now available in our workspace. These are:

philosophers_stone: Harry Potter and the Philosophers Stone (1997)
chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
order_of_the_phoenix: Harry Potter and the Order of the Phoenix
half_blood_price: Harry Potter and the Half-Blood Prince (2005)
deathly_hallows: Harry Potter and the Deathly Hallows (2007)

As you’ll recall, we want to convert these to corpus objects that are easier to work with.

philosophers_stone_corpus <- corpus(philosophers_stone)
philosophers_stone_summary <- summary(philosophers_stone_corpus) 
philosophers_stone_summary

Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences
  text1  1274   5643       349
  text2  1067   4128       237
  text3  1226   4630       297
  text4  1199   4759       321
  text5  1822   8371       563
  text6  1568   7949       566
  text7  1379   5445       351
  text8  1096   3594       198
  text9  1428   6130       410
 text10  1294   5207       334
 text11  1114   4152       276
 text12  1511   6729       447
 text13  1079   3929       261
 text14  1113   4354       308
 text15  1386   6437       459
 text16  1582   8277       591
 text17  1491   7101       506

# add an indicator for the book; this will be useful later when we add all the books together into a single corpus
philosophers_stone_summary$book <- "Philosopher's Stone"

# create a chapter indicator
philosophers_stone_summary$chapter <- as.numeric(str_extract(philosophers_stone_summary$Text, "[0-9]+"))

# add the metadata
docvars(philosophers_stone_corpus) <- philosophers_stone_summary

Document-Feature Matrix

A common first-step in text analysis is converting texts from their written format (“The dog runs down the hall.”) to a numerical representation of that language. The basic approach for representing a sentence is the document-feature matrix, sometimes also call the document-term matrix. Here, we are creating a matrix where the rows indicate documents, the columns indicate words, and the value of each cell in the matrix is the count of the word (column) for the document (row).

We can use quanteda’s dfm command to generate the document-feature matrix directly from the corpus object.

# create the dfm
philosophers_stone_dfm <- dfm(tokens(philosophers_stone_corpus))

# find out a quick summary of the dfm
philosophers_stone_dfm

Document-feature matrix of: 17 documents, 6,161 features (80.18% sparse) and 6 docvars.
       features
docs    the boy who lived mr   . and mrs dursley   ,
  text1 204   9   9     2 30 417 102  21      45 290
  text2 181   6   7     3  3 253  84   5       3 235
  text3 222   1   6     0  4 306 112   3       0 230
  text4 126   5  11     2  2 289  74   0       6 289
  text5 274  14  10     0 26 569 154   0       0 526
  text6 282  26  18     0  0 489 185   0       0 500
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 6,151 more features ]

The summary of the document-feature matrix provides a few interesting notes for us. We have the number of documents (17 chapters) and the number of features (6,161). We also get a note about sparsity. This refers to the number of 0 entries in our matrix; here, 80.18% of our matrix is 0. The high sparsity of text data is a well-recognized trait and something we will regularly return to.

Below the summary statement, we can see the first few rows and columns of our document-feature matrix. The first entry in the matrix, for instance, indicates that “the” appears 204 times in the first chapter (text1) of “The Philosopher’s Stone”. This reminds us that we did not preprocess our corpus. Fortunately, the dfm() function explicitly includes the ability to preprocess when you are creating your matrix. Indeed, that’s why the text is lower-cased above; the function defaults to removing capitalization. We can be a bit more heavy-handed with our preprocessing as follows.

# create the dfm
philosophers_stone_dfm <- tokens(philosophers_stone_corpus,
                                    remove_punct = TRUE,
                                    remove_numbers = TRUE) %>%
                           dfm(tolower=TRUE) %>%
                           dfm_remove(stopwords('english'))
# find out a quick summary of the dfm
philosophers_stone_dfm

Document-feature matrix of: 17 documents, 5,962 features (82.01% sparse) and 6 docvars.
       features
docs    boy lived mr mrs dursley number four privet drive proud
  text1   9     2 30  21      45      7    5      8     9     1
  text2   6     3  3   5       3      1    2      1     1     0
  text3   1     0  4   3       0      1    4      5     6     0
  text4   5     2  2   0       6      0    0      0     0     2
  text5  14     0 26   0       0      0    2      0     0     2
  text6  26     0  0   0       0      4    4      0     1     0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 5,952 more features ]

Working with DFMs

Once we have our document-feature matrix, and have made some preprocessing decisions, we can turn to thinking about what we can learn with this new representation. Let’s start with some basics. It’s really easy to see the most frequent terms (features) now.

topfeatures(philosophers_stone_dfm, 20)

     harry       said         --        ron     hagrid       back   hermione 
      1212        794        783        410        336        261        257 
       one        got       like        get       know       just        see 
       254        198        194        194        188        180        180 
 professor     looked        now      snape     around dumbledore 
       180        169        166        145        142        142

Perhaps you’d also like to know something like which words were only used within a particular text. We can look, for instance, at the final chapter to see what words were uniquely used there.

final_chapter_words <- as.vector(colSums(philosophers_stone_dfm) == philosophers_stone_dfm["text17",])
colnames(philosophers_stone_dfm)[final_chapter_words]

  [1] "treble"          "type"            "overgrown"       "suspect"        
  [5] "p-p-poor"        "st-stuttering"   "p-professor"     "unpopular"      
  [9] "nosy"            "scurrying"       "trolls"          "concentrating"  
 [13] "idly"            "frighten"        "presenting"      "binding"        
 [17] "loathed"         "spasm"           "master's"        "instructions"   
 [21] "traveled"        "served"          "faithfully"      "mistakes"       
 [25] "displeased"      "trailed"         "-how"            "myseff"         
 [29] "blood-red"       "incredibly"      "rooting"         "muscle"         
 [33] "slits"           "mere"            "vapor"           "another's"      
 [37] "willing"         "strengthened"    "faithful"        "create"         
 [41] "surged"          "begging"         "liar"            "value"          
 [45] "courageous"      "vain"            "flame"           "seize"          
 [49] "needle-sharp"    "seared"          "lessened"        "blistering"     
 [53] "lunged"          "agony"           "pinning"         "palms"          
 [57] "raw"             "perform"         "deadly"          "instinct"       
 [61] "aaaargh"         "yells"           "grasp"           "blackness"      
 [65] "swam"            "linen"           "tokens"          "admirers"       
 [69] "naturally"       "misters"         "responsible"     "hygienic"       
 [73] "confiscated"     "distracted"      "prevent"         "blankly"        
 [77] "stored"          "affairs"         "well-organized"  "beings"         
 [81] "precisely"       "increases"       "truly"           "nevertheless"   
 [85] "delayed"         "merely"          "therefore"       "treated"        
 [89] "beg"             "alas"            "loved"           "hatred"         
 [93] "marked"          "bird"            "sheet"           "twinkled"       
 [97] "detest"          "unlike"          "dreamily"        "debt"           
[101] "hating"          "otherwise"       "unfortunate"     "youth"          
[105] "vomitflavored"   "toffee"          "golden-brown"    "nurse"          
[109] "pleaded"         "resting"         "fling"           "audience"       
[113] "rocker"          "stopping"        "accident"        "end-of-year"    
[117] "steamrollered"   "food'll"         "bustled"         "stiffily"       
[121] "risky"           "indoors"         "chucked"         "grief"          
[125] "remorse"         "leaking"         "sandwich"        "chuckle"        
[129] "fix"             "shoulda"         "leather-covered" "pomfrey's"      
[133] "fussing"         "insisting"       "checkup"         "decked"         
[137] "slytherin's"     "serpent"         "hush"            "fortunately"    
[141] "waffle"          "fuller"          "awarding"        "thus"           
[145] "fifty-two"       "twenty-six"      "seventy-"        "stamping"       
[149] "account"         "ahem"            "dish"            "radish"         
[153] "sunburn"         "best-played"     "award"           "din"            
[157] "seventy-two"     "gradually"       "kinds"           "takes"          
[161] "explosion"       "nudged"          "downfall"        "jot"            
[165] "grades"          "scraped"         "abysmal"         "wardrobes"      
[169] "toilets"         "boarding"        "greener"         "tidier"         
[173] "towns"           "wizened"         "gate"            "twos"           
[177] "threes"          "alarming"        "gateway"         "squealed"       
[181] "purple-faced"    "mustached"       "manner"          "holiday"        
[185] "spreading"

Word clouds

We started out earlier this semester by making those fancy little word clouds. We haven’t done much of that since, as we’ve been busy getting our hands on data, getting it into R, and thinking about some of the more NLP-centric types of approaches one might take. Now that we’re moving to representing texts, though, we can quickly return to word clouds.

The general idea here is that the size of the word corresponds to the frequency of the term in the corpus. That is, we are characterizing the most frequent terms in a corpus. Importantly, that means the axes don’t really mean anything in these clouds, nor does the orientation of the term. For that reason, though these are pretty, they aren’t terribly useful.

# programs often work with random initializations, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)

# draw the wordcloud
textplot_wordcloud(philosophers_stone_dfm, min_count = 50, random_order = FALSE)

One way to get a bit more utility is to use the comparison option within the function to plot a comparison of wordclouds across two different documents. Here’s an example.

# narrow to first and last chapters
smallDfm <- philosophers_stone_dfm[c(1,17),]

# draw the wordcloud
textplot_wordcloud(smallDfm, comparison = TRUE, min_count = 10, random_order = FALSE)

Zipf’s Law

Now that our data are nicely formatted, we can also look at one of the statistical regularities that characterizes language, Zipf’s Law. Word frequencies are distributed according to Zipf’s law. What does that mean? Let’s take a look at the distribution of word frequencies.

# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(philosophers_stone_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(philosophers_stone_dfm))
head(word_counts)

       Frequency Rank
harry       1212    1
said         794    2
--           783    3
ron          410    4
hagrid       336    5
back         261    6

# now we can plot this
ggplot(word_counts, mapping = aes(x = Rank, y = Frequency)) + 
  geom_point() +
  labs(title = "Zipf's Law", x = "Rank", y = "Frequency") + 
  theme_bw()

Updating our DFMs

Having seen what we are working with here, we might start to think that our matrix still contains too many uninformative or very rare terms. We can trim our DFM in two different ways related to feature frequencies using dfm_trim().

# trim based on the overall frequency (i.e., the word counts)
smaller_dfm <- dfm_trim(philosophers_stone_dfm, min_termfreq = 10)

# trim based on the proportion of documents that the feature appears in; here, the feature needs to appear in more than 10% of documents (chapters)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.1, docfreq_type = "prop")

smaller_dfm

Document-feature matrix of: 17 documents, 884 features (42.22% sparse) and 6 docvars.
       features
docs    boy mr mrs dursley number four privet drive say normal
  text1   9 30  21      45      7    5      8     9   7      5
  text2   6  3   5       3      1    2      1     1   3      0
  text3   1  4   3       0      1    4      5     6   0      0
  text4   5  2   0       6      0    0      0     0   7      1
  text5  14 26   0       0      0    2      0     0  12      0
  text6  26  0   0       0      4    4      0     1   5      0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 874 more features ]

textplot_wordcloud(smaller_dfm, min_count = 50,
                   random_order = FALSE)

Note that our sparsity is now significantly decreased. We can also do this in the opposite direction as a way of avoiding features that appear frequently in our corpus and thus are perhaps more uninformative in the particular setting but that would not be caught by a standard stop-word list. As an example, we may want to drop the feature “harry” from the analysis of Harry Potter books, since every.single.reference. to Harry increases that count.

smaller_dfm2 <- dfm_trim(philosophers_stone_dfm, max_termfreq = 250)
smaller_dfm2 <- dfm_trim(smaller_dfm2, max_docfreq = .5, docfreq_type = "prop")

smaller_dfm2

Document-feature matrix of: 17 documents, 5,447 features (87.23% sparse) and 6 docvars.
       features
docs    lived mrs dursley number privet drive proud perfectly normal thank
  text1     2  21      45      7      8     9     1         2      5     2
  text2     3   5       3      1      1     1     0         0      0     0
  text3     0   3       0      1      5     6     0         0      0     0
  text4     2   0       6      0      0     0     2         0      1     1
  text5     0   0       0      0      0     0     2         1      0     0
  text6     0   0       0      4      0     1     0         0      0     1
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 5,437 more features ]

# when you are doing the quiz, you might want to leverage this chunk of code 
as.vector(smaller_dfm2[,which(colnames(smaller_dfm2) == "voldemort")])

 [1]  4  0  0  1  0  2  0  0  0  0  0  0  0  0  4  6 14

textplot_wordcloud(smaller_dfm2, min_count = 20,
                   random_order = FALSE)

Feature Co-occurrence matrix

Representing text-as-data as a document-feature matrix allows us to learn both about document-level characteristics and about corpus-level characteristics. However, it tells us less about how words within the corpus relate to one another. For this, we can turn to the feature co-occurrence matrix. The idea here is to construct a matrix that — instead of presenting the times a word appears within a document — presents the *number of times word{a} appears in the same document as word{b}. As before creating the feature co-occurrence matrix is straight-forward.

# let's create a nicer dfm by limiting to words that appear frequently and are in more than 30% of chapters
smaller_dfm3 <- dfm_trim(philosophers_stone_dfm, min_termfreq = 10)
smaller_dfm3 <- dfm_trim(smaller_dfm3, min_docfreq = .3, docfreq_type = "prop")

# create fcm from dfm
smaller_fcm <- fcm(smaller_dfm3)

# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created
dim(smaller_fcm)

[1] 774 774

Notice that the number of rows and columns are the same; that’s because they are each the vocabulary, with the entry being the number of times the row word and column word co-occur (with the diagonal elements undefined). Later on this semester, we’ll leverage these word co-occurrence matrices to estimate word embedding models.

For now, let’s use what we’ve got to try to learn a bit more about what features co-occur, and how, within our book. To do, we’ll visualize a semantic network using textplot_network().

# pull the top features
myFeatures <- names(topfeatures(smaller_fcm, 30))

# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")

# check dimensions
dim(even_smaller_fcm)

[1] 30 30

# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))

# create plot
textplot_network(even_smaller_fcm, vertex_size = size/ max(size) * 3)

The graph above is build on dfm, so it does not show which words are closer in a sentence. If we make fcm using the original documents, and set up a window, then we can have more information abot which words are more likely to appear together.

book1_token <- tokens(philosophers_stone_corpus,
                    remove_punct = TRUE,
                    remove_numbers = TRUE)

book1_token <- tokens_select(book1_token,
                     pattern = stopwords("en"),
                     selection = "remove")

try_fcm <- fcm(book1_token,context = "window", window=2)

try_fcm

Feature co-occurrence matrix of: 6,667 by 6,667 features.
         features
features  BOY LIVED Mr Mrs Dursley number four Privet Drive proud
  BOY       0     1  1   0       0      0    0      0     0     0
  LIVED     0     0  1   1       0      0    0      0     0     0
  Mr        0     0  0   3      32      0    0      0     0     1
  Mrs       0     0  0   4      17      1    0      0     0     0
  Dursley   0     0  0   0       0      1    1      0     0     0
  number    0     0  0   0       0      0    7      1     0     0
  four      0     0  0   0       0      0    0      1     1     0
  Privet    0     0  0   0       0      0    0      0    16     1
  Drive     0     0  0   0       0      0    0      0     0     1
  proud     0     0  0   0       0      0    0      0     0     0
[ reached max_feat ... 6,657 more features, reached max_nfeat ... 6,657 more features ]

book1_Features <- names(topfeatures(try_fcm, 30))

book1_small_fcm <- fcm_select(try_fcm, pattern = book1_Features, selection = "keep")

textplot_network(book1_small_fcm, vertex_size = 2)

We observe in the graph that “Uncle Vernon”, and “Professor McGonagall” often appear together. Of course, Harry, Ron, and Hermione also appear together.