Tutorial 6 Text Representation I

In this tutorial, we’ll learn about representing texts. This week, we’ll continue looking at the Harry Potter series. We’ll first install and load the packages for today’s notebook.

library(devtools)

Loading required package: usethis

library(tidytext)
library(plyr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(quanteda)

Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.

library(quanteda.textplots)

First load all the Harry Potter books.

# Define the folder containing the .rda files (Change to your path). 
folder <- "/Users/mpang/Dropbox/Teaching Resources/DACSS_TAD/HarryPotter"

# Get the list of all .rda files in the folder
rda_files <- list.files(folder, pattern = "\\.rda$", full.names = TRUE)

# Load all .rda files into the environment
lapply(rda_files, load, .GlobalEnv)

[[1]]
[1] "chamber_of_secrets"

[[2]]
[1] "deathly_hallows"

[[3]]
[1] "goblet_of_fire"

[[4]]
[1] "half_blood_prince"

[[5]]
[1] "order_of_the_phoenix"

[[6]]
[1] "philosophers_stone"

[[7]]
[1] "prisoner_of_azkaban"

As a reminder, we have seven books — each stored as a character vector where each chapter is an element in that vector — now available in our workspace. These are:

philosophers_stone: Harry Potter and the Philosophers Stone (1997)
chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)
prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)
goblet_of_fire: Harry Potter and the Goblet of Fire (2000)
order_of_the_phoenix: Harry Potter and the Order of the Phoenix
half_blood_price: Harry Potter and the Half-Blood Prince (2005)
deathly_hallows: Harry Potter and the Deathly Hallows (2007)

As you’ll recall, we want to convert these to corpus objects that are easier to work with.

philosophers_stone_corpus <- corpus(philosophers_stone)
philosophers_stone_summary <- summary(philosophers_stone_corpus) 
philosophers_stone_summary

Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences
  text1  1271   5693       350
  text2  1066   4154       237
  text3  1226   4656       297
  text4  1193   4832       322
  text5  1818   8446       563
  text6  1563   8016       566
  text7  1374   5487       351
  text8  1095   3608       198
  text9  1422   6195       411
 text10  1293   5237       334
 text11  1110   4215       277
 text12  1509   6790       447
 text13  1076   3953       262
 text14  1110   4394       308
 text15  1385   6486       459
 text16  1581   8357       591
 text17  1489   7172       506

# add an indicator for the book; this will be useful later when we add all the books together into a single corpus
philosophers_stone_summary$book <- "Philosopher's Stone"

# create a chapter indicator
philosophers_stone_summary$chapter <- as.numeric(str_extract(philosophers_stone_summary$Text, "[0-9]+"))

# add the metadata
docvars(philosophers_stone_corpus) <- philosophers_stone_summary

Document-Feature Matrix

A common first-step in text analysis is converting texts from their written format (“The dog runs down the hall.”) to a numerical representation of that language. The basic approach for representing a sentence is the document-feature matrix, sometimes also call the document-term matrix. Here, we are creating a matrix where the rows indicate documents, the columns indicate words, and the value of each cell in the matrix is the count of the word (column) for the document (row).

We can use quanteda’s dfm command to generate the document-feature matrix directly from the corpus object.

# create the dfm
philosophers_stone_dfm <- dfm(tokens(philosophers_stone_corpus))

# find out a quick summary of the dfm
philosophers_stone_dfm

Document-feature matrix of: 17 documents, 6,116 features (80.08% sparse) and 6 docvars.
       features
docs    the boy who lived mr   . and mrs dursley   ,
  text1 204   9   9     2 30 417 102  21      45 290
  text2 181   6   7     3  3 253  84   5       3 235
  text3 222   1   6     0  4 306 112   3       0 230
  text4 126   5  11     2  2 289  74   0       6 289
  text5 274  14  10     0 26 569 155   0       0 526
  text6 282  26  18     0  0 489 185   0       0 500
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 6,106 more features ]

The summary of the document-feature matrix provides a few interesting notes for us. We have the number of documents (17 chapters) and the number of features (6,116). We also get a note about sparsity. This refers to the number of 0 entries in our matrix; here, 80.08% of our matrix is 0. The high sparsity of text data is a well-recognized trait and something we will regularly return to.

Below the summary statement, we can see the first few rows and columns of our document-feature matrix. The first entry in the matrix, for instance, indicates that “the” appears 204 times in the first chapter (text1) of “The Philosopher’s Stone”. This reminds us that we did not preprocess our corpus. Fortunately, the dfm() function explicitly includes the ability to preprocess when you are creating your matrix. Indeed, that’s why the text is lower-cased above; the function defaults to removing capitalization. We can be a bit more heavy-handed with our preprocessing as follows.

# create the dfm
philosophers_stone_dfm <- tokens(philosophers_stone_corpus,
                                    remove_punct = TRUE,
                                    remove_numbers = TRUE) %>%
                           dfm(tolower=TRUE) %>%
                           dfm_remove(stopwords('english'))
# find out a quick summary of the dfm
philosophers_stone_dfm

Document-feature matrix of: 17 documents, 5,918 features (81.91% sparse) and 6 docvars.
       features
docs    boy lived mr mrs dursley number four privet drive proud
  text1   9     2 30  21      45      7    5      8     9     1
  text2   6     3  3   5       3      1    2      1     1     0
  text3   1     0  4   3       0      1    4      5     6     0
  text4   5     2  2   0       6      0    0      0     0     2
  text5  14     0 26   0       0      0    2      0     0     2
  text6  26     0  0   0       0      4    4      0     1     0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 5,908 more features ]

Working with DFMs

Once we have our document-feature matrix, and have made some preprocessing decisions, we can turn to thinking about what we can learn with this new representation. Let’s start with some basics. It’s really easy to see the most frequent terms (features) now.

topfeatures(philosophers_stone_dfm, 20)

         `      harry       said        ron     hagrid       back   hermione 
      4757       1213        794        410        336        261        257 
       one        got       like        get       know       just        see 
       254        198        194        194        188        180        180 
 professor     looked        now      snape dumbledore     around 
       180        169        166        145        143        142

We see the symbol “`” as the top 1 feature (which is annoying). Even though we removed punctuation, some special characters might not have been treated as punctuation by default. Let’s remove this symbol before moving on.

philosophers_stone_dfm <- dfm_remove(philosophers_stone_dfm, pattern = "`", valuetype = "fixed")

topfeatures(philosophers_stone_dfm, 20)

     harry       said        ron     hagrid       back   hermione        one 
      1213        794        410        336        261        257        254 
       got       like        get       know       just        see  professor 
       198        194        194        188        180        180        180 
    looked        now      snape dumbledore     around      going 
       169        166        145        143        142        135

Perhaps you’d also like to know something like which words were only used within a particular text. We can look, for instance, at the final chapter to see what words were uniquely used there.

final_chapter_words <- as.vector(colSums(philosophers_stone_dfm) == philosophers_stone_dfm["text17",])
colnames(philosophers_stone_dfm)[final_chapter_words]

  [1] "treble"          "type"            "overgrown"       "suspect"        
  [5] "p-p-poor"        "st-stuttering"   "p-professor"     "unpopular"      
  [9] "nosy"            "scurrying"       "trolls"          "concentrating"  
 [13] "idly"            "frighten"        "presenting"      "binding"        
 [17] "loathed"         "spasm"           "master's"        "instructions"   
 [21] "traveled"        "served"          "faithfully"      "mistakes"       
 [25] "displeased"      "trailed"         "myseff"          "blood-red"      
 [29] "incredibly"      "rooting"         "muscle"          "slits"          
 [33] "mere"            "vapor"           "another's"       "willing"        
 [37] "strengthened"    "faithful"        "create"          "surged"         
 [41] "begging"         "liar"            "value"           "courageous"     
 [45] "vain"            "flame"           "seize"           "needle-sharp"   
 [49] "seared"          "lessened"        "blistering"      "lunged"         
 [53] "agony"           "pinning"         "palms"           "raw"            
 [57] "perform"         "deadly"          "instinct"        "aaaargh"        
 [61] "yells"           "grasp"           "blackness"       "swam"           
 [65] "linen"           "tokens"          "admirers"        "naturally"      
 [69] "misters"         "responsible"     "hygienic"        "confiscated"    
 [73] "distracted"      "prevent"         "blankly"         "stored"         
 [77] "affairs"         "well-organized"  "beings"          "precisely"      
 [81] "increases"       "truly"           "nevertheless"    "delayed"        
 [85] "merely"          "therefore"       "treated"         "beg"            
 [89] "alas"            "loved"           "hatred"          "marked"         
 [93] "bird"            "sheet"           "twinkled"        "detest"         
 [97] "unlike"          "dreamily"        "debt"            "hating"         
[101] "otherwise"       "unfortunate"     "youth"           "vomitflavored"  
[105] "toffee"          "golden-brown"    "nurse"           "pleaded"        
[109] "resting"         "fling"           "audience"        "rocker"         
[113] "stopping"        "accident"        "end-of-year"     "steamrollered"  
[117] "food'll"         "bustled"         "stiffily"        "risky"          
[121] "indoors"         "chucked"         "grief"           "remorse"        
[125] "leaking"         "sandwich"        "chuckle"         "fix"            
[129] "shoulda"         "leather-covered" "pomfrey's"       "fussing"        
[133] "insisting"       "checkup"         "decked"          "slytherin's"    
[137] "serpent"         "hush"            "fortunately"     "waffle"         
[141] "fuller"          "awarding"        "thus"            "fifty-two"      
[145] "twenty-six"      "stamping"        "account"         "ahem"           
[149] "dish"            "radish"          "sunburn"         "best-played"    
[153] "award"           "din"             "seventy-two"     "gradually"      
[157] "kinds"           "takes"           "explosion"       "nudged"         
[161] "downfall"        "jot"             "grades"          "scraped"        
[165] "abysmal"         "wardrobes"       "toilets"         "boarding"       
[169] "greener"         "tidier"          "towns"           "wizened"        
[173] "gate"            "twos"            "threes"          "alarming"       
[177] "gateway"         "squealed"        "purple-faced"    "mustached"      
[181] "manner"          "holiday"         "spreading"

Word clouds

We started out earlier this semester by making those fancy little word clouds. We haven’t done much of that since, as we’ve been busy getting our hands on data, getting it into R, and thinking about some of the more NLP-centric types of approaches one might take. Now that we’re moving to representing texts, though, we can quickly return to word clouds.

The general idea here is that the size of the word corresponds to the frequency of the term in the corpus. That is, we are characterizing the most frequent terms in a corpus. Importantly, that means the axes don’t really mean anything in these clouds, nor does the orientation of the term. For that reason, though these are pretty, they aren’t terribly useful.

# programs often work with random initialization, yielding different outcomes.
# we can set a standard starting point though to ensure the same output.
set.seed(1234)

# draw the wordcloud
textplot_wordcloud(philosophers_stone_dfm, min_count = 50, random_order = FALSE)

One way to get a bit more utility is to use the comparison option within the function to plot a comparison of wordclouds across two different documents. Here’s an example.

# narrow to first and last chapters
smallDfm <- philosophers_stone_dfm[c(1,17),]

# draw the wordcloud
textplot_wordcloud(smallDfm, comparison = TRUE, min_count = 10, random_order = FALSE)

Zipf’s Law

Now that our data are nicely formatted, we can also look at one of the statistical regularities that characterizes language, Zipf’s Law. Word frequencies are distributed according to Zipf’s law. What does that mean? Let’s take a look at the distribution of word frequencies.

# first, we need to create a word frequency variable and the rankings
word_counts <- as.data.frame(sort(colSums(philosophers_stone_dfm),dec=T))
colnames(word_counts) <- c("Frequency")
word_counts$Rank <- c(1:ncol(philosophers_stone_dfm))
word_counts$Word <- rownames(word_counts)
head(word_counts)

         Frequency Rank     Word
harry         1213    1    harry
said           794    2     said
ron            410    3      ron
hagrid         336    4   hagrid
back           261    5     back
hermione       257    6 hermione

# We only want to label top 10 words
word_counts$Label <- ifelse(word_counts$Rank <= 10, word_counts$Word, NA)

# now we can plot this
ggplot(word_counts, mapping = aes(x = Rank, y = Frequency)) + 
  geom_point() +
  geom_text(aes(label = Label),vjust = -0.5, hjust = 0.5, size = 3) +
  labs(title = "Zipf's Law", x = "Rank", y = "Frequency") + 
  theme_bw()

Warning: Removed 5907 rows containing missing values or values outside the scale range
(`geom_text()`).

Updating our DFMs

Having seen what we are working with here, we might start to think that our matrix still contains too many uninformative or very rare terms. We can trim our DFM in two different ways related to feature frequencies using dfm_trim().

# trim based on the overall frequency (i.e., the word counts)
smaller_dfm <- dfm_trim(philosophers_stone_dfm, min_termfreq = 10)

# trim based on the proportion of documents that the feature appears in; here, the feature needs to appear in more than 10% of documents (chapters)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.1, docfreq_type = "prop")

smaller_dfm

Document-feature matrix of: 17 documents, 885 features (42.28% sparse) and 6 docvars.
       features
docs    boy mr mrs dursley number four privet drive say normal
  text1   9 30  21      45      7    5      8     9   7      5
  text2   6  3   5       3      1    2      1     1   3      0
  text3   1  4   3       0      1    4      5     6   0      0
  text4   5  2   0       6      0    0      0     0   7      1
  text5  14 26   0       0      0    2      0     0  12      0
  text6  26  0   0       0      4    4      0     1   5      0
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 875 more features ]

textplot_wordcloud(smaller_dfm, min_count = 50,
                   random_order = FALSE)

Note that our sparsity is now significantly decreased. We can also do this in the opposite direction as a way of avoiding features that appear frequently in our corpus and thus are perhaps more uninformative in the particular setting but that would not be caught by a standard stop-word list. As an example, we may want to drop the feature “harry” from the analysis of Harry Potter books, since every.single.reference. to Harry increases that count.

smaller_dfm2 <- dfm_trim(philosophers_stone_dfm, max_termfreq = 250)
smaller_dfm2 <- dfm_trim(smaller_dfm2, max_docfreq = .5, docfreq_type = "prop")

smaller_dfm2

Document-feature matrix of: 17 documents, 5,402 features (87.18% sparse) and 6 docvars.
       features
docs    lived mrs dursley number privet drive proud perfectly normal thank
  text1     2  21      45      7      8     9     1         2      5     2
  text2     3   5       3      1      1     1     0         0      0     0
  text3     0   3       0      1      5     6     0         0      0     0
  text4     2   0       6      0      0     0     2         0      1     1
  text5     0   0       0      0      0     0     2         1      0     0
  text6     0   0       0      4      0     1     0         0      0     1
[ reached max_ndoc ... 11 more documents, reached max_nfeat ... 5,392 more features ]

# when you are doing the quiz, you might want to leverage this chunk of code 
as.vector(smaller_dfm2[,which(colnames(smaller_dfm2) == "voldemort")])

 [1]  4  0  0  1  0  2  0  0  0  0  0  0  0  0  4  6 14

textplot_wordcloud(smaller_dfm2, min_count = 20,
                   random_order = FALSE)

Feature Co-occurrence matrix

Representing text-as-data as a document-feature matrix allows us to learn both about document-level characteristics and about corpus-level characteristics. However, it tells us less about how words within the corpus relate to one another. For this, we can turn to the feature co-occurrence matrix. The idea here is to construct a matrix that — instead of presenting the times a word appears within a document — presents the *number of times word{a} appears in the same document as word{b}. As before creating the feature co-occurrence matrix is straight-forward.

# let's create a nicer dfm by limiting to words that appear frequently and are in more than 30% of chapters
smaller_dfm3 <- dfm_trim(philosophers_stone_dfm, min_termfreq = 10)
smaller_dfm3 <- dfm_trim(smaller_dfm3, min_docfreq = .3, docfreq_type = "prop")

# create fcm from dfm
smaller_fcm <- fcm(smaller_dfm3)

# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created
dim(smaller_fcm)

[1] 775 775

Notice that the number of rows and columns are the same; that’s because they are each the vocabulary, with the entry being the number of times the row word and column word co-occur (with the diagonal elements undefined). Later on this semester, we’ll leverage these word co-occurrence matrices to estimate word embedding models.

For now, let’s use what we’ve got to try to learn a bit more about what features co-occur, and how, within our book. To do, we’ll visualize a semantic network using textplot_network().

# pull the top features
myFeatures <- names(sort(colSums(smaller_fcm), decreasing = TRUE)[1:30])

# retain only those top features as part of our matrix
even_smaller_fcm <- fcm_select(smaller_fcm, pattern = myFeatures, selection = "keep")

# check dimensions
dim(even_smaller_fcm)

[1] 30 30

# compute size weight for vertices in network
size <- log(colSums(even_smaller_fcm))

# create plot
textplot_network(even_smaller_fcm, vertex_size = size/ max(size) * 3)

The graph above is build on dfm, so it does not show which words are closer in a sentence. If we make fcm using the original documents, and set up a window, then we can have more information about which words are more likely to appear together.

book1_token <- tokens(philosophers_stone_corpus,
                    remove_punct = TRUE,
                    remove_numbers = TRUE) %>%
  tokens_tolower()

book1_token <- tokens_select(book1_token,
                     pattern = c(stopwords("en"),"`"),
                     selection = "remove")

try_fcm <- fcm(book1_token,context = "window", window=2)

try_fcm

Feature co-occurrence matrix of: 5,917 by 5,917 features.
         features
features  boy lived mr mrs dursley number four privet drive proud
  boy       4     3  1   0       0      0    0      0     0     0
  lived     0     0  1   1       0      0    0      0     0     0
  mr        0     0  0   3      33      0    0      0     0     1
  mrs       0     0  0   4      17      1    0      0     0     0
  dursley   0     0  0   0       0      1    1      0     0     0
  number    0     0  0   0       0      0    7      1     1     0
  four      0     0  0   0       0      0    0      1     1     0
  privet    0     0  0   0       0      0    0      0    16     1
  drive     0     0  0   0       0      0    0      0     0     1
  proud     0     0  0   0       0      0    0      0     0     0
[ reached max_feat ... 5,907 more features, reached max_nfeat ... 5,907 more features ]

book1_Features <- names(sort(colSums(try_fcm), decreasing = TRUE)[1:30])

book1_small_fcm <- fcm_select(try_fcm, pattern = book1_Features, selection = "keep")

textplot_network(book1_small_fcm, vertex_size = 2)

We observe in the graph that “Uncle Vernon”, and “Professor McGonagall” often appear together. Of course, Harry, Ron, and Hermione also appear together.