Tutorial7_WordEmbeddings/Text Representation II

Word Embeddings

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a tool for identifying similarities between words in a corpus by using some form of model to predict the co-occurrence of words within a small chunk of text.

We’ll be using the text2vec package. text2vec was one of the first implementations of word embeddings functionality in R, and is designed to run fast, relatively speaking. Still, it’s important to remember that our computational complexity is amping up here, so don’t expect immediate results.

GloVe

Stanford University’s Global Vectors for Word Representation (GloVe) is an approach to estimating a distributional representation of a word. GloVe is based, essentially, on factorizing a huge term co-occurrence matrix.

The distributional representation of words means that each term is represented as a distribution over some number of dimensions (say, 3 dimensions, where the values are 0.6, 0.3, and 0.1). This stands in stark contrast to the work we’ve done to this point, which has effectively encoded each word as being effectively just present (1) or not (0).

Perhaps unsurprisingly, the distributional representation better captures semantic meaning than the one-hot encoding. This opens up a world of possibilities for us as researchers. Indeed, this has been a major leap forward for research in Text-as-Data.

As an example, we can see how similar one word is to other words by measuring the distance between their distributions. Even more interestingly, we can capture really specific phenomena from text with some simple arithmetic based on word distributions. Consider the following canonical example:

Note

king - man + woman = queen

Ponder this equation for a moment. From the vector representation of king, we subtract the vector representation of man. Then, we add the vector representation of woman. The end result of that should be a vector that is very similar to the vector representation of queen.

In what follows, we’ll work through some examples to see how well this works. I want to caution, though, that the models we are training here are probably too small for us to have too much confidence in the trained models. Nevertheless, you’ll see that even with this small set we’ll recover really interesting dynamics.

Front-end Matters

First, let’s install and load the text2vec package:

#Installing text2vec package (might take a while)
#install.packages('text2vec')
library(text2vec)

PoKi Dataset

We’ll be using PoKi, a corpus of poems written by children and teenagers from grades 1 to 12.

One thing to flag right off the bat is the really interesting dynamics related to who is writing these posts. We need to keep in mind that the children writing these texts are going to use less formal writing and more imaginative stories. Given this, we’ll focus on analogies that are more appropriate for this context; here, we’ll aim to create word embeddings that can recreate these two equations:

cat - meow + bark = dog

mom - girl + boy = dad

By the end, we should hopefully be able to recreate these by creating and fitting our GloVe models. But first, let’s perform the necessary pre-processing steps before creating our embedding models.

Let’s download and read in the data:

# Create file
temp <- tempfile()

# Downloads and unzip file
download.file("https://raw.githubusercontent.com/whipson/PoKi-Poems-by-Kids/master/poki.csv", temp)
# Reads in downloaded file
poem <- read.csv(temp)

# First ten rows
head(poem, 10)
       id                            title       author grade
1  104987                   I Love The Zoo                  1
2   67185                The scary forest.                  1
3  103555                 A Hike At School 1st grade-wh     1
4  112483                         Computer            a     1
5   74516                            Angel          aab     1
6  114693         Nature Nature and Nature       aadhya     1
7   46453                             Jack      aaliyah     1
8   57397         When I awoke one morning        aanna     1
9   77201 My Blue Berries and  My Cherries      aarathi     1
10  40520                      A snowy day          ab.     1
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            text
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                  roses are red,  violets are blue.   i love the zoo.   do you?
2                                                                                                                                                                                                                                                                                                                                                                                                                                         the forest is really haunted.  i believe it to be so.  but then we are going camping. 
3                                                                                                                                                                                            i took a hike at school today  and this is what i saw     bouncing balls      girls chatting against the walls     kids climbing on monkey bars     i even saw some teachers' cars     the wind was blowing my hair in my face     i saw a mud puddle,  but just a trace all of these things i noticed just now on my little hike. 
4                                                                                                                                                                                                                                                                                                                                                                                                                       you  can  do  what  you  want  you  can play a  game    you can do many things,   you can read and write
5                                                                                                                                                                                                                                                                                                                                                              angel oh angle you spin like a top angel oh angel you will never stop can't you feel the air  as it blows through your hair angel oh angel itisto bad your a mop!
6  look at the sun, what a beautiful day.  under the trees, we can run and play.  beauty of nature, we love to see,  from tiny insect to exotic tree.  it is a place to sit and think,  nature and human share the deepest link.  nature has ocean, which is in motion.  nature has tree, nature has river.  if we destroy the nature we would never be free.  our nature keeps us alive,  we must protect it, for society to thrive.  we spoil the nature, we spoil the future.  go along with nature, for your better future. 
7                                                                                                                                                                                                                                                                                                                                                                                                                                                     dog  playful,  energetic running,  jumping,  tackling my is my friend jack
8                                                                                                                                                                                       when i awoke one morning,  a dog was on my  head.  i asked ,  ''what are you doing there?' it looked at me and said  ''woof!'' ''wouldn't you like to be outside playing?''said the man ''i'm staying here and playing here. '' said the dog he played all night and day. he came inside his new house and played inside a wet wet day. 
9                                                                                                                                                                                                                                                                                                                                                                                  i went to my blue berry tree they were no blue berries found i went to another tree to get some more free but found none but cherries round. 
10                                                                                                                                                                                                                                                                                                                      one snowy day the children went outside to play in the snow.  they threw snowballs,  went sledding and made a snowman.  afterwards they went inside to drink warm hot chocolate.  it was a fun snowy day
   char
1    62
2    87
3   324
4   106
5   164
6   491
7    74
8   325
9   143
10  199
# Checks dimensions
dim(poem)
[1] 61508     6

We want the poems themselves, so we’ll use the column text for tokenization.

Tokenization and Vectorization

The process for text2vec is different than the standard process we’d been following. To that end, we’ll follow the same process as we will do for LDA later, creating a tokenized iterator and vectorized vocabulary first. This time, there’s no need to lowercase our words since the downloaded dataset is already lowercased.

Let’s tokenize the data:

# Tokenization
tokens <- word_tokenizer(poem$text)

# First five rows tokenized
head(tokens, 5)
[[1]]
 [1] "roses"   "are"     "red"     "violets" "are"     "blue"    "i"      
 [8] "love"    "the"     "zoo"     "do"      "you"    

[[2]]
 [1] "the"     "forest"  "is"      "really"  "haunted" "i"       "believe"
 [8] "it"      "to"      "be"      "so"      "but"     "then"    "we"     
[15] "are"     "going"   "camping"

[[3]]
 [1] "i"        "took"     "a"        "hike"     "at"       "school"  
 [7] "today"    "and"      "this"     "is"       "what"     "i"       
[13] "saw"      "bouncing" "balls"    "girls"    "chatting" "against" 
[19] "the"      "walls"    "kids"     "climbing" "on"       "monkey"  
[25] "bars"     "i"        "even"     "saw"      "some"     "teachers"
[31] "cars"     "the"      "wind"     "was"      "blowing"  "my"      
[37] "hair"     "in"       "my"       "face"     "i"        "saw"     
[43] "a"        "mud"      "puddle"   "but"      "just"     "a"       
[49] "trace"    "all"      "of"       "these"    "things"   "i"       
[55] "noticed"  "just"     "now"      "on"       "my"       "little"  
[61] "hike"    

[[4]]
 [1] "you"    "can"    "do"     "what"   "you"    "want"   "you"    "can"   
 [9] "play"   "a"      "game"   "you"    "can"    "do"     "many"   "things"
[17] "you"    "can"    "read"   "and"    "write" 

[[5]]
 [1] "angel"   "oh"      "angle"   "you"     "spin"    "like"    "a"      
 [8] "top"     "angel"   "oh"      "angel"   "you"     "will"    "never"  
[15] "stop"    "can't"   "you"     "feel"    "the"     "air"     "as"     
[22] "it"      "blows"   "through" "your"    "hair"    "angel"   "oh"     
[29] "angel"   "itisto"  "bad"     "your"    "a"       "mop"    

Create an iterator object:

# Create iterator object
it <- itoken(tokens, progressbar = FALSE)

Build the vocabulary:

# Build vocabulary
vocab <- create_vocabulary(it)

# Vocabulary
vocab
Number of docs: 61508 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 1 
Vocabulary: 
          term term_count doc_count
        <char>      <int>     <int>
    1:    0000          1         1
    2: 0000000          1         1
    3: 0000001          1         1
    4:   00a:m          1         1
    5:    00he          1         1
   ---                             
56470:      to      69175     30347
56471:     and      80863     34798
56472:       a      92765     37607
56473:     the     120677     37676
56474:       i     124832     32777
# Check dimensions
dim(vocab)
[1] 56474     3

And prune and vectorize it. We’ll keep the terms that occur at least 5 times.

# Prune vocabulary
vocab <- prune_vocabulary(vocab, term_count_min = 5)

# Check dimensions
dim(vocab)
[1] 14267     3
# Vectorize
vectorizer <- vocab_vectorizer(vocab)

As we can see, pruning our vocabulary deleted over 40 thousand words. I want to reiterate that this is a very small corpus from the perspective of traditional word embedding models. When we are working with word representations trained with these smaller corpora, we should be really cautious in our approach.

Moving on, we can create out term-co-occurence matrix (TCM). We can achieve different results by experimenting with the skip_grams_window and other parameters. The definition of whether two words occur together is arbitrary, so we definitely want to play around with the parameters to see the different results.

# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Creating and fitting the GloVe model

Now we have a TCM matrix and can factorize it via the GloVe algorithm. We will use the method $new to GlobalVectors to create our GloVe model. Here is documentation for related functions and methods.

# Creating new GloVe model
glove <- GlobalVectors$new(rank = 50, x_max = 10)

# Checking GloVe methods
glove
<GloVe>
  Public:
    bias_i: NULL
    bias_j: NULL
    clone: function (deep = FALSE) 
    components: NULL
    fit_transform: function (x, n_iter = 10L, convergence_tol = -1, n_threads = getOption("rsparse_omp_threads", 
    get_history: function () 
    initialize: function (rank, x_max, learning_rate = 0.15, alpha = 0.75, lambda = 0, 
    shuffle: FALSE
  Private:
    alpha: 0.75
    b_i: NULL
    b_j: NULL
    cost_history: 
    fitted: FALSE
    glove_fitter: NULL
    initial: NULL
    lambda: 0
    learning_rate: 0.15
    rank: 50
    w_i: NULL
    w_j: NULL
    x_max: 10

You’ll be able to access the public methods. We can fit our model using $fit_transform to our glove variable. This may take several minutes to fit.

# Fitting model
wv_main <- glove$fit_transform(tcm, n_iter= 10, 
                               convergence_tol = 0.01,
                               n_threads = 8)
INFO  [01:41:04.460] epoch 1, loss 0.1926
INFO  [01:41:08.345] epoch 2, loss 0.1286
INFO  [01:41:12.114] epoch 3, loss 0.1115
INFO  [01:41:15.906] epoch 4, loss 0.1010
INFO  [01:41:19.629] epoch 5, loss 0.0937
INFO  [01:41:23.312] epoch 6, loss 0.0883
INFO  [01:41:26.946] epoch 7, loss 0.0843
INFO  [01:41:30.659] epoch 8, loss 0.0811
INFO  [01:41:34.250] epoch 9, loss 0.0786
INFO  [01:41:37.928] epoch 10, loss 0.0765
# Checking dimensions
dim(wv_main)
[1] 14267    50

Note that model learns two sets of word vectors–target and context. We can think of our word of interest as the target in this environment, and all the other words as the context inside the window. For both, word vectors are learned.

wv_context <- glove$components
dim(wv_context)
[1]    50 14267

While both of word-vectors matrices can be used as result, the creators recommends to average or take a sum of main and context vector:

word_vectors <- wv_main + t(wv_context)

Here’s a preview of the word vector matrix:

dim(word_vectors)
[1] 14267    50
word_vectors[1:6,1:6]
            [,1]       [,2]       [,3]         [,4]        [,5]        [,6]
1837  0.43907892 -0.1872493 -0.5214302  0.281306138  0.54493457  0.07442340
1841  0.78713256  0.1654886  0.5445842 -0.115891594  0.48654030 -0.40464814
1881 -0.11485630  0.0799278  0.2563614 -0.408114871  0.04040807 -0.44825113
2005 -0.00553796  0.4716572  0.1298125 -0.001526793 -0.66258253  0.32161776
36    0.19366583 -0.2620511 -0.3671042  0.587645557  0.22950087 -0.03482799
38    0.46699611  0.1000720 -0.8211754  0.485239607 -0.60822053 -0.65862959

Cosine Similarity

School example

Now we can begin to play. Similarly to standard correlation, we can look at comparing two vectors using cosine similarity. Let’s see what is similar with ‘school’:

# Word vector for school
school <- word_vectors["school", , drop = FALSE]

# Cosine similarity
school_cos_sim <- sim2(x = word_vectors, y = school, 
                       method = "cosine", norm = "l2")

head(sort(school_cos_sim[,1], decreasing = TRUE), 10)
   school      home      time      work     after       fun     today      late 
1.0000000 0.7197087 0.7155737 0.6953649 0.6785728 0.6769731 0.6765467 0.6705443 
     when       get 
0.6428421 0.6428196 

Obviously, school is the most similar to school. Based on the poems that the children wrote, we can also see words like ‘work’, ‘fun’, and ‘class’ as most similar to ‘school.’

We can also calculate the cosine similarity between ‘school’ and ‘study’.

# Word vector for study
study <- word_vectors["study", , drop = FALSE]

# Cosine similarity
sim2(x = school, y = study, 
      method = "cosine", norm = "l2")
           study
school 0.2301332

In practice, you can write a loop to find the similarity between the keyword and a list of words. For example, calculating the similarity between ‘school’ and ‘study’, ‘learn’, ‘homework’, ‘lunch’, ‘fun’, ‘friends’.

# List of words to compare
words <- c("study", "learn","homework", "lunch", "fun", "friends")

# Loop through each word and calculate cosine similarity
for (i in words) {
  list_vector <- word_vectors[i, , drop = FALSE]
  similarity <- sim2(x = school, y = list_vector, 
                     method = "cosine", norm = "l2")
  print(paste("Cosine similarity between 'school' and", i, ":", round(similarity, 3)))
}
[1] "Cosine similarity between 'school' and study : 0.23"
[1] "Cosine similarity between 'school' and learn : 0.594"
[1] "Cosine similarity between 'school' and homework : 0.533"
[1] "Cosine similarity between 'school' and lunch : 0.573"
[1] "Cosine similarity between 'school' and fun : 0.677"
[1] "Cosine similarity between 'school' and friends : 0.514"

Pet example

Let’s try our pet example:

# cat - meow + bark should equal dog
dog <- word_vectors["cat", , drop = FALSE] - 
  word_vectors["meow", , drop = FALSE] +
  word_vectors["bark", , drop = FALSE]

# Calculates pairwise similarities between the rows of two matrices
dog_cos_sim <- sim2(x = word_vectors, y = dog,
                    method = "cosine", norm = "l2")

# Top five predictions
head(sort(dog_cos_sim[,1], decreasing = TRUE), 5)
      cat       dog       fat       big        he 
0.8117645 0.7786891 0.7102236 0.7061164 0.6906470 

Success - Our predicted result was correct! We get ‘dog’ as the highest predicted result after the one we used (cat). We can think of this scenario as cats say meow and dogs say bark.

Parent example

Let’s move on to the parent example:

# mom - girl + boy should equal dad
dad <- word_vectors["mom", , drop = FALSE] -
  word_vectors["girl", , drop = FALSE] +
  word_vectors["boy", , drop = FALSE]

# Calculates pairwise similarities between the rows of two matrices
dad_cos_sim <- sim2(x = word_vectors, y = dad,
                    method = "cosine", norm = "l2")

# Top five predictions
head(sort(dad_cos_sim[,1], decreasing = TRUE), 5)
      mom       dad   brother    sister      says 
0.8817681 0.8105545 0.6831259 0.6511331 0.6464033 

‘Dad’ wasn’t a top result. Finally, let’s try the infamous king and queen example.

King and queen example

# king - man + woman should equal queen
queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["man", , drop = FALSE] +
  word_vectors["woman", , drop = FALSE]

# Calculate pairwise similarities
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")

# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
     king      kong      maid    martin     queen 
0.8026007 0.6838580 0.6251205 0.5827649 0.5764399 

Unfortunately, we did not get queen as a top result. Let’s try changing man and woman to boy and girl to account for the kid’s writing.

# king - boy + girl should equal queen
queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["boy", , drop = FALSE] +
  word_vectors["girl", , drop = FALSE]

# Calculate pairwise similarities
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")

# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
     king     queen    castle      kong      girl 
0.8678330 0.6258649 0.5963708 0.5585626 0.5140303 

It worked!

As we can see through, outcomes are highly dependent on the data and settings you select, so bear in mind the context when trying this out.

Word2Vec

Word2Vec and GloVe are both popular word embedding models, but they differ in how they learn word relationships: Word2Vec is a predictive model that uses local context (through Skip-Gram or CBOW) to generate embeddings. It works well for smaller datasets and captures relationships from local word co-occurrences. GloVe is a count-based model that relies on global word co-occurrence statistics, which typically requires larger datasets to perform well.

In this section, I’ve applied Word2Vec to a 20% sample of the poem dataset, as it tends to perform better on smaller datasets.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(word2vec)

set.seed(1234)
poem_small <- slice_sample(poem, prop = 0.2)

tokens_small <- word_tokenizer(poem_small$text)

word2vec_model <- word2vec(tokens_small, type = "skip-gram", dim = 50, window = 5, iter = 10, min_count = 5)

word2vec_vectors <- as.matrix(word2vec_model)

# Word vector for school
school_word2vec <- word2vec_vectors["school", , drop = FALSE]

# Cosine similarity
school_cos_sim2 <- sim2(x = word2vec_vectors, y = school_word2vec,method = "cosine", norm = "l2")

head(sort(school_cos_sim2[,1], decreasing = TRUE), 10)
   school detention      pool      camp   studies     class   anyways    monday 
1.0000000 0.7648640 0.7295435 0.7286105 0.6899808 0.6821124 0.6772132 0.6745307 
   office  vacation 
0.6632186 0.6611257