Tutorial7_WordEmbeddings/Text Representation II

Word Embeddings

Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a tool for identifying similarities between words in a corpus by using some form of model to predict the co-occurrence of words within a small chunk of text.

We’ll be using the text2vec package. text2vec was one of the first implementations of word embeddings functionality in R, and is designed to run fast, relatively speaking. Still, it’s important to remember that our computational complexity is amping up here, so don’t expect immediate results.


Stanford University’s Global Vectors for Word Representation (GloVe) is an approach to estimating a distributional representation of a word. GloVe is based, essentially, on factorizing a huge term co-occurrence matrix.

The distributional representation of words means that each term is represented as a distribution over some number of dimensions (say, 3 dimensions, where the values are 0.6, 0.3, and 0.1). This stands in stark contrast to the work we’ve done to this point, which has effectively encoded each word as being effectively just present (1) or not (0).

Perhaps unsurprisingly, the distributional representation better captures semantic meaning than the one-hot encoding. This opens up a world of possibilities for us as researchers. Indeed, this has been a major leap forward for research in Text-as-Data.

As an example, we can see how similar one word is to other words by measuring the distance between their distributions. Even more interestingly, we can capture really specific phenomena from text with some simple arithmetic based on word distributions. Consider the following canonical example:


king - man + woman = queen

Ponder this equation for a moment. From the vector representation of king, we subtract the vector representation of man. Then, we add the vector representation of woman. The end result of that should be a vector that is very similar to the vector representation of queen.

In what follows, we’ll work through some examples to see how well this works. I want to caution, though, that the models we are training here are probably too small for us to have too much confidence in the trained models. Nevertheless, you’ll see that even with this small set we’ll recover really interesting dynamics.

Front-end Matters

First, let’s install and load the text2vec package:

#Installing text2vec package (might take a while)

PoKi Dataset

We’ll be using PoKi, a corpus of poems written by children and teenagers from grades 1 to 12.

One thing to flag right off the bat is the really interesting dynamics related to who is writing these posts. We need to keep in mind that the children writing these texts are going to use less formal writing and more imaginative stories. Given this, we’ll focus on analogies that are more appropriate for this context; here, we’ll aim to create word embeddings that can recreate these two equations:

cat - meow + bark = dog

mom - girl + boy = dad

By the end, we should hopefully be able to recreate these by creating and fitting our GloVe models. But first, let’s perform the necessary pre-processing steps before creating our embedding models.

Let’s download and read in the data:

# Create file
temp <- tempfile()

# Downloads and unzip file
download.file("https://raw.githubusercontent.com/whipson/PoKi-Poems-by-Kids/master/poki.csv", temp)
# Reads in downloaded file
poem <- read.csv(temp)

# First ten rows
head(poem, 10)
       id                            title       author grade
1  104987                   I Love The Zoo                  1
2   67185                The scary forest.                  1
3  103555                 A Hike At School 1st grade-wh     1
4  112483                         Computer            a     1
5   74516                            Angel          aab     1
6  114693         Nature Nature and Nature       aadhya     1
7   46453                             Jack      aaliyah     1
8   57397         When I awoke one morning        aanna     1
9   77201 My Blue Berries and  My Cherries      aarathi     1
10  40520                      A snowy day          ab.     1
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                  roses are red,  violets are blue.   i love the zoo.   do you?
2                                                                                                                                                                                                                                                                                                                                                                                                                                         the forest is really haunted.  i believe it to be so.  but then we are going camping. 
3                                                                                                                                                                                            i took a hike at school today  and this is what i saw     bouncing balls      girls chatting against the walls     kids climbing on monkey bars     i even saw some teachers' cars     the wind was blowing my hair in my face     i saw a mud puddle,  but just a trace all of these things i noticed just now on my little hike. 
4                                                                                                                                                                                                                                                                                                                                                                                                                       you  can  do  what  you  want  you  can play a  game    you can do many things,   you can read and write
5                                                                                                                                                                                                                                                                                                                                                              angel oh angle you spin like a top angel oh angel you will never stop can't you feel the air  as it blows through your hair angel oh angel itisto bad your a mop!
6  look at the sun, what a beautiful day.  under the trees, we can run and play.  beauty of nature, we love to see,  from tiny insect to exotic tree.  it is a place to sit and think,  nature and human share the deepest link.  nature has ocean, which is in motion.  nature has tree, nature has river.  if we destroy the nature we would never be free.  our nature keeps us alive,  we must protect it, for society to thrive.  we spoil the nature, we spoil the future.  go along with nature, for your better future. 
7                                                                                                                                                                                                                                                                                                                                                                                                                                                     dog  playful,  energetic running,  jumping,  tackling my is my friend jack
8                                                                                                                                                                                       when i awoke one morning,  a dog was on my  head.  i asked ,  ''what are you doing there?' it looked at me and said  ''woof!'' ''wouldn't you like to be outside playing?''said the man ''i'm staying here and playing here. '' said the dog he played all night and day. he came inside his new house and played inside a wet wet day. 
9                                                                                                                                                                                                                                                                                                                                                                                  i went to my blue berry tree they were no blue berries found i went to another tree to get some more free but found none but cherries round. 
10                                                                                                                                                                                                                                                                                                                      one snowy day the children went outside to play in the snow.  they threw snowballs,  went sledding and made a snowman.  afterwards they went inside to drink warm hot chocolate.  it was a fun snowy day
1    62
2    87
3   324
4   106
5   164
6   491
7    74
8   325
9   143
10  199
# Checks dimensions
[1] 61508     6

We want the poems themselves, so we’ll use the column text for tokenization.

Tokenization and Vectorization

The process for text2vec is different than the standard process we’d been following. To that end, we’ll follow the same process as we will do for LDA later, creating a tokenized iterator and vectorized vocabulary first. This time, there’s no need to lowercase our words since the downloaded dataset is already lowercased.

Let’s tokenize the data:

# Tokenization
tokens <- word_tokenizer(poem$text)

# First five rows tokenized
head(tokens, 5)
 [1] "roses"   "are"     "red"     "violets" "are"     "blue"    "i"      
 [8] "love"    "the"     "zoo"     "do"      "you"    

 [1] "the"     "forest"  "is"      "really"  "haunted" "i"       "believe"
 [8] "it"      "to"      "be"      "so"      "but"     "then"    "we"     
[15] "are"     "going"   "camping"

 [1] "i"        "took"     "a"        "hike"     "at"       "school"  
 [7] "today"    "and"      "this"     "is"       "what"     "i"       
[13] "saw"      "bouncing" "balls"    "girls"    "chatting" "against" 
[19] "the"      "walls"    "kids"     "climbing" "on"       "monkey"  
[25] "bars"     "i"        "even"     "saw"      "some"     "teachers"
[31] "cars"     "the"      "wind"     "was"      "blowing"  "my"      
[37] "hair"     "in"       "my"       "face"     "i"        "saw"     
[43] "a"        "mud"      "puddle"   "but"      "just"     "a"       
[49] "trace"    "all"      "of"       "these"    "things"   "i"       
[55] "noticed"  "just"     "now"      "on"       "my"       "little"  
[61] "hike"    

 [1] "you"    "can"    "do"     "what"   "you"    "want"   "you"    "can"   
 [9] "play"   "a"      "game"   "you"    "can"    "do"     "many"   "things"
[17] "you"    "can"    "read"   "and"    "write" 

 [1] "angel"   "oh"      "angle"   "you"     "spin"    "like"    "a"      
 [8] "top"     "angel"   "oh"      "angel"   "you"     "will"    "never"  
[15] "stop"    "can't"   "you"     "feel"    "the"     "air"     "as"     
[22] "it"      "blows"   "through" "your"    "hair"    "angel"   "oh"     
[29] "angel"   "itisto"  "bad"     "your"    "a"       "mop"    

Create an iterator object:

# Create iterator object
it <- itoken(tokens, progressbar = FALSE)

Build the vocabulary:

# Build vocabulary
vocab <- create_vocabulary(it)

# Vocabulary
Number of docs: 61508 
0 stopwords:  ... 
ngram_min = 1; ngram_max = 1 
          term term_count doc_count
    1:    0000          1         1
    2: 0000000          1         1
    3: 0000001          1         1
    4:   00a:m          1         1
    5:    00he          1         1
56470:      to      69175     30347
56471:     and      80863     34798
56472:       a      92765     37607
56473:     the     120677     37676
56474:       i     124832     32777
# Check dimensions
[1] 56474     3

And prune and vectorize it. We’ll keep the terms that occur at least 5 times.

# Prune vocabulary
vocab <- prune_vocabulary(vocab, term_count_min = 5)

# Check dimensions
[1] 14267     3
# Vectorize
vectorizer <- vocab_vectorizer(vocab)

As we can see, pruning our vocabulary deleted over 40 thousand words. I want to reiterate that this is a very small corpus from the perspective of traditional word embedding models. When we are working with word representations trained with these smaller corpora, we should be really cautious in our approach.

Moving on, we can create out term-co-occurence matrix (TCM). We can achieve different results by experimenting with the skip_grams_window and other parameters. The definition of whether two words occur together is arbitrary, so we definitely want to play around with the parameters to see the different results.

# use window of 5 for context words
tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L)

Creating and fitting the GloVe model

Now we have a TCM matrix and can factorize it via the GloVe algorithm. We will use the method $new to GlobalVectors to create our GloVe model. Here is documentation for related functions and methods.

# Creating new GloVe model
glove <- GlobalVectors$new(rank = 50, x_max = 10)

# Checking GloVe methods
    bias_i: NULL
    bias_j: NULL
    clone: function (deep = FALSE) 
    components: NULL
    fit_transform: function (x, n_iter = 10L, convergence_tol = -1, n_threads = getOption("rsparse_omp_threads", 
    get_history: function () 
    initialize: function (rank, x_max, learning_rate = 0.15, alpha = 0.75, lambda = 0, 
    shuffle: FALSE
    alpha: 0.75
    b_i: NULL
    b_j: NULL
    fitted: FALSE
    glove_fitter: NULL
    initial: NULL
    lambda: 0
    learning_rate: 0.15
    rank: 50
    w_i: NULL
    w_j: NULL
    x_max: 10

You’ll be able to access the public methods. We can fit our modelusing $fit_transform to our glove variable. This may take several minutes to fit.

# Fitting model
wv_main <- glove$fit_transform(tcm, n_iter= 10, 
                               convergence_tol = 0.01,
                               n_threads = 8)
INFO  [13:25:28.736] epoch 1, loss 0.1916
INFO  [13:25:32.383] epoch 2, loss 0.1290
INFO  [13:25:36.013] epoch 3, loss 0.1118
INFO  [13:25:39.640] epoch 4, loss 0.1011
INFO  [13:25:43.302] epoch 5, loss 0.0937
INFO  [13:25:46.900] epoch 6, loss 0.0883
INFO  [13:25:50.459] epoch 7, loss 0.0843
INFO  [13:25:54.229] epoch 8, loss 0.0811
INFO  [13:25:57.833] epoch 9, loss 0.0785
INFO  [13:26:01.516] epoch 10, loss 0.0764
# Checking dimensions
[1] 14267    50

Note that model learns two sets of word vectors–target and context. We can think of our word of interest as the target in this environment, and all the other words as the context inside the window. For both, word vectors are learned.

wv_context <- glove$components
[1]    50 14267

While both of word-vectors matrices can be used as result, the creators recommends to average or take a sum of main and context vector:

word_vectors <- wv_main + t(wv_context)

Here’s a preview of the word vector matrix:

[1] 14267    50
            [,1]        [,2]        [,3]        [,4]       [,5]        [,6]
1837  0.04517645 -0.28666180  0.40401689  0.30614881  0.5127633 -0.56553552
1841 -0.28710643 -0.48091927  0.74153012  0.17633699  0.4550122 -0.29851067
1881 -0.13368742  0.36169375 -0.04082482  0.04540700 -0.3577053 -0.19733959
2005 -0.77480473 -0.09591058 -0.31780528  0.49908921 -0.7781627  0.08876551
36   -0.33893810  0.23180553  0.64480213  0.03419768 -0.2916895 -0.07991617
38   -0.26858585 -0.48943996  0.63325512 -0.31352754  0.1633754 -0.12409298

Cosine Similarity

School example

Now we can begin to play. Similarly to standard correlation, we can look at comparing two vectors using cosine similarity. Let’s see what is similar with ‘school’:

# Word vector for school
school <- word_vectors["school", , drop = FALSE]

# Cosine similarity
school_cos_sim <- sim2(x = word_vectors, y = school, 
                       method = "cosine", norm = "l2")

head(sort(school_cos_sim[,1], decreasing = TRUE), 10)
   school       fun      work      time       day      cool     today        go 
1.0000000 0.6931877 0.6679507 0.6618867 0.6495852 0.6456139 0.6174673 0.6165279 
       to       all 
0.6117772 0.6101692 

Obviously, school is the most similar to school. Bawed on the poems that the children wrote, we can also see words like ‘work’, ‘fun’, and ‘class’ as most similar to ‘school.’

Pet example

Let’s try our pet example:

# cat - meow + bark should equal dog
dog <- word_vectors["cat", , drop = FALSE] - 
  word_vectors["meow", , drop = FALSE] +
  word_vectors["bark", , drop = FALSE]

# Calculates pairwise similarities between the rows of two matrices
dog_cos_sim <- sim2(x = word_vectors, y = dog,
                    method = "cosine", norm = "l2")

# Top five predictions
head(sort(dog_cos_sim[,1], decreasing = TRUE), 5)
      dog       cat       big     small       his 
0.8087565 0.8026895 0.6945863 0.6699371 0.6476813 

Success - Our predicted result was correct! We get ‘dog’ as the highest predicted result after the one we used (cat). We can think of this scenario as cats say meow and dogs say bark.

Parent example

Let’s move on to the parent example:

# mom - girl + boy should equal dad
dad <- word_vectors["mom", , drop = FALSE] -
  word_vectors["girl", , drop = FALSE] +
  word_vectors["boy", , drop = FALSE]

# Calculates pairwise similarities between the rows of two matrices
dad_cos_sim <- sim2(x = word_vectors, y = dad,
                    method = "cosine", norm = "l2")

# Top five predictions
head(sort(dad_cos_sim[,1], decreasing = TRUE), 5)
      mom       dad      says   brother      said 
0.8479007 0.8065054 0.7150675 0.7076835 0.6855733 

‘Dad’ wasn’t a top result. Finally, let’s try the infamous king and queen example.

King and queen example

# king - man + woman should equal queen
queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["man", , drop = FALSE] +
  word_vectors["woman", , drop = FALSE]

# Calculate pairwise similarities
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")

# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
     king      kong   natural  dinasaur    martin 
0.7251078 0.6366864 0.6133684 0.5437246 0.5224198 

Unfortunately, we did not get queen as a top result. Let’s try changing man and woman to boy and girl to account for the kid’s writing.

# king - boy + girl should equal queen
queen <- word_vectors["king", , drop = FALSE] -
  word_vectors["boy", , drop = FALSE] +
  word_vectors["girl", , drop = FALSE]

# Calculate pairwise similarities
queen_cos_sim = sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")

# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
     king     queen     child      girl  princess 
0.8119764 0.5533746 0.5464525 0.5095002 0.5058019 

It worked!

As we can see through, outcomes are highly dependent on the data and settings you select, so bear in mind the context when trying this out.