#Installing text2vec package (might take a while)
#install.packages('text2vec')
library(text2vec)
Tutorial7_WordEmbeddings/Text Representation II
Word Embeddings
Word embeddings are a type of word representation that allows words with similar meaning to have a similar representation. They are a tool for identifying similarities between words in a corpus by using some form of model to predict the co-occurrence of words within a small chunk of text.
We’ll be using the text2vec package. text2vec
was one of the first implementations of word embeddings functionality in R, and is designed to run fast, relatively speaking. Still, it’s important to remember that our computational complexity is amping up here, so don’t expect immediate results.
GloVe
Stanford University’s Global Vectors for Word Representation (GloVe) is an approach to estimating a distributional representation of a word. GloVe is based, essentially, on factorizing a huge term co-occurrence matrix.
The distributional representation of words means that each term is represented as a distribution over some number of dimensions (say, 3 dimensions, where the values are 0.6, 0.3, and 0.1). This stands in stark contrast to the work we’ve done to this point, which has effectively encoded each word as being effectively just present (1) or not (0).
Perhaps unsurprisingly, the distributional representation better captures semantic meaning than the one-hot encoding. This opens up a world of possibilities for us as researchers. Indeed, this has been a major leap forward for research in Text-as-Data.
As an example, we can see how similar one word is to other words by measuring the distance between their distributions. Even more interestingly, we can capture really specific phenomena from text with some simple arithmetic based on word distributions. Consider the following canonical example:
king - man + woman = queen
Ponder this equation for a moment. From the vector representation of king, we subtract the vector representation of man. Then, we add the vector representation of woman. The end result of that should be a vector that is very similar to the vector representation of queen.
In what follows, we’ll work through some examples to see how well this works. I want to caution, though, that the models we are training here are probably too small for us to have too much confidence in the trained models. Nevertheless, you’ll see that even with this small set we’ll recover really interesting dynamics.
Front-end Matters
First, let’s install and load the text2vec
package:
PoKi Dataset
We’ll be using PoKi, a corpus of poems written by children and teenagers from grades 1 to 12.
One thing to flag right off the bat is the really interesting dynamics related to who is writing these posts. We need to keep in mind that the children writing these texts are going to use less formal writing and more imaginative stories. Given this, we’ll focus on analogies that are more appropriate for this context; here, we’ll aim to create word embeddings that can recreate these two equations:
cat - meow + bark = dog
mom - girl + boy = dad
By the end, we should hopefully be able to recreate these by creating and fitting our GloVe models. But first, let’s perform the necessary pre-processing steps before creating our embedding models.
Let’s download and read in the data:
# Create file
<- tempfile()
temp
# Downloads and unzip file
download.file("https://raw.githubusercontent.com/whipson/PoKi-Poems-by-Kids/master/poki.csv", temp)
# Reads in downloaded file
<- read.csv(temp)
poem
# First ten rows
head(poem, 10)
id title author grade
1 104987 I Love The Zoo 1
2 67185 The scary forest. 1
3 103555 A Hike At School 1st grade-wh 1
4 112483 Computer a 1
5 74516 Angel aab 1
6 114693 Nature Nature and Nature aadhya 1
7 46453 Jack aaliyah 1
8 57397 When I awoke one morning aanna 1
9 77201 My Blue Berries and My Cherries aarathi 1
10 40520 A snowy day ab. 1
text
1 roses are red, violets are blue. i love the zoo. do you?
2 the forest is really haunted. i believe it to be so. but then we are going camping.
3 i took a hike at school today and this is what i saw bouncing balls girls chatting against the walls kids climbing on monkey bars i even saw some teachers' cars the wind was blowing my hair in my face i saw a mud puddle, but just a trace all of these things i noticed just now on my little hike.
4 you can do what you want you can play a game you can do many things, you can read and write
5 angel oh angle you spin like a top angel oh angel you will never stop can't you feel the air as it blows through your hair angel oh angel itisto bad your a mop!
6 look at the sun, what a beautiful day. under the trees, we can run and play. beauty of nature, we love to see, from tiny insect to exotic tree. it is a place to sit and think, nature and human share the deepest link. nature has ocean, which is in motion. nature has tree, nature has river. if we destroy the nature we would never be free. our nature keeps us alive, we must protect it, for society to thrive. we spoil the nature, we spoil the future. go along with nature, for your better future.
7 dog playful, energetic running, jumping, tackling my is my friend jack
8 when i awoke one morning, a dog was on my head. i asked , ''what are you doing there?' it looked at me and said ''woof!'' ''wouldn't you like to be outside playing?''said the man ''i'm staying here and playing here. '' said the dog he played all night and day. he came inside his new house and played inside a wet wet day.
9 i went to my blue berry tree they were no blue berries found i went to another tree to get some more free but found none but cherries round.
10 one snowy day the children went outside to play in the snow. they threw snowballs, went sledding and made a snowman. afterwards they went inside to drink warm hot chocolate. it was a fun snowy day
char
1 62
2 87
3 324
4 106
5 164
6 491
7 74
8 325
9 143
10 199
# Checks dimensions
dim(poem)
[1] 61508 6
We want the poems themselves, so we’ll use the column text
for tokenization.
Tokenization and Vectorization
The process for text2vec
is different than the standard process we’d been following. To that end, we’ll follow the same process as we will do for LDA later, creating a tokenized iterator and vectorized vocabulary first. This time, there’s no need to lowercase our words since the downloaded dataset is already lowercased.
Let’s tokenize the data:
# Tokenization
<- word_tokenizer(poem$text)
tokens
# First five rows tokenized
head(tokens, 5)
[[1]]
[1] "roses" "are" "red" "violets" "are" "blue" "i"
[8] "love" "the" "zoo" "do" "you"
[[2]]
[1] "the" "forest" "is" "really" "haunted" "i" "believe"
[8] "it" "to" "be" "so" "but" "then" "we"
[15] "are" "going" "camping"
[[3]]
[1] "i" "took" "a" "hike" "at" "school"
[7] "today" "and" "this" "is" "what" "i"
[13] "saw" "bouncing" "balls" "girls" "chatting" "against"
[19] "the" "walls" "kids" "climbing" "on" "monkey"
[25] "bars" "i" "even" "saw" "some" "teachers"
[31] "cars" "the" "wind" "was" "blowing" "my"
[37] "hair" "in" "my" "face" "i" "saw"
[43] "a" "mud" "puddle" "but" "just" "a"
[49] "trace" "all" "of" "these" "things" "i"
[55] "noticed" "just" "now" "on" "my" "little"
[61] "hike"
[[4]]
[1] "you" "can" "do" "what" "you" "want" "you" "can"
[9] "play" "a" "game" "you" "can" "do" "many" "things"
[17] "you" "can" "read" "and" "write"
[[5]]
[1] "angel" "oh" "angle" "you" "spin" "like" "a"
[8] "top" "angel" "oh" "angel" "you" "will" "never"
[15] "stop" "can't" "you" "feel" "the" "air" "as"
[22] "it" "blows" "through" "your" "hair" "angel" "oh"
[29] "angel" "itisto" "bad" "your" "a" "mop"
Create an iterator object:
# Create iterator object
<- itoken(tokens, progressbar = FALSE) it
Build the vocabulary:
# Build vocabulary
<- create_vocabulary(it)
vocab
# Vocabulary
vocab
Number of docs: 61508
0 stopwords: ...
ngram_min = 1; ngram_max = 1
Vocabulary:
term term_count doc_count
<char> <int> <int>
1: 0000 1 1
2: 0000000 1 1
3: 0000001 1 1
4: 00a:m 1 1
5: 00he 1 1
---
56470: to 69175 30347
56471: and 80863 34798
56472: a 92765 37607
56473: the 120677 37676
56474: i 124832 32777
# Check dimensions
dim(vocab)
[1] 56474 3
And prune and vectorize it. We’ll keep the terms that occur at least 5 times.
# Prune vocabulary
<- prune_vocabulary(vocab, term_count_min = 5)
vocab
# Check dimensions
dim(vocab)
[1] 14267 3
# Vectorize
<- vocab_vectorizer(vocab) vectorizer
As we can see, pruning our vocabulary deleted over 40 thousand words. I want to reiterate that this is a very small corpus from the perspective of traditional word embedding models. When we are working with word representations trained with these smaller corpora, we should be really cautious in our approach.
Moving on, we can create out term-co-occurence matrix (TCM). We can achieve different results by experimenting with the skip_grams_window
and other parameters. The definition of whether two words occur together is arbitrary, so we definitely want to play around with the parameters to see the different results.
# use window of 5 for context words
<- create_tcm(it, vectorizer, skip_grams_window = 5L) tcm
Creating and fitting the GloVe model
Now we have a TCM matrix and can factorize it via the GloVe algorithm. We will use the method $new
to GlobalVectors
to create our GloVe model. Here is documentation for related functions and methods.
# Creating new GloVe model
<- GlobalVectors$new(rank = 50, x_max = 10)
glove
# Checking GloVe methods
glove
<GloVe>
Public:
bias_i: NULL
bias_j: NULL
clone: function (deep = FALSE)
components: NULL
fit_transform: function (x, n_iter = 10L, convergence_tol = -1, n_threads = getOption("rsparse_omp_threads",
get_history: function ()
initialize: function (rank, x_max, learning_rate = 0.15, alpha = 0.75, lambda = 0,
shuffle: FALSE
Private:
alpha: 0.75
b_i: NULL
b_j: NULL
cost_history:
fitted: FALSE
glove_fitter: NULL
initial: NULL
lambda: 0
learning_rate: 0.15
rank: 50
w_i: NULL
w_j: NULL
x_max: 10
You’ll be able to access the public methods. We can fit our model using $fit_transform
to our glove
variable. This may take several minutes to fit.
# Fitting model
<- glove$fit_transform(tcm, n_iter= 10,
wv_main convergence_tol = 0.01,
n_threads = 8)
INFO [01:41:04.460] epoch 1, loss 0.1926
INFO [01:41:08.345] epoch 2, loss 0.1286
INFO [01:41:12.114] epoch 3, loss 0.1115
INFO [01:41:15.906] epoch 4, loss 0.1010
INFO [01:41:19.629] epoch 5, loss 0.0937
INFO [01:41:23.312] epoch 6, loss 0.0883
INFO [01:41:26.946] epoch 7, loss 0.0843
INFO [01:41:30.659] epoch 8, loss 0.0811
INFO [01:41:34.250] epoch 9, loss 0.0786
INFO [01:41:37.928] epoch 10, loss 0.0765
# Checking dimensions
dim(wv_main)
[1] 14267 50
Note that model learns two sets of word vectors–target and context. We can think of our word of interest as the target in this environment, and all the other words as the context inside the window. For both, word vectors are learned.
<- glove$components
wv_context dim(wv_context)
[1] 50 14267
While both of word-vectors matrices can be used as result, the creators recommends to average or take a sum of main and context vector:
<- wv_main + t(wv_context) word_vectors
Here’s a preview of the word vector matrix:
dim(word_vectors)
[1] 14267 50
1:6,1:6] word_vectors[
[,1] [,2] [,3] [,4] [,5] [,6]
1837 0.43907892 -0.1872493 -0.5214302 0.281306138 0.54493457 0.07442340
1841 0.78713256 0.1654886 0.5445842 -0.115891594 0.48654030 -0.40464814
1881 -0.11485630 0.0799278 0.2563614 -0.408114871 0.04040807 -0.44825113
2005 -0.00553796 0.4716572 0.1298125 -0.001526793 -0.66258253 0.32161776
36 0.19366583 -0.2620511 -0.3671042 0.587645557 0.22950087 -0.03482799
38 0.46699611 0.1000720 -0.8211754 0.485239607 -0.60822053 -0.65862959
Cosine Similarity
School example
Now we can begin to play. Similarly to standard correlation, we can look at comparing two vectors using cosine similarity. Let’s see what is similar with ‘school’:
# Word vector for school
<- word_vectors["school", , drop = FALSE]
school
# Cosine similarity
<- sim2(x = word_vectors, y = school,
school_cos_sim method = "cosine", norm = "l2")
head(sort(school_cos_sim[,1], decreasing = TRUE), 10)
school home time work after fun today late
1.0000000 0.7197087 0.7155737 0.6953649 0.6785728 0.6769731 0.6765467 0.6705443
when get
0.6428421 0.6428196
Obviously, school is the most similar to school. Based on the poems that the children wrote, we can also see words like ‘work’, ‘fun’, and ‘class’ as most similar to ‘school.’
We can also calculate the cosine similarity between ‘school’ and ‘study’.
# Word vector for study
<- word_vectors["study", , drop = FALSE]
study
# Cosine similarity
sim2(x = school, y = study,
method = "cosine", norm = "l2")
study
school 0.2301332
In practice, you can write a loop to find the similarity between the keyword and a list of words. For example, calculating the similarity between ‘school’ and ‘study’, ‘learn’, ‘homework’, ‘lunch’, ‘fun’, ‘friends’.
# List of words to compare
<- c("study", "learn","homework", "lunch", "fun", "friends")
words
# Loop through each word and calculate cosine similarity
for (i in words) {
<- word_vectors[i, , drop = FALSE]
list_vector <- sim2(x = school, y = list_vector,
similarity method = "cosine", norm = "l2")
print(paste("Cosine similarity between 'school' and", i, ":", round(similarity, 3)))
}
[1] "Cosine similarity between 'school' and study : 0.23"
[1] "Cosine similarity between 'school' and learn : 0.594"
[1] "Cosine similarity between 'school' and homework : 0.533"
[1] "Cosine similarity between 'school' and lunch : 0.573"
[1] "Cosine similarity between 'school' and fun : 0.677"
[1] "Cosine similarity between 'school' and friends : 0.514"
Pet example
Let’s try our pet example:
# cat - meow + bark should equal dog
<- word_vectors["cat", , drop = FALSE] -
dog "meow", , drop = FALSE] +
word_vectors["bark", , drop = FALSE]
word_vectors[
# Calculates pairwise similarities between the rows of two matrices
<- sim2(x = word_vectors, y = dog,
dog_cos_sim method = "cosine", norm = "l2")
# Top five predictions
head(sort(dog_cos_sim[,1], decreasing = TRUE), 5)
cat dog fat big he
0.8117645 0.7786891 0.7102236 0.7061164 0.6906470
Success - Our predicted result was correct! We get ‘dog’ as the highest predicted result after the one we used (cat). We can think of this scenario as cats say meow and dogs say bark.
Parent example
Let’s move on to the parent example:
# mom - girl + boy should equal dad
<- word_vectors["mom", , drop = FALSE] -
dad "girl", , drop = FALSE] +
word_vectors["boy", , drop = FALSE]
word_vectors[
# Calculates pairwise similarities between the rows of two matrices
<- sim2(x = word_vectors, y = dad,
dad_cos_sim method = "cosine", norm = "l2")
# Top five predictions
head(sort(dad_cos_sim[,1], decreasing = TRUE), 5)
mom dad brother sister says
0.8817681 0.8105545 0.6831259 0.6511331 0.6464033
‘Dad’ wasn’t a top result. Finally, let’s try the infamous king and queen example.
King and queen example
# king - man + woman should equal queen
<- word_vectors["king", , drop = FALSE] -
queen "man", , drop = FALSE] +
word_vectors["woman", , drop = FALSE]
word_vectors[
# Calculate pairwise similarities
= sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")
queen_cos_sim
# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
king kong maid martin queen
0.8026007 0.6838580 0.6251205 0.5827649 0.5764399
Unfortunately, we did not get queen as a top result. Let’s try changing man and woman to boy and girl to account for the kid’s writing.
# king - boy + girl should equal queen
<- word_vectors["king", , drop = FALSE] -
queen "boy", , drop = FALSE] +
word_vectors["girl", , drop = FALSE]
word_vectors[
# Calculate pairwise similarities
= sim2(x = word_vectors, y = queen, method = "cosine", norm = "l2")
queen_cos_sim
# Top five predictions
head(sort(queen_cos_sim[,1], decreasing = TRUE), 5)
king queen castle kong girl
0.8678330 0.6258649 0.5963708 0.5585626 0.5140303
It worked!
As we can see through, outcomes are highly dependent on the data and settings you select, so bear in mind the context when trying this out.
Word2Vec
Word2Vec and GloVe are both popular word embedding models, but they differ in how they learn word relationships: Word2Vec is a predictive model that uses local context (through Skip-Gram or CBOW) to generate embeddings. It works well for smaller datasets and captures relationships from local word co-occurrences. GloVe is a count-based model that relies on global word co-occurrence statistics, which typically requires larger datasets to perform well.
In this section, I’ve applied Word2Vec to a 20% sample of the poem dataset, as it tends to perform better on smaller datasets.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(word2vec)
set.seed(1234)
<- slice_sample(poem, prop = 0.2)
poem_small
<- word_tokenizer(poem_small$text)
tokens_small
<- word2vec(tokens_small, type = "skip-gram", dim = 50, window = 5, iter = 10, min_count = 5)
word2vec_model
<- as.matrix(word2vec_model)
word2vec_vectors
# Word vector for school
<- word2vec_vectors["school", , drop = FALSE]
school_word2vec
# Cosine similarity
<- sim2(x = word2vec_vectors, y = school_word2vec,method = "cosine", norm = "l2")
school_cos_sim2
head(sort(school_cos_sim2[,1], decreasing = TRUE), 10)
school detention pool camp studies class anyways monday
1.0000000 0.7648640 0.7295435 0.7286105 0.6899808 0.6821124 0.6772132 0.6745307
office vacation
0.6632186 0.6611257