Tutorial8_Dictionary/Sentiment Analysis

In this tutorial, we’ll learn about dictionary methods.

Front-End Matters

library(tidytext)
library(plyr)
library(textdata)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(quanteda)

Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.

library(quanteda.textmodels)
library(ggplot2)

This time, we’ll be making use of a package that’s available on GitHub. To install it, we need to load the devtools package. The package itself contains a host of different dictionaries publicly available dictionaries.

#library(devtools)
#devtools::install_github("kbenoit/quanteda.dictionaries")
library(quanteda.dictionaries)

#remotes::install_github("quanteda/quanteda.sentiment")
library(quanteda.sentiment)


Attaching package: 'quanteda.sentiment'

The following object is masked from 'package:quanteda':

    data_dictionary_LSD2015

# large movie review database of 50,000 movie reviews
load(url("https://www.dropbox.com/s/sjdfmx8ggwfda5o/data_orpus_LMRD.rda?dl=1"))

# this process is unnecessary, given 'data_corpus_LMRD' is a corpus object; but to make sure
data_corpus_LMRD <- corpus(data_corpus_LMRD)

# check the text of document 1
convert(data_corpus_LMRD[1],to="data.frame")

            doc_id
1 test/neg/0_2.txt
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  text
1 Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costner's character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks he's better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutcher's ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.
  docnumber rating  set polarity
1         0      2 test      neg

Dictionary Analysis

The basic idea with a dictionary analysis is to identify a set of words that connect to a certain concept, and to count the frequency of that set of words within a document. The set of words is the dictionary; as you might quickly realize, a more appropriate name is probably thesaurus.

liwcalike()

There are a couple of ways to do this. First, the quanteda.dictionaries package contains the liwcalike() function, which takes a corpus or character vector and carries out an analysis–based on a provide dictionary–that mimics the pay-to-play software LIWC (Linguistic Inquiry and Word Count). The LIWC software calculates the percentage of the document that reflects a host of different characteristics. We are going to focus on positive and negative language, but keep in mind that there are lots of other dimensions that could be of interest.

# use liwcalike() to estimate sentiment using NRC dictionary
reviewSentiment_nrc <- liwcalike(data_corpus_LMRD, data_dictionary_NRC)

names(reviewSentiment_nrc)

 [1] "docname"      "Segment"      "WPS"          "WC"           "Sixltr"      
 [6] "Dic"          "anger"        "anticipation" "disgust"      "fear"        
[11] "joy"          "negative"     "positive"     "sadness"      "surprise"    
[16] "trust"        "AllPunc"      "Period"       "Comma"        "Colon"       
[21] "SemiC"        "QMark"        "Exclam"       "Dash"         "Quote"       
[26] "Apostro"      "Parenth"      "OtherP"

Now let’s look at polarity. What are the most positive reviews?

ggplot(reviewSentiment_nrc) +
  geom_histogram(aes(x = positive)) +
  theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Based on that, let’s look at those that are out in the right tail (i.e., which are greater than 15)

data_corpus_LMRD[which(reviewSentiment_nrc$positive > 15)]

Corpus consisting of 26 documents and 4 docvars.
test/neg/11213_3.txt :
"I couldn't stop laughing, I caught this again on late night ..."

test/neg/5147_1.txt :
"Primary plot!Primary direction!Poor interpretation."

test/pos/10115_8.txt :
"Radio will have you laughing, crying, feeling. This story ba..."

test/pos/1049_9.txt :
"Add this little gem to your list of holiday regulars. It is ..."

test/pos/11123_8.txt :
"We enjoy a film like "Fame" because we imagine we are there ..."

test/pos/1592_10.txt :
"Morte a Venezia is one of my favorite movies. More than beau..."

[ reached max_ndoc ... 20 more documents ]

Now how about the most negative?

ggplot(reviewSentiment_nrc) +
  geom_histogram(aes(x = negative)) +
  theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data_corpus_LMRD[which(reviewSentiment_nrc$negative > 15)]

Corpus consisting of 14 documents and 4 docvars.
test/neg/3718_1.txt :
"This was truly horrible. Bad acting, bad writing, bad effect..."

test/neg/3847_1.txt :
"John Leguizamo must have been insane if he thinks this was a..."

test/neg/819_1.txt :
"it really is terrible, from start to finish you'll sit and w..."

train/neg/11201_1.txt :
"Horrible waste of time - bad acting, plot, directing. This i..."

train/neg/11699_1.txt :
"The plot of The Thinner is decidedly thin. And gross. An obe..."

train/neg/11931_2.txt :
"Olivier Gruner stars as Jacques a foreign exchange college s..."

[ reached max_ndoc ... 8 more documents ]

Of course, you might be realizing that the proportions of positive and negative words used in isolation might be a poor indicator of overall sentiment. Instead, we want a score that incorporates both. The example directly above alludes to this problem, as the description of a horror movie makes it look negative, but in reality there are also a lot of positive words in there. So let’s correct for that and see what we’ve got.

reviewSentiment_nrc$polarity <- reviewSentiment_nrc$positive - reviewSentiment_nrc$negative

ggplot(reviewSentiment_nrc) +
  geom_histogram(aes(polarity)) +
  theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

data_corpus_LMRD[which(reviewSentiment_nrc$polarity < -12)]

Corpus consisting of 18 documents and 4 docvars.
test/neg/1792_3.txt :
"Aside from the horrendous acting and the ridiculous and ludi..."

test/neg/3718_1.txt :
"This was truly horrible. Bad acting, bad writing, bad effect..."

test/neg/3850_1.txt :
"This movie is pathetic in every way possible. Bad acting, ho..."

test/neg/6263_2.txt :
"Cheerleader Massacre was supposed to be the fourth installme..."

test/neg/6850_2.txt :
"Read the book, forget the movie!"

test/neg/7458_1.txt :
"Everything about this movie is bad. everything. Ridiculous 8..."

[ reached max_ndoc ... 12 more documents ]

Ah. unfortunately the same problem persists. We’ll come back to this later.

Using Dictionaries with DFMs

# create a full dfm for comparison
movieReviewDfm <- tokens(data_corpus_LMRD,
                         remove_punct = TRUE,
                         remove_symbols = TRUE,
                         remove_numbers = TRUE,
                         remove_url = TRUE,
                         split_hyphens = FALSE,
                         include_docvars = TRUE) %>%
  tokens_tolower() %>%
  dfm()

head(movieReviewDfm, 10)

Document-feature matrix of: 10 documents, 145,748 features (99.91% sparse) and 4 docvars.
                      features
docs                   once again mr costner has dragged out a movie for
  test/neg/0_2.txt        1     1  1       3   1       1   1 3     1   1
  test/neg/10000_4.txt    0     0  0       0   2       0   0 4     0   3
  test/neg/10001_1.txt    0     0  0       0   0       0   0 6     3   2
  test/neg/10002_3.txt    0     0  0       0   0       0   1 8     5   2
  test/neg/10003_3.txt    0     0  0       0   0       0   0 4     0   4
  test/neg/10004_2.txt    0     0  0       0   1       0   0 3     0   0
[ reached max_ndoc ... 4 more documents, reached max_nfeat ... 145,738 more features ]

dim(movieReviewDfm)

[1]  50000 145748

# convert corpus to dfm using the dictionary
movieReviewDfm_nrc <- tokens(data_corpus_LMRD,
                         remove_punct = TRUE,
                         remove_symbols = TRUE,
                         remove_numbers = TRUE,
                         remove_url = TRUE,
                         split_hyphens = FALSE,
                         include_docvars = TRUE) %>%
  tokens_tolower() %>%
  dfm() %>%
  dfm_lookup(data_dictionary_NRC)
  
  
dim(movieReviewDfm_nrc)

[1] 50000    10

head(movieReviewDfm_nrc, 10)

Document-feature matrix of: 10 documents, 10 features (3.00% sparse) and 4 docvars.
                      features
docs                   anger anticipation disgust fear joy negative positive
  test/neg/0_2.txt         1            4       2    3   3        3        4
  test/neg/10000_4.txt     3            3       5    4   2       10       11
  test/neg/10001_1.txt    10            7       7   13   2       17        8
  test/neg/10002_3.txt     5            9       6    5   4       13       13
  test/neg/10003_3.txt     3            8       5    4   8       13       20
  test/neg/10004_2.txt     1            4       4    1   4        8        5
                      features
docs                   sadness surprise trust
  test/neg/0_2.txt           5        3     3
  test/neg/10000_4.txt       1        1     6
  test/neg/10001_1.txt      11        0     5
  test/neg/10002_3.txt       7        1     4
  test/neg/10003_3.txt       6        4     7
  test/neg/10004_2.txt       4        2     5
[ reached max_ndoc ... 4 more documents ]

class(movieReviewDfm_nrc)

[1] "dfm"
attr(,"package")
[1] "quanteda"

Note that these are counts now, rather than the percentage that we got from liwcalike(). Let’s convert that to a data frame that’s useful for downstream analysis, then create a polarity measure.

df_nrc <- convert(movieReviewDfm_nrc, to = "data.frame")
names(df_nrc)

 [1] "doc_id"       "anger"        "anticipation" "disgust"      "fear"        
 [6] "joy"          "negative"     "positive"     "sadness"      "surprise"    
[11] "trust"

df_nrc$polarity <- (df_nrc$positive - df_nrc$negative)/(df_nrc$positive + df_nrc$negative)

df_nrc$polarity[(df_nrc$positive + df_nrc$negative) == 0] <- 0

ggplot(df_nrc) +
  geom_histogram(aes(x=polarity)) +
  theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

writeLines(head(data_corpus_LMRD[which(df_nrc$polarity == 1)]))

New York family is the last in their neighborhood to get a television set, which nearly ruins David Niven's marriage to Mitzi Gaynor. Bedroom comedy that rarely ventures into the bedroom(and nothing sexy happens there anyway). Gaynor as an actress has about as much range as an oven--she turns on, she turns off. Film's sole compensation is a supporting performance by perky Patty Duke, pre-"Miracle Worker", as Niven's daughter. She's delightful; "Happy Anniversary" is not. * from ****
Aya! If you are looking for special effects that are 10-20 years before its time, this is it. The glowing lightning bolts, fireballs, etc. look like they came from a cheesy 70's sci-fi flick. And yes, Hercules really grows; he's not being pushed on a cart closer to the camera!
First of all, I really can't understand how some people "enjoyed" this movie. It's the worst thing I have ever seen. Even the actors seem to be bored...and I think that says it all!

However, I have to give my applause to the opening credits creators - that team seems to have a really good future. That's why I recommend the big studios to watch ONLY the opening credits, and one or two special effects sequences (if they're watched outside this movie, it almost looks like a good movie).

Better luck (or judgment) next time for the producers of this, this... this "thing!".
The movie is a happy lullaby, was made to make us sleep. And that´s what we do, as we dream about the top beautiful Natasha Henstridge. No screenplay, no deep characters, nothing special. So, let´s sleep.
The first season was great - good mix of the job and the brother and friends at home. it was actually a pretty funny show.

Now it shows up again and the brother and the two hot chicks are gone -- and the whole thing revolves around the airline company. Even the old man who runs the company has gone downhill - way too over the top, where before it was perfect.

That and no more Sarah Mason - one of the best looking girls in Hollywood.

This is what happens when you let some execs get their hands on a show. You can almost see the meeting "the old man is funny, lets focus on him, make him way over the top and make it all about the airline.. it'll be a nutty version of the office!" Anyhow, no hot chicks, no watch.
I laughed at the movie! The script, the acting please don't we deserve better? But now the filming, some of the camera angles were interesting. I did enjoy the film, but it's not to be taken seriously though. I liked it. If it had a new cast and scriptwritter it would be better than all right. It's worth a look!

Well, that’s not good. Those are all pretty negative reviews but they are ranked as positive by our sentiment score. Let’s add some other dictionaries and compare.

Dictionary Comparison

# convert corpus to DFM using the General Inquirer dictionary
movieReviewDfm_geninq <- movieReviewDfm %>%
  dfm_lookup(data_dictionary_geninqposneg)

head(movieReviewDfm_geninq, 6)

Document-feature matrix of: 6 documents, 2 features (0.00% sparse) and 4 docvars.
                      features
docs                   positive negative
  test/neg/0_2.txt           12        7
  test/neg/10000_4.txt       16        6
  test/neg/10001_1.txt       11       15
  test/neg/10002_3.txt       16       14
  test/neg/10003_3.txt       18        8
  test/neg/10004_2.txt       11        7

The quanteda.dictionaries package provides access to a host of different dictionaries, with a real diversity of dimensions supposedly captured through the dictionaries. For now, we’ll focus just on two of the packages that include positive and negative.

# create polarity measure for geninq
df_geninq <- convert(movieReviewDfm_geninq, to = "data.frame")
df_geninq$polarity <- (df_geninq$positive - df_geninq$negative)/(df_geninq$positive + df_geninq$negative)
df_geninq$polarity[which((df_geninq$positive + df_geninq$negative) == 0)] <- 0

# look at first few rows
head(df_geninq)

                doc_id positive negative    polarity
1     test/neg/0_2.txt       12        7  0.26315789
2 test/neg/10000_4.txt       16        6  0.45454545
3 test/neg/10001_1.txt       11       15 -0.15384615
4 test/neg/10002_3.txt       16       14  0.06666667
5 test/neg/10003_3.txt       18        8  0.38461538
6 test/neg/10004_2.txt       11        7  0.22222222

Let’s combine all of these into a single dataframe in order to see how well they match up.

# create unique names for each data frame
colnames(df_nrc) <- paste("nrc", colnames(df_nrc), sep = "_")
colnames(df_geninq) <- paste("geninq", colnames(df_geninq), sep = "_")

# now let's compare our estimates
sent_df <- merge(df_nrc, df_geninq, by.x = "nrc_doc_id", by.y = "geninq_doc_id")

head(sent_df)

            nrc_doc_id nrc_anger nrc_anticipation nrc_disgust nrc_fear nrc_joy
1     test/neg/0_2.txt         1                4           2        3       3
2     test/neg/1_3.txt         5                3           2        8       3
3    test/neg/10_3.txt        12               21          12       20      26
4   test/neg/100_4.txt         2                6           3        4       4
5  test/neg/1000_3.txt         6                8           1        8       6
6 test/neg/10000_4.txt         3                3           5        4       2
  nrc_negative nrc_positive nrc_sadness nrc_surprise nrc_trust nrc_polarity
1            3            4           5            3         3   0.14285714
2           11           10           5            3         3  -0.04761905
3           40           53          33           15        31   0.13978495
4            6           14           6            4         7   0.40000000
5            8           13           8            7         7   0.23809524
6           10           11           1            1         6   0.04761905
  geninq_positive geninq_negative geninq_polarity
1              12               7      0.26315789
2               8               9     -0.05882353
3              66              50      0.13793103
4              14              14      0.00000000
5              20              11      0.29032258
6              16               6      0.45454545

Now that we have them all in a single dataframe, it’s straightforward to figure out a bit about how well our different measures of polarity agree across the different approaches.

cor(sent_df$nrc_polarity, sent_df$geninq_polarity)

[1] 0.6403378

# Plot this out. You can update this to check the look of other combinations.
ggplot(sent_df, mapping = aes(x = nrc_polarity,
                              y = geninq_polarity)) +
  geom_point(alpha = 0.1) +
  geom_smooth() +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_bw()

`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

As the plots make clear, while the measures are strongly correlated, they are decidedly not identical to one another. We can observe really significant variance across each in the estimates of polarity. So which is the best? Well, we could depend on our actual classifications to try to understand that. Of course, for many of the settings where we’d really like to know sentiment (from tweets, or news articles, or speeches, and so on), we won’t know the true sentiment. In those cases, we could hand-code a random subset. If we’re doing that, though, why not just code some more and use it as a training set for a supervised learning approach?

In all, for many of our research settings, we are limited in what we can learn from these sorts of dictionary approaches unless we do substantial validation of the estimates.

Apply Dictionary within Contexts

The approach we’ve taken so far largely leverages working with DFMs. However, we might care about contextual usage. So, for instance, how is New York City treated across the corpus of movie reviews when it is discussed? To see this, we’ll first limit our corpus to just New York City related tokens (ny_words) and the window they appear within.

# tokenize corpus
tokens_LMRD <- tokens(data_corpus_LMRD, remove_punct = TRUE)

# what are the context (target) words or phrases
ny_words <- c("big apple", "new york", "nyc", "ny", "new york city", "brookyln", "bronx", "manhattan", "queens", "staten island")

# retain only our tokens and their context
tokens_ny <- tokens_keep(tokens_LMRD, pattern = phrase(ny_words), window = 40)

Next, we’ll pull out the positive and negative dictionaries and look for those within our token sets.

data_dictionary_LSD2015_pos_neg <- data_dictionary_LSD2015[1:2]

tokens_ny_lsd <- tokens_lookup(tokens_ny,
                               dictionary = data_dictionary_LSD2015_pos_neg)

We can convert this to a DFM then.

dfm_ny <- dfm(tokens_ny_lsd)
head(dfm_ny, 10)

Document-feature matrix of: 10 documents, 2 features (85.00% sparse) and 4 docvars.
                      features
docs                   negative positive
  test/neg/0_2.txt            0        0
  test/neg/10000_4.txt        4        4
  test/neg/10001_1.txt        0        0
  test/neg/10002_3.txt        0        0
  test/neg/10003_3.txt        0        0
  test/neg/10004_2.txt        0        0
[ reached max_ndoc ... 4 more documents ]

Ok. As you can see above we have some positive and negative words in one of our movie reviews, but many more did not feature any emotionally valence words. We’ll drop those from our analysis, then take a look at the distribution.

# convert to data frame
mat_ny <- convert(dfm_ny, to = "data.frame")

# drop if both features are 0
mat_ny <- mat_ny[-which((mat_ny$negative + mat_ny$positive)==0),]

# print a little summary info
paste("We have ",nrow(mat_ny)," reviews that mention positive or negative words in the context of New York City terms.", sep="")

[1] "We have 1442 reviews that mention positive or negative words in the context of New York City terms."

# create polarity scores
mat_ny$polarity <- (mat_ny$positive - mat_ny$negative)/(mat_ny$positive + mat_ny$negative)

# summary
summary(mat_ny$polarity)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-1.0000 -0.2000  0.1250  0.1292  0.5000  1.0000

# plot
ggplot(mat_ny) + 
  geom_histogram(aes(x=polarity)) + 
  theme_bw()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Write your own dictionary

Sometimes you may find no existing dictionary can best capture what you want to study, so you want to write your own dictionary.

For example, we want to define a dictionary for different aspects of a movie, say, writing, directing, acting, and music.

my_dict <- dictionary(list(writing=c("writing","writer","story","character"),             
              directing=c("directing","director"),
              acting=c("acting","actor","actress","character"),           
              music=c("music","sound","sing")))

We can use the dictionary as with dfm:

movieReviewDfm_mydict <- movieReviewDfm %>%
  dfm_lookup(my_dict)

head(movieReviewDfm_mydict, 6)

Document-feature matrix of: 6 documents, 4 features (54.17% sparse) and 4 docvars.
                      features
docs                   writing directing acting music
  test/neg/0_2.txt           2         0      2     0
  test/neg/10000_4.txt       3         0      6     0
  test/neg/10001_1.txt       2         0      3     0
  test/neg/10002_3.txt       0         0      2     1
  test/neg/10003_3.txt       1         0      1     0
  test/neg/10004_2.txt       0         1      0     0

In addition to count frequency, we can also assign values/weights to words, to indicate how strongly positive/negative each word is.

For example, we first construct a fake corpus

df <- c('I am happy and kind of sad','sad is sad, happy is good')

df_corpus <- corpus(df)
df_dfm <- tokens(df_corpus,
                    remove_punct = TRUE,  
                    remove_symbols = TRUE) %>%
                    tokens_tolower() %>%
                    dfm()

head(df_dfm)

Document-feature matrix of: 2 documents, 9 features (38.89% sparse) and 0 docvars.
       features
docs    i am happy and kind of sad is good
  text1 1  1     1   1    1  1   1  0    0
  text2 0  0     1   0    0  0   2  2    1

#convert dfm to dataframe for further steps
df2 <- convert(df_dfm, to = "data.frame")

df3 <- pivot_longer(df2,!doc_id,names_to="word")

Now construct your lexicon dictionary with values/weights

lexicon <- data_frame(word =c('happy','sad'),scores=c(1.3455,-1.0552))

Warning: `data_frame()` was deprecated in tibble 1.1.0.
ℹ Please use `tibble()` instead.

lexicon

# A tibble: 2 × 2
  word  scores
  <chr>  <dbl>
1 happy   1.35
2 sad    -1.06

Now, we can merge lexicon and data to have the sum of the scores.

merged <- merge(df3,lexicon,by="word")

score <- aggregate(cbind(scores*value) ~ doc_id, data=merged, sum)

score

  doc_id      V1
1  text1  0.2903
2  text2 -0.7649

Dictionaries avaialbe in R

Tidytext package

afinn

AFINN is a list of words rated for valence with an integer between -5 (negative) and 5 (positive). Each word is exclusively assigned to a value.

afinn_dict <- get_sentiments("afinn")

words <- afinn_dict %>%
 group_by(value) %>%
 table() 

head(words,10)

            value
word         -5 -4 -3 -2 -1 0 1 2 3 4 5
  abandon     0  0  0  1  0 0 0 0 0 0 0
  abandoned   0  0  0  1  0 0 0 0 0 0 0
  abandons    0  0  0  1  0 0 0 0 0 0 0
  abducted    0  0  0  1  0 0 0 0 0 0 0
  abduction   0  0  0  1  0 0 0 0 0 0 0
  abductions  0  0  0  1  0 0 0 0 0 0 0
  abhor       0  0  1  0  0 0 0 0 0 0 0
  abhorred    0  0  1  0  0 0 0 0 0 0 0
  abhorrent   0  0  1  0  0 0 0 0 0 0 0
  abhors      0  0  1  0  0 0 0 0 0 0 0

dim(afinn_dict)

[1] 2477    2

ggplot(afinn_dict) +
 geom_histogram(aes(x=value), stat = "count") +
 theme_bw()

Warning in geom_histogram(aes(x = value), stat = "count"): Ignoring unknown
parameters: `binwidth`, `bins`, and `pad`

loughran

Loughran-McDonald sentiment lexicon is for use with financial documents. It labels words with six sentiments important in financial contexts: constraining, litigious, negative, positive, superfluous, and uncertainty. A word may belong to multiple sentiments.

loughran_dict <- get_sentiments("loughran")

words <- loughran_dict %>%
 group_by(sentiment) %>%
 table() 

head(words,10)

              sentiment
word           constraining litigious negative positive superfluous uncertainty
  abandon                 0         0        1        0           0           0
  abandoned               0         0        1        0           0           0
  abandoning              0         0        1        0           0           0
  abandonment             0         0        1        0           0           0
  abandonments            0         0        1        0           0           0
  abandons                0         0        1        0           0           0
  abdicated               0         0        1        0           0           0
  abdicates               0         0        1        0           0           0
  abdicating              0         0        1        0           0           0
  abdication              0         0        1        0           0           0

dim(loughran_dict)

[1] 4150    2

ggplot(loughran_dict) +
 geom_histogram(aes(x=sentiment), stat = "count") +
 theme_bw()

Warning in geom_histogram(aes(x = sentiment), stat = "count"): Ignoring unknown
parameters: `binwidth`, `bins`, and `pad`

bing

English sentiment lexicon that categorizes words into a binary fashion, either positive or negative.

bing_dict <- get_sentiments("bing")

words <- bing_dict %>%
 group_by(sentiment) %>%
 table() 

head(words,10)

             sentiment
word          negative positive
  2-faces            1        0
  abnormal           1        0
  abolish            1        0
  abominable         1        0
  abominably         1        0
  abominate          1        0
  abomination        1        0
  abort              1        0
  aborted            1        0
  aborts             1        0

dim(bing_dict)

[1] 6786    2

ggplot(bing_dict) +
 geom_histogram(aes(x=sentiment), stat = "count") +
 theme_bw()

Warning in geom_histogram(aes(x = sentiment), stat = "count"): Ignoring unknown
parameters: `binwidth`, `bins`, and `pad`

nrc

Categorize words into eight emotions: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust; and two sentiments: negative and positive. One word can belong to multiple categories.

nrc_dict <- get_sentiments("nrc")

words <- nrc_dict %>%
 group_by(sentiment) %>%
 table() 

head(words,10)

             sentiment
word          anger anticipation disgust fear joy negative positive sadness
  abacus          0            0       0    0   0        0        0       0
  abandon         0            0       0    1   0        1        0       1
  abandoned       1            0       0    1   0        1        0       1
  abandonment     1            0       0    1   0        1        0       1
  abba            0            0       0    0   0        0        1       0
  abbot           0            0       0    0   0        0        0       0
  abduction       0            0       0    1   0        1        0       1
  aberrant        0            0       0    0   0        1        0       0
  aberration      0            0       1    0   0        1        0       0
  abhor           1            0       1    1   0        1        0       0
             sentiment
word          surprise trust
  abacus             0     1
  abandon            0     0
  abandoned          0     0
  abandonment        1     0
  abba               0     0
  abbot              0     1
  abduction          1     0
  aberrant           0     0
  aberration         0     0
  abhor              0     0

dim(nrc_dict)

[1] 13872     2

ggplot(nrc_dict) +
 geom_histogram(aes(x=sentiment), stat = "count") +
 theme_bw()

Warning in geom_histogram(aes(x = sentiment), stat = "count"): Ignoring unknown
parameters: `binwidth`, `bins`, and `pad`

Quanteda package

Dictionaries available under Quanteda: ANEW (Affective Norms for English Words), AFINN, LSD (Lexicoder Sentiment Dictionary), Loughran McDonald, etc.

names(data_dictionary_ANEW)

[1] "pleasure"  "arousal"   "dominance"

names(data_dictionary_LoughranMcDonald)

[1] "NEGATIVE"           "POSITIVE"           "UNCERTAINTY"       
[4] "LITIGIOUS"          "CONSTRAINING"       "SUPERFLUOUS"       
[7] "INTERESTING"        "MODAL WORDS STRONG" "MODAL WORDS WEAK"

names(data_dictionary_LSD2015)

[1] "negative"     "positive"     "neg_positive" "neg_negative"