Tutorial 2 Text as Data

This is our second tutorial for running R. In this tutorial, we’ll start working with texts. As we discussed in class, there’s a huge variance in what counts as a “text”, running the gamut from sentences or tweets to entire novels like War & Peace.

In this tutorial, you’ll learn how to start working with text-as-data in R. That includes storage formats, manipulation, counting, subsetting, and editing of the texts. There’s a lot we can do once we have the texts in R!

By the end of this tutorial, you should be familiar with the following:

1. Character vectors: c()

2. Corpus: corpus(), summary()

3. Metadata: docvars()

4. Subsetting corpora: corpus_subset()

5. Number of documents: ndoc()

6. Tokenization: tokens()

7. Contextual analysis: kwic()

8. N-grams: tokens_ngrams()

Front-end Matters

This week, we’ll start by looking at the Harry Potter series. First things first, we need to install and load the packages for today’s notebook.

library(tidytext)
library(plyr)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()   masks plyr::arrange()
✖ purrr::compact()   masks plyr::compact()
✖ dplyr::count()     masks plyr::count()
✖ dplyr::desc()      masks plyr::desc()
✖ dplyr::failwith()  masks plyr::failwith()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::id()        masks plyr::id()
✖ dplyr::lag()       masks stats::lag()
✖ dplyr::mutate()    masks plyr::mutate()
✖ dplyr::rename()    masks plyr::rename()
✖ dplyr::summarise() masks plyr::summarise()
✖ dplyr::summarize() masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(quanteda)
Package version: 4.0.2
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.

Reading all files in a directory

Unfortunately, the harrypotter package, which contains the book contents of all seven Harry Potter books, we used in previous years no longer works now (It seems R packages can disappear faster than a disarming spell!). So before we move on to text analysis, let’s read in our data first. Please visit my Github HarryPotter repository. Download all the .rda files, and save them under the same folder.

Make sure to replace the folder path with the correct path on your own device.

# Define the folder containing the .rda files. 
folder <- "/Users/mpang/Dropbox/Teaching Resources/DACSS_TAD/HarryPotter"

# Get the list of all .rda files in the folder
rda_files <- list.files(folder, pattern = "\\.rda$", full.names = TRUE)

# Load all .rda files into the environment
lapply(rda_files, load, .GlobalEnv)
[[1]]
[1] "chamber_of_secrets"

[[2]]
[1] "deathly_hallows"

[[3]]
[1] "goblet_of_fire"

[[4]]
[1] "half_blood_prince"

[[5]]
[1] "order_of_the_phoenix"

[[6]]
[1] "philosophers_stone"

[[7]]
[1] "prisoner_of_azkaban"

Character Vectors

Now you should see the seven books our workspace/environment. These are:

1. philosophers_stone: Harry Potter and the Philosophers Stone (1997)

2. chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)

3. prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)

4. goblet_of_fire: Harry Potter and the Goblet of Fire (2000)

5. order_of_the_phoenix: Harry Potter and the Order of the Phoenix

6. half_blood_price: Harry Potter and the Half-Blood Prince (2005)

7. deathly_hallows: Harry Potter and the Deathly Hallows (2007)

Each is stored as a character vector. A character vector is a collection of elements, where each element is a string. You could create your own character vector as something like:

my_char_vec <- c("Rosemary's", "favorite","horror movie","director", "is", "James Wan.")
print(my_char_vec)
[1] "Rosemary's"   "favorite"     "horror movie" "director"     "is"          
[6] "James Wan."  

Each element of a vector has an index; starting at 1, count from left-to-right. You can call to particular elements from the character vector using that indexing. So, if I wanted the third element from the above, I could type:

my_char_vec[3]
[1] "horror movie"

As you can see in the example, an element doesn’t have to be a single word. Each element can be as long as you like. For instance, a character vector could also be:

my_char_vec2 <- c("Rosemary's favorite horror movie director is James Wan.", "She has a James Wan figure in her office.", "She likes the Conjuring series the most.")
print(my_char_vec2)
[1] "Rosemary's favorite horror movie director is James Wan."
[2] "She has a James Wan figure in her office."              
[3] "She likes the Conjuring series the most."               

The storage of the Harry Potter books follows this intuition. Each book is a character vector, and each chapter is an element in that book’s character vector. So, for philosophers_stone, we can see the first chapter via:

philosophers_stone[1]
[1] "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much. They were the last people you'd expect to be involved in anything strange or mysterious, because they just didn't hold with such nonsense.  Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, beefy man with hardly any neck, although he did have a very large mustache. Mrs. Dursley was thin and blonde and had nearly twice the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbors. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere.  The Dursleys had everything they wanted, but they also had a secret, and their greatest fear was that somebody would discover it. They didn't think they could bear it if anyone found out about the Potters. Mrs. Potter was Mrs. Dursley's sister, but they hadn't met for several years; in fact, Mrs. Dursley pretended she didn't have a sister, because her sister and her good-for-nothing husband were as unDursleyish as it was possible to be. The Dursleys shuddered to think what the neighbors would say if the Potters arrived in the street. The Dursleys knew that the Potters had a small son, too, but they had never even seen him. This boy was another good reason for keeping the Potters away; they didn't want Dudley mixing with a child like that.  When Mr. and Mrs. Dursley woke up on the dull, gray Tuesday our story starts, there was nothing about the cloudy sky outside to suggest that strange and mysterious things would soon be happening all over the country. Mr. Dursley hummed as he picked out his most boring tie for work, and Mrs. Dursley gossiped away happily as she wrestled a screaming Dudley into his high chair.  None of them noticed a large, tawny owl flutter past the window.  At half past eight, Mr. Dursley picked up his briefcase, pecked Mrs. Dursley on the cheek, and tried to kiss Dudley good-bye but missed, because Dudley was now having a tantrum and throwing his cereal at the walls. `Little tyke,` chortled Mr. Dursley as he left the house. He got into his car and backed out of number four's drive.  It was on the corner of the street that he noticed the first sign of something peculiar -- a cat reading a map. For a second, Mr. Dursley didn't realize what he had seen -- then he jerked his head around to look again. There was a tabby cat standing on the corner of Privet Drive, but there wasn't a map in sight. What could he have been thinking of? It must have been a trick of the light. Mr. Dursley blinked and stared at the cat. It stared back. As Mr. Dursley drove around the corner and up the road, he watched the cat in his mirror. It was now reading the sign that said Privet Drive -- no, looking at the sign; cats couldn't read maps or signs. Mr. Dursley gave himself a little shake and put the cat out of his mind. As he drove toward town he thought of nothing except a large order of drills he was hoping to get that day.  But on the edge of town, drills were driven out of his mind by something else. As he sat in the usual morning traffic jam, he couldn't help noticing that there seemed to be a lot of strangely dressed people about. People in cloaks. Mr. Dursley couldn't bear people who dressed in funny clothes -- the getups you saw on young people! He supposed this was some stupid new fashion. He drummed his fingers on the steering wheel and his eyes fell on a huddle of these weirdos standing quite close by. They were whispering excitedly together. Mr. Dursley was enraged to see that a couple of them weren't young at all; why, that man had to be older than he was, and wearing an emerald-green cloak! The nerve of him! But then it struck Mr. Dursley that this was probably some silly stunt -- these people were obviously collecting for something... yes, that would be it. The traffic moved on and a few minutes later, Mr. Dursley arrived in the Grunnings parking lot, his mind back on drills.  Mr. Dursley always sat with his back to the window in his office on the ninth floor. If he hadn't, he might have found it harder to concentrate on drills that morning. He didn't see the owls swoop ing past in broad daylight, though people down in the street did; they pointed and gazed open- mouthed as owl after owl sped overhead. Most of them had never seen an owl even at nighttime. Mr. Dursley, however, had a perfectly normal, owl-free morning. He yelled at five different people. He made several important telephone calls and shouted a bit more. He was in a very good mood until lunchtime, when he thought he'd stretch his legs and walk across the road to buy himself a bun from the bakery. He'd forgotten all about the people in cloaks until he passed a group of them next to the baker's. He eyed them angrily as he passed. He didn't know why, but they made him uneasy. This bunch were whispering excitedly, too, and he couldn't see a single collecting tin. It was on his way back past them, clutching a large doughnut in a bag, that he caught a few words of what they were saying.   `The Potters, that's right, that's what I heard yes, their son, Harry`  Mr. Dursley stopped dead. Fear flooded him. He looked back at the whisperers as if he wanted to say something to them, but thought better of it.  He dashed back across the road, hurried up to his office, snapped at his secretary not to disturb him, seized his telephone, and had almost finished dialing his home number when he changed his mind. He put the receiver back down and stroked his mustache, thinking... no, he was being stupid. Potter wasn't such an unusual name. He was sure there were lots of people called Potter who had a son called Harry. Come to think of it, he wasn't even sure his nephew was called Harry. He'd never even seen the boy. It might have been Harvey. Or Harold. There was no point in worrying Mrs. Dursley; she always got so upset at any mention of her sister. He didn't blame her -- if he'd had a sister like that... but all the same, those people in cloaks...  He found it a lot harder to concentrate on drills that afternoon and when he left the building at five o'clock, he was still so worried that he walked straight into someone just outside the door.  `Sorry,` he grunted, as the tiny old man stumbled and almost fell. It was a few seconds before Mr. Dursley realized that the man was wearing a violet cloak. He didn't seem at all upset at being almost knocked to the ground. On the contrary, his face split into a wide smile and he said in a squeaky voice that made passersby stare, `Don't be sorry, my dear sir, for nothing could upset me today! Rejoice, for You-Know-Who has gone at last! Even Muggles like yourself should be celebrating, this happy, happy day!`  And the old man hugged Mr. Dursley around the middle and walked off.  Mr. Dursley stood rooted to the spot. He had been hugged by a complete stranger. He also thought he had been called a Muggle, whatever that was. He was rattled. He hurried to his car and set off for home, hoping he was imagining things, which he had never hoped before, because he didn't approve of imagination.  As he pulled into the driveway of number four, the first thing he saw -- and it didn't improve his mood -- was the tabby cat he'd spotted that morning. It was now sitting on his garden wall. He was sure it was the same one; it had the same markings around its eyes. `Shoo!` said Mr. Dursley loudly. The cat didn't move. It just gave him a stern look. Was this normal cat behavior? Mr. Dursley wondered. Trying to pull himself together, he let himself into the house. He was still determined not to mention anything to his wife. Mrs. Dursley had had a nice, normal day. She told him over dinner all about Mrs. Next Door's problems with her daughter and how Dudley had learned a new word (`Won't!`). Mr. Dursley tried to act normally. When Dudley had been put to bed, he went into the living room in time to catch the last report on the evening news:  `And finally, bird-watchers everywhere have reported that the nation's owls have been behaving very unusually today. Although owls normally hunt at night and are hardly ever seen in daylight, there have been hundreds of sightings of these birds flying in every direction since sunrise. Experts are unable to explain why the owls have suddenly changed their sleeping pattern.` The newscaster allowed himself a grin. `Most mysterious. And now, over to Jim McGuffin with the weather. Going to be any more showers of owls tonight, Jim?`  `Well, Ted,` said the weatherman, `I don't know about that, but it's not only the owls that have been acting oddly today. Viewers as far apart as Kent, Yorkshire, and Dundee have been phoning in to tell me that instead of the rain I promised yesterday, they've had a downpour of shooting stars! Perhaps people have been celebrating Bonfire Night early -- it's not until next week, folks! But I can promise a wet night tonight.`  Mr. Dursley sat frozen in his armchair. Shooting stars all over Britain? Owls flying by daylight? Mysterious people in cloaks all over the place? And a whisper, a whisper about the Potters...  Mrs. Dursley came into the living room carrying two cups of tea. It was no good. He'd have to say something to her. He cleared his throat nervously. `Er -- Petunia, dear -- you haven't heard from your sister lately, have you?` As he had expected, Mrs. Dursley looked shocked and angry. After all, they normally pretended she didn't have a sister.  `No,` she said sharply. `Why?`  `Funny stuff on the news,` Mr. Dursley mumbled. `Owls... shooting stars... and there were a lot of funny-looking people in town today...`  `So?` snapped Mrs. Dursley.  `Well, I just thought... maybe... it was something to do with... you know... her crowd.` Mrs. Dursley sipped her tea through pursed lips. Mr. Dursley wondered whether he dared tell her he'd heard the name `Potter.` He decided he didn't dare. Instead he said, as casually as he could, `Their son -- he'd be about Dudley's age now, wouldn't he?`  `I suppose so,` said Mrs. Dursley stiffly.  `What's his name again? Howard, isn't it?`  `Harry. Nasty, common name, if you ask me.`  `Oh, yes,` said Mr. Dursley, his heart sinking horribly. `Yes, I quite agree.` He didn't say another word on the subject as they went upstairs to bed. While Mrs. Dursley was in the bathroom, Mr. Dursley crept to the bedroom window and peered down into the front garden. The cat was still there. It was staring down Privet Drive as though it were waiting for something.  Was he imagining things? Could all this have anything to do with the Potters? If it did... if it got out that they were related to a pair of -- well, he didn't think he could bear it.  The Dursleys got into bed. Mrs. Dursley fell asleep quickly but Mr. Dursley lay awake, turning it all over in his mind. His last, comforting thought before he fell asleep was that even if the Potters were involved, there was no reason for them to come near him and Mrs. Dursley. The Potters knew very well what he and Petunia thought about them and their kind.... He couldn't see how he and Petunia could get mixed up in anything that might be going on -- he yawned and turned over -- it couldn't affect them....  How very wrong he was.  Mr. Dursley might have been drifting into an uneasy sleep, but the cat on the wall outside was showing no sign of sleepiness. It was sitting as still as a statue, its eyes fixed unblinkingly on the far corner of Privet Drive. It didn't so much as quiver when a car door slammed on the next street, nor when two owls swooped overhead. In fact, it was nearly midnight before the cat moved at all.  A man appeared on the corner the cat had been watching, appeared so suddenly and silently you'd have thought he'd just popped out of the ground. The cat's tail twitched and its eyes narrowed.  Nothing like this man had ever been seen on Privet Drive. He was tall, thin, and very old, judging by the silver of his hair and beard, which were both long enough to tuck into his belt. He was wearing long robes, a purple cloak that swept the ground, and high-heeled, buckled boots. His blue eyes were light, bright, and sparkling behind half-moon spectacles and his nose was very long and crooked, as though it had been broken at least twice. This man's name was Albus Dumbledore.  Albus Dumbledore didn't seem to realize that he had just arrived in a street where everything from his name to his boots was unwelcome. He was busy rummaging in his cloak, looking for something. But he did seem to realize he was being watched, because he looked up suddenly at the cat, which was still staring at him from the other end of the street. For some reason, the sight of the cat seemed to amuse him. He chuckled and muttered, `I should have known.` He found what he was looking for in his inside pocket. It seemed to be a silver cigarette lighter. He flicked it open, held it up in the air, and clicked it. The nearest street lamp went out with a little pop. He clicked it again -- the next lamp flickered into darkness. Twelve times he clicked the Put-Outer, until the only lights left on the whole street were two tiny pinpricks in the distance, which were the eyes of the cat watching him. If anyone looked out of their window now, even beady-eyed Mrs. Dursley, they wouldn't be able to see anything that was happening down on the pavement. Dumbledore slipped the Put-Outer back inside his cloak and set off down the street toward number four, where he sat down on the wall next to the cat. He didn't look at it, but after a moment he spoke to it.  `Fancy seeing you here, Professor McGonagall.`  He turned to smile at the tabby, but it had gone. Instead he was smiling at a rather severe-looking woman who was wearing square glasses exactly the shape of the markings the cat had had around its eyes. She, too, was wearing a cloak, an emerald one. Her black hair was drawn into a tight bun. She looked distinctly ruffled.  `How did you know it was me?` she asked.  `My dear Professor, I 've never seen a cat sit so stiffly.`  `You'd be stiff if you'd been sitting on a brick wall all day,` said Professor McGonagall.  `All day? When you could have been celebrating? I must have passed a dozen feasts and parties on my way here.`  Professor McGonagall sniffed angrily. `Oh yes, everyone's celebrating, all right,` she said impatiently. `You'd think they'd be a bit more careful, but no -- even the Muggles have noticed something's going on. It was on their news.` She jerked her head back at the Dursleys' dark living-room window. `I heard it. Flocks of owls... shooting stars.... Well, they're not completely stupid. They were bound to notice something. Shooting stars down in Kent -- I'll bet that was Dedalus Diggle. He never had much sense.`  `You can't blame them,` said Dumbledore gently. `We've had precious little to celebrate for eleven years.`  `I know that,` said Professor McGonagall irritably. `But that's no reason to lose our heads. People are being downright careless, out on the streets in broad daylight, not even dressed in Muggle clothes, swapping rumors.`  She threw a sharp, sideways glance at Dumbledore here, as though hoping he was going to tell her something, but he didn't, so she went on. `A fine thing it would be if, on the very day YouKnow-Who seems to have disappeared at last, the Muggles found out about us all. I suppose he really has gone, Dumbledore?` `It certainly seems so,` said Dumbledore. `We have much to be thankful for. Would you care for a lemon drop?`  `A what?` `A lemon drop. They're a kind of Muggle sweet I'm rather fond of`  `No, thank you,` said Professor McGonagall coldly, as though she didn't think this was the moment for lemon drops. `As I say, even if You-Know-Who has gone -`  `My dear Professor, surely a sensible person like yourself can call him by his name? All this 'You- Know-Who' nonsense -- for eleven years I have been trying to persuade people to call him by his proper name: Voldemort.` Professor McGonagall flinched, but Dumbledore, who was unsticking two lemon drops, seemed not to notice. `It all gets so confusing if we keep saying 'You-Know-Who.' I have never seen any reason to be frightened of saying Voldemort's name.  `I know you haven 't, said Professor McGonagall, sounding half exasperated, half admiring. `But you're different. Everyone knows you're the only one You-Know- oh, all right, Voldemort, was frightened of.`  `You flatter me,` said Dumbledore calmly. `Voldemort had powers I will never have.`  `Only because you're too -- well -- noble to use them.`  `It's lucky it's dark. I haven't blushed so much since Madam Pomfrey told me she liked my new earmuffs.`  Professor McGonagall shot a sharp look at Dumbledore and said, `The owls are nothing next to the rumors that are flying around. You know what everyone's saying? About why he's disappeared? About what finally stopped him?`  It seemed that Professor McGonagall had reached the point she was most anxious to discuss, the real reason she had been waiting on a cold, hard wall all day, for neither as a cat nor as a woman had she fixed Dumbledore with such a piercing stare as she did now. It was plain that whatever `everyone` was saying, she was not going to believe it until Dumbledore told her it was true. Dumbledore, however, was choosing another lemon drop and did not answer.  `What they're saying,` she pressed on, `is that last night Voldemort turned up in Godric's Hollow. He went to find the Potters. The rumor is that Lily and James Potter are -- are -- that they're -- dead. ` Dumbledore bowed his head. Professor McGonagall gasped.  `Lily and James... I can't believe it... I didn't want to believe it... Oh, Albus...` Dumbledore reached out and patted her on the shoulder. `I know... I know...` he said heavily.  Professor McGonagall's voice trembled as she went on. `That's not all. They're saying he tried to kill the Potter's son, Harry. But -- he couldn't. He couldn't kill that little boy. No one knows why, or how, but they're saying that when he couldn't kill Harry Potter, Voldemort's power somehow broke -- and that's why he's gone.  Dumbledore nodded glumly.  `It's -- it's true?` faltered Professor McGonagall. `After all he's done... all the people he's killed... he couldn't kill a little boy? It's just astounding... of all the things to stop him... but how in the name of heaven did Harry survive?`  `We can only guess,` said Dumbledore. `We may never know.`  Professor McGonagall pulled out a lace handkerchief and dabbed at her eyes beneath her spectacles. Dumbledore gave a great sniff as he took a golden watch from his pocket and examined it. It was a very odd watch. It had twelve hands but no numbers; instead, little planets were moving around the edge. It must have made sense to Dumbledore, though, because he put it back in his pocket and said, `Hagrid's late. I suppose it was he who told you I'd be here, by the way?`  `Yes,` said Professor McGonagall. `And I don't suppose you're going to tell me why you're here, of all places?`  `I've come to bring Harry to his aunt and uncle. They're the only family he has left now.`  `You don't mean -- you can't mean the people who live here?` cried Professor McGonagall, jumping to her feet and pointing at number four. `Dumbledore -- you can't. I've been watching them all day. You couldn't find two people who are less like us. And they've got this son -- I saw him kicking his mother all the way up the street, screaming for sweets. Harry Potter come and live here!`  `It's the best place for him,` said Dumbledore firmly. `His aunt and uncle will be able to explain everything to him when he's older. I've written them a letter.`  `A letter?` repeated Professor McGonagall faintly, sitting back down on the wall. `Really, Dumbledore, you think you can explain all this in a letter? These people will never understand him! He'll be famous -- a legend -- I wouldn't be surprised if today was known as Harry Potter day in the future -- there will be books written about Harry -- every child in our world will know his name!`  `Exactly,` said Dumbledore, looking very seriously over the top of his half-moon glasses. `It would be enough to turn any boy's head. Famous before he can walk and talk! Famous for something he won't even remember! CarA you see how much better off he'll be, growing up away from all that until he's ready to take it?` Professor McGonagall opened her mouth, changed her mind, swallowed, and then said, `Yes -- yes, you're right, of course. But how is the boy getting here, Dumbledore?` She eyed his cloak suddenly as though she thought he might be hiding Harry underneath it.  `Hagrid's bringing him.` `You think it -- wise -- to trust Hagrid with something as important as this?`  I would trust Hagrid with my life,` said Dumbledore.  `I'm not saying his heart isn't in the right place,` said Professor McGonagall grudgingly, `but you can't pretend he's not careless. He does tend to -- what was that?`  A low rumbling sound had broken the silence around them. It grew steadily louder as they looked up and down the street for some sign of a headlight; it swelled to a roar as they both looked up at the sky -- and a huge motorcycle fell out of the air and landed on the road in front of them.  If the motorcycle was huge, it was nothing to the man sitting astride it. He was almost twice as tall as a normal man and at least five times as wide. He looked simply too big to be allowed, and so wild - long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins. In his vast, muscular arms he was holding a bundle of blankets.  `Hagrid,` said Dumbledore, sounding relieved. `At last. And where did you get that motorcycle?`  `Borrowed it, Professor Dumbledore, sit,` said the giant, climbing carefully off the motorcycle as he spoke. `Young Sirius Black lent it to me. I've got him, sir.`  `No problems, were there?`  `No, sir -- house was almost destroyed, but I got him out all right before the Muggles started swarmin' around. He fell asleep as we was flyin' over Bristol.`  Dumbledore and Professor McGonagall bent forward over the bundle of blankets. Inside, just visible, was a baby boy, fast asleep. Under a tuft of jet-black hair over his forehead they could see a curiously shaped cut, like a bolt of lightning.  `Is that where -?` whispered Professor McGonagall.  `Yes,` said Dumbledore. `He'll have that scar forever.`  `Couldn't you do something about it, Dumbledore?`  `Even if I could, I wouldn't. Scars can come in handy. I have one myself above my left knee that is a perfect map of the London Underground. Well -- give him here, Hagrid -- we'd better get this over with.`  Dumbledore took Harry in his arms and turned toward the Dursleys' house.  `Could I -- could I say good-bye to him, sir?` asked Hagrid. He bent his great, shaggy head over Harry and gave him what must have been a very scratchy, whiskery kiss. Then, suddenly, Hagrid let out a howl like a wounded dog.  `Shhh!` hissed Professor McGonagall, `you'll wake the Muggles!`  `S-s-sorry,` sobbed Hagrid, taking out a large, spotted handkerchief and burying his face in it. `But I c-c-can't stand it -- Lily an' James dead -- an' poor little Harry off ter live with Muggles -`  `Yes, yes, it's all very sad, but get a grip on yourself, Hagrid, or we'll be found,` Professor McGonagall whispered, patting Hagrid gingerly on the arm as Dumbledore stepped over the low garden wall and walked to the front door. He laid Harry gently on the doorstep, took a letter out of his cloak, tucked it inside Harry's blankets, and then came back to the other two. For a full minute the three of them stood and looked at the little bundle; Hagrid's shoulders shook, Professor McGonagall blinked furiously, and the twinkling light that usually shone from Dumbledore's eyes seemed to have gone out. `Well,` said Dumbledore finally, `that's that. We've no business staying here. We may as well go and join the celebrations.`  `Yeah,` said Hagrid in a very muffled voice, `I'll be takin' Sirius his bike back. G'night, Professor McGonagall -- Professor Dumbledore, sir.`  Wiping his streaming eyes on his jacket sleeve, Hagrid swung himself onto the motorcycle and kicked the engine into life; with a roar it rose into the air and off into the night.  `I shall see you soon, I expect, Professor McGonagall,` said Dumbledore, nodding to her. Professor McGonagall blew her nose in reply.  Dumbledore turned and walked back down the street. On the corner he stopped and took out the silver Put-Outer. He clicked it once, and twelve balls of light sped back to their street lamps so that Privet Drive glowed suddenly orange and he could make out a tabby cat slinking around the corner at the other end of the street. He could just see the bundle of blankets on the step of number four.  `Good luck, Harry,` he murmured. He turned on his heel and with a swish of his cloak, he was gone.  A breeze ruffled the neat hedges of Privet Drive, which lay silent and tidy under the inky sky, the very last place you would expect astonishing things to happen. Harry Potter rolled over inside his blankets without waking up. One small hand closed on the letter beside him and he slept on, not knowing he was special, not knowing he was famous, not knowing he would be woken in a few hours' time by Mrs. Dursley's scream as she opened the front door to put out the milk bottles, nor that he would spend the next few weeks being prodded and pinched by his cousin Dudley... He couldn't know that at this very moment, people meeting in secret all over the country were holding up their glasses and saying in hushed voices: `To Harry Potter -- the boy who lived!"

Corpus

While this is interesting, we want something that’s more straightforward to work with. Therefore, we are going to convert the character vectors to a corpus. A corpus is a stored collection of texts; we can also store corpus meta-data as a dataframe associated with the corpus. This is particularly helpful when we have document-level covariates that we might want to use in analysis of the texts.

philosophers_stone_corpus <- corpus(philosophers_stone)
philosophers_stone_summary <- summary(philosophers_stone_corpus)
philosophers_stone_summary
Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences
  text1  1271   5693       350
  text2  1066   4154       237
  text3  1226   4656       297
  text4  1193   4832       322
  text5  1818   8446       563
  text6  1563   8016       566
  text7  1374   5487       351
  text8  1095   3608       198
  text9  1422   6195       411
 text10  1293   5237       334
 text11  1110   4215       277
 text12  1509   6790       447
 text13  1076   3953       262
 text14  1110   4394       308
 text15  1385   6486       459
 text16  1581   8357       591
 text17  1489   7172       506
# how to access to "Tokens" in the summary?
philosophers_stone_summary$Tokens
 [1] 5693 4154 4656 4832 8446 8016 5487 3608 6195 5237 4215 6790 3953 4394 6486
[16] 8357 7172
# Which chapters have less than 5000 words?
which(philosophers_stone_summary$Tokens < 5000)
[1]  2  3  4  8 11 13 14

Notice that each element from the character vector has been treated as a unique text; that is, each chapter is being treated as a separate text. The summary() function provides a breakdown of some basic statistics on each chapter then. Text is an automatically created unique identifier for each text, Types is the number of unique words/tokens in the text, Tokens is the total number of words/tokens in the text (i.e., the length of the chapter), and Sentences is the number of sentences in the chapter.

Metadata

For each book, we don’t have much in the way of metadata. However, this summary gives us a start and is something we can use to add metadata to the corpus we’ve created.

# check for metadata; shouldn't see any
docvars(philosophers_stone_corpus)
data frame with 0 columns and 17 rows
# add an indicator for the book; this will be useful later when we add all the books together into a single corpus
philosophers_stone_summary$book <- "Philosopher's Stone"
philosophers_stone_summary
Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences                book
  text1  1271   5693       350 Philosopher's Stone
  text2  1066   4154       237 Philosopher's Stone
  text3  1226   4656       297 Philosopher's Stone
  text4  1193   4832       322 Philosopher's Stone
  text5  1818   8446       563 Philosopher's Stone
  text6  1563   8016       566 Philosopher's Stone
  text7  1374   5487       351 Philosopher's Stone
  text8  1095   3608       198 Philosopher's Stone
  text9  1422   6195       411 Philosopher's Stone
 text10  1293   5237       334 Philosopher's Stone
 text11  1110   4215       277 Philosopher's Stone
 text12  1509   6790       447 Philosopher's Stone
 text13  1076   3953       262 Philosopher's Stone
 text14  1110   4394       308 Philosopher's Stone
 text15  1385   6486       459 Philosopher's Stone
 text16  1581   8357       591 Philosopher's Stone
 text17  1489   7172       506 Philosopher's Stone
# create a chapter indicator
philosophers_stone_summary$chapter <- as.numeric(str_extract(philosophers_stone_summary$Text, "[0-9]+"))
philosophers_stone_summary
Corpus consisting of 17 documents, showing 17 documents:

   Text Types Tokens Sentences                book chapter
  text1  1271   5693       350 Philosopher's Stone       1
  text2  1066   4154       237 Philosopher's Stone       2
  text3  1226   4656       297 Philosopher's Stone       3
  text4  1193   4832       322 Philosopher's Stone       4
  text5  1818   8446       563 Philosopher's Stone       5
  text6  1563   8016       566 Philosopher's Stone       6
  text7  1374   5487       351 Philosopher's Stone       7
  text8  1095   3608       198 Philosopher's Stone       8
  text9  1422   6195       411 Philosopher's Stone       9
 text10  1293   5237       334 Philosopher's Stone      10
 text11  1110   4215       277 Philosopher's Stone      11
 text12  1509   6790       447 Philosopher's Stone      12
 text13  1076   3953       262 Philosopher's Stone      13
 text14  1110   4394       308 Philosopher's Stone      14
 text15  1385   6486       459 Philosopher's Stone      15
 text16  1581   8357       591 Philosopher's Stone      16
 text17  1489   7172       506 Philosopher's Stone      17

Now we can assign these to the corpus as document-level metadata as follows:

docvars(philosophers_stone_corpus) <- philosophers_stone_summary
docvars(philosophers_stone_corpus)
     Text Types Tokens Sentences                book chapter
1   text1  1271   5693       350 Philosopher's Stone       1
2   text2  1066   4154       237 Philosopher's Stone       2
3   text3  1226   4656       297 Philosopher's Stone       3
4   text4  1193   4832       322 Philosopher's Stone       4
5   text5  1818   8446       563 Philosopher's Stone       5
6   text6  1563   8016       566 Philosopher's Stone       6
7   text7  1374   5487       351 Philosopher's Stone       7
8   text8  1095   3608       198 Philosopher's Stone       8
9   text9  1422   6195       411 Philosopher's Stone       9
10 text10  1293   5237       334 Philosopher's Stone      10
11 text11  1110   4215       277 Philosopher's Stone      11
12 text12  1509   6790       447 Philosopher's Stone      12
13 text13  1076   3953       262 Philosopher's Stone      13
14 text14  1110   4394       308 Philosopher's Stone      14
15 text15  1385   6486       459 Philosopher's Stone      15
16 text16  1581   8357       591 Philosopher's Stone      16
17 text17  1489   7172       506 Philosopher's Stone      17

These document variables can be really useful when we want to subset the corpus to some specific level. With just one book, it doesn’t make a lot of sense to subset right now. But the intuition works later, so let’s look at what we’d do if we wanted to, say, look at only chapters with fewer than 5,000 tokens.

small_corpus <- corpus_subset(philosophers_stone_corpus, Tokens < 5000)
summary(small_corpus)
Corpus consisting of 7 documents, showing 7 documents:

   Text Types Tokens Sentences   Text Types Tokens Sentences
  text2  1066   4154       237  text2  1066   4154       237
  text3  1226   4656       297  text3  1226   4656       297
  text4  1193   4832       322  text4  1193   4832       322
  text8  1095   3608       198  text8  1095   3608       198
 text11  1110   4215       277 text11  1110   4215       277
 text13  1076   3953       262 text13  1076   3953       262
 text14  1110   4394       308 text14  1110   4394       308
                book chapter
 Philosopher's Stone       2
 Philosopher's Stone       3
 Philosopher's Stone       4
 Philosopher's Stone       8
 Philosopher's Stone      11
 Philosopher's Stone      13
 Philosopher's Stone      14

Chapters offer a natural unit for analysis here. However, we may want to reshape the level of analysis that we are conducting, perhaps moving from the chapter level to the paragraph or sentence level.

# the number of documents (chapters) in our small corpus
ndoc(small_corpus)
[1] 7
# the command to reshape our corpus to the sentence level
small_corpus_sentences <- corpus_reshape(small_corpus, to = "sentences")
# can also be to = "sentences", "paragraphs", or "documents"

#the number of documents (sentences) in our reshaped corpus
ndoc(small_corpus_sentences)
[1] 1901
# a summary of the first 5 texts in the sentence-level corpus
summary(small_corpus_sentences, n = 5)
Corpus consisting of 1901 documents, showing 5 documents:

    Text Types Tokens Sentences  Text Types Tokens Sentences
 text2.1    29     32         1 text2  1066   4154       237
 text2.2    45     57         1 text2  1066   4154       237
 text2.3    13     14         1 text2  1066   4154       237
 text2.4    55     70         1 text2  1066   4154       237
 text2.5    17     17         1 text2  1066   4154       237
                book chapter
 Philosopher's Stone       2
 Philosopher's Stone       2
 Philosopher's Stone       2
 Philosopher's Stone       2
 Philosopher's Stone       2

So we have gone from 7 documents (chapters) to 1,898 documents (sentences). The summary provides some unique new information for us now. The first four columns now relate to summary items for the sentence level (which is why Sentences is always equal to 1) and the next four columns relate to the document. Note that the first Text column now includes an additional term (i.e., .1). These index each sentence within the chapter.

The “right” level of analysis is going to be really contingent on your specific research question. There is no single correct level of analysis across all analyses. Think – a lot – about what you are interested in studying, and what level is best for that study.

Tokens

We’ve used the phrase “tokens” in a few places now, and it’s time to dive into what we mean by it. Tokens are the individual component pieces of the text, and tokenizing is the process of breaking up the text into those component pieces. Consider the following tweet from President Trump:

Unless Republicans have a death wish, and it is also the right thing to do, they must approve the $2,000 payments ASAP. $600 IS NOT ENOUGH! Also, get rid of Section 230 - Don't let Big Tech steal our Country, and don't let the Democrats steal the Presidential Election. Get tough!

We can start to break that into constituent words (“Unless”, “Republicans”, etc.) but notice that we pretty quickly have decisions to make. Should we include the punctuation marks (“,”, “.”, “!”) with a word? As unique tokens themselves? What about numbers like 2,000? Likewise, we have to decide whether to include the “$” with 2,000; you can see why that’d be important when you get a bit further and run into “Section 230”. Finally, what should we do with contractions like “Don’t”? Is that one word or two? And should we treat “Don’t” and “don’t” as the same token, or two different tokens?

All of these are choices that get nested into the tokenization process. The most basic versions of a tokenizer will split on white space; others are trained to split on a host of other characteristics. The big thing to know is to always look at the data you are creating.

# the default breaks on white space
philosophers_stone_tokens <- tokens(philosophers_stone_corpus)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "."       "and"    
 [8] "Mrs"     "."       "Dursley" ","       "of"     
[ ... and 5,681 more ]

text2 :
 [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
[ ... and 4,142 more ]

text3 :
 [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
 [6] "The"         "escape"      "of"          "the"         "Brazilian"  
[11] "boa"         "constrictor"
[ ... and 4,644 more ]

text4 :
 [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "."      
 [8] "They"    "knocked" "again"   "."       "Dudley" 
[ ... and 4,820 more ]

text5 :
 [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "."        "Although" "he"       "could"   
[ ... and 8,434 more ]

text6 :
 [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
 [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 8,004 more ]

[ reached max_ndoc ... 11 more documents ]
# you can also drop punctuation
philosophers_stone_tokens <- tokens(philosophers_stone_corpus,
       remove_punct = T)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "and"     "Mrs"    
 [8] "Dursley" "of"      "number"  "four"    "Privet" 
[ ... and 4,784 more ]

text2 :
 [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
[ ... and 3,556 more ]

text3 :
 [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
 [6] "The"         "escape"      "of"          "the"         "Brazilian"  
[11] "boa"         "constrictor"
[ ... and 3,984 more ]

text4 :
 [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "They"   
 [8] "knocked" "again"   "Dudley"  "jerked"  "awake"  
[ ... and 3,910 more ]

text5 :
 [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "Although" "he"       "could"    "tell"    
[ ... and 6,998 more ]

text6 :
 [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
 [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 6,750 more ]

[ reached max_ndoc ... 11 more documents ]
# as well as numbers
philosophers_stone_tokens <- tokens(philosophers_stone_corpus,
       remove_punct = T,      
       remove_numbers = T)
print(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "THE"     "BOY"     "WHO"     "LIVED"   "Mr"      "and"     "Mrs"    
 [8] "Dursley" "of"      "number"  "four"    "Privet" 
[ ... and 4,784 more ]

text2 :
 [1] "THE"       "VANISHING" "GLASS"     "Nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "Dursleys"  "had"      
[ ... and 3,556 more ]

text3 :
 [1] "THE"         "LETTERS"     "FROM"        "NO"          "ONE"        
 [6] "The"         "escape"      "of"          "the"         "Brazilian"  
[11] "boa"         "constrictor"
[ ... and 3,981 more ]

text4 :
 [1] "THE"     "KEEPER"  "OF"      "THE"     "KEYS"    "BOOM"    "They"   
 [8] "knocked" "again"   "Dudley"  "jerked"  "awake"  
[ ... and 3,907 more ]

text5 :
 [1] "DIAGON"   "ALLEY"    "Harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "Although" "he"       "could"    "tell"    
[ ... and 6,989 more ]

text6 :
 [1] "THE"            "JOURNEY"        "FROM"           "PLATFORM"      
 [5] "NINE"           "AND"            "THREE-QUARTERS" "Harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 6,749 more ]

[ reached max_ndoc ... 11 more documents ]
# as well as changing letters to lower cases
tokens_tolower(philosophers_stone_tokens)
Tokens consisting of 17 documents and 6 docvars.
text1 :
 [1] "the"     "boy"     "who"     "lived"   "mr"      "and"     "mrs"    
 [8] "dursley" "of"      "number"  "four"    "privet" 
[ ... and 4,784 more ]

text2 :
 [1] "the"       "vanishing" "glass"     "nearly"    "ten"       "years"    
 [7] "had"       "passed"    "since"     "the"       "dursleys"  "had"      
[ ... and 3,556 more ]

text3 :
 [1] "the"         "letters"     "from"        "no"          "one"        
 [6] "the"         "escape"      "of"          "the"         "brazilian"  
[11] "boa"         "constrictor"
[ ... and 3,981 more ]

text4 :
 [1] "the"     "keeper"  "of"      "the"     "keys"    "boom"    "they"   
 [8] "knocked" "again"   "dudley"  "jerked"  "awake"  
[ ... and 3,907 more ]

text5 :
 [1] "diagon"   "alley"    "harry"    "woke"     "early"    "the"     
 [7] "next"     "morning"  "although" "he"       "could"    "tell"    
[ ... and 6,989 more ]

text6 :
 [1] "the"            "journey"        "from"           "platform"      
 [5] "nine"           "and"            "three-quarters" "harry's"       
 [9] "last"           "month"          "with"           "the"           
[ ... and 6,749 more ]

[ reached max_ndoc ... 11 more documents ]

When the data are tokenized, we can start to look at a more granular level at the usage of particular terms. For instance, maybe we want to know about the usage of particular terms within the corpus. We can look at that using keyword-in-context (kwic).

# check the use of "dumbledore"
kwic_dumbledore <- kwic(philosophers_stone_tokens,
     pattern = c("dumbledore"))
# window = 5 

# look at the first few uses
head(kwic_dumbledore)
Keyword-in-context with 6 matches.                                                             
 [text1, 2292]       This man's name was Albus | Dumbledore |
 [text1, 2294] name was Albus Dumbledore Albus | Dumbledore |
 [text1, 2489]  happening down on the pavement | Dumbledore |
 [text1, 2759]         can't blame them ` said | Dumbledore |
 [text1, 2818]      a sharp sideways glance at | Dumbledore |
 [text1, 2869]      suppose he really has gone | Dumbledore |
                                  
 Albus Dumbledore didn't seem to  
 didn't seem to realize that      
 slipped the Put-Outer back inside
 gently ` We've had precious      
 here as though hoping he         
 ` ` It certainly seems           
# now look at a broader window of terms around "dumbledore"
kwic_dumbledore <- kwic(philosophers_stone_tokens,
     pattern = c("dumbledore"),
     window = 10)

# look at the first few uses
head(kwic_dumbledore)
Keyword-in-context with 6 matches.                                                                          
 [text1, 2292]      been broken at least twice This man's name was Albus |
 [text1, 2294] at least twice This man's name was Albus Dumbledore Albus |
 [text1, 2489]   to see anything that was happening down on the pavement |
 [text1, 2759]                much sense ` ` You can't blame them ` said |
 [text1, 2818]    swapping rumors ` She threw a sharp sideways glance at |
 [text1, 2869]             out about us all I suppose he really has gone |
                                                                        
 Dumbledore | Albus Dumbledore didn't seem to realize that he had just  
 Dumbledore | didn't seem to realize that he had just arrived in        
 Dumbledore | slipped the Put-Outer back inside his cloak and set off   
 Dumbledore | gently ` We've had precious little to celebrate for eleven
 Dumbledore | here as though hoping he was going to tell her            
 Dumbledore | ` ` It certainly seems so ` said Dumbledore `             
# if you are more interested in phrases, then you can do that too using phrase()
kwic_phrase <- kwic(philosophers_stone_tokens,
                    pattern = phrase("daily prophet"))


head(kwic_phrase)
Keyword-in-context with 3 matches.                                                                   
   [text5, 901:902] Hagrid read his newspaper the | Daily Prophet |
 [text6, 5196:5197]        It's been all over the | Daily Prophet |
 [text8, 2909:2910]        was a cutting from the | Daily Prophet |
                                                  
 Harry had learned from Uncle                     
 but I don't suppose you                          
 GRINGOTTS BREAK-IN LATEST Investigations continue

The Daily Prophet offers a great example of a problem we might encounter with tokenizers; the standard approach is going to treat this as two different words when really it is the phrase itself that is likely of interest. Therefore, we can compound the tokens into a phrase using tokens_compound(). This creates a bigram; you can similarly create three-token phrases (trigram), four-token phrases (four-gram), and so on. Note that the newly created token will be the phrase separated by “_” (i.e., Daily_Prophet).

philosophers_stone_compound <- tokens_compound(philosophers_stone_tokens,
                pattern = phrase("Daily Prophet"))

head(kwic(philosophers_stone_compound,
          pattern = "Daily_Prophet"))
Keyword-in-context with 3 matches.                                                              
  [text5, 901] Hagrid read his newspaper the | Daily_Prophet |
 [text6, 5196]        It's been all over the | Daily_Prophet |
 [text8, 2909]        was a cutting from the | Daily_Prophet |
                                                  
 Harry had learned from Uncle                     
 but I don't suppose you                          
 GRINGOTTS BREAK-IN LATEST Investigations continue

Of course, you may also believe that there are lots and lots of potentially meaningful n-grams (i.e., uni-, bi-, tri-, four-, etc.) that you do not want to individually specify. In those cases, you can specify that tokenization specifically include every possible n-gram.

# create a tokens object with unigrams and bigrams
philosophers_stone_ngrams <- tokens_ngrams(philosophers_stone_tokens, n=1:2)

# look at the first few observations. Note the indexing here to look at only the first few words *within the first chapter*
head(philosophers_stone_ngrams[[1]], 50)
 [1] "THE"        "BOY"        "WHO"        "LIVED"      "Mr"        
 [6] "and"        "Mrs"        "Dursley"    "of"         "number"    
[11] "four"       "Privet"     "Drive"      "were"       "proud"     
[16] "to"         "say"        "that"       "they"       "were"      
[21] "perfectly"  "normal"     "thank"      "you"        "very"      
[26] "much"       "They"       "were"       "the"        "last"      
[31] "people"     "you'd"      "expect"     "to"         "be"        
[36] "involved"   "in"         "anything"   "strange"    "or"        
[41] "mysterious" "because"    "they"       "just"       "didn't"    
[46] "hold"       "with"       "such"       "nonsense"   "Mr"        
tail(philosophers_stone_ngrams[[1]], 50)
 [1] "nor_that"       "that_he"        "he_would"       "would_spend"   
 [5] "spend_the"      "the_next"       "next_few"       "few_weeks"     
 [9] "weeks_being"    "being_prodded"  "prodded_and"    "and_pinched"   
[13] "pinched_by"     "by_his"         "his_cousin"     "cousin_Dudley" 
[17] "Dudley_He"      "He_couldn't"    "couldn't_know"  "know_that"     
[21] "that_at"        "at_this"        "this_very"      "very_moment"   
[25] "moment_people"  "people_meeting" "meeting_in"     "in_secret"     
[29] "secret_all"     "all_over"       "over_the"       "the_country"   
[33] "country_were"   "were_holding"   "holding_up"     "up_their"      
[37] "their_glasses"  "glasses_and"    "and_saying"     "saying_in"     
[41] "in_hushed"      "hushed_voices"  "voices_`"       "`_To"          
[45] "To_Harry"       "Harry_Potter"   "Potter_the"     "the_boy"       
[49] "boy_who"        "who_lived"     

As you can see, there is a pretty severe curse of dimensionality problem as you look to expand into greater and greater ngrams. Nevertheless, computational time and space is cheap, and the added information from the phrases could be useful in different research settings.

Combining Corpora

In the above, we’ve been working with just one text, broken into chapters. But occasionally we have two corpora that we need to combine. Here, for instance, there are 6 more Harry Potter books that we have not, to this point, added to any of our analysis.

Doing so with quanteda is easy, but getting there is hard because we have to repeat a lot of steps for seven corpora. Instead, let’s do this with loops.

# list out the object (book) names that we need
myBooks <- c("philosophers_stone",
             "chamber_of_secrets",
             "prisoner_of_azkaban",
             "goblet_of_fire",
             "order_of_the_phoenix",
             "half_blood_prince",
             "deathly_hallows")

# create loop.
for (i in 1:length(myBooks)){
  
  # create corpora
  corpusCall <- paste(myBooks[i],"_corpus <- corpus(",myBooks[i],")", sep = "")
  eval(parse(text=corpusCall))

  # change document names for each chapter to include the book title. If you don't do this, the document names will be duplicated and you'll get an error.
  namesCall <- paste("tmpNames <- docnames(",myBooks[i],"_corpus)", sep = "")
  eval(parse(text=namesCall))
  bindCall <- paste("docnames(",myBooks[i],"_corpus) <- paste(\"",myBooks[i],"\", tmpNames, sep = \"-\")", sep = "")
  eval(parse(text=bindCall))

  # create summary data
  summaryCall <- paste(myBooks[i],"_summary <- summary(",myBooks[i],"_corpus)", sep = "")
  eval(parse(text=summaryCall))

  # add indicator
  bookCall <- paste(myBooks[i],"_summary$book <- \"",myBooks[i],"\"", sep = "")
  eval(parse(text=bookCall))

  # add chapter indicator
  chapterCall <- paste(myBooks[i],"_summary$chapter <- as.numeric(str_extract(",myBooks[i],"_summary$Text, \"[0-9]+\"))", sep = "")
  eval(parse(text=chapterCall))

  # add meta data to each corpus
  metaCall <- paste("docvars(",myBooks[i],"_corpus) <- ",myBooks[i],"_summary", sep = "")
  eval(parse(text=metaCall))

}

# once the loop finishes up, check to make sure you've created what you want
docvars(deathly_hallows_corpus)
                     Text Types Tokens Sentences            book chapter
1   deathly_hallows-text1  1108   3876       223 deathly_hallows       1
2   deathly_hallows-text2  1466   4616       163 deathly_hallows       2
3   deathly_hallows-text3  1110   4171       256 deathly_hallows       3
4   deathly_hallows-text4  1549   6586       307 deathly_hallows       4
5   deathly_hallows-text5  1577   7604       489 deathly_hallows       5
6   deathly_hallows-text6  1772   8389       481 deathly_hallows       6
7   deathly_hallows-text7  1810   8457       534 deathly_hallows       7
8   deathly_hallows-text8  1860   7762       438 deathly_hallows       8
9   deathly_hallows-text9  1288   5198       279 deathly_hallows       9
10 deathly_hallows-text10  1724   8071       461 deathly_hallows      10
11 deathly_hallows-text11  1677   6931       384 deathly_hallows      11
12 deathly_hallows-text12  1848   7542       405 deathly_hallows      12
13 deathly_hallows-text13  1725   7094       357 deathly_hallows      13
14 deathly_hallows-text14  1295   5175       267 deathly_hallows      14
15 deathly_hallows-text15  1908   8860       537 deathly_hallows      15
16 deathly_hallows-text16  1527   6150       289 deathly_hallows      16
17 deathly_hallows-text17  1510   6741       391 deathly_hallows      17
18 deathly_hallows-text18  1183   4125       206 deathly_hallows      18
19 deathly_hallows-text19  1708   8228       459 deathly_hallows      19
20 deathly_hallows-text20  1451   5561       319 deathly_hallows      20
21 deathly_hallows-text21  1426   6156       395 deathly_hallows      21
22 deathly_hallows-text22  1620   7430       453 deathly_hallows      22
23 deathly_hallows-text23  1881  10152       700 deathly_hallows      23
24 deathly_hallows-text24  1655   8449       554 deathly_hallows      24
25 deathly_hallows-text25  1344   5543       326 deathly_hallows      25
26 deathly_hallows-text26  1878   8206       378 deathly_hallows      26
27 deathly_hallows-text27   921   3207       148 deathly_hallows      27
28 deathly_hallows-text28  1328   5622       338 deathly_hallows      28
29 deathly_hallows-text29  1359   5806       373 deathly_hallows      29
30 deathly_hallows-text30  1475   6266       372 deathly_hallows      30
31 deathly_hallows-text31  2069   9913       560 deathly_hallows      31
32 deathly_hallows-text32  1535   6801       326 deathly_hallows      32
33 deathly_hallows-text33  2072  10456       721 deathly_hallows      33
34 deathly_hallows-text34  1170   4599       241 deathly_hallows      34
35 deathly_hallows-text35  1329   6321       426 deathly_hallows      35
36 deathly_hallows-text36  1893   8635       424 deathly_hallows      36
37 deathly_hallows-text37   710   2072       141 deathly_hallows      37
# You can change the book name to any of the seven Harry Potter books

Now that we have all of the corpora in order, we can combine then using c().

# create combined corpora of the first 7 harry potter books.
harry_potter_corpus <-
  c(philosophers_stone_corpus,                  chamber_of_secrets_corpus,                  prisoner_of_azkaban_corpus,
    goblet_of_fire_corpus,                      order_of_the_phoenix_corpus,
    half_blood_prince_corpus,
    deathly_hallows_corpus)
summary(harry_potter_corpus)
Corpus consisting of 200 documents, showing 100 documents:

                       Text Types Tokens Sentences                       Text
   philosophers_stone-text1  1271   5693       350   philosophers_stone-text1
   philosophers_stone-text2  1066   4154       237   philosophers_stone-text2
   philosophers_stone-text3  1226   4656       297   philosophers_stone-text3
   philosophers_stone-text4  1193   4832       322   philosophers_stone-text4
   philosophers_stone-text5  1818   8446       563   philosophers_stone-text5
   philosophers_stone-text6  1563   8016       566   philosophers_stone-text6
   philosophers_stone-text7  1374   5487       351   philosophers_stone-text7
   philosophers_stone-text8  1095   3608       198   philosophers_stone-text8
   philosophers_stone-text9  1422   6195       411   philosophers_stone-text9
  philosophers_stone-text10  1293   5237       334  philosophers_stone-text10
  philosophers_stone-text11  1110   4215       277  philosophers_stone-text11
  philosophers_stone-text12  1509   6790       447  philosophers_stone-text12
  philosophers_stone-text13  1076   3953       262  philosophers_stone-text13
  philosophers_stone-text14  1110   4394       308  philosophers_stone-text14
  philosophers_stone-text15  1385   6486       459  philosophers_stone-text15
  philosophers_stone-text16  1581   8357       591  philosophers_stone-text16
  philosophers_stone-text17  1489   7172       506  philosophers_stone-text17
   chamber_of_secrets-text1   998   3206       183   chamber_of_secrets-text1
   chamber_of_secrets-text2  1040   3715       252   chamber_of_secrets-text2
   chamber_of_secrets-text3  1410   5743       342   chamber_of_secrets-text3
   chamber_of_secrets-text4  1751   7350       393   chamber_of_secrets-text4
   chamber_of_secrets-text5  1630   6390       354   chamber_of_secrets-text5
   chamber_of_secrets-text6  1648   6145       358   chamber_of_secrets-text6
   chamber_of_secrets-text7  1335   5019       363   chamber_of_secrets-text7
   chamber_of_secrets-text8  1625   5896       353   chamber_of_secrets-text8
   chamber_of_secrets-text9  1554   6701       431   chamber_of_secrets-text9
  chamber_of_secrets-text10  1563   6652       407  chamber_of_secrets-text10
  chamber_of_secrets-text11  1763   7380       444  chamber_of_secrets-text11
  chamber_of_secrets-text12  1472   5582       349  chamber_of_secrets-text12
  chamber_of_secrets-text13  1092   4050       263  chamber_of_secrets-text13
  chamber_of_secrets-text14   898   2959       195  chamber_of_secrets-text14
  chamber_of_secrets-text15  1377   5328       335  chamber_of_secrets-text15
  chamber_of_secrets-text16   986   3150       168  chamber_of_secrets-text16
  chamber_of_secrets-text17  1485   6613       464  chamber_of_secrets-text17
  chamber_of_secrets-text18  1063   4316       307  chamber_of_secrets-text18
  chamber_of_secrets-text19  2063  10949       729  chamber_of_secrets-text19
  prisoner_of_azkaban-text1  1267   4299       217  prisoner_of_azkaban-text1
  prisoner_of_azkaban-text2  1254   4795       302  prisoner_of_azkaban-text2
  prisoner_of_azkaban-text3  1383   5660       364  prisoner_of_azkaban-text3
  prisoner_of_azkaban-text4  1619   6414       382  prisoner_of_azkaban-text4
  prisoner_of_azkaban-text5  1621   7059       437  prisoner_of_azkaban-text5
  prisoner_of_azkaban-text6  1762   7978       504  prisoner_of_azkaban-text6
  prisoner_of_azkaban-text7  1318   5501       388  prisoner_of_azkaban-text7
  prisoner_of_azkaban-text8  1583   6469       417  prisoner_of_azkaban-text8
  prisoner_of_azkaban-text9  1542   6737       450  prisoner_of_azkaban-text9
 prisoner_of_azkaban-text10  1982   9021       570 prisoner_of_azkaban-text10
 prisoner_of_azkaban-text11  1575   6517       416 prisoner_of_azkaban-text11
 prisoner_of_azkaban-text12  1366   6283       433 prisoner_of_azkaban-text12
 prisoner_of_azkaban-text13  1374   5388       326 prisoner_of_azkaban-text13
 prisoner_of_azkaban-text14  1609   7043       496 prisoner_of_azkaban-text14
 prisoner_of_azkaban-text15  1664   6908       459 prisoner_of_azkaban-text15
 prisoner_of_azkaban-text16  1325   5007       344 prisoner_of_azkaban-text16
 prisoner_of_azkaban-text17  1206   5464       360 prisoner_of_azkaban-text17
 prisoner_of_azkaban-text18   784   2955       186 prisoner_of_azkaban-text18
 prisoner_of_azkaban-text19  1382   7013       486 prisoner_of_azkaban-text19
 prisoner_of_azkaban-text20   736   2571       184 prisoner_of_azkaban-text20
 prisoner_of_azkaban-text21  1622   9752       748 prisoner_of_azkaban-text21
 prisoner_of_azkaban-text22  1436   6165       475 prisoner_of_azkaban-text22
       goblet_of_fire-text1  1225   5150       278       goblet_of_fire-text1
       goblet_of_fire-text2   975   3307       151       goblet_of_fire-text2
       goblet_of_fire-text3  1064   3777       214       goblet_of_fire-text3
       goblet_of_fire-text4  1053   3753       236       goblet_of_fire-text4
       goblet_of_fire-text5  1244   4773       274       goblet_of_fire-text5
       goblet_of_fire-text6   878   3080       199       goblet_of_fire-text6
       goblet_of_fire-text7  1647   6731       416       goblet_of_fire-text7
       goblet_of_fire-text8  1655   7219       421       goblet_of_fire-text8
       goblet_of_fire-text9  1734   9396       647       goblet_of_fire-text9
      goblet_of_fire-text10  1108   4194       267      goblet_of_fire-text10
      goblet_of_fire-text11  1162   4246       258      goblet_of_fire-text11
      goblet_of_fire-text12  1700   6770       394      goblet_of_fire-text12
      goblet_of_fire-text13  1408   5021       312      goblet_of_fire-text13
      goblet_of_fire-text14  1453   6107       393      goblet_of_fire-text14
      goblet_of_fire-text15  1733   6429       343      goblet_of_fire-text15
      goblet_of_fire-text16  1635   7579       458      goblet_of_fire-text16
      goblet_of_fire-text17  1255   5269       345      goblet_of_fire-text17
      goblet_of_fire-text18  1758   8330       467      goblet_of_fire-text18
      goblet_of_fire-text19  1695   7876       441      goblet_of_fire-text19
      goblet_of_fire-text20  1747   9024       516      goblet_of_fire-text20
      goblet_of_fire-text21  1616   7134       443      goblet_of_fire-text21
      goblet_of_fire-text22  1310   5735       379      goblet_of_fire-text22
      goblet_of_fire-text23  2131  10268       619      goblet_of_fire-text23
      goblet_of_fire-text24  1742   8010       501      goblet_of_fire-text24
      goblet_of_fire-text25  1505   7175       409      goblet_of_fire-text25
      goblet_of_fire-text26  2021   9944       529      goblet_of_fire-text26
      goblet_of_fire-text27  1815   8780       534      goblet_of_fire-text27
      goblet_of_fire-text28  1898   9578       670      goblet_of_fire-text28
      goblet_of_fire-text29  1278   5674       406      goblet_of_fire-text29
      goblet_of_fire-text30  1492   8054       515      goblet_of_fire-text30
      goblet_of_fire-text31  1932   9930       661      goblet_of_fire-text31
      goblet_of_fire-text32   689   2410       131      goblet_of_fire-text32
      goblet_of_fire-text33  1134   5217       275      goblet_of_fire-text33
      goblet_of_fire-text34   818   3713       145      goblet_of_fire-text34
      goblet_of_fire-text35  1442   7531       607      goblet_of_fire-text35
      goblet_of_fire-text36  1518   7797       535      goblet_of_fire-text36
      goblet_of_fire-text37  1408   6273       443      goblet_of_fire-text37
 order_of_the_phoenix-text1  1761   7026       405 order_of_the_phoenix-text1
 order_of_the_phoenix-text2  1654   7724       545 order_of_the_phoenix-text2
 order_of_the_phoenix-text3  1622   6435       398 order_of_the_phoenix-text3
 order_of_the_phoenix-text4  1641   7158       447 order_of_the_phoenix-text4
 order_of_the_phoenix-text5  1544   6877       431 order_of_the_phoenix-text5
 Types Tokens Sentences                 book chapter
  1271   5693       350   philosophers_stone       1
  1066   4154       237   philosophers_stone       2
  1226   4656       297   philosophers_stone       3
  1193   4832       322   philosophers_stone       4
  1818   8446       563   philosophers_stone       5
  1563   8016       566   philosophers_stone       6
  1374   5487       351   philosophers_stone       7
  1095   3608       198   philosophers_stone       8
  1422   6195       411   philosophers_stone       9
  1293   5237       334   philosophers_stone      10
  1110   4215       277   philosophers_stone      11
  1509   6790       447   philosophers_stone      12
  1076   3953       262   philosophers_stone      13
  1110   4394       308   philosophers_stone      14
  1385   6486       459   philosophers_stone      15
  1581   8357       591   philosophers_stone      16
  1489   7172       506   philosophers_stone      17
   998   3206       183   chamber_of_secrets       1
  1040   3715       252   chamber_of_secrets       2
  1410   5743       342   chamber_of_secrets       3
  1751   7350       393   chamber_of_secrets       4
  1630   6390       354   chamber_of_secrets       5
  1648   6145       358   chamber_of_secrets       6
  1335   5019       363   chamber_of_secrets       7
  1625   5896       353   chamber_of_secrets       8
  1554   6701       431   chamber_of_secrets       9
  1563   6652       407   chamber_of_secrets      10
  1763   7380       444   chamber_of_secrets      11
  1472   5582       349   chamber_of_secrets      12
  1092   4050       263   chamber_of_secrets      13
   898   2959       195   chamber_of_secrets      14
  1377   5328       335   chamber_of_secrets      15
   986   3150       168   chamber_of_secrets      16
  1485   6613       464   chamber_of_secrets      17
  1063   4316       307   chamber_of_secrets      18
  2063  10949       729   chamber_of_secrets      19
  1267   4299       217  prisoner_of_azkaban       1
  1254   4795       302  prisoner_of_azkaban       2
  1383   5660       364  prisoner_of_azkaban       3
  1619   6414       382  prisoner_of_azkaban       4
  1621   7059       437  prisoner_of_azkaban       5
  1762   7978       504  prisoner_of_azkaban       6
  1318   5501       388  prisoner_of_azkaban       7
  1583   6469       417  prisoner_of_azkaban       8
  1542   6737       450  prisoner_of_azkaban       9
  1982   9021       570  prisoner_of_azkaban      10
  1575   6517       416  prisoner_of_azkaban      11
  1366   6283       433  prisoner_of_azkaban      12
  1374   5388       326  prisoner_of_azkaban      13
  1609   7043       496  prisoner_of_azkaban      14
  1664   6908       459  prisoner_of_azkaban      15
  1325   5007       344  prisoner_of_azkaban      16
  1206   5464       360  prisoner_of_azkaban      17
   784   2955       186  prisoner_of_azkaban      18
  1382   7013       486  prisoner_of_azkaban      19
   736   2571       184  prisoner_of_azkaban      20
  1622   9752       748  prisoner_of_azkaban      21
  1436   6165       475  prisoner_of_azkaban      22
  1225   5150       278       goblet_of_fire       1
   975   3307       151       goblet_of_fire       2
  1064   3777       214       goblet_of_fire       3
  1053   3753       236       goblet_of_fire       4
  1244   4773       274       goblet_of_fire       5
   878   3080       199       goblet_of_fire       6
  1647   6731       416       goblet_of_fire       7
  1655   7219       421       goblet_of_fire       8
  1734   9396       647       goblet_of_fire       9
  1108   4194       267       goblet_of_fire      10
  1162   4246       258       goblet_of_fire      11
  1700   6770       394       goblet_of_fire      12
  1408   5021       312       goblet_of_fire      13
  1453   6107       393       goblet_of_fire      14
  1733   6429       343       goblet_of_fire      15
  1635   7579       458       goblet_of_fire      16
  1255   5269       345       goblet_of_fire      17
  1758   8330       467       goblet_of_fire      18
  1695   7876       441       goblet_of_fire      19
  1747   9024       516       goblet_of_fire      20
  1616   7134       443       goblet_of_fire      21
  1310   5735       379       goblet_of_fire      22
  2131  10268       619       goblet_of_fire      23
  1742   8010       501       goblet_of_fire      24
  1505   7175       409       goblet_of_fire      25
  2021   9944       529       goblet_of_fire      26
  1815   8780       534       goblet_of_fire      27
  1898   9578       670       goblet_of_fire      28
  1278   5674       406       goblet_of_fire      29
  1492   8054       515       goblet_of_fire      30
  1932   9930       661       goblet_of_fire      31
   689   2410       131       goblet_of_fire      32
  1134   5217       275       goblet_of_fire      33
   818   3713       145       goblet_of_fire      34
  1442   7531       607       goblet_of_fire      35
  1518   7797       535       goblet_of_fire      36
  1408   6273       443       goblet_of_fire      37
  1761   7026       405 order_of_the_phoenix       1
  1654   7724       545 order_of_the_phoenix       2
  1622   6435       398 order_of_the_phoenix       3
  1641   7158       447 order_of_the_phoenix       4
  1544   6877       431 order_of_the_phoenix       5

Now we’re cooking. Here are some handy functions that can help us get a handle of the size and scope of our corpus now that we’re not going to be able to quickly see everything even in a summary page.

# check the number of documents (here, total chapters in the 7 books)
ndoc(harry_potter_corpus)
[1] 200
# check the total length of the text (i.e., the total word count)
sum(ntoken(harry_potter_corpus))
[1] 1365970

We’ll learn other ways to characterize and explore the texts later this semester when we turn to the different manners in which we present them. For now, you should have all of the tools you need to get your own corpus set up in R, and to be able to identify a number of important characteristics (the size of the corpus in terms of documents and vocabulary, for instance).

Need more practice?

#install.packages("janeaustenr")
library(janeaustenr)

# Jane Austen's novel "sense and sensibility"
sensesensibility[1:30]
 [1] "SENSE AND SENSIBILITY"                                                  
 [2] ""                                                                       
 [3] "by Jane Austen"                                                         
 [4] ""                                                                       
 [5] "(1811)"                                                                 
 [6] ""                                                                       
 [7] ""                                                                       
 [8] ""                                                                       
 [9] ""                                                                       
[10] "CHAPTER 1"                                                              
[11] ""                                                                       
[12] ""                                                                       
[13] "The family of Dashwood had long been settled in Sussex.  Their estate"  
[14] "was large, and their residence was at Norland Park, in the centre of"   
[15] "their property, where, for many generations, they had lived in so"      
[16] "respectable a manner as to engage the general good opinion of their"    
[17] "surrounding acquaintance.  The late owner of this estate was a single"  
[18] "man, who lived to a very advanced age, and who for many years of his"   
[19] "life, had a constant companion and housekeeper in his sister.  But her" 
[20] "death, which happened ten years before his own, produced a great"       
[21] "alteration in his home; for to supply her loss, he invited and received"
[22] "into his house the family of his nephew Mr. Henry Dashwood, the legal"  
[23] "inheritor of the Norland estate, and the person to whom he intended to" 
[24] "bequeath it.  In the society of his nephew and niece, and their"        
[25] "children, the old Gentleman's days were comfortably spent.  His"        
[26] "attachment to them all increased.  The constant attention of Mr. and"   
[27] "Mrs. Henry Dashwood to his wishes, which proceeded not merely from"     
[28] "interest, but from goodness of heart, gave him every degree of solid"   
[29] "comfort which his age could receive; and the cheerfulness of the"       
[30] "children added a relish to his existence."                              
class(sensesensibility)
[1] "character"
sensesensibility[110]
[1] "father's decease; but the indelicacy of her conduct was so much the"

We see that this is a character vector that contains multiple strings. We want to combine these strings into one.

sensesensibility_test <- paste(sensesensibility, collapse = " ")
class(sensesensibility_test)
[1] "character"
#print(sensesensibility_test)
#Output is too long, so I won't render it
#We'll show this in class

WOW, this is really long. We notice that this character vector now contains only one element, which is the whole book. And we want to separate it by chapters.

sensesensibility2 <- unlist(strsplit(sensesensibility_test, "CHAPTER [0-9]+   "))

print(sensesensibility2[1])
[1] "SENSE AND SENSIBILITY  by Jane Austen  (1811)     "

Now we convert the character vectors to a corpus

snse2corpus <- corpus(sensesensibility2)
sensesummary <- summary(snse2corpus)
print(sensesummary)
Corpus consisting of 51 documents, showing 51 documents:

   Text Types Tokens Sentences
  text1     9      9         1
  text2   560   1810        49
  text3   611   2339        99
  text4   606   1785        79
  text5   682   2255        88
  text6   463   1171        33
  text7   557   1499        44
  text8   523   1424        41
  text9   511   1452        62
 text10   729   2191        79
 text11   774   2412        82
 text12   590   1636        58
 text13   628   1965        68
 text14   653   2637       130
 text15   577   1752        70
 text16   794   3003       138
 text17   745   2386       103
 text18   612   2063       114
 text19   611   1823        77
 text20   898   3502       139
 text21   650   3032       154
 text22   914   3455       107
 text23   819   3437       120
 text24   814   2710        69
 text25   719   2481        82
 text26   660   2248        65
 text27   817   2929        95
 text28   843   2973       115
 text29   581   1689        59
 text30  1071   4636       172
 text31   900   3729       166
 text32  1114   4556       156
 text33   853   3011        99
 text34   916   3601       138
 text35   879   3117        69
 text36   731   2850        95
 text37   965   3556       100
 text38  1126   5484       167
 text39   838   3755       119
 text40   707   2396        54
 text41   759   3217       114
 text42   818   3246        84
 text43   709   2081        34
 text44  1013   4030        99
 text45  1363   6959       228
 text46   745   2556        67
 text47  1016   3419        87
 text48   785   2869        90
 text49   523   1601        60
 text50  1155   4955       110
 text51   871   2884        57

Why are there NINE tokens in text1 (The book title)? (hint: find the answer using tokens function)