A corpus is a structured collection of text documents.
Using packages like quanteda and tm, you can
create a corpus from raw text data (like a vector of strings or a column
in a dataframe). Once your text is in a corpus, you can easily perform
text analysis such as tokenization, creating document-feature matrices,
and calculating word frequencies.
Below are two ways to create a corpus in R – I am not
partial to either, though the latter will be a bit more intuitive (I
think) for adding metadata. We’ll use the same data for both.
tmlibrary(tm) # Load tm
texts <- c(
"The quick brown fox jumps over the lazy dog.",
"Data science is revolutionizing the way we analyze information.",
"Text analysis in R is fun and informative!"
) # Sample Texts (as vector)
tm_corpus <- tm::VCorpus(VectorSource(texts)) # Create Corpus from texts
tm::inspect(tm_corpus) # Inspect
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 44
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 63
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 42
tm_corpus_clean <- tm::tm_map(tm_corpus, content_transformer(tolower)) # Convert All to Lowercase
tm_corpus_clean <- tm::tm_map(tm_corpus_clean, removePunctuation) # Remove Punctuation
tm_corpus_clean <- tm::tm_map(tm_corpus_clean, removeNumbers) # Remove Numbers
tm_corpus_clean <- tm::tm_map(tm_corpus_clean, removeWords, stopwords("english")) # Remove English Stopwords
tm::inspect(tm_corpus_clean) # Notice How A Ton of Characters Are Now Gone?
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 33
##
## [[2]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 55
##
## [[3]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 34
quantedalibrary(quanteda) # Load Quanteda
texts_with_meta <- tibble(
doc_id = c("sentence_1", "sentence_2", "sentence_3"),
text = texts,
author = c('Josh', 'Leo', 'Toby'),
date = as.Date(c("2025-01-01", "2025-01-02", "2025-01-03"))
) # Create Metadata for Texts (Same as tm example!)
quanteda_corpus <- corpus(texts_with_meta, text_field = "text")
summary(quanteda_corpus) # Inspect the Corpus
## Corpus consisting of 3 documents, showing 3 documents:
##
## Text Types Tokens Sentences author date
## sentence_1 10 10 1 Josh 2025-01-01
## sentence_2 10 10 1 Leo 2025-01-02
## sentence_3 9 9 1 Toby 2025-01-03
Let’s use the same example from the West Wing episode Lies, Damn Lies and Statistics (2000). Recall that our data was organized such that – for every character we care about – we have a single dataframe listing the associated character and their dialogue, as well as other information related to word count and dialogue order throughout the episode.
head(damn_lies, 10)
## # A tibble: 10 × 4
## character dialogue id word_count
## <chr> <chr> <int> <int>
## 1 DONNA They got to start the poll, Josh. It's 7:05. 1 9
## 2 JOSH It's ten to seven. 2 4
## 3 DONNA No, it's really not. 3 4
## 4 JOSH It's 7:05? 4 2
## 5 DONNA Yeah. 5 1
## 6 JOSH That's ridiculous. 6 2
## 7 DONNA I'm not making it up. 7 5
## 8 JOSH My watch says ten to seven. 8 6
## 9 DONNA That's 'cause your watch sucks. 9 5
## 10 JOSH My watch is fine. 10 4
Let’s use quanteda to construct a corpus from Lies,
Damn Lies and Statistics:
damn_lies_corpus <- quanteda::corpus(damn_lies, text_field = "dialogue") # Create Corpus (Text = 'dialogue')
summary(damn_lies_corpus[1:10]) # Inspect (Just First Couple of Rows)
## Corpus consisting of 10 documents, showing 10 documents:
##
## Text Types Tokens Sentences character id word_count
## text1 13 14 2 DONNA 1 9
## text2 5 5 1 JOSH 2 4
## text3 6 6 1 DONNA 3 4
## text4 5 5 1 JOSH 4 2
## text5 2 2 1 DONNA 5 1
## text6 3 3 1 JOSH 6 2
## text7 6 6 1 DONNA 7 5
## text8 7 7 1 JOSH 8 6
## text9 7 7 1 DONNA 9 5
## text10 5 5 1 JOSH 10 4