A corpus is a structured collection of text documents. Using packages like quanteda and tm, you can create a corpus from raw text data (like a vector of strings or a column in a dataframe). Once your text is in a corpus, you can easily perform text analysis such as tokenization, creating document-feature matrices, and calculating word frequencies.

Below are two ways to create a corpus in R – I am not partial to either, though the latter will be a bit more intuitive (I think) for adding metadata. We’ll use the same data for both.

Using tm

library(tm) # Load tm

texts <- c(
  "The quick brown fox jumps over the lazy dog.",
  "Data science is revolutionizing the way we analyze information.",
  "Text analysis in R is fun and informative!"
) # Sample Texts (as vector)

tm_corpus <- tm::VCorpus(VectorSource(texts)) # Create Corpus from texts

tm::inspect(tm_corpus) # Inspect
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 44
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 63
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 42
tm_corpus_clean <- tm::tm_map(tm_corpus, content_transformer(tolower)) # Convert All to Lowercase
tm_corpus_clean <- tm::tm_map(tm_corpus_clean, removePunctuation) # Remove Punctuation
tm_corpus_clean <- tm::tm_map(tm_corpus_clean, removeNumbers) # Remove Numbers
tm_corpus_clean <- tm::tm_map(tm_corpus_clean, removeWords, stopwords("english")) # Remove English Stopwords

tm::inspect(tm_corpus_clean) # Notice How A Ton of Characters Are Now Gone?
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 33
## 
## [[2]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 55
## 
## [[3]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 34


Using quanteda

library(quanteda) # Load Quanteda

texts_with_meta <- tibble(
  doc_id = c("sentence_1", "sentence_2", "sentence_3"),
  text = texts, 
  author = c('Josh', 'Leo', 'Toby'),
  date = as.Date(c("2025-01-01", "2025-01-02", "2025-01-03"))
) # Create Metadata for Texts (Same as tm example!)

quanteda_corpus <- corpus(texts_with_meta, text_field = "text")

summary(quanteda_corpus) # Inspect the Corpus
## Corpus consisting of 3 documents, showing 3 documents:
## 
##        Text Types Tokens Sentences author       date
##  sentence_1    10     10         1   Josh 2025-01-01
##  sentence_2    10     10         1    Leo 2025-01-02
##  sentence_3     9      9         1   Toby 2025-01-03

Example: Lies, Damn Lies, and Statistics (2000)

Let’s use the same example from the West Wing episode Lies, Damn Lies and Statistics (2000). Recall that our data was organized such that – for every character we care about – we have a single dataframe listing the associated character and their dialogue, as well as other information related to word count and dialogue order throughout the episode.

head(damn_lies, 10)
## # A tibble: 10 × 4
##    character dialogue                                        id word_count
##    <chr>     <chr>                                        <int>      <int>
##  1 DONNA     They got to start the poll, Josh. It's 7:05.     1          9
##  2 JOSH      It's ten to seven.                               2          4
##  3 DONNA     No, it's really not.                             3          4
##  4 JOSH      It's 7:05?                                       4          2
##  5 DONNA     Yeah.                                            5          1
##  6 JOSH      That's ridiculous.                               6          2
##  7 DONNA     I'm not making it up.                            7          5
##  8 JOSH      My watch says ten to seven.                      8          6
##  9 DONNA     That's 'cause your watch sucks.                  9          5
## 10 JOSH      My watch is fine.                               10          4

Let’s use quanteda to construct a corpus from Lies, Damn Lies and Statistics:

damn_lies_corpus <- quanteda::corpus(damn_lies, text_field = "dialogue") # Create Corpus (Text = 'dialogue')

summary(damn_lies_corpus[1:10]) # Inspect (Just First Couple of Rows)
## Corpus consisting of 10 documents, showing 10 documents:
## 
##    Text Types Tokens Sentences character id word_count
##   text1    13     14         2     DONNA  1          9
##   text2     5      5         1      JOSH  2          4
##   text3     6      6         1     DONNA  3          4
##   text4     5      5         1      JOSH  4          2
##   text5     2      2         1     DONNA  5          1
##   text6     3      3         1      JOSH  6          2
##   text7     6      6         1     DONNA  7          5
##   text8     7      7         1      JOSH  8          6
##   text9     7      7         1     DONNA  9          5
##  text10     5      5         1      JOSH 10          4