Recall that one of our first exercises using text as data was to analyze State of the Union addresses by former presidents – particularly as it relates to concepts like the military and the economy. To do this, we made at least two important assumptions:
Presidents may touch on various policy areas and themes throughout these speeches.
Even though they may draw from different perspectives and contexts, we can define a word or assert a general vocabulary to identify those thematic elements – e.g., I can use terms like navy, army, marines, etc. to reasonably identify periods where the president is discussing military matters.
If we were to apply a clustering algorithm to these speeches, we may be incidentally obscuring our ability to correctly identify these various discussion points. Even clustering algorithms that use soft partitioning are likely to draw documents towards a single cluster center. But when we have documents where themes and other distinct literary elements are in flux throughout, we may want to instead rely on topic models.
Topic Models are similar to clustering methods with an important caveat: rather than assign each document to only one cluster, topic models assign each document with proportional membership to all categories (topics). That is, topic models suppose that each document is a mixture across categories – a mixed membership model (GSR Ch. 13). The key benefit here is that we don’t need a single topic to be representative an entire document on its own. For instance, in trying to identify how presidents may discuss the economy, topic models may help us understand this through the lens of more discrete items like the welfare state, taxation, foreign trade or other topics within the broader scope of discussion related to economic policy – we don’t need to say that only one of these should be used to discuss a speech (like a SOTU address) that touches on all of them!
Much like how k-means is the canonical clustering algorithm, Latent Dirichlet Allocation (LDA) is the canonical topic model. First introduced in 2003, LDA is a generative probabilistic model – meaning that it explicitly posits a data generating process for documents where documents are mixtures of latent topics and words are drawn conditional on those topics. In particular, it tries to recover:
Document-Topic Distributions – where each document is represented a distribution over (perhaps multiple) different topics, each with a certain probability.
Topic-Word Distributions – where each topic is represented as a distribution over words, meaning that each topic is defined by a set of words (again all with an associated probability of appearing in that topic).
As you might be able to infer, LDA assumes the document-topic \(Z_{im}\) and topic-word \(W_{im}\) distributions has a Dirichlet prior and draws from a multinomial distribution, where:
\[ Z_{im} \sim \text{Multinomial}(1,\pi_i) \quad \text{Topic Indicator for Each Word} \\ W_{im} \sim \text{Multinomial}(1, \mu{Z_{im}}) \quad \text{Observed Word Token For Each Word Position} \] So the probability of observing a word token in a document (\(W_{im} = j\)) conditional on both the document-topic weights (\(\pi_{iK}\)) and the topic-word distributions (\(\mu_k\)) is a weighted average of topic-specific word probabilities, where the weights are given by the document’s topic proportions (\(\pi_{iK}\)):
\[ p(W_{im} = j \mid \pi_i,\mu) = \sum_K\pi_{iK}\mu_{jK} \]
library(topicmodels)
data("AssociatedPress")
dtm <- AssociatedPress
lda_model <- LDA(dtm, k = 5, method = "Gibbs", control = list(seed = 1234)) # LDA w/ 5 Topics & Gibbs Sampling
topicmodels::terms(lda_model, 10)
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
## [1,] "i" "percent" "court" "i" "government"
## [2,] "people" "million" "police" "president" "soviet"
## [3,] "two" "year" "two" "bush" "united"
## [4,] "air" "billion" "state" "house" "states"
## [5,] "years" "new" "case" "new" "two"
## [6,] "new" "company" "years" "committee" "military"
## [7,] "just" "last" "federal" "congress" "people"
## [8,] "city" "market" "department" "dukakis" "police"
## [9,] "like" "prices" "attorney" "national" "union"
## [10,] "first" "stock" "drug" "campaign" "officials"
topic_probs <- posterior(lda_model)$topics # Prob Document i Belongs to Topic K
print(round(topic_probs[c(1:10), c(1:5)], 2))
## 1 2 3 4 5
## [1,] 0.20 0.05 0.61 0.06 0.08
## [2,] 0.07 0.31 0.12 0.20 0.30
## [3,] 0.29 0.10 0.44 0.08 0.08
## [4,] 0.30 0.17 0.13 0.14 0.25
## [5,] 0.28 0.14 0.18 0.26 0.15
## [6,] 0.06 0.08 0.08 0.38 0.40
## [7,] 0.22 0.30 0.21 0.11 0.16
## [8,] 0.13 0.13 0.10 0.39 0.25
## [9,] 0.25 0.19 0.18 0.18 0.21
## [10,] 0.18 0.13 0.11 0.27 0.30
word_probs <- posterior(lda_model)$terms # Prob Word J in Topic K
head(round(word_probs[, 1:5], 5))
## aaron abandon abandoned abandoning abbott
## 1 0.00000 0.00000 0.00039 0e+00 0e+00
## 2 0.00000 0.00000 0.00000 0e+00 0e+00
## 3 0.00000 0.00000 0.00000 0e+00 1e-04
## 4 0.00011 0.00016 0.00001 7e-05 0e+00
## 5 0.00000 0.00000 0.00006 0e+00 0e+00
GRS spend some time talking about upstream and downstream structural methods (though we won’t get too bogged down in it…), but there is a lot of merit to the Structural Topic Model (STM) – which provides a mechanism to add covariates to help explain both topic prevalence \(X^P_i\) and topic content \(X^c_i\). It works like a traditional topic model but also lets you introduce information about the documents themselves \(X^p_i\). For each document, the model first decides how much of each topic is likely to appear (topic prevalence) – this can be driven by features like authorship, publication date, etc.
Then, for each word in the document, the model picks a topic according to those proportions and generates the word from that topic. What makes STM especially powerful is that the way topics are expressed in words (topic content) can also shift depending on covariates (\(X^c_i\)). Not only can you see which topics are more common in certain types of documents, but you can also see how the language used to discuss a topic changes with different document characteristics. It’s a way to combine topic discovery with regression-style analysis, giving you richer insights into both what topics are present and how they’re talked about.
texts <- apply(AssociatedPress, 1, function(x) {
paste(rep(colnames(AssociatedPress), x), collapse = " ")
})
reduce_complexity <- function(text){
text <- tolower(text) # Lower Case
text <- tm::removePunctuation(text) # Punctuation
text <- tm::removeNumbers(text) # Numbers
text <- removeWords(text, tm::stopwords("english")) # Stop Words
text <- unlist(stringr::str_split(text, '\\s+')) # Tokenize
text <- textstem::lemmatize_words(text) # Lemmatize
text <- paste(text, collapse = ' ') # Re-Append
text <- gsub("\\s{2,}", ' ', text) # 2 or More Spaces --> One Space
text <- trimws(text) # White Space
return(text)
} # Complexity Reduction (See Class 5!)
texts_reduced <- vapply(texts, reduce_complexity, character(1))
corpus_reduced <- VCorpus(VectorSource(texts_reduced))
dtm_reduced <- DocumentTermMatrix(
corpus_reduced,
control = list(wordLengths = c(1, Inf))
)
dfm_ap <- as.dfm(dtm_reduced)
docs_stm <- convert(dfm_ap, to = "stm")
meta <- data.frame(dummy = rep(1, length(docs_stm$documents)))
stm_model <- stm::stm(
documents = docs_stm$documents,
vocab = docs_stm$vocab,
K = 5,
prevalence = ~1,
data = meta,
init.type = "Spectral",
seed = 1234,
max.em.its = 500,
verbose = FALSE
)
stm::labelTopics(stm_model, n = 10) # Top Words/Topic
## Topic 1 Top Words:
## Highest Prob: court, year, case, charge, say, two, attorney, go, police, school
## FREX: court, attorney, judge, trial, sentence, justice, lawyer, book, mother, convict
## Lift: male, moore, souter, abrams, abyss, acquit, actress, adjourn, adkins, admission
## Score: court, attorney, sentence, trial, police, lawyer, judge, prison, jury, convict
## Topic 2 Top Words:
## Highest Prob: official, report, air, department, state, two, service, test, area, fire
## FREX: water, plane, flight, ship, accident, space, pilot, navy, pentagon, aircraft
## Lift: virus, airplane, aspirin, bay, bomber, cargo, carrier, creek, crow, defect
## Score: plane, flight, pilot, nasa, aircraft, shuttle, airline, accident, ship, pentagon
## Topic 3 Top Words:
## Highest Prob: percent, million, year, market, price, company, billion, trade, new, stock
## FREX: market, price, stock, dollar, sale, cent, index, yen, gold, investor
## Lift: chrysler, trader, troy, abboud, acquisition, acustar, aftertax, ag, alkaline, alltime
## Score: stock, cent, percent, price, market, yen, index, rate, dollar, billion
## Topic 4 Top Words:
## Highest Prob: soviet, government, official, people, police, state, unite, party, force, military
## FREX: soviet, army, troop, gorbachev, communist, soldier, rebel, israel, africa, israeli
## Lift: abductor, abu, afghanistan, africas, ahmed, alfredo, algeria, algerian, algiers, amal
## Score: soviet, communist, police, gorbachev, troop, minister, guerrilla, palestinian, party, soldier
## Topic 5 Top Words:
## Highest Prob: bush, president, state, house, year, vote, new, go, say, make
## FREX: bush, dukakis, republican, senate, budget, sen, democrat, rep, jackson, legislation
## Lift: cavazos, dmass, sens, ad, legislation, marlin, nbc, abcs, accomplishment, adhere
## Score: dukakis, bush, republican, tax, percent, campaign, poll, democratic, budget, vote
proportions <- stm_model$theta # Document-Topic Proportions
round(proportions[1:10, ], 2)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.86 0.05 0.00 0.08 0.01
## [2,] 0.27 0.04 0.18 0.48 0.02
## [3,] 0.83 0.02 0.04 0.09 0.01
## [4,] 0.35 0.04 0.11 0.34 0.15
## [5,] 0.57 0.04 0.02 0.02 0.35
## [6,] 0.01 0.24 0.01 0.25 0.50
## [7,] 0.11 0.44 0.27 0.13 0.05
## [8,] 0.01 0.01 0.02 0.17 0.79
## [9,] 0.06 0.81 0.04 0.05 0.04
## [10,] 0.03 0.04 0.04 0.54 0.35
topic_word_probs <- exp(stm_model$beta$logbeta[[1]]) #Topic Word Probabilities
head(round(topic_word_probs[,1:5], 5))
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.00079 0.00044 0.00032 0.00026 0.00009
## [2,] 0.00086 0.00000 0.00000 0.00014 0.00019
## [3,] 0.00078 0.00079 0.00029 0.00011 0.00004
## [4,] 0.00043 0.00041 0.00012 0.00000 0.00002
## [5,] 0.00071 0.00055 0.00021 0.00016 0.00001