Topic Models

Recall that one of our first exercises using text as data was to analyze State of the Union addresses by former presidents – particularly as it relates to concepts like the military and the economy. To do this, we made at least two important assumptions:

  1. Presidents may touch on various policy areas and themes throughout these speeches.

  2. Even though they may draw from different perspectives and contexts, we can define a word or assert a general vocabulary to identify those thematic elements – e.g., I can use terms like navy, army, marines, etc. to reasonably identify periods where the president is discussing military matters.

If we were to apply a clustering algorithm to these speeches, we may be incidentally obscuring our ability to correctly identify these various discussion points. Even clustering algorithms that use soft partitioning are likely to draw documents towards a single cluster center. But when we have documents where themes and other distinct literary elements are in flux throughout, we may want to instead rely on topic models.

Topic Models are similar to clustering methods with an important caveat: rather than assign each document to only one cluster, topic models assign each document with proportional membership to all categories (topics). That is, topic models suppose that each document is a mixture across categories – a mixed membership model (GSR Ch. 13). The key benefit here is that we don’t need a single topic to be representative an entire document on its own. For instance, in trying to identify how presidents may discuss the economy, topic models may help us understand this through the lens of more discrete items like the welfare state, taxation, foreign trade or other topics within the broader scope of discussion related to economic policy – we don’t need to say that only one of these should be used to discuss a speech (like a SOTU address) that touches on all of them!

Latent Dirichlet Allocation

Much like how k-means is the canonical clustering algorithm, Latent Dirichlet Allocation (LDA) is the canonical topic model. First introduced in 2003, LDA is a generative probabilistic model – meaning that it explicitly posits a data generating process for documents where documents are mixtures of latent topics and words are drawn conditional on those topics. In particular, it tries to recover:

  1. Document-Topic Distributions – where each document is represented a distribution over (perhaps multiple) different topics, each with a certain probability.

  2. Topic-Word Distributions – where each topic is represented as a distribution over words, meaning that each topic is defined by a set of words (again all with an associated probability of appearing in that topic).

As you might be able to infer, LDA assumes the document-topic \(Z_{im}\) and topic-word \(W_{im}\) distributions has a Dirichlet prior and draws from a multinomial distribution, where:

\[ Z_{im} \sim \text{Multinomial}(1,\pi_i) \quad \text{Topic Indicator for Each Word} \\ W_{im} \sim \text{Multinomial}(1, \mu{Z_{im}}) \quad \text{Observed Word Token For Each Word Position} \] So the probability of observing a word token in a document (\(W_{im} = j\)) conditional on both the document-topic weights (\(\pi_{iK}\)) and the topic-word distributions (\(\mu_k\)) is a weighted average of topic-specific word probabilities, where the weights are given by the document’s topic proportions (\(\pi_{iK}\)):

\[ p(W_{im} = j \mid \pi_i,\mu) = \sum_K\pi_{iK}\mu_{jK} \]

LDA Example:

library(topicmodels)
data("AssociatedPress")
dtm <- AssociatedPress

lda_model <- LDA(dtm, k = 5, method = "Gibbs", control = list(seed = 1234)) # LDA w/ 5 Topics & Gibbs Sampling 
topicmodels::terms(lda_model, 10) 
##       Topic 1  Topic 2   Topic 3      Topic 4     Topic 5     
##  [1,] "i"      "percent" "court"      "i"         "government"
##  [2,] "people" "million" "police"     "president" "soviet"    
##  [3,] "two"    "year"    "two"        "bush"      "united"    
##  [4,] "air"    "billion" "state"      "house"     "states"    
##  [5,] "years"  "new"     "case"       "new"       "two"       
##  [6,] "new"    "company" "years"      "committee" "military"  
##  [7,] "just"   "last"    "federal"    "congress"  "people"    
##  [8,] "city"   "market"  "department" "dukakis"   "police"    
##  [9,] "like"   "prices"  "attorney"   "national"  "union"     
## [10,] "first"  "stock"   "drug"       "campaign"  "officials"
topic_probs <- posterior(lda_model)$topics # Prob Document i Belongs to Topic K
print(round(topic_probs[c(1:10), c(1:5)], 2))
##          1    2    3    4    5
##  [1,] 0.20 0.05 0.61 0.06 0.08
##  [2,] 0.07 0.31 0.12 0.20 0.30
##  [3,] 0.29 0.10 0.44 0.08 0.08
##  [4,] 0.30 0.17 0.13 0.14 0.25
##  [5,] 0.28 0.14 0.18 0.26 0.15
##  [6,] 0.06 0.08 0.08 0.38 0.40
##  [7,] 0.22 0.30 0.21 0.11 0.16
##  [8,] 0.13 0.13 0.10 0.39 0.25
##  [9,] 0.25 0.19 0.18 0.18 0.21
## [10,] 0.18 0.13 0.11 0.27 0.30
word_probs <- posterior(lda_model)$terms # Prob Word J in Topic K
head(round(word_probs[, 1:5], 5))   
##     aaron abandon abandoned abandoning abbott
## 1 0.00000 0.00000   0.00039      0e+00  0e+00
## 2 0.00000 0.00000   0.00000      0e+00  0e+00
## 3 0.00000 0.00000   0.00000      0e+00  1e-04
## 4 0.00011 0.00016   0.00001      7e-05  0e+00
## 5 0.00000 0.00000   0.00006      0e+00  0e+00

Structural Topic Models

GRS spend some time talking about upstream and downstream structural methods (though we won’t get too bogged down in it…), but there is a lot of merit to the Structural Topic Model (STM) – which provides a mechanism to add covariates to help explain both topic prevalence \(X^P_i\) and topic content \(X^c_i\). It works like a traditional topic model but also lets you introduce information about the documents themselves \(X^p_i\). For each document, the model first decides how much of each topic is likely to appear (topic prevalence) – this can be driven by features like authorship, publication date, etc.

Then, for each word in the document, the model picks a topic according to those proportions and generates the word from that topic. What makes STM especially powerful is that the way topics are expressed in words (topic content) can also shift depending on covariates (\(X^c_i\)). Not only can you see which topics are more common in certain types of documents, but you can also see how the language used to discuss a topic changes with different document characteristics. It’s a way to combine topic discovery with regression-style analysis, giving you richer insights into both what topics are present and how they’re talked about.

STM Example:

texts <- apply(AssociatedPress, 1, function(x) {
  paste(rep(colnames(AssociatedPress), x), collapse = " ")
})

reduce_complexity <- function(text){
  text <- tolower(text) # Lower Case
  text <- tm::removePunctuation(text) # Punctuation
  text <- tm::removeNumbers(text) # Numbers
  text <- removeWords(text, tm::stopwords("english")) # Stop Words
  text <- unlist(stringr::str_split(text, '\\s+')) # Tokenize 
  text <- textstem::lemmatize_words(text) # Lemmatize
  text <- paste(text, collapse = ' ') # Re-Append
  text <- gsub("\\s{2,}", ' ', text) # 2 or More Spaces --> One Space
  text <- trimws(text) # White Space
  return(text)
} # Complexity Reduction (See Class 5!)

texts_reduced <- vapply(texts, reduce_complexity, character(1))

corpus_reduced <- VCorpus(VectorSource(texts_reduced))

dtm_reduced <- DocumentTermMatrix(
  corpus_reduced,
  control = list(wordLengths = c(1, Inf))
)

dfm_ap <- as.dfm(dtm_reduced)

docs_stm <- convert(dfm_ap, to = "stm")

meta <- data.frame(dummy = rep(1, length(docs_stm$documents)))

stm_model <- stm::stm(
  documents  = docs_stm$documents,
  vocab      = docs_stm$vocab,
  K          = 5,
  prevalence = ~1,
  data       = meta,
  init.type  = "Spectral",
  seed       = 1234,
  max.em.its = 500,
  verbose    = FALSE
)

stm::labelTopics(stm_model, n = 10) # Top Words/Topic
## Topic 1 Top Words:
##       Highest Prob: court, year, case, charge, say, two, attorney, go, police, school 
##       FREX: court, attorney, judge, trial, sentence, justice, lawyer, book, mother, convict 
##       Lift: male, moore, souter, abrams, abyss, acquit, actress, adjourn, adkins, admission 
##       Score: court, attorney, sentence, trial, police, lawyer, judge, prison, jury, convict 
## Topic 2 Top Words:
##       Highest Prob: official, report, air, department, state, two, service, test, area, fire 
##       FREX: water, plane, flight, ship, accident, space, pilot, navy, pentagon, aircraft 
##       Lift: virus, airplane, aspirin, bay, bomber, cargo, carrier, creek, crow, defect 
##       Score: plane, flight, pilot, nasa, aircraft, shuttle, airline, accident, ship, pentagon 
## Topic 3 Top Words:
##       Highest Prob: percent, million, year, market, price, company, billion, trade, new, stock 
##       FREX: market, price, stock, dollar, sale, cent, index, yen, gold, investor 
##       Lift: chrysler, trader, troy, abboud, acquisition, acustar, aftertax, ag, alkaline, alltime 
##       Score: stock, cent, percent, price, market, yen, index, rate, dollar, billion 
## Topic 4 Top Words:
##       Highest Prob: soviet, government, official, people, police, state, unite, party, force, military 
##       FREX: soviet, army, troop, gorbachev, communist, soldier, rebel, israel, africa, israeli 
##       Lift: abductor, abu, afghanistan, africas, ahmed, alfredo, algeria, algerian, algiers, amal 
##       Score: soviet, communist, police, gorbachev, troop, minister, guerrilla, palestinian, party, soldier 
## Topic 5 Top Words:
##       Highest Prob: bush, president, state, house, year, vote, new, go, say, make 
##       FREX: bush, dukakis, republican, senate, budget, sen, democrat, rep, jackson, legislation 
##       Lift: cavazos, dmass, sens, ad, legislation, marlin, nbc, abcs, accomplishment, adhere 
##       Score: dukakis, bush, republican, tax, percent, campaign, poll, democratic, budget, vote
proportions <- stm_model$theta # Document-Topic Proportions
round(proportions[1:10, ], 2)
##       [,1] [,2] [,3] [,4] [,5]
##  [1,] 0.86 0.05 0.00 0.08 0.01
##  [2,] 0.27 0.04 0.18 0.48 0.02
##  [3,] 0.83 0.02 0.04 0.09 0.01
##  [4,] 0.35 0.04 0.11 0.34 0.15
##  [5,] 0.57 0.04 0.02 0.02 0.35
##  [6,] 0.01 0.24 0.01 0.25 0.50
##  [7,] 0.11 0.44 0.27 0.13 0.05
##  [8,] 0.01 0.01 0.02 0.17 0.79
##  [9,] 0.06 0.81 0.04 0.05 0.04
## [10,] 0.03 0.04 0.04 0.54 0.35
topic_word_probs <- exp(stm_model$beta$logbeta[[1]]) #Topic Word Probabilities
head(round(topic_word_probs[,1:5], 5))
##         [,1]    [,2]    [,3]    [,4]    [,5]
## [1,] 0.00079 0.00044 0.00032 0.00026 0.00009
## [2,] 0.00086 0.00000 0.00000 0.00014 0.00019
## [3,] 0.00078 0.00079 0.00029 0.00011 0.00004
## [4,] 0.00043 0.00041 0.00012 0.00000 0.00002
## [5,] 0.00071 0.00055 0.00021 0.00016 0.00001