Multinomial Language Model

The multinomial language model is a probabilistic framework used to model the distribution of words in a document or collection of documents. It assumes that each word in a document is drawn independently from a fixed vocabulary according to a categorical distribution – i.e., in accordance with the Bag of Words approach, where each word has a certain probability of occurring. The model is multinomial in that it considers the counts of each word in a document, rather than just the presence or absence of words. Formally, for a document represented as a sequence of word counts, the likelihood of observing the document is given by the multinomial probability mass function (PMF), which combines the factorial of the total word count with the product of the probabilities of each word raised to the power of its observed count. This framework forms the basis for many text modeling techniques, including Naive Bayes classifiers and topic models (which we will explore more this week and next…), and provides a straightforward way to estimate the probability of unseen documents given observed word frequencies.


GRS (Ch. 6) uses a simplified three-word vocabulary (cat, dog, fish) and each document only contains a single token (i.e., instance of a type). We are going to retains a similar structure, but add another word to our vocabulary:

hamburger = (1, 0, 0, 0)

salad = (0, 1, 0, 0)

taco = (0, 0, 1, 0)

nuggets = (0, 0, 0, 1)

Recall that this approach accords with the Bag of Words, where words are drawn individually and independently from a categorical distribution, where \(W_i = \mu\) – where \(\mu\) is a vector containing the probability of each individual type. For this example, lets say \(\mu =\) (0.3, 0.25, 0.15, 0.3). Meaning that the probability of each token type being drawn for any trial is:

In other words, each word in the document is generated by independently sampling from these four categories according to their respective probabilities. Let’s assume we were interested in the probability of drawing the document (hamburger, hamburger, taco, nuggets). The resulting count vector would be (2, 0, 1, 1) – representing 2 instances of hamburger, 0 instances of salad, and 1 instance of both taco and nuggets.

Recall that the probability mass function for a categorical distribution is:

\[ p(\mathbf{W}_i \mid \boldsymbol{\mu}) = \prod_{j=1}^J \mu_j^{w_{ij}} \]

which we can generalize for documents that are longer than one word using the multinomial distribution, where \(\mathbf{M}\) is an integer that controls the number of tokens (i.e., length of the document):

\[ p(\mathbf{W}_i \mid \boldsymbol{\mu}) = \frac{M!}{\prod_{j=1}^J W_{ij}!} \prod_{j=1}^J \mu_{j}^{\mathbf{W}_{ij}} \]

Supplementing our values for the hypothetical document (hamburger, hamburger, taco, nuggets), we get:

\[ p(\texttt{H,H,T,N} \mid \mu) = \frac{4!}{(2_{H}!)(0_{S}!)(1_{T}!)(1_{N}!)} (0.3_H)^2 (0.25_S)^0 (0.15_T)^1 (0.3_N)^1 \]

\[ = \frac{4!}{2!\cdot0!\cdot1!\cdot1!}\quad 0.09 \cdot 1 \cdot 0.15 \cdot 0.3 \]

\[ = 12 \cdot 0.00406 \\ \] \[ p(H,H,T,N \mid \mu) \approx 0.0486 \]


As GRS (Ch.6) also note, the advantage of specifying a probability model is it accompanies a set of known results that are a consequence of the modeling assumptions – e.g., expectation, variance, and covariance.

\[ \text{Expected Number of Times Word } j \text{ appears in Document } i \] \[ E[\mathbf{W_{ij}}] = M_i\mu_j \]
\[ \text{Variance of the count of word } j \text{ in Document } i \]

\[ \text{Var}(\mathbf{W}_{ij{}}) = M_i\mu_j(1-\mu_j) \]
\[ \text{Covariance of the count of word } j \text{ given word } k \text{ in Document } i; j \neq k \] \[ \text{Cov}(\mathbf{W}_{ij}, \mathbf{W}_{ik}) = -M_i\mu_j\mu_k \]

Putting it all together…

mu <- c(hamburger = 0.3,
        salad     = 0.25,
        taco      = 0.15,
        nuggets   = 0.3)  # Mu (Probs)

M <- 4  # Document length (number of tokens)


expectation_wij <- M * mu # i.e., If I repeatedly generated documents of length M from mu, this is the average number of times each word would appear.

expectation_wij
## hamburger     salad      taco   nuggets 
##       1.2       1.0       0.6       1.2
variance_wij <- M * mu * (1 - mu) # i.e., How much the count of each word bounces around from document to document -- Does it show up a lot (1), never (0), or 50/50 (0.5)? 

variance_wij
## hamburger     salad      taco   nuggets 
##      0.84      0.75      0.51      0.84
covariance_wij <- -M * (mu %o% mu)  # i.e., How much is the variance of one word related to another? We use off-diagonal elements -- how counts of *different* words move together (all sum to M!)


diag(covariance_wij) <- variance_wij # Note: outer-product for off-diagonal gives us wrong values for diagonal (M * mu_j * mu_k) -- so we replace with true variance for W_ij 

covariance_wij
##           hamburger salad  taco nuggets
## hamburger      0.84 -0.30 -0.18   -0.36
## salad         -0.30  0.75 -0.15   -0.30
## taco          -0.18 -0.15  0.51   -0.18
## nuggets       -0.36 -0.30 -0.18    0.84

Example: Unaccredited Federalist Papers

GSR (Ch.6) provide an illustrative example from Mosteller and Wallace (1963) concerning the disputed authorship of 12 Federalist Papers – a collection of 85 essays authored by Alexander Hamilton, James Madison, and John Jay between October 1787 and August 1788 advocating for the ratification of the U.S. Constitution. Since each Paper was authored using the same collective pseudonym – Publius – authorship was disputed. Or, at least, debate was surely had concerning which of the three – Hamilton, Madison, or Jay – was a particular Paper’s author. Some were later discovered to have been authored jointly. However, by the mid-20th century, it was believed that Jay authored (5), Hamilton authored at least (43), and Madison authored (14).

Mosteller and Wallace (1963) were able to use basic notions from the Bag of Words and multinomial language models to infer authorship for those disputed papers given variance in the writing styles of each known author. Given variance in the writing styles of Hamilton, Madison, and Jay – e.g., use of specific terminology, phrasing, etc. – we can infer authorship of those without attribution. Most importantly, we assume these authors’ behaviors represent distinct data generating processes – and, by virtue, unique multinomial distribution and generative process \(\mu_{H,M,J}\) (See GSR Table 6.1). However, we’re again going to add a bit to GSR’s example and use an five-word vocabulary

Let’s start by recovering the Federalist Papers, reducing complexity, and isolating to instances where any of the authors use the terms: by, man, upon, heretofore or whilst. Doing so, we recover the unique counts below:

tidy_federalist %>%
  count(author, word) %>%          
  tidyr:: pivot_wider(names_from = author,values_from = n, values_fill = 0) %>%
  { 
    wide <- .                      
    bind_rows(
      wide,
      wide %>%
        select(-word) %>%
        summarise(across(everything(), sum)) %>%
        mutate(word = "TOTAL") %>%
        select(word, everything()))
  } # Print Counts of Interesting Words
## # A tibble: 6 × 4
##   word       Hamilton   Jay Madison
##   <chr>         <int> <int>   <int>
## 1 by              861    82     477
## 2 heretofore       13     1       1
## 3 man             102     0      17
## 4 upon            374     1       7
## 5 whilst            1     0      12
## 6 TOTAL          1351    84     514

Produces the following multinomial models, where the values x in Multinomial(x, \(\mu\)) represent the count of the observed vocabulary for each prospective author : \[ \mathbf{W_{Hamilton}} ~ \text{Multinomial}(1351, \mu_H) \] \[ \mathbf{W_{Madison}} ~ \text{Multinomial}(514, \mu_M) \] \[ \mathbf{W_{Jay}} ~ \text{Multinomial}(84, \mu_J) \]

We then turn towards recovering the unique \(\mu_{H,M,J}\) (ML estimator) for each author – This example recovers Hamilton’s:

\[ \mu_{Hamilton} = (\frac{861}{861+13+102+374+1},\frac{13}{1351},\frac{102}{1351}, \frac{374}{1351}, \frac{1}{1351}) \]

\[ \mu_{Hamilton} = (0.63, 0.009, 0.07, 0.27, 0.0007) \]

The others are:

\[ \mu_{Madison} = (0.92, 0.001, 0.033, 0.013, 0.023) \] \[ \mu_{Jay} = (0.97, 0.01, 0, 0.01, 0) \]

We’re going to apply these author-specific \(\mu\) to two examples – Federalist 51, where authorship was disputed between Madison and Hamilton prior to Mosteller and Wallace (1963), and another where we know authorship (as a robustness check).

For Federalist 51, we know the following counts:

  • by (23)

  • man (1)

  • upon (0)

  • heretofore (0)

  • whilst (2)

Applying our known \(\mu\) for each author:

\[ p(\mathbf{W_{Fed51}}\mid\mu_{Hamilton}) = \frac{26!}{(23!)(1!)(0!)(0!)(2!)}(0.63)^{23}(0.009)^{1}(0.07)^{0}(0.27)^{0}(0.0007)^{2} \] \[ p(\mathbf{W_{Fed51}}\mid\mu_{Hamilton}) = 0.0000000008346 \] \[ p(\mathbf{W_{Fed51}}\mid\mu_{Madison}) = \frac{26!}{(23!)(1!)(0!)(0!)(2!)}(0.92)^{23}(0.001)^{1}(0.033)^{0}(0.013)^{0}(0.023)^{2} \] \[ p(\mathbf{W_{Fed51}}\mid\mu_{Madison}) = 0.00055692 \] \[ p(\mathbf{W_{Fed51}}\mid\mu_{Jay}) = \frac{26!}{(23!)(1!)(0!)(0!)(2!)}(0.97)^{23}(0.01)^{1}(0)^{0}(0.01)^{0}(0)^{2} \]

\[ p(\mathbf{W_{Fed51}}\mid\mu_{Jay}) = 0 \]

Voilà – matching the academic consensus, our results tell us that James Madison was the most likely author of Federalist 51. However, a primary concern that may emerge is that, while the probability of John Jay authoring is very low, our results effectively say it is impossible because he did not record whilst or man in any of the Federalist Papers. We can do better by regularizing our estimates – i.e., adding a small positive number to each vector to encode the possibility that (let’s say) Jay might someday use those words.

For the sake of ease, I am going to replicate our results with Laplace smoothing and dmultinom(), where federalist_51_vector is the vector of word-specific counts, but we’re now going to add an integer such that each word \is used at least once.)

hamilton_likelihood <- dmultinom(x = federalist_51_vector,
                                 prob = author_vectors[['Hamilton']] + 1)

madison_likelihood <- dmultinom(x = federalist_51_vector,
                                 prob = author_vectors[['Madison']] + 1)

jay_likelihood <- dmultinom(x = federalist_51_vector,
                                 prob = author_vectors[['Jay']] + 1)


data.frame(Author = c('Hamilton', 'Madison', 'Jay'), 
           Likelihood = c(hamilton_likelihood, madison_likelihood, jay_likelihood))
##     Author   Likelihood
## 1 Hamilton 3.845094e-08
## 2  Madison 2.557079e-02
## 3      Jay 2.222033e-03
madison_likelihood/jay_likelihood # Likelihood Ratio of Madison v. Jay
## [1] 11.50783
madison_likelihood/hamilton_likelihood # Likelihood Ratio of Madison v. Hamilton 
## [1] 665023.7

So - Yeah, We can be very confident from our niche analysis that James Madison was likely the author! But, for the sake of ease, let’s do a robustness check using Federalist 10 – which we know James Madison wrote.

  hamilton_likelihood <- dmultinom(x = federalist_10_vector,
                                 prob = author_vectors[['Hamilton']] + 1)
  
  madison_likelihood <- dmultinom(x = federalist_10_vector,
                                 prob = author_vectors[['Madison']] + 1)
  
  jay_likelihood <- dmultinom(x = federalist_10_vector,
                                 prob = author_vectors[['Jay']] + 1)

  
  madison_likelihood/jay_likelihood # Madison vs. Jay
## [1] 18.06388
  madison_likelihood/hamilton_likelihood # Madison v. Hamilton
## [1] 181170.9

Same confidence!


Dirichlet Distribution (Briefly)

We will return to the Dirichlet distribution in much greater detail when we discuss topic models, but for now it is useful to introduce it as a principled alternative to Laplace smoothing. Recall that Laplace smoothing adds pseudo-counts (a small positive constant) to the observed word counts before normalizing, thereby preventing zero probabilities when an author never uses a particular term. The Dirichlet distribution formalizes this idea within a fully probabilistic framework.

In Short – Instead of just adding a small number to every category like Laplace smoothing, the Dirichlet lets us treat the probabilities themselves as random (though we can choose what \(\sigma_k\) is for a given value in the vector). It gives us a whole set of possible probability values at once. The hyperparameter α controls whether the probabilities are mostly focused on one category or spread more evenly across all categories.

\[ \alpha < 1 \rightarrow \mu \text{ vectors lie near the corners of the simplex}\] \[ \alpha = 1 \rightarrow \mu \text{ vectors are uniformly distributed (equally likely)}\] \[ \alpha > 1 \rightarrow \mu \text{ vectors are balanced across categories and cluster toward the center of the simplex}\]

Returning to the Federalist Paper example from GSR, we too will address the disputed paper with the addition of the Dirichlet distribution, rather than relying on Laplace smoothing to regularize. Using our intuition, let’s assume that we expect by and upon to feature most prominently – so we assign \(\alpha = (2, 1, 1, 2, 1)\), such that:

\[\mu \sim \text{ Dirichlet}(\alpha), \quad \alpha = (2, 1, 1, 2, 1)\] \[ W_i\mid\mu \sim \text{ Multinomial}(M_i, \mu) \]

\[ \mu\mid\alpha,W_i \sim \text{ Dirichlet}(W_i + \alpha) \]

\[ E[\mu_j\mid\alpha,W_i] = \frac{W_{ij}+\alpha_j}{\sum_k(W_{i,k} + \alpha_k)} \]

Looking at Hamilton…

\[ \hat{\mu}_{Hamilton} = (\frac{861 + 2}{861+13+102+374+1 + (2 + 1 + 1 + 2 + 1)}, \frac{13+1}{1358}, \frac{102+1}{1358}, \frac{374+2}{1358}, \frac{1+1}{1358}) \] \[ \hat{\mu}_{Hamilton} = (0.63, 0.01, 0.07, 0.27, 0.001)\]

Notice how these probabilities in \(\hat{\mu}_{Hamilton}\) are not entirely dissimilar from our original probabilities – the ones with the most discernible change are those with very low probabilities to begin with!

\[ p(\mathbf{W_{Fed51}\mid\hat{\mu}_{Hamilton}}) = \frac{26!}{(23!)(1!)(0!)(0!)(2!)}(0.63)^{23}(0.01)^{1}(0.07)^{0}(0.27)^{0}(0.001)^{2} \] Which results in a probability of authorship greater than the original (though still discernbily lesser than what we’d get for Madison!)

\[ p(\mathbf{W_{Fed51}\mid\hat{\mu}_{Hamilton}}) = 0.00000000186 \]