Discriminating Words

I want to briefly address the concept of discriminating words, which offers a (more straightforward and) complimentary intuition to clustering algorithms and topic discovery methods. As we emphasized in last week’s discussion of the disputed Federalist Papers, the core goal of text analysis is often to recover and interpret the discursive properties of distinct documents – i.e., to compare and contrast different documents. In our example, we represented the writing styles of Hamilton, Madison, and Jay as unique identifiers that could be used to infer authorship for disputed essays. The central idea is that each author exhibits distinctive “quarks” in their writing style, which can be used to distinguish them – e.g., a disputed document that exhibits more features of one style than the others provides a signal of its likely author. That exercise touched on a broader concept in studies of content analysis – namely, the ability to both make an assessment of attribution given variance in behaviors, as well as draw other assessments re: association with alternative groups or clusters.

In short, we could assert:

Federalist 51 was most likely written by James Madison because his writing style most closely matched the disputed essay
James Madison’s writing style is more alike to Alexander Hamilton than John Jay
Alexander Hamilton is more likely to have authored Federalist 51 than John Jay.

\[ \text{Federalist 51} \ge \text{Madison} > \text{Hamilton} > \text{Jay} \]

We were able to use individual words to address not only how variance in word choice informs us about individual authors and the disputed essay(s), but also how it can inform us about these items as non-mutually exclusive or discrete groups (or clusters) – ex: political parties, authors, etc. The key idea is that the frequency and distribution of words carry meaningful information about latent categories – i.e., words that appear more frequently in some categories (Discriminating Words) than others are informative signals about that category.

For example, consider this passage from one of my articles Measuring Judicial Ideology as Text (2025):

Terminology is an expression of language and language is an expression of ideology (Thompson, 1987; Woolard and Schieffelin, 1994). The choices made by judges regarding how they express decisions through written opinions is thus an expression of preferences, which are shaped by both personal inclinations (Segal and Spaeth, 2002), strategic machinations (Bailey and Maltzman, 2011; Bonneau et al., 2007), and considerations of the perceived audience (Baum, 2009; Romano and Curry, 2019) (…) Put more simply: the words judges choose matter and are a reflection of their own ideological beliefs, which facilitates a considerable impact on how we know and speak about the law. An illustrative example is the contrasting use of terms such as “healthcare provider” – a neutral descriptor for medical professionals – and “abortionist” – a perceptively derogatory term endowed with legal significance through rulings like Dobbs v. Jackson Women’s Health Organization (2022). Despite serving the same lexical purpose, these terms carry distinct ideological meanings.

While sensibly concerning the same topic, the terms healthcare provider and abortionist draw from distinct discursive frames and connotations, signaling different ideological perspectives and shaping how audiences interpret the underlying issue. Moreover, rather than just focusing on how groups may use divergent terminology to discuss the same concept, we can go even broader to view discriminating words as those simply more likely to appear in contextual discussions. For instance, if we had to make a list of (5) words most likely to be used to describe the policy focuses of Democrats vs. Republicans, we might get something like this:

Democrats	Republicans
Climate	Freedom
Healthcare	Tax
Immigration	Border
Justice	Security
Welfare	Economy

This is not to say that these words are exclusively representative of Democrats or Republicans, but rather that if we were to compile all the speeches, policies, and other documents released by the parties, we’d find these words most frequently appear – and thus discriminate between the parties. For instance, both parties may spend time discussing immigration – but Republicans would be more likely than Democrats to discuss it in the context of border security. Alternatively, both may discuss the economy – although Democrats would be more likely to discuss it in the context of Medicaid or other welfare programs.

Calculating Discriminating Words

Mutual Information

In essence, mutual information (MI) concerns the amount of information one variable provides about another. In our context, assume we have two classification labels: Red and Blue, as well as a tokenized list of words sourced from documents published by authors associated with one of the two classifications. Here, mutual information helps us disentangle which terms best discriminate across the labels, as well as which labels best discriminate across the text. If a word and a label are completely independent, knowing one informs nothing about the other – so MI will be 0. Alternatively, if knowing one thing perfectly informs the other, then the MI will be high.

\[ H(k) = -\pi_k\log_2\pi_k-\pi_{-k}\log_2\pi_{-k,} \\ H(k|j) = -\sum_{k,-k}\sum_{j,-j}\pi_{k,j}\log_2\pi_{k,j} \\ \text{MI}_{kj} = H(k) - H(k|j) \]

Where:

\[ H(k) = \text{Unconditional Certainty (Entropy)} \\ H(k|j) = \text{Conditional Certainty (Entropy)} \\ \text{MI}_{kj} = \text{Unconditional Certainty - Conditional Certainty} \]

Here, $k$ is the discrete categories (Red and Blue), and $j$ is a particular word in the lexicon of words found among the various documents. $\pi_k$ represents the proportion of documents that fall into categories ($k$) Red or Blue, $\pi_{-k}$ (same as $1-\pi_k$) be the probability that a document does not belong in category $k$. $H(k)$ just quantifies how uncertain we are about the category of a randomly selected document before seeing any word – i.e., how likely are to assume any random document belongs to a particular category without even seeing a word? As we could imagine, if $\pi_k$ is high, it means that one category seemingly dominates the observable data and our entropy (uncertainty) would be very low. For instance, if there are 100 documents from Blue and 5 from Red, $H(Blue)$ would be very low – because we could reasonably assume virtually any word $j$ from any document would belong to Blue. If we balanced closer to 50/50, the entropy would be much higher.

Moreover, $H(k|j)$ builds on the intuition by assessing our how much our uncertainty decreases after we condition on the presence of a particular word found among the documents. If a word $j$ perfectly predicts a category $k$, then the uncertainty would reasonably be 0. Combining this intuition with $H(k)$, we can assess the mutual information as the unconditional certainty minus the conditional certainty – which we can replicate for each word and category available to us.

Let’s suppose we have 10 documents:

Document ID	Category ($k$)	Words ($j$)
1	Red	Apple, Tree
2	Red	Apple, Juice
3	Red	Tree, Green
4	Red	Apple, Green
5	Blue	Blue, Ocean
6	Blue	Ocean, Water
7	Blue	Blue, Water
8	Blue	Ocean, Blue
9	Red	Apple, Tree
10	Blue	Ocean, Tree

Step 1: Compute Unconditional Certainty ($\pi_k$) \[ \pi_{Red} = 5/10 = 0.5 \\ \pi_{Blue} = 5/10 = 0.5 \\ H(K) = -\pi_{Red}\log_2\pi_{Red}-\pi_{Blue}\log_2\pi_{Blue} \\ H(K) = -0.5\log_20.5-0.5\log_20.5 \\ H(K) = 1 \]

Step 2: Compute Conditional Certainty for Each Word

Tree is the only word that appears in both categories, so we will calculate the mutual information for that word – though the others will have much higher MI because they perfectly predict the categories. For example, when apple appears in Documents 1, 2, 4, and 9, every one of those documents are Red documents. So if we get apple as $j$, we know it perfectly predicts Red – there is no uncertainty!

\[ P(Red|Tree) = 3/4 = 0.75 \\ P(Blue|Tree) = 1/4 = 0.25 \\ H(K|Tree) = -(0.75\log_20.75+0.25\log_20.25) = 0.811 \\ \]

This represents the reduction in entropy given ONLY the event Tree – meaning it is the conditional information gain for a specific word. Now we need to weight given how often the word appears (and not) across the data.
\[ p(\text{Tree}) = \frac{4}{10} \text{ (Appears in Docs 1, 3, 9, and 10)} \\ p(\text{Tree})\cdot H(K\mid\text{Tree}) = 0.4\cdot 0.811 = 0.3244 \text{ (Contribution from Tree)} \\ P(¬Tree) = 0.6 \text{ (Missing from 60% of Observations)} \\ p(\text{Red|¬Tree}) = \frac{2}{6} \text{ (Missing from Red Twice)} \\ p(\text{Blue|¬Tree}) = \frac{4}{6} \text{ (Missing from Blue Four Times)} \\ H(K\mid\text{¬Tree})= -(\frac{2}{6}\log_{2}\frac{2}{6} + \frac{4}{6}\log_{2}\frac{4}{6}) \approx 0.918 \\ \text{Weighted } P(¬Tree) = 0.6 \cdot 0.918 = 0.5508 \\ H(K\mid J = \text{Tree}) = 0.3244 + 0.5508 = 0.8752 \text{ (Full Conditional Entropy)} \\ MI(K, \text{ Tree}) = 1 - 0.8752 = 0.1248 \]

Meaning that, knowing the word Tree appears (or not) reduces the uncertainty about a document’s category by $$0.125 bits, on average. So, Tree explains about 12.5% of uncertainty – given that MI = 0 (tells us nothing) and MI = 1 (perfectly discriminates), Tree is informative, but far from decisive.

Pretty neat – huh? That being said, there are certainly downsides to this approach (as GRS note in Chapter 11) – including that it doesn’t take into account uncertainty in the estimation of probabilities, only considers whether a word appears (or not) in a document (rather than the volume of repetition), is sensitive to rare words, is difficult to interpret with more than (2) categories, and ignores co-occurrence and context (e.g., Not Good – “Not” “Good” independently). We can account for most of these concerns when we move to a fully probabilistic model.

Fightin’ Words

Fighting Words (Monroe, Colaresi, and Quinn 2008) is a method to identify words that discriminate between two categories (e.g., party affiliation, authorship, etc.) using regularized log-odds ratios. In essence, our goal is to identify words that are strongly associated with one category versus another that also accounts for rarity of words and overestimation of discriminating power. The solution is to essentially regularize the probability estimate.

\[ \hat{\mu}_{jk} = \frac{W^*_{jk} + \alpha_j}{n_k + \sum^j_{j=1}\alpha_j} \]

This may look daunting, but we’ve already learned most of the tools necessary to do this: $\alpha_j$ is just a small value to smooth the estimates, similar to the Dirichlet prior that we discussed in a previous class. As GSR note, if we suppose that $\mu \sim \text{Dirichlet}(\alpha)$ and that $W^*_k \sim \text{Multinomial}(n_k,\mu_k)$, the estimate corresponds to the expected value of the posterior distribution $p(\mu|W)$ after observing the words. In short, this means that $\hat{\mu_{jk}}$ gives a smoothed probability of word $j$ in category $k$, combining the observed counts with a small prior so that even words we haven’t seen get a small chance.

We can then represent our probability model with uncertainty using a standardized log odds ratio, which is the log odds a particular word is used nd compare usage in group $k$ to other groups:

\[ \text{Log Odds Ratio}_{kj} = \log(\frac{\mu_{kj}}{1-\mu_{kj}}) - \log(\frac{\mu-kj}{1-\mu_{-kj}}) \]

Going back to our previous example of Red and Blue categories $k$, we get the following:

Word	Red	Blue	Total
Apple	4	3	7
Tree	3	1	4
Green	2	0	2
Juice	1	0	1
Blue	0	3	3
Ocean	0	4	4
Water	0	2	2

Assuming $\alpha_j$ = 0.5 for each of the (7) words, the posterior mean for Apple is:

\[ \hat{\mu}(Apple|Red) = \frac{n_{(Apple,Red)} + \alpha_{Apple}}{n_{Red} + \sum_j\alpha_j} \\ \hat{\mu}(Apple|Red) = \frac{4 + 0.5}{10 + 7 \times 0.5} = 0.33 \\ \hat{\mu}(Apple|Blue) = \frac{3 + 0.05}{13 + 7 \times 0.5} = 0.212 \]

From here, the log odds ratio that Apple is a predictor of Red is:

\[ \delta_{\text{Red}} = \log\left(\frac{0.33}{1-0.33}\right) \rightarrow \log(0.492) \approx -0.71 \\ \delta_{\text{Blue}} = \log\left(\frac{0.212}{1-0.212}\right) \rightarrow \log(0.269) \approx -1.31 \\ \delta_{kj} = -0.71 - (-1.31) \approx 0.60 \]

Viewed conceptually, the difference in the log odds that Apple appears in a Red document is approximately 0.60 – which, after exponentiation $\exp(0.60)$ is revealed as being $\approx$ 1.8 times larger. Because it’s regularized, we have a lot more power to assert that our belief re: Apple is not simply driven by the relative lack of observations in Blue and that a larger value means the word’s association is robust, not just accidental.

Discriminating Words

POS6933: Computational Social Science

Truscott (Spring 2026)