Note: Students should always aim to produce publication-worthy tables and figures. Unless otherwise stated, tables should be rendered using stargazer::(), while figures can be rendered using ggplot2::() or plot(). Regardless, tables and figures should always be presented with necessary formatting – e.g., (sub)title, axis (variable) labels and titles, a clearly-identifiable legend and key, etc. Problem sets must always be compiled using LaTex or RMarkdown and include the full coding routine (with notes explaining your implementation) used to complete each problem (10pts).


  1. The sample data below represent terms recovered from American party platforms, where each row represents a party and terms from similar policy areas (4pt).
Document_ID Category Terms
1 Democrat healthcare, taxes, security
2 Democrat climate, energy, security
3 Democrat healthcare, equity, border
4 Democrat immigration, pathway, energy
5 Democrat taxes, diplomacy, defense
6 Republican taxes, energy, security
7 Republican border, enforcement, energy
8 Republican healthcare, market, defense
9 Republican taxes, domestic, security
10 Republican immigration, enforcement, defense

A. For two words found in both classes (Democratic and Republican), calculate the mutual information. There are six total, but you only need to choose two.

B. For the words Healthcare, Energy, and Security, use the Fightin’ Words approach to recover the log odds ratio that each is a predictor of either class.


  1. Using the Federalist Papers authored by James Madison (no co-authors or disputed authors), apply TF-IDF weighting and run k-means clustering using 5, 10, 15, 20, 25 cluster centers. Report the within-cluster sum of squares for each. What is the average reduction in the sum of squares for the addition of each cluster center? Is the answer the same if I ask this question for bins 5-15 versus 15-25? (3pt)

  2. Using the same Federalist Papers data (Madison authorship only), construct an LDA topic model with 10 topics. Print the top-10 words associated with each of those topics.

A. For each topic, report the posterior probabilities for the top 10 documents (words) in each.

B. For at least 5 of the topics, write 2-3 sentences explaining what you think the “topic” identified by the model is – i.e., what do these words tell you about the substantive content of the pooled words? (3pt)