Note: Students should always aim to produce
publication-worthy tables and figures. Unless otherwise stated,
tables should be rendered using stargazer::(), while
figures can be rendered using ggplot2::() or
plot(). Regardless, tables and figures should always be
presented with necessary formatting – e.g., (sub)title, axis (variable)
labels and titles, a clearly-identifiable legend and key, etc. Problem
sets must always be compiled using LaTex or
RMarkdown and include the full coding routine (with notes
explaining your implementation) used to complete each problem
(10pts).
| Document_ID | Category | Terms |
|---|---|---|
| 1 | Democrat | healthcare, taxes, security |
| 2 | Democrat | climate, energy, security |
| 3 | Democrat | healthcare, equity, border |
| 4 | Democrat | immigration, pathway, energy |
| 5 | Democrat | taxes, diplomacy, defense |
| 6 | Republican | taxes, energy, security |
| 7 | Republican | border, enforcement, energy |
| 8 | Republican | healthcare, market, defense |
| 9 | Republican | taxes, domestic, security |
| 10 | Republican | immigration, enforcement, defense |
A. For two words found in both classes (Democratic and
Republican), calculate the mutual information. There are
six total, but you only need to choose two.
B. For the words Healthcare, Energy, and
Security, use the Fightin’ Words approach to recover the
log odds ratio that each is a predictor of either class.
Using the Federalist Papers authored by James Madison (no co-authors or disputed authors), apply TF-IDF weighting and run k-means clustering using 5, 10, 15, 20, 25 cluster centers. Report the within-cluster sum of squares for each. What is the average reduction in the sum of squares for the addition of each cluster center? Is the answer the same if I ask this question for bins 5-15 versus 15-25? (3pt)
Using the same Federalist Papers data (Madison authorship only), construct an LDA topic model with 10 topics. Print the top-10 words associated with each of those topics.
A. For each topic, report the posterior probabilities for the top 10 documents (words) in each.
B. For at least 5 of the topics, write 2-3 sentences explaining what you think the “topic” identified by the model is – i.e., what do these words tell you about the substantive content of the pooled words? (3pt)