There are countless sources of text data – from a single haiku to bounded volumes providing an expansive anthology of human knowledge, we can use text analysis tools to bridge an entire domain of qualitative and quantitative inquiry.

Aaron Sorken’s The West Wing (1999-2007) is not only the best political drama ever created, but I’d go so far as to say it is the best drama, ever – period. Most public polling concerning the show’s continued popularity almost 20 years since its end highlights its staying power. For instance, a survey conducted by Data for Progress found that 33 percent of Americans had a favorable opinion of The West Wing, while only 10 percent had an unfavorable opinion. The same poll also found that the show is more popular among older, more educated, and higher-income Americans.
One of my favorite episodes is Season 1’s Lies, Damn Lies, and Statistics – the title originating from a phrase popularized by Mark Twain (though attributed to British Prime Minister Benjamin Disraeli) to describe the persuasive power of statistics to bolster (often weak or subjective) arguments. In this episode, the White House staff obsess over poll results following a made-for-television fight to regain control of the American political narrative.
Let’s imagine I wanted to know which character in the episode spoke the most lines, or maybe even which had the longest dialogue. To do that, I would need to:
Rwest_wing <- readLines(west_wing_script_location, warn = FALSE) # Read Txt from GitHub Repo
head(west_wing) # Print Head
## [1] "THE WEST WING" "'LIES, DAMN LIES, AND STATISTICS'"
## [3] "WRITTEN BY: AARON SORKIN" "DIRECTED BY: DON SCARDINO"
## [5] "" "TEASER"
Clearly, the text is not as organized as I’d like it to be. My next steps will be to conciously try to develop a 4-column dataframe that identifies:
There are several easier ways to do this, but I want to try to be as precise as I can. I am only interested in Josh, Toby, C.J., Donna, Sam, Leo, and President Barlet.
west_wing <- data.frame(unlist(west_wing)) %>%
setNames('text')
characters <- c('Josh', 'Toby', 'C.J.', 'Donna', 'Sam', 'Leo', 'Bartlet')
character_regex <- paste0("^(", paste0(toupper(characters), collapse = "|"), ")$")
damn_lies <- west_wing %>%
mutate(character_line = ifelse(stringr::str_detect(text, character_regex), 1, 0),
empty_row = ifelse(text == '', 1, 0),
first_entry = ifelse(character_line == 1, 1, NA)) %>%
tidyr::fill(first_entry, .direction = 'down') %>%
filter(!is.na(first_entry)) %>%
select(-c(first_entry)) %>%
mutate(group = cumsum(character_line == 1)) %>%
group_by(group) %>%
mutate(to_keep = row_number() < which(empty_row == 1)[1] | is.na(which(empty_row == 1)[1])) %>%
ungroup() %>%
filter(to_keep) %>%
select(text, character_line) %>%
mutate(group = cumsum(character_line == 1)) %>%
group_by(group) %>%
summarise(
character = text[character_line == 1][1],
dialogue = paste(text[-1], collapse = " "),
.groups = "drop") %>%
rename(id = group) %>%
select(character, dialogue, id) %>%
rowwise() %>%
mutate(word_count = stringr::str_count(dialogue, "\\S+")) %>%
ungroup() %>%
filter(!word_count == 0)
## # A tibble: 10 × 4
## character dialogue id word_count
## <chr> <chr> <int> <int>
## 1 DONNA They got to start the poll, Josh. It's 7:05. 1 9
## 2 JOSH It's ten to seven. 2 4
## 3 DONNA No, it's really not. 3 4
## 4 JOSH It's 7:05? 4 2
## 5 DONNA Yeah. 5 1
## 6 JOSH That's ridiculous. 6 2
## 7 DONNA I'm not making it up. 7 5
## 8 JOSH My watch says ten to seven. 8 6
## 9 DONNA That's 'cause your watch sucks. 9 5
## 10 JOSH My watch is fine. 10 4
## # A tibble: 7 × 4
## Character `Total Words` `Average Words` `Total Lines`
## <chr> <int> <dbl> <int>
## 1 BARTLET 1359 9 145
## 2 C.J. 985 11 91
## 3 JOSH 757 11 67
## 4 LEO 679 8 90
## 5 TOBY 617 8 78
## 6 SAM 454 6 75
## 7 DONNA 162 8 20