Text Retrieval and Pre-Processing

Text Retrieval

There are countless sources of text data – from a single haiku to bounded volumes providing an expansive anthology of human knowledge, we can use text analysis tools to bridge an entire domain of qualitative and quantitative inquiry.

Lies, Damn Lies, and Statistics (2000)

West Wing

Aaron Sorken’s The West Wing (1999-2007) is not only the best political drama ever created, but I’d go so far as to say it is the best drama, ever – period. Most public polling concerning the show’s continued popularity almost 20 years since its end highlights its staying power. For instance, a survey conducted by Data for Progress found that 33 percent of Americans had a favorable opinion of The West Wing, while only 10 percent had an unfavorable opinion. The same poll also found that the show is more popular among older, more educated, and higher-income Americans.

One of my favorite episodes is Season 1’s Lies, Damn Lies, and Statistics – the title originating from a phrase popularized by Mark Twain (though attributed to British Prime Minister Benjamin Disraeli) to describe the persuasive power of statistics to bolster (often weak or subjective) arguments. In this episode, the White House staff obsess over poll results following a made-for-television fight to regain control of the American political narrative.

Let’s imagine I wanted to know which character in the episode spoke the most lines, or maybe even which had the longest dialogue. To do that, I would need to:

Locate the script for the episode (Here)
Convert it to a plain text format (.txt) – Which I’ve conventiently done here
Load that plain text file into R
Process and organize the strings, as well as identify those strings with their associated characters
Perform analyses on the comparative volume of spoken lines, as well as the breadth of individual monologues.

west_wing <- readLines(west_wing_script_location, warn = FALSE) # Read Txt from GitHub Repo
head(west_wing) # Print Head

## [1] "THE WEST WING"                     "'LIES, DAMN LIES, AND STATISTICS'"
## [3] "WRITTEN BY: AARON SORKIN"          "DIRECTED BY: DON SCARDINO"        
## [5] ""                                  "TEASER"

Clearly, the text is not as organized as I’d like it to be. My next steps will be to conciously try to develop a 4-column dataframe that identifies:

Character: Which character is currently speaking.
Dialogue: The text (string)
Word Count: How many words are found in the text
Line Number: The current number of dialogue entries to that point

There are several easier ways to do this, but I want to try to be as precise as I can. I am only interested in Josh, Toby, C.J., Donna, Sam, Leo, and President Barlet.

west_wing <- data.frame(unlist(west_wing)) %>%
  setNames('text')
characters <- c('Josh', 'Toby', 'C.J.', 'Donna', 'Sam', 'Leo', 'Bartlet')
character_regex <- paste0("^(", paste0(toupper(characters), collapse = "|"), ")$")

damn_lies <-  west_wing %>%
    mutate(character_line = ifelse(stringr::str_detect(text, character_regex), 1, 0), 
           empty_row = ifelse(text == '', 1, 0), 
           first_entry = ifelse(character_line == 1, 1, NA)) %>%
    tidyr::fill(first_entry, .direction = 'down') %>%
    filter(!is.na(first_entry)) %>%
    select(-c(first_entry)) %>%
    mutate(group = cumsum(character_line == 1)) %>%
    group_by(group) %>%
    mutate(to_keep = row_number() < which(empty_row == 1)[1] | is.na(which(empty_row == 1)[1])) %>%
    ungroup() %>%
    filter(to_keep) %>%
    select(text, character_line) %>%
    mutate(group = cumsum(character_line == 1)) %>%
    group_by(group) %>%
    summarise(
    character = text[character_line == 1][1],
    dialogue  = paste(text[-1], collapse = " "),
    .groups = "drop") %>%
    rename(id = group) %>%
    select(character, dialogue, id) %>%
    rowwise() %>%
    mutate(word_count = stringr::str_count(dialogue, "\\S+")) %>%
    ungroup() %>%
    filter(!word_count == 0)

## # A tibble: 10 × 4
##    character dialogue                                        id word_count
##    <chr>     <chr>                                        <int>      <int>
##  1 DONNA     They got to start the poll, Josh. It's 7:05.     1          9
##  2 JOSH      It's ten to seven.                               2          4
##  3 DONNA     No, it's really not.                             3          4
##  4 JOSH      It's 7:05?                                       4          2
##  5 DONNA     Yeah.                                            5          1
##  6 JOSH      That's ridiculous.                               6          2
##  7 DONNA     I'm not making it up.                            7          5
##  8 JOSH      My watch says ten to seven.                      8          6
##  9 DONNA     That's 'cause your watch sucks.                  9          5
## 10 JOSH      My watch is fine.                               10          4

## # A tibble: 7 × 4
##   Character `Total Words` `Average Words` `Total Lines`
##   <chr>             <int>           <dbl>         <int>
## 1 BARTLET            1359               9           145
## 2 C.J.                985              11            91
## 3 JOSH                757              11            67
## 4 LEO                 679               8            90
## 5 TOBY                617               8            78
## 6 SAM                 454               6            75
## 7 DONNA               162               8            20

Text Retrieval and Pre-Processing

POS6933: Computational Social Science

Truscott (Spring 2026)

Text Retrieval

Lies, Damn Lies, and Statistics (2000)