Supervised Classification

Imagine you are coding party platforms or manifestos and are interested in whether these groups are discussing social, economic, or other policy issues. Perhaps the most tried and true method is also the most obvious – read the documents. Completing tasks such as this with perhaps only a few dozen (or so) documents is certainly feasible, if not just a bit tedious. Yet, recall our payroll example from Class 3 – completing the same task at scale assuming the range of observations encompasses thousands (or millions) of pages or documents is considerably time consuming. Moreover, more observations only increases the propensity for manual (hand) coding to introduce bias as the scope of language becomes more diverse.

However, supervised classification methods can theoretically alleviate both obstacles (time and coding error) by leveraging manually-coded training data to learn systematic patterns in language and subsequently apply those patterns consistently at scale. Rather than requiring researchers to read and classify every document, supervised models are trained on a subset of texts that have been labeled according to theoretically meaningful categories (e.g., social policy, economic policy, foreign affairs, etc.). Once trained, these models can rapidly classify thousands or even millions of documents with a level of consistency that is difficult to achieve through human coding alone.

In essence, supervised classification concerns our ability to leverage a (small) sample of manually-classified terms, n-grams, or documents to inform a model re: how to approach the remaining (unseen and uncoded) observations! As you might be able to infer, the flexibility here is quite attractive – i.e., I could just as easily develop a training set (sample data of hand-coded observations to train model) for classifying platforms (manifestos) as discussing certain policy dimensions, or alternatively whether the language they use re: particular policies is positive, negative, liberal, conservative, etc… – the key, as will be discussed below, is satisfying a few key assumptions regarding both our training and testing (unseen) data.

Creating a Training Set

The first step to conducting supervised classification is to develop a training set from the larger set of observations. The first question I’d imagine you may ask is how much – or what percentage of my observations – need to be used for training? There isn’t a definitive answer, though I’d advise you to consider the holistic (dis)advantages of removing observations for training – i.e., use too few and you risk biasing your results with a non-representative training set, while using too many serves to reduce the testing set you use for inferences. Luckily, a great way to gauge whether you’ve found the Goldilocks region is with validation – which is discussed more below.

Some other (and extended) considerations (GRS):

  1. To make sure your training set is representative, it is best to draw from a random sample whenever possible – and perhaps even a stratified random sample if you expect your data to be influenced by temporal or other variances.

  2. The performance of supervised learning is conditioned on the accuracy of the training set. GRS recommends at least two human coders when developing a training set so that observations are consistently (and correctly) annotated. In this article, my coauthors and I each coded the training for the comparative classification section, compared our annotations, and collectively resolved any discrepancies. All of this generally boils down to a simple premise: If you say term (n-gram, document) more accurately represents \(x\) instead of \(x'\), it needs to actually represent \(x\).

  3. Generally speaking, GRS emphasize a good training set should have the following characteristics: Objective-intersubjectivity, an a priori design, reliability, validity, and replicability.

Classifying Documents and Checking Performance

The next step is selecting a classifier to learn the mapping between the features and labels in the training set. There is a lot of flexibility and discretion that you can exercise here, including how features a represented, whether to remove irrelevant features (e.g., reduce complexity), etc. Omnce you have selected a classifier, now you must choose a model to learn that mapping – which again is very flexible and open to your discretion. We will discuss Naive Bayes in particular with more detail later.

Regardless of your model selection, it is always a good rule of thumb to conduct validation testing of the fitted model. Preferably with a cross-validated series of observations held out from the training set, compare the classification of those observations yielded from the fitted model against your hand-coded labels using a confusion matrix. Accuracy and other metrics are derived from this matrix and serve as good evidence of the model’s general performance. Some of these other metrics include precision (the proportion of predicted \(k\) classifications that are truly \(k\)) and recall (the proportion of true \(k\) observations that are correctly classified by the model).

A question you may ask at this point is what score/value is a good indicator for validation? Again, the answer here isn’t definitive – just understand that when reporting these metrics (as you should when aiming to publish a study that uses these methods…), reviewers tend to approach this as a spot test. There are some reviewers who will look at an accuracy above-50% as good (because you’ve essentially defeated a coin flip – at least for binary classifications) – you’ve improved the odds from random selection. Others are a bit harder to convince and expect a classification accuracy in excess of what we’d expect from trained coders – usually 70-80% accuracy from a validation set.