Naive Bayes

\[ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} \quad \text{ Bayes Rule} \\ \]

Naive Bayes is a canonical supervised classification model that leverages Bayes’ rule to recover the probability that an unseen document (\(D_i\)) containing words (\(w_{i1},...,w_{ij}\)) belongs to a certain classification (\(\pi_{ik}\)), where \(\pi_{ik}\) is represented as 0 or 1 given the membership of \(D_i\) in class \(k\). Using Bayes’ rule, the posterior probability that document \(i\) belongs to class \(k\) is:

\[ p(\pi_{ik}\mid D_i) = \frac{p(\pi_{ik} = 1)p(D_i\mid \pi_{ik} =1)}{p(D_i)} \quad \text{ Bayes' Rule -- Conditional Probability} \\ p(\pi_{ik} = 1) \quad \text{ Baseline Probability that } D_i \text{ Belongs to Class } k \text{ Before Reading It} \\ p(D_i\mid \pi_{ik} = 1) = \quad \text{ Probability of Observing } w_{i1}...w_{ij} \text{ if } D_i \text{ in Class } k \\ p(D_i) \quad \text{ Probability of Observing } D_i \text{ in Any Class -- Will Use to Normalize At End} \] Naive Assumption: Conditional on class, words are independent of each other – i.e., once you know the class of any document, learning that one appears in it tells you nothing additional about whether another word appears. This is almost certainty wrong (more soon), but (as GRS note) baking in this assumption makes estimation much easier and effective.

\[ W_i\mid \pi_{ik} = 1 \sim \text{ Multinomial}(\sum_jW_{ij},\mu_k) \\ p(W_i\mid \pi_{ik} = 1) \propto \prod^j_{j=1}\mu_{kj}^{W_{ij}}, \\ \hat{\mu}_{kj} = \frac{c + \sum^N_i \pi_{ik}W_{ij}}{Jc + \sum_i \sum_j \pi_{ik}W_{ij}} \]

There’s a lot going on here, but the intuition is effectively that the probability of observing words \(W_{i...j}\) in class \(k\) is proportional to the vector of probabilities for drawing each word in class \(k\). What this means practically is that we essentially need to collect all of the documents assigned to class \(k\) and calculate the frequency with which word \(W_i\) is used (and we can aid problems re: sparse word usage with a similar strategy to Laplace smoothing by adding a constant \(c\)). The final Naive algorithm is:

\[ p(\pi_{ik}=1\mid W_i) \propto \frac{\sum^N_i I(y_i=k)}{N} \prod^j_{j=1} \mu_{kj}^{ij} \] Which means that the probability of document \(i\) belonging to class \(k\) is proportional to the baseline frequency of class \(k\) times the probability of observing all the words in the document if it came from that class, where \(\frac{\sum^N_i I(y_i=k)}{N}\) is the baseline prior and \(\prod^j_{j=1} \mu_{kj}^{ij}\) is the likelihood term.

Finally, since the probabilities for single words are often very small, the standard way to implement Naive Bayes is usually by adding an \(arg max\) term, which effectively takes the log to convert probabilities into sums and make computation safe and easier.

\[ \hat{y}_i = \text{arg max}_k [\text{log}p(\pi_{ik} = 1) + \sum^J_j=1W_{ij}\text{log}\mu_{ik}] \] Which essentially just means to pick the class (\(k\)) with the largest value. In other words, the model adds up the contributions from the prior and all the words, then chooses the class that looks most likely — a process that becomes much simpler and more stable when we work with logs.

Basic Naive Bayes Example

Let’s suppose we are classifying stories from a popular newspaper and want to know what types of content these organizations are discussing. Let’s say we have two class – Sports and Politics – and just three words to consider – Game, Team, and Vote.

We first construct a training set using a random sample of the observations, which produces:

Document Class Words (Counts)
1 Sports Game (2), Team (1), Vote (0)
2 Sports Game (1), Team (2), Vote (0)
3 Politics Game (0), Team (0), Vote (3)

The class (\(k\)) priors (\(\frac{\sum^N_i I(y_i=k)}{N}\)) would be \(p(\text{Sports}) = \frac{2}{3}\) and \(p(\text{Politics}) = \frac{1}{3}\), because 2 of the 3 documents in the training set are labeled as Sports.

Before we compute the word probabilities per class (\(\mu_{kj}\)) with 6 total words observed in Sports and 3 in Politics, there are going to be terms that don’t appear in certain classes (e.g., Vote does not appear in any Sports documents), so we are going to add Laplace smoothing (\(c\)). With \(c\) = 1 and the total vocabulary (\(J\)) = 3, \(\mu_{kj}\) =

  • \(\mu_{Sports, Game} = \frac{3 + 1}{6 + 3} \approx 0.444\)
  • \(\mu_{Sports, Team} = \frac{4}{9} \approx 0.444\)
  • \(\mu_{Sports, Vote} = \frac{1}{9} \approx 0.111\)
  • \(\mu_{Politics, Game} = \frac{1}{6} \approx 0.167\)
  • \(\mu_{Politics, Team} = \frac{1}{6} \approx 0.167\)
  • \(\mu_{Politics, Vote} = \frac{4}{6} \approx 0.667\)

Assuming we introduce a new document (\(D_{New}\)) = (Game, Team, Vote):

\[ p(Sports\mid D_{New}) \propto p(Sports) \times \mu_{Sports,Game} \times \mu_{Sports,Team} \times \mu_{Sports,Vote} \\ \frac{2}{3} \times 0.444 \times 0.444 \times 0.111 \approx 0.0145 \] \[ p(Politics\mid D_{New}) \propto p(Politics) \times \mu_{Politics,Game} \times \mu_{Politics,Team} \times \mu_{Politics,Vote} \\ \frac{1}{3} \times 0.167 \times 0.167 \times 0.667 \approx 0.0062 \]

Normalizing the probabilities, we get:

\[ p(Sport|D_{New}) = \frac{0.0145}{0.0145+0.0062} \approx 0.70 \\ p(Politics|D_{New}) = \frac{0.0062}{0.0145+0.0062} \approx 0.299 \\ \]

Though the probabilities are both rather small, the document is nevertheless more likely to be Sports – adding more words to the training will certainly improve our inferential power.

While I haven’t done it here, we could also implement the argmax approach by taking the log of both the priors and likelihoods, which would preserve the preference for Sports while avoiding the computational issues caused by very small probabilities (even with Laplace smoothing).

Larger Example Using R

Now let’s do an extended example in R using a dataset of hand-coded IMDB movie reviews.

for (i in 1:length(imdb_files)){
  load(imdb_files[i])
} # Load Each IMDB file from Week 7 Data directory

set.seed(1234) # Set Random Seed

train_data <- bind_rows(imdb_train_pos, imdb_train_neg) # Combine Train Rows 
test_data  <- bind_rows(imdb_test_pos, imdb_test_neg) # combine Test Rows

train_data <- train_data[sample(nrow(train_data), size = 5000, replace = FALSE), ] # Sample Out
test_data <- test_data[sample(nrow(test_data), size = 1000, replace = FALSE), ] # Sample Out

train_corpus <- tm::VCorpus(VectorSource(train_data$text)) # Convert Train to Corpus
test_corpus  <- tm::VCorpus(VectorSource(test_data$text)) # Convert tes tto Corpus

reduce_complexity <- function(corpus){
  corpus <- tm_map(corpus, content_transformer(tolower))       
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stripWhitespace)
  return(corpus)
} # Pre-Process (Reduce Complexity)

train_corpus <- reduce_complexity(train_corpus) # Apply Complexity Reduction
test_corpus  <- reduce_complexity(test_corpus) # Apply Complexity Reduction

train_dtm <- removeSparseTerms(DocumentTermMatrix(train_corpus), 0.995) # Convert to DTM - Only Keep Terms in At least 0.5% of Docs
test_dtm  <- DocumentTermMatrix(test_corpus, control = list(dictionary = Terms(train_dtm))) # Control Ensures Same Terms in Test as Train

train_matrix <- as.data.frame(as.matrix(train_dtm)) # Convert Back to DF for NB
train_matrix$sentiment <- train_data$sentiment # Ensure Sentiment 

test_matrix <- as.data.frame(as.matrix(test_dtm)) # Convert to DF
test_matrix$sentiment <- test_data$sentiment # Ensure Sentiment (for comparison)

head(train_matrix[,c(1:5)])
##   abandoned ability able absence absolute
## 1         0       0    1       0        0
## 2         0       0    0       0        0
## 3         0       0    0       0        0
## 4         0       0    0       0        0
## 5         0       0    0       0        0
## 6         0       1    0       0        0
nb_model <- e1071::naiveBayes(sentiment ~ ., data = train_matrix) # Run NB Model

head(nb_model$tables) # Likelihood of Word Being Absent [,1] or Present [,2] Given a Class (+/-)
## $abandoned
##           abandoned
## Y                 [,1]       [,2]
##   negative 0.007278609 0.08965275
##   positive 0.007518797 0.09512492
## 
## $ability
##           ability
## Y                [,1]      [,2]
##   negative 0.01860089 0.1572729
##   positive 0.01662050 0.1278700
## 
## $able
##           able
## Y                [,1]      [,2]
##   negative 0.04407602 0.2223329
##   positive 0.06489909 0.2649762
## 
## $absence
##           absence
## Y                 [,1]       [,2]
##   negative 0.004043672 0.06347401
##   positive 0.007123071 0.08411381
## 
## $absolute
##           absolute
## Y                [,1]      [,2]
##   negative 0.01698342 0.1353509
##   positive 0.01266324 0.1153239
## 
## $absolutely
##           absolutely
## Y                [,1]      [,2]
##   negative 0.07157299 0.2818194
##   positive 0.04788287 0.2279085
predictions <- predict(nb_model, newdata = test_matrix) # Make Predictions on Test Data

mean(predictions == test_matrix$sentiment) # Performance Accuracy
## [1] 0.764
table(Predicted = predictions, Actual = test_matrix$sentiment) # Confusion Matrix
##           Actual
## Predicted  negative positive
##   negative      418      143
##   positive       93      346