\[ P(A\mid B) = \frac{P(B\mid A) P(A)}{P(B)} \quad \text{ Bayes Rule} \\ \]
Naive Bayes is a canonical supervised classification model that leverages Bayes’ rule to recover the probability that an unseen document (\(D_i\)) containing words (\(w_{i1},...,w_{ij}\)) belongs to a certain classification (\(\pi_{ik}\)), where \(\pi_{ik}\) is represented as 0 or 1 given the membership of \(D_i\) in class \(k\). Using Bayes’ rule, the posterior probability that document \(i\) belongs to class \(k\) is:
\[ p(\pi_{ik}\mid D_i) = \frac{p(\pi_{ik} = 1)p(D_i\mid \pi_{ik} =1)}{p(D_i)} \quad \text{ Bayes' Rule -- Conditional Probability} \\ p(\pi_{ik} = 1) \quad \text{ Baseline Probability that } D_i \text{ Belongs to Class } k \text{ Before Reading It} \\ p(D_i\mid \pi_{ik} = 1) = \quad \text{ Probability of Observing } w_{i1}...w_{ij} \text{ if } D_i \text{ in Class } k \\ p(D_i) \quad \text{ Probability of Observing } D_i \text{ in Any Class -- Will Use to Normalize At End} \] Naive Assumption: Conditional on class, words are independent of each other – i.e., once you know the class of any document, learning that one appears in it tells you nothing additional about whether another word appears. This is almost certainty wrong (more soon), but (as GRS note) baking in this assumption makes estimation much easier and effective.
\[ W_i\mid \pi_{ik} = 1 \sim \text{ Multinomial}(\sum_jW_{ij},\mu_k) \\ p(W_i\mid \pi_{ik} = 1) \propto \prod^j_{j=1}\mu_{kj}^{W_{ij}}, \\ \hat{\mu}_{kj} = \frac{c + \sum^N_i \pi_{ik}W_{ij}}{Jc + \sum_i \sum_j \pi_{ik}W_{ij}} \]
There’s a lot going on here, but the intuition is effectively that the probability of observing words \(W_{i...j}\) in class \(k\) is proportional to the vector of probabilities for drawing each word in class \(k\). What this means practically is that we essentially need to collect all of the documents assigned to class \(k\) and calculate the frequency with which word \(W_i\) is used (and we can aid problems re: sparse word usage with a similar strategy to Laplace smoothing by adding a constant \(c\)). The final Naive algorithm is:
\[ p(\pi_{ik}=1\mid W_i) \propto \frac{\sum^N_i I(y_i=k)}{N} \prod^j_{j=1} \mu_{kj}^{ij} \] Which means that the probability of document \(i\) belonging to class \(k\) is proportional to the baseline frequency of class \(k\) times the probability of observing all the words in the document if it came from that class, where \(\frac{\sum^N_i I(y_i=k)}{N}\) is the baseline prior and \(\prod^j_{j=1} \mu_{kj}^{ij}\) is the likelihood term.
Finally, since the probabilities for single words are often very small, the standard way to implement Naive Bayes is usually by adding an \(arg max\) term, which effectively takes the log to convert probabilities into sums and make computation safe and easier.
\[ \hat{y}_i = \text{arg max}_k [\text{log}p(\pi_{ik} = 1) + \sum^J_j=1W_{ij}\text{log}\mu_{ik}] \] Which essentially just means to pick the class (\(k\)) with the largest value. In other words, the model adds up the contributions from the prior and all the words, then chooses the class that looks most likely — a process that becomes much simpler and more stable when we work with logs.
Let’s suppose we are classifying stories from a popular newspaper and want to know what types of content these organizations are discussing. Let’s say we have two class – Sports and Politics – and just three words to consider – Game, Team, and Vote.
We first construct a training set using a random sample of the observations, which produces:
| Document | Class | Words (Counts) |
|---|---|---|
| 1 | Sports | Game (2), Team (1), Vote (0) |
| 2 | Sports | Game (1), Team (2), Vote (0) |
| 3 | Politics | Game (0), Team (0), Vote (3) |
The class (\(k\)) priors (\(\frac{\sum^N_i I(y_i=k)}{N}\)) would be \(p(\text{Sports}) = \frac{2}{3}\) and \(p(\text{Politics}) = \frac{1}{3}\), because 2 of the 3 documents in the training set are labeled as Sports.
Before we compute the word probabilities per class (\(\mu_{kj}\)) with 6 total words observed in Sports and 3 in Politics, there are going to be terms that don’t appear in certain classes (e.g., Vote does not appear in any Sports documents), so we are going to add Laplace smoothing (\(c\)). With \(c\) = 1 and the total vocabulary (\(J\)) = 3, \(\mu_{kj}\) =
Assuming we introduce a new document (\(D_{New}\)) = (Game, Team, Vote):
\[ p(Sports\mid D_{New}) \propto p(Sports) \times \mu_{Sports,Game} \times \mu_{Sports,Team} \times \mu_{Sports,Vote} \\ \frac{2}{3} \times 0.444 \times 0.444 \times 0.111 \approx 0.0145 \] \[ p(Politics\mid D_{New}) \propto p(Politics) \times \mu_{Politics,Game} \times \mu_{Politics,Team} \times \mu_{Politics,Vote} \\ \frac{1}{3} \times 0.167 \times 0.167 \times 0.667 \approx 0.0062 \]
Normalizing the probabilities, we get:
\[ p(Sport|D_{New}) = \frac{0.0145}{0.0145+0.0062} \approx 0.70 \\ p(Politics|D_{New}) = \frac{0.0062}{0.0145+0.0062} \approx 0.299 \\ \]
Though the probabilities are both rather small, the document is nevertheless more likely to be Sports – adding more words to the training will certainly improve our inferential power.
While I haven’t done it here, we could also implement the argmax approach by taking the log of both the priors and likelihoods, which would preserve the preference for Sports while avoiding the computational issues caused by very small probabilities (even with Laplace smoothing).
Now let’s do an extended example in R using a dataset of
hand-coded
IMDB movie reviews.
for (i in 1:length(imdb_files)){
load(imdb_files[i])
} # Load Each IMDB file from Week 7 Data directory
set.seed(1234) # Set Random Seed
train_data <- bind_rows(imdb_train_pos, imdb_train_neg) # Combine Train Rows
test_data <- bind_rows(imdb_test_pos, imdb_test_neg) # combine Test Rows
train_data <- train_data[sample(nrow(train_data), size = 5000, replace = FALSE), ] # Sample Out
test_data <- test_data[sample(nrow(test_data), size = 1000, replace = FALSE), ] # Sample Out
train_corpus <- tm::VCorpus(VectorSource(train_data$text)) # Convert Train to Corpus
test_corpus <- tm::VCorpus(VectorSource(test_data$text)) # Convert tes tto Corpus
reduce_complexity <- function(corpus){
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
return(corpus)
} # Pre-Process (Reduce Complexity)
train_corpus <- reduce_complexity(train_corpus) # Apply Complexity Reduction
test_corpus <- reduce_complexity(test_corpus) # Apply Complexity Reduction
train_dtm <- removeSparseTerms(DocumentTermMatrix(train_corpus), 0.995) # Convert to DTM - Only Keep Terms in At least 0.5% of Docs
test_dtm <- DocumentTermMatrix(test_corpus, control = list(dictionary = Terms(train_dtm))) # Control Ensures Same Terms in Test as Train
train_matrix <- as.data.frame(as.matrix(train_dtm)) # Convert Back to DF for NB
train_matrix$sentiment <- train_data$sentiment # Ensure Sentiment
test_matrix <- as.data.frame(as.matrix(test_dtm)) # Convert to DF
test_matrix$sentiment <- test_data$sentiment # Ensure Sentiment (for comparison)
head(train_matrix[,c(1:5)])
## abandoned ability able absence absolute
## 1 0 0 1 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 1 0 0 0
nb_model <- e1071::naiveBayes(sentiment ~ ., data = train_matrix) # Run NB Model
head(nb_model$tables) # Likelihood of Word Being Absent [,1] or Present [,2] Given a Class (+/-)
## $abandoned
## abandoned
## Y [,1] [,2]
## negative 0.007278609 0.08965275
## positive 0.007518797 0.09512492
##
## $ability
## ability
## Y [,1] [,2]
## negative 0.01860089 0.1572729
## positive 0.01662050 0.1278700
##
## $able
## able
## Y [,1] [,2]
## negative 0.04407602 0.2223329
## positive 0.06489909 0.2649762
##
## $absence
## absence
## Y [,1] [,2]
## negative 0.004043672 0.06347401
## positive 0.007123071 0.08411381
##
## $absolute
## absolute
## Y [,1] [,2]
## negative 0.01698342 0.1353509
## positive 0.01266324 0.1153239
##
## $absolutely
## absolutely
## Y [,1] [,2]
## negative 0.07157299 0.2818194
## positive 0.04788287 0.2279085
predictions <- predict(nb_model, newdata = test_matrix) # Make Predictions on Test Data
mean(predictions == test_matrix$sentiment) # Performance Accuracy
## [1] 0.764
table(Predicted = predictions, Actual = test_matrix$sentiment) # Confusion Matrix
## Actual
## Predicted negative positive
## negative 418 143
## positive 93 346