It turns out Nigam et al. weren’t the only ones fiddling with naive Bayes. I just finished:
- Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In ICML.
Rennie et al. introduce four distinct modifications to Naive Bayes:
- log term frequency normalization, to cope with overdispersion,
- length normalization, to dampen inter-term correlation,
- inverse doc frequency normalization, to cope with noise from common features, and
- complementation, to cope with low-count bias.
Log Term Frequency (TF)
Word counts tend to be overdispersed relative to what is predicted by a multinomial. Simply put, two or three occurrences of a word in a document are much more common than predicted by multinomials. Ken Church called this “burstiness”. The two principled ways to address this are (1) mixture models, including the gamma/Poisson [negative binomial], the Dirichlet/multinomial, and zero-inflated models, and (2) latent topic models, where the latent topics model correlations among terms.
Rennie et al. motivate log word frequency transforms on the basis that modeling log counts with a multinomial is overdispersed compared to standard multinomial models of linear counts.
Rather than training on raw document counts, as in standard naive Bayes, the count vectors
count(w,d) of words in documents are first transformed based on term frequency (TF) and inverse document frequency (IDF):
count(w,d) is the count of word
w in document
Inverse Document Frequency (IDF)
To help eliminate noise caused by common words (defined as those that show up in many documents), word counts are next scaled by inverse document frequency:
numDocs is the total number of documents and
numDocs(w) is the total number of documents containing word
Length Normalization (L2)
Length normalization dampens the correlations among words in documents. It also scales probabilities for different lengths, which is not important for Rennie et al.’s first-best classifiers.
To L2 (Euclidean) length normalize, they divide the vector of TF/IDF counts by its length:
where the L2 (Euclidean) length of the term vector for document d is defined by:
What I’d never seen done is complementation. Complementation is introduced to deal with the bias seen with imbalanced training data. What’s really nifty is their comparison to one-versus-all formulations of linear classifiers. I’ll talk about both of these issues in a future blog post.
Instead of training a category based on its own counts and then taking the most likely category, Rennie et al. train a category based on all other category counts, then take the lowest scoring category.
That is, we count all the instances (after all the TF/IDF and length norms) of the word in categories other than c. Finally, the discrete distribution underlying category c is defined as:
The output is a set of weights that can be used as the basis of a regular old linear classifier (with intercept feature of the category probabilities). As such, this looks like search-engine TF/IDF computations for information retrieval (see my post TF/IDF, Scaling and Symmetry for gory details), for which there are typically no normalizations applied to the queries. It’s possible to roll most of the norms into the training, so this is no big deal, and the final length normalization doesn’t affect first-best answers.
First-Best, Conditional Probabilities and EM
Recall that (Nigam, McCallum and Mitchell 2003) (L1) length-normalized the documents being classified, to fit better semi-supervised models with expectation maximization (EM). Although this has no effect on first-best classification, it does change the category probability estimates, and thus affects expectations, and hence EM. Ironically, Rennie et al. list not being able to run EM as a drawback to their ad-hoc normalizations. As Nigam et al. showed, you really only need the conditional estimates for computing the E step, and these may be computed with Rennie et al.’s modifications as well as with Nigam et al.’s. Then the M step can be any old model fitting, including Rennie et al.’s.