It turns out Nigam et al. weren’t the only ones fiddling with naive Bayes. I just finished:

- Rennie, Shih, Teevan, Karger (2003) Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In
*ICML*.

Rennie et al. introduce four distinct modifications to Naive Bayes:

- log term frequency normalization, to cope with overdispersion,
- length normalization, to dampen inter-term correlation,
- inverse doc frequency normalization, to cope with noise from common features, and
- complementation, to cope with low-count bias.

### Log Term Frequency (TF)

Word counts tend to be overdispersed relative to what is predicted by a multinomial. Simply put, two or three occurrences of a word in a document are much more common than predicted by multinomials. Ken Church called this “burstiness”. The two principled ways to address this are (1) mixture models, including the gamma/Poisson [negative binomial], the Dirichlet/multinomial, and zero-inflated models, and (2) latent topic models, where the latent topics model correlations among terms.

Rennie et al. motivate log word frequency transforms on the basis that modeling log counts with a multinomial is overdispersed compared to standard multinomial models of linear counts.

Rather than training on raw document counts, as in standard naive Bayes, the count vectors `count(w,d)`

of words in documents are first transformed based on term frequency (TF) and inverse document frequency (IDF):

where

and where `count(w,d)`

is the count of word `w`

in document `d`

.

### Inverse Document Frequency (IDF)

To help eliminate noise caused by common words (defined as those that show up in many documents), word counts are next scaled by inverse document frequency:

where `numDocs`

is the total number of documents and `numDocs(w)`

is the total number of documents containing word `w`

.

### Length Normalization (L2)

Length normalization dampens the correlations among words in documents. It also scales probabilities for different lengths, which is not important for Rennie et al.’s first-best classifiers.

To L2 (Euclidean) length normalize, they divide the vector of TF/IDF counts by its length:

where the L2 (Euclidean) length of the term vector for document d is defined by:

I’ve been doing classification like this ever since I started doing classification, which is why the result so far looks like LingPipe’s `classify.TfIdfClassifierTrainer`

.

### Complementation

What I’d never seen done is complementation. Complementation is introduced to deal with the bias seen with imbalanced training data. What’s really nifty is their comparison to one-versus-all formulations of linear classifiers. I’ll talk about both of these issues in a future blog post.

Instead of training a category based on its own counts and then taking the most likely category, Rennie et al. train a category based on all other category counts, then take the lowest scoring category.

That is, we count all the instances (after all the TF/IDF and length norms) of the word in categories other than c. Finally, the discrete distribution underlying category c is defined as:

### Classification

The output is a set of weights that can be used as the basis of a regular old linear classifier (with intercept feature of the category probabilities). As such, this looks like search-engine TF/IDF computations for information retrieval (see my post TF/IDF, Scaling and Symmetry for gory details), for which there are typically no normalizations applied to the queries. It’s possible to roll most of the norms into the training, so this is no big deal, and the final length normalization doesn’t affect first-best answers.

### First-Best, Conditional Probabilities and EM

Recall that (Nigam, McCallum and Mitchell 2003) (L1) length-normalized the documents being classified, to fit better semi-supervised models with expectation maximization (EM). Although this has no effect on first-best classification, it does change the category probability estimates, and thus affects expectations, and hence EM. Ironically, Rennie et al. list not being able to run EM as a drawback to their ad-hoc normalizations. As Nigam et al. showed, you really only need the conditional estimates for computing the E step, and these may be computed with Rennie et al.’s modifications as well as with Nigam et al.’s. Then the M step can be any old model fitting, including Rennie et al.’s.

October 5, 2010 at 5:23 am |

The equation for compL2TfIdf seems to be wrong. I think it should be:

compL2TfIdf(w, c) = \sum_{d: c’ \neq c} l2TfIdf(w,d)

October 5, 2010 at 11:43 am |

I don’t understand where your came from — it’s not bound in the condition .

October 6, 2010 at 5:48 am |

So, to elaborate why I think there’s something wrong: in compL2TfIdf, you’re using l2TfIdf(w,c’) but according to the definition of l2TfIdf you wrote above, the second variable should be a document, not a class.

The equation I wrote reads: sum of all documents d whose class c’ is different from c. To make it more explicit, you may want to use y_d (the class of d) instead of c’. See Table 4 in Rennie et al.’s paper.

October 7, 2010 at 1:56 pm |

Thanks! Table 4 (page 7) of their paper has the full definition, and the one in the blog text above is wrong. If we let be the category of document , that’d be

I think there’s a remaining inconsistency with their general prior (smoothing) term and normalization; just read their defs!

November 27, 2012 at 12:32 pm |

All of the equations aren’t rendered here.

(And it might be useful to me, as a practitioner, arguing with a coworker about normalizing for document length with naive bayes models).

November 27, 2012 at 2:43 pm |

OK — I fixed them. Luckily I put in alt text for this. I used to be using a LaTeX rendering service, but it’s no longer in business so the early LaTeX links are dead. Now WordPress supports LaTeX directly.