I just finished re-reading the Nigam, McCallum and Mitchell paper Semi-Supervisied Text Classification Using EM before implementing an EM tutorial and new naive Bayes implementation. I was struck by:
For pre-processing, stopwords are removed and word counts of each document are scaled such that each document has constant length, with potentially fractional word counts. (Nigam et al. p. 9; my bold)
I figured they meant an L1 norm, which amounts to dividing by the sum of (the absolute value of the) counts. But they could’ve meant the L2 norm that’s used in vector-based information retrieval, which amounts to dividing by vector length, which is just the square root of the sum of the squared counts. In fact, (Rennie et al. 2003) used L2 norms (after log-scale TF/IDF weighting!).
I also didn’t know what length they normed to, but figured it’d be something small so that the conditional probability estimates of naive Bayes would have less variance (more on this below).
So I wrote off to Kamal Nigam, who kindly responded with answers. Yes, they used an L1 norm, but they used a surprisingly long length, on the order of hundreds of words, which will cause most naive Bayes estimates to tend toward 0 or 1 due to the faulty independence assumption underlying naive Bayes. They used the stoplist from Andrew McCallum’s BOW toolkit, and Kamal was pretty sure they case normalized. (20 Newsgroups also works a smidgen better for classification if you strip out the headers, which I’m not sure if they did or not.)
It’s pretty clear how to generalize naive Bayes to handle fractional counts — just use floating point accumulators (better make that a
double) and carry out all the other calculations in exactly the same way.
Let’s say I have two categories with the following estimates after training:
p(cat1) = 0.9 p(cat2) = 0.1 p(a|cat1) = 0.7 p(a|cat2) = 0.4 p(b|cat1) = 0.3 p(b|cat2) = 0.6
Now what happens when I get the document “a a b”?
p(cat1|a a b) ∝ p(cat1) * p(a|cat1) * p(a|cat1) * p(b|cat1) = .9 * .147 p(cat2|a a b) ∝ p(cat2) * p(a|cat2) * p(a|cat2) * p(b|cat2) = .1 * .096
p(cat1|a a b) = .9*.147 / (.9*.147 + .1*.096) = 0.93
That’s without length normalization. What if we make the length norm 1? Then we have to raise all of our token probabilities (not the category probabilities) to the 1/3 power, yielding:
p(cat1|a a b) ∝ p(cat1) * p(a|cat1) * p(a|cat1) * p(b|cat1) = .9 * .147**(1/3) p(cat2|a a b) ∝ p(cat2) * p(a|cat2) * p(a|cat2) * p(b|cat2) = .1 * .096**(1/3)
This works out to an estimate of
p(cat1|a a b) = 0.91. Suppose we take the norm to be 100, we raise the conditional probs to the power of (100/3), yielding
p(cat1|a a b)=1.0 (well, actually 0.999999925).
In general, length norms lower than the length of the document drive the estimates
p(cat|text) closer to the category distribution
p(cat|text), in this case 0.9, and length norms greater than the document length drive the estimates closer to 0 or 1. Thus with a balanced category distribution, the posterior category estimates are driven to the uniform distribution.
One side effect of this normalization is that the joint probabilities are no longer coherent. It’s easy to see that the infinite chain of inputs “a”, “a a”, “a a a”, …, each of which has the same “joint probability” after length normalization, is going to drive our total probablity above 1.0. Nigam et al. just went ahead and carried out EM in the usual way in an attempt to maximize joint probability estimates of the data and the models (simply Dirichlet prior and MAP estimate).