I just finished re-reading the Nigam, McCallum and Mitchell paper Semi-Supervisied Text Classification Using EM before implementing an EM tutorial and new naive Bayes implementation. I was struck by:

For pre-processing, stopwords are removed and word counts of each document are

scaled such that each document has constant length, with potentially fractional word counts. (Nigam et al. p. 9; my bold)

I figured they meant an L1 norm, which amounts to dividing by the sum of (the absolute value of the) counts. But they could’ve meant the L2 norm that’s used in vector-based information retrieval, which amounts to dividing by vector length, which is just the square root of the sum of the squared counts. In fact, (Rennie et al. 2003) used L2 norms (after log-scale TF/IDF weighting!).

I also didn’t know what length they normed to, but figured it’d be something small so that the conditional probability estimates of naive Bayes would have less variance (more on this below).

So I wrote off to Kamal Nigam, who kindly responded with answers. Yes, they used an L1 norm, but they used a surprisingly long length, on the order of hundreds of words, which will cause most naive Bayes estimates to tend toward 0 or 1 due to the faulty independence assumption underlying naive Bayes. They used the stoplist from Andrew McCallum’s BOW toolkit, and Kamal was pretty sure they case normalized. (20 Newsgroups also works a smidgen better for classification if you strip out the headers, which I’m not sure if they did or not.)

It’s pretty clear how to generalize naive Bayes to handle fractional counts — just use floating point accumulators (better make that a `double`

) and carry out all the other calculations in exactly the same way.

Let’s say I have two categories with the following estimates after training:

p(cat1) = 0.9 p(cat2) = 0.1 p(a|cat1) = 0.7 p(a|cat2) = 0.4 p(b|cat1) = 0.3 p(b|cat2) = 0.6

Now what happens when I get the document “a a b”?

p(cat1|a a b) ∝ p(cat1) * p(a|cat1) * p(a|cat1) * p(b|cat1) = .9 * .147 p(cat2|a a b) ∝ p(cat2) * p(a|cat2) * p(a|cat2) * p(b|cat2) = .1 * .096

so that:

p(cat1|a a b) = .9*.147 / (.9*.147 + .1*.096) = 0.93

That’s without length normalization. What if we make the length norm 1? Then we have to raise all of our token probabilities (not the category probabilities) to the 1/3 power, yielding:

p(cat1|a a b) ∝ p(cat1) * p(a|cat1) * p(a|cat1) * p(b|cat1) = .9 * .147**(1/3) p(cat2|a a b) ∝ p(cat2) * p(a|cat2) * p(a|cat2) * p(b|cat2) = .1 * .096**(1/3)

This works out to an estimate of `p(cat1|a a b) = 0.91`

. Suppose we take the norm to be 100, we raise the conditional probs to the power of (100/3), yielding `p(cat1|a a b)=1.0`

(well, actually 0.999999925).

In general, length norms lower than the length of the document drive the estimates `p(cat|text)`

closer to the category distribution `p(cat|text)`

, in this case 0.9, and length norms greater than the document length drive the estimates closer to 0 or 1. Thus with a balanced category distribution, the posterior category estimates are driven to the uniform distribution.

One side effect of this normalization is that the joint probabilities are no longer coherent. It’s easy to see that the infinite chain of inputs “a”, “a a”, “a a a”, …, each of which has the same “joint probability” after length normalization, is going to drive our total probablity above 1.0. Nigam et al. just went ahead and carried out EM in the usual way in an attempt to maximize joint probability estimates of the data and the models (simply Dirichlet prior and MAP estimate).

February 25, 2009 at 2:33 pm |

I liked your post. It came in useful with some work I’m doing.

I came up with a procedure for diminishing the effect of the highly polarized probabilities (tending toward 0 or 1) you may find useful. It works like this:

p = probability a document belongs to category

po = probability a document belongs to some other category

pn = p / (p + po)

pn = 1 / (1 + po / p)

pn = 1 / (1 + exp(log(po) – log(p)))

pn = 1 / (1 + exp(1(0 – (log(p) – log(po)))))

This expression produces a curve reminiscent of this all-too-beautiful ascii art:

…………………_________________

___________/

It also happens to fit the form of the solution to a class of autonomous differential equations of the form:

y'(t) = g*y(1 – y/K)

… where g controls the growth rate (how quickly it jumps from 0 to 1), and K affects where the inflection point is (the center of the curve).

A slightly generalized solution to the differential equation I gave above looks like this:

y(t) = 1 / (1 + exp(A(B – t)))

t = log(p) – log(po)

… where A influences the growth rate, and B slides the curve left/right.

In short, by decreasing the magnitude of A, the effect you get is a smoother gradient for documents where the probability of a document being in a given class is roughly equal to the probability the document belongs to some other class. Additionally, altering B causes your classifier to require stronger (or weaker) evidence in order for documents to achieve a higher probability.

February 25, 2009 at 3:04 pm |

The more conventional approach to after-the-fact calibration is to use a one-dimensional logistic fit on scores. It’s also related to annealing and length normalization, given the way naive Bayes scores are calculated.

There’ve been many other attempts at “calibrating” highly biased estimators like naive Bayes:

Bennet. 2000. Assessing the calibration of naive Bayes’ posterior estimates.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.5608

The non-parametric (histogram-based) approach is explored in:

Zadrozny and Elkin.

Obtaining calibrated probability estimates

from decision trees and naive Bayesian classifiers.

http://www-cse.ucsd.edu/users/elkan/calibrated.pdf

February 26, 2009 at 3:07 pm |

That’s exactly what I’ve done there. A logistic function is defined as:

f(t) = 1 / (1 + e^(-t))

… where in this case, t = log(p) – log(p_other)

What I described earlier is a more general logistic function that allows you to fine-tune the gradient, or slide the center of the logistic curve one way or the other. For example, if a document has just enough feature evidence such that t=10, the probability will be very near 1.0. But by lowering the magnitude of A in my earlier post, the assigned probability decreases (still above 50% for B=0).

I found it useful for teasing out documents that had just barely enough evidence to get above the threshold but tended not to belong to the category it was assigned to (for example, sites with wordlists & the like). It also solved problems with documents that contained a significant number of tokens that had not been encountered during training, so the dampened logistic function assigned more reasonable (under-threshold) probabilities in a lot of these cases.

February 26, 2009 at 5:30 pm |

Why didn’t you say so? I’ve never heard of “autonomous differential equations”, but then I was terrible at analysis despite being a math major.

I got lost the first time on the pn definitions — was the last one the only one you care about?

Of course, I should’ve recognized the y(t) formula in your first comment — the difference is just the log odds and the A and B the usual scaling and offset. I even implemented a minimum square error fitter in

`com.aliasi.stats.Statistics.logisticRegression()`

for just this kind of application.February 26, 2009 at 6:22 pm |

Yes. Sorry, being a math major myself I’m very used to writing out proofs (or outlining one anyway), so I was leaving a trail of breadcrumbs to follow as opposed to just stating the result.

p = probability of category i

po = probability of something other than i

pn = the normalized probability

If you use the total number of known features that occurred in the document as the scaling factor (A = 1/count), most of the probabilities cluster around 50%, so that seems to be a reasonable lower limit for how low the scaling factor should go.