I was just reading a really nifty Bayesian approach to clickthrough rate estimation from Microsoft Research Cambridge (UK), which used as a baseline the following paper’s approach to calibrating naive Bayes classifiers:
- Zadrozny, Bianca and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayes classifiers. In ICML.
I’ve blogged about Rennie et al.’s approach to tackling the poor assumptions of Naive Bayes for text classifiers; oddly, the Rennie et al. 2003 ICML paper I blogged about before doesn’t cite the 2001 Zadrozny and Elkan ICML paper on the same topic.
Failed Independence Assumptions in Naive Bayes
The well-known problem with naive Bayes (i.e. a discrete mixture of multinomials) derives from using features that are correlated in a model that assumes they’re not. This is, in fact, the naivete from which the name is derived, though it’s a misnomer, because the model’s right for some situations. For instance, it’s common to see naive Bayes applied to model bags of words in human language, which are highly correlated both locally (by syntax and collocation) and globally (by topic). It’s what you’ll get from LingPipe’s
In practice, what happens, especially for longer documents, is that the probabilities tend toward 0 or 1 because of redundant tokens with high correlations being treated as independent. This is especially easy to see for repeated words. If we’re classifying sports documents and it’s 90% likely to be an article about basketball if it mentions “lebron” (actually it’s probably higher, due to the celebrity of LeBron James). So what if we see “lebron” twice? In the model, the probability of being about basketball goes up to . The problem is that two “lebron”s don’t tell you much more about the topic than one. An article about the filmmaker Eddie Lebron or the Salsa band The Lebrón Brothers is also likely to mention “lebron” more than once. Once you’ve seen one, the effect of the next one should go down. (The fact that all three uses of “lebron” are spelled differently is an orthogonal issue I shouldn’t have confused here; the same holds for a fully capitalized name like “Smith”.)
Measuring Term Correlation
One thing to do is try to measure the correlation. Along these lines, I really liked Ken Church’s analysis of correlation in a paper titled One term or two? (but only available pay-per-view) and Martin Jansche’s careful follow up.
It’s traditional to hack calibrations by converting every document to the same virtual document length , using . This is what we do in our traditional naive Bayes class in LingPipe (it doesn’t change first-best results, but helps enormously for EM and presumably for n-best results ordering).
What Zadrozny and Elkan did for naive Bayes is to train a classifer and and then calibrate its predictions. In particular, they sort the inputs by predicted probability for the most likely category. They then sort these into ten bins of equal size. For each bin, they calculate the proportion of true answers in the bin. For new items, they are assigned probabilities based on which bin their estimate for the probability of the best category falls into.
They talk about using held-out data, but don’t. I’m not sure why — naive Bayes is pretty easy to overfit. Ideally they’d have just done cross-validation on the whole data set to get predictions, or use a representative held-out data set of appropriate size.
The Microsoft paper referred to Zadrozny and Elkan’s approach as isotonic regression, but on my reading, Zadrozny and Elkan didn’t enforce isotonicity (i.e. upward monotonicity) on the bin probabilities. They probably didn’t need to with lots of items per bin and a reasonably ordered (if not calibrated) set of naive Bayes results. Structural constraints like isotonicity (or simiarly intentioned smoothness priors for something like loess) are especially helpful if some of the bins don’t contain much data or you’re fitting a more complex model.
Binning versus Piecewise Linear Normalization
Binning may be useful for absolute probability prediction, but it’s not very useful for sorting results because it’s not a properly isotonic function (that is, we can have and ).
Instead of binning, a more reasonable solution seems to be a piecewise linear approximation. Just fit linear functions through each bin that remain isotonic. That’s what you typically see in non-parametric statistics approaches (e.g. for normalizing intensities in micro-array data or empirically adjusting for false discovery rates).
In addition to predicted probability on new items, I’d think document length would be a huge factor here. In particular, a two-way binning by length and by prediction probability makes sense to me if there’s enough data. As documents get longer, the overconfidence due to failed independence assumptions gets worse.
I’d also think you could adjust for category effects. There’s often one or more small foreground categories and a big background category. I’d expect them to calibrate differently.
It’s common in estimating confidence for speech recognizer output to look at not only the first-best result, but the difference between the first-best and second-best result’s probability. That may also help here.