I was just reading a really nifty Bayesian approach to clickthrough rate estimation from Microsoft Research Cambridge (UK), which used as a baseline the following paper’s approach to calibrating naive Bayes classifiers:

- Zadrozny, Bianca and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive Bayes classifiers. In
*ICML*.

I’ve blogged about Rennie et al.’s approach to tackling the poor assumptions of Naive Bayes for text classifiers; oddly, the Rennie et al. 2003 ICML paper I blogged about before doesn’t cite the 2001 Zadrozny and Elkan ICML paper on the same topic.

### Failed Independence Assumptions in Naive Bayes

The well-known problem with naive Bayes (i.e. a discrete mixture of multinomials) derives from using features that are correlated in a model that assumes they’re not. This is, in fact, the naivete from which the name is derived, though it’s a misnomer, because the model’s right for some situations. For instance, it’s common to see naive Bayes applied to model bags of words in human language, which are highly correlated both locally (by syntax and collocation) and globally (by topic). It’s what you’ll get from LingPipe’s `NaiveBayesClassifier`

and `TradNaiveBayesClassifier`

classes.

In practice, what happens, especially for longer documents, is that the probabilities tend toward 0 or 1 because of redundant tokens with high correlations being treated as independent. This is especially easy to see for repeated words. If we’re classifying sports documents and it’s 90% likely to be an article about basketball if it mentions “lebron” (actually it’s probably higher, due to the celebrity of LeBron James). So what if we see “lebron” twice? In the model, the probability of being about basketball goes up to . The problem is that two “lebron”s don’t tell you much more about the topic than one. An article about the filmmaker Eddie Lebron or the Salsa band The Lebrón Brothers is also likely to mention “lebron” more than once. Once you’ve seen one, the effect of the next one should go down. (The fact that all three uses of “lebron” are spelled differently is an orthogonal issue I shouldn’t have confused here; the same holds for a fully capitalized name like “Smith”.)

### Measuring Term Correlation

One thing to do is try to measure the correlation. Along these lines, I really liked Ken Church’s analysis of correlation in a paper titled One term or two? (but only available pay-per-view) and Martin Jansche’s careful follow up.

It’s traditional to hack calibrations by converting every document to the same virtual document length , using . This is what we do in our traditional naive Bayes class in LingPipe (it doesn’t change first-best results, but helps enormously for EM and presumably for n-best results ordering).

### Binned Calibration

What Zadrozny and Elkan did for naive Bayes is to train a classifer and and then calibrate its predictions. In particular, they sort the inputs by predicted probability for the most likely category. They then sort these into ten bins of equal size. For each bin, they calculate the proportion of true answers in the bin. For new items, they are assigned probabilities based on which bin their estimate for the probability of the best category falls into.

They talk about using held-out data, but don’t. I’m not sure why — naive Bayes is pretty easy to overfit. Ideally they’d have just done cross-validation on the whole data set to get predictions, or use a representative held-out data set of appropriate size.

### Isotonic Regression

The Microsoft paper referred to Zadrozny and Elkan’s approach as isotonic regression, but on my reading, Zadrozny and Elkan didn’t enforce isotonicity (i.e. upward monotonicity) on the bin probabilities. They probably didn’t need to with lots of items per bin and a reasonably ordered (if not calibrated) set of naive Bayes results. Structural constraints like isotonicity (or simiarly intentioned smoothness priors for something like loess) are especially helpful if some of the bins don’t contain much data or you’re fitting a more complex model.

### Binning versus Piecewise Linear Normalization

Binning may be useful for absolute probability prediction, but it’s not very useful for sorting results because it’s not a properly isotonic function (that is, we can have and ).

Instead of binning, a more reasonable solution seems to be a piecewise linear approximation. Just fit linear functions through each bin that remain isotonic. That’s what you typically see in non-parametric statistics approaches (e.g. for normalizing intensities in micro-array data or empirically adjusting for false discovery rates).

### Length, Anyone?

In addition to predicted probability on new items, I’d think document length would be a huge factor here. In particular, a two-way binning by length and by prediction probability makes sense to me if there’s enough data. As documents get longer, the overconfidence due to failed independence assumptions gets worse.

### Category Effects?

I’d also think you could adjust for category effects. There’s often one or more small foreground categories and a big background category. I’d expect them to calibrate differently.

### Comaprative Effects

It’s common in estimating confidence for speech recognizer output to look at not only the first-best result, but the difference between the first-best and second-best result’s probability. That may also help here.

June 4, 2010 at 5:00 am |

One term or two – http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.3857

June 8, 2010 at 6:05 am |

I read Zadrozny and Elkan’s paper, in particular their Naive Bayes calibration method. It wasn’t clear what they meant by the score n(.) but, given that the corrected probability estimate is the “fraction of training examples in the bin that actually belong to class j”, putting a test example x in a bin according to its score n(x) seems to only make sense in the two-class case. In the experiment section, indeed they’re working with only two classes (j=1 or j=0).

June 8, 2010 at 12:28 pm |

I don’t think what they’re doing is that complicated. They’re taking the basic scores output from naive Bayes, sorting them, and binning them into equal sized bins. Then they output a score for an example based on which bin it falls into, calculating the score by the percentage of examples in that bin that were correctly classified. I’m just surprised they didn’t massively overfit by doing this on the training data rather than cross-validated training data (where the examples being evaluated are held out from training).

June 8, 2010 at 7:08 pm

So what is n(x)? It is the non-calibrated score output by Naive Bayes. In the two-class case, n(x) can be set to the score of class j=1. In the multi-class case, what is n(x)? The score of the most likely class for x? Then I don’t see why, in that case, choosing what bin x falls into solely according to n(x) makes any sense.

As I see it, the approach of Rennie’s paper is quite different from Zadrozny’s. In the latter, given the predicted P(c|x), the goal was to find a more reliable estimate ˆP(c|x), especially for decision making. In the former, a way to find a better estimate of P(w|c) was proposed but the classification rule didn’t change.

June 9, 2010 at 3:21 pm

Thanks for being persistent — I hadn’t thought through the multi-class case enough. The problem is that you want the adjusted category probabilities to sum to 1. This is actually a problem for the two-class case, too. You might get raw probabilities of 0.95 and 0.05, where the binned probabilities would be 0.70 and 0.01.

One approach would be to renormalize the adjusted scores. This is what everyone does after taking roots to account for some of the correlation.

I agree that Zadrozny and Elkan’s proposal is very different from what Rennie et al. proposed.