Thanks for being persistent — I hadn’t thought through the multi-class case enough. The problem is that you want the adjusted category probabilities to sum to 1. This is actually a problem for the two-class case, too. You might get raw probabilities of 0.95 and 0.05, where the binned probabilities would be 0.70 and 0.01.

One approach would be to renormalize the adjusted scores. This is what everyone does after taking roots to account for some of the correlation.

I agree that Zadrozny and Elkan’s proposal is very different from what Rennie et al. proposed.

]]>So what is n(x)? It is the non-calibrated score output by Naive Bayes. In the two-class case, n(x) can be set to the score of class j=1. In the multi-class case, what is n(x)? The score of the most likely class for x? Then I don’t see why, in that case, choosing what bin x falls into solely according to n(x) makes any sense.

As I see it, the approach of Rennie’s paper is quite different from Zadrozny’s. In the latter, given the predicted P(c|x), the goal was to find a more reliable estimate ˆP(c|x), especially for decision making. In the former, a way to find a better estimate of P(w|c) was proposed but the classification rule didn’t change.

]]>I don’t think what they’re doing is that complicated. They’re taking the basic scores output from naive Bayes, sorting them, and binning them into equal sized bins. Then they output a score for an example based on which bin it falls into, calculating the score by the percentage of examples in that bin that were correctly classified. I’m just surprised they didn’t massively overfit by doing this on the training data rather than cross-validated training data (where the examples being evaluated are held out from training).

]]>