p = probability of category i

po = probability of something other than i

pn = the normalized probability

If you use the total number of known features that occurred in the document as the scaling factor (A = 1/count), most of the probabilities cluster around 50%, so that seems to be a reasonable lower limit for how low the scaling factor should go.

]]>I got lost the first time on the pn definitions — was the last one the only one you care about?

Of course, I should’ve recognized the y(t) formula in your first comment — the difference is just the log odds and the A and B the usual scaling and offset. I even implemented a minimum square error fitter in `com.aliasi.stats.Statistics.logisticRegression()`

for just this kind of application.

f(t) = 1 / (1 + e^(-t))

… where in this case, t = log(p) – log(p_other)

What I described earlier is a more general logistic function that allows you to fine-tune the gradient, or slide the center of the logistic curve one way or the other. For example, if a document has just enough feature evidence such that t=10, the probability will be very near 1.0. But by lowering the magnitude of A in my earlier post, the assigned probability decreases (still above 50% for B=0).

I found it useful for teasing out documents that had just barely enough evidence to get above the threshold but tended not to belong to the category it was assigned to (for example, sites with wordlists & the like). It also solved problems with documents that contained a significant number of tokens that had not been encountered during training, so the dampened logistic function assigned more reasonable (under-threshold) probabilities in a lot of these cases.

]]>There’ve been many other attempts at “calibrating” highly biased estimators like naive Bayes:

Bennet. 2000. Assessing the calibration of naive Bayesâ€™ posterior estimates.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.32.5608

The non-parametric (histogram-based) approach is explored in:

Zadrozny and Elkin.

Obtaining calibrated probability estimates

from decision trees and naive Bayesian classifiers.

http://www-cse.ucsd.edu/users/elkan/calibrated.pdf

I came up with a procedure for diminishing the effect of the highly polarized probabilities (tending toward 0 or 1) you may find useful. It works like this:

p = probability a document belongs to category

po = probability a document belongs to some other category

pn = p / (p + po)

pn = 1 / (1 + po / p)

pn = 1 / (1 + exp(log(po) – log(p)))

pn = 1 / (1 + exp(1(0 – (log(p) – log(po)))))

This expression produces a curve reminiscent of this all-too-beautiful ascii art:

…………………_________________

___________/

It also happens to fit the form of the solution to a class of autonomous differential equations of the form:

y'(t) = g*y(1 – y/K)

… where g controls the growth rate (how quickly it jumps from 0 to 1), and K affects where the inflection point is (the center of the curve).

A slightly generalized solution to the differential equation I gave above looks like this:

y(t) = 1 / (1 + exp(A(B – t)))

t = log(p) – log(po)

… where A influences the growth rate, and B slides the curve left/right.

In short, by decreasing the magnitude of A, the effect you get is a smoother gradient for documents where the probability of a document being in a given class is roughly equal to the probability the document belongs to some other class. Additionally, altering B causes your classifier to require stronger (or weaker) evidence in order for documents to achieve a higher probability.

]]>