## Predicting Category Prevalence by Adjusting for Biased Classifiers

Suppose you are interested in text-based epidemiology, as our customer Health Monitoring Services is. Among the information they get are real-time inputs in the form of chief complaints (a few words about what’s wrong with the patient) from hospital emergency rooms. These are then classifed by syndrome (e.g. botulinic, gastrointestinal, neurological, etc.) to carry out so-called syndromic surveillance. (HMS is doing lots of other cool stuff; this is just the very basics.)

Public health agencies like the U.S. Center for Disease Control are more interested in overall prevalence of conditions than in individual cases. That is, how many people have the flu or are infected with HIV? And where are they geographically? (Epidemiologists, in general, also care about the reliability of diagnostic tests, such as the sensitivity and specificity of a particular blood test.)

The naive way to do this computation is to run everything through a first-best classifier and then count the results in each category. You can smooth it a bit, and then look for significant divergences from expected norms (tricky given seasonality, etc., but doable).

A slightly more sophisticated way to do this would be to count expectations. That is, run a probabilistic classifier to compute posterior inferences about the distribution of categories for a given input. Then sum the probabilities, which is the expected number of cases according to the model. This should work better.

Well, I just learned an even better way to do this last weekend.
Gary King (of the Social Science Statistics Blog) presented this paper co-authored with Daniel Hopkins:

They point out that social scientists are much more interested in population trends than the opinions of single members of the public. So they propose to estimate the prevalence of positive sentiment toward political candidates. But they’re worried about bias, and don’t care about individual level classification accuracy, which is what all the models train for in one way or another (i.e. SVMs, perceptrons, naive Bayes, logistic regression). They also point out that the distribution of categories may be very different in the test set than the training set, because things like opinions may drift over time.

So maybe we can get away from assuming our test set is a random sample from the same pool as the training (in which case the problem’s solved, because the training distribution of categories will match the test distribution by definition!).

Dan and Gary pulled out the wayback machine (that’s what the image above is of) and found this (sorry for the closed source; I can’t get at anything beyond the abstract, either):

It uses what’s in hindsight, a blindingly obvious adjustment for bias. Suppose we take a classifier and measure sensitivity,

sens = TP / (TP + FN),

which is just its accuracy truly positive items (has the disease), and specificity,

spec = TN / (TN + FP),

which is its accuracy on truly negative items (don’t have the disease).

We can estimate sensitivity and specificity by cross-validation on our training set easily enough; the functionality’s built into LingPipe (combine a cross-validating corpus and a classifier evaluator). Suppose C is a category taking on values 0 or 1; if C’ is an estimate of C for a new data point, we have:

Pr(C’=1) = sens * Pr(C=1) + (1 – spec) * Pr(C=0)

That is, there are two ways in which the system estimates the category C’ to be 1, by being right for an item which truly is of category 1, or being wrong for an item which truly is of category 0. An instance is of category 1 with probability Pr(C=1) and is classified correctly with probability sensitivity and is of category 0 with probability Pr(C=0) = (1 – Pr(C=1)) and classified in error with probability (1 – specificity).

Now we just apply some algebra to solve for Pr(C=1):

Pr(C=1) = (Pr(C’=1) + spec – 1) / (sens + spec – 1)

Et voilĂ . An unbiased estimate for Pr(C=1). I love algebra. Of course, this assumes that your sensitivity and specificity are the same on the test set as on the training set, but you have to assume something to get off the ground here.