## Evaluating with Unbalanced Training Data

For simplicity, we’ll only consider binary classifiers, so the population consists of positive and negative instances, such as relevant and irrelevant documents for a search query, or patients who have or do not have a disease.

Quite often, the available training data for a task is not balanced with the same ratio of positive and negative instances as the population of interest. For instance, in many information retrieval evaluations, the training data is positive biased because it is data annoted from the top results of many search engines. In many epidemiological applications, there is also positive bias because a study selects from a high risk subpopulation of the actual population.

Even with unbalanced training data, we might still want to be able to calculate precision, recall, and similar statistics for the actual population. It’s easy if we know (or can estimate) the true percentage of positive instances in the population.

### Specificity and Sensitivity vs. Precision and Recall

Using the usual notation for true and false positives and negatives,

$\mbox{sens} = \mbox{TP}/(\mbox{TP} + \mbox{FN})$, and

$\mbox{spec} = \mbox{TN}/(\mbox{TN} + \mbox{FP})$.

Sensitivity and specificity are accuracies on positive ($\mbox{TP} + \mbox{FN}$) and negative cases ($\mbox{TN} + \mbox{FP}$), respecitively.

Prevalence for a sample may be calculated from the true and false positive and negative counts, by

$\pi = (\mbox{TP} + \mbox{FN})/(\mbox{TP} + \mbox{FP} + \mbox{TN} + \mbox{FN})$.

Recall is just sensitivity, but precision is the percentage of positive responses that are correct, namely

$\mbox{prec} = \mbox{TP}/(\mbox{TP} + \mbox{FP})$.

Suppose we know the test population prevalence $\pi'$, the probability of a random population member being a positive instance. Then we can adjust evaluation statistics over a training population with any ratio of positive to negative examples by recomputing expected true and false positive and negative values.

The expected true and false positive and negative counts in test data with prevalence $\pi'$ over $N$ samples, given test sensitivity and specificity $\mbox{sens}$ and $\mbox{spec}$, are

$\mbox{TP}' = N \cdot \mbox{sens} \cdot \pi'$,

$\mbox{TN}' = N \cdot \mbox{spec} \cdot (1 - \pi')$,

$\mbox{FP}' = N \cdot (1 - \mbox{spec}) \cdot (1 - \pi')$, and

$\mbox{FN}' = N \cdot (1-\mbox{sens}) \cdot \pi'$.

### Sensitivity and Specificity are Invariant to Prevalence

Although the adjusted contingency matrix counts (true and false positive and negatives) vary based on prevalence, sensitivity and specificity derived from them do not. That’s because specificity is accuracy on negative cases and sensitivity the accuracy on positive cases.

### Plug-In Population Statistic Estimates

Precision, on the other hand, is not invariant to prevalence. But we can compute its expected value in a population with known prevalence using the adjusted contingency matrix counts,

$\mbox{prec}' = \mbox{TP}'/(\mbox{TP}' + \mbox{FP}')$.

The counts $N$ all cancel; we can really work with true and false positive and negative rates instead of counts.

### Related Work

This is related to my earlier blog post on estimating population prevalence using a classifier with known bias and an unbiased population sample:

The approach described in the linked blog post may be used to estimate the population prevalence $\pi'$ given only an arbitrarily biased labeled training set and an unlabeled and unbiased sample from the population.