I’ve been advocating an approach to classifier evaluation in which the “truth” is itself a probability distribution over possible labels and system responses take the same form.

### Is the Truth Out There?

The main motivation is that we don’t actually know what the truth is for some items being classified. Not knowing the truth can come about because our labelers are biased or inconsistent, because our coding standard is underspecified, or because the items being classified are truly marginal. That is, the truth may not actually be out there, at least in any practically useful sense.

For example, we can squint all we want at the use of a pronoun in text and still not be able to come to any consensus about its antecedent, no matter how many experts in anaphora we assemble. Similarly, we might look at a phrase like *fragile X* in a paper on autism and truly not know if it refers to the gene or the disease. We may not know a searcher’s intent when we try to do spelling correction on a query like *Montona* — maybe they want Montana the state, maybe they want the pop star, or maybe they really want the west central region of the island of São Vicente.

The truth’s not always out there. At least for language data collected from the wild.

### Partial Credit for Uncertainty

The secondary motivation is that many classifiers, like most of LingPipe’s, are able to return probabilistic answers. The basic motivation of log loss is that we want to evaluate the probability assigned by the system to the correct answers. The less uncertain the system is about the true answer, the better.

If we want to be able to trade precision (or specificity) for recall (i.e., sensitivity), it’s nice to have probabilistically normalized outputs over which to threshold.

### Probabilistic Reference Truth

Let’s just consider binary (two category) classification problems, where a positive outcome is coded as and a negative one as .. Suppose there are items being classified, . A probabilistic corpus then consists of a sequence of probabilistic categorizations , where for .

A non-probabilistic corpus is just an edge case, with for .

### Probabilistic System Responses

Next, suppose that the systems we’re evaluating are themselves allowed to return conditional probabilities of positive responses, which we will write as for .

A non-probabilistic response is again an edge case, with .

### Log Loss

The log loss for response for reference is the negative expected log probability of the response given the reference, namely

.

If we think of and as parameters to Bernoulli distributions, it’s just KL divergence.

If the reference is restricted to being or , this reduces to the usual definition of log loss for a probabilistic classifier (that is, may be any value in .

### Plotting a Single Response

Because we just sum over all the responses, we can see what the result is for a single item . Suppose , which means the reference says there’s a 70% chance of item being positive, and a 30% chance of it being negative. The plot of the loss for response given reference is

It’s easy to see that the response is better than for the case where the reference is .

The red vertical dashed line is at the point , which is where . The blue horizontal line is at , which is the minimum loss for any when .

### Classifier Uncertainty Matches Reference Uncertainty

It’s easy to see from the graph (and prove via differentiation) that the log loss function is convex and the minimum loss occurs when . That is, best performance occurs when the system response has the same uncertainty as the gold standard. In other words, we want to train the system to be uncertain in the same places as the reference (training) data is uncertain.

### Log Loss Degenerates at the Edges

If the reference is and the response , the log loss is infinite. Similarly for . As long as the response is not 0 or 1, the results will be finite.

If the response is 0 or 1, the loss will only be finite if the reference category is the same value, . In fact, having the reference equal to the response, , with both being 0 or 1, is the only way to get zero loss. In all other situations, loss will be non-zero. Loss is never negative because we’re negating logs of values between 0 and 1.

This is why you don’t see log loss used for bakeoffs. If you’re using a system that doesn’t have probabilistic output and the truth is binary 0/1, then every system will have infinite error.

### Relation to Regularization via Prior Data

This matching of uncertainty can be thought of as a kind of regularization. One common way to implement priors if you only have maximum likelihood estimators is to add some artificial data. If I add a new data point which has all features turned on, and give it one positive instance and one negative instance, and I’m doing logistic regression, it acts as a regularizer pushing all the coefficients closer to zero (the further they are away from zero, the worse they do on the combination of the one negative and one positive example). If I add more of this kind of data, the regularization effect is stronger. Here, we can think of adding 3 negative examples and 7 positive examples, or 0.3 and 0.7 if we allow fractional training data.

### Comparison to 0/1 Loss

Consider what happens if instead of log loss, we consider the standard 0/1 loss as expressed in a confusion matrix (true positive, true negative, false positive, and false negative counts).

With a probabilistic reference category and probabilistic system response , the expected confusion matrix counts are

,

,

, and

.

With this, we can compute expected accuracy, precision, recall (= sensitivity), specificity, and F-measure as

.

,

,

, and

.

All of these statistical measures are degenerate for probabilistic evaluation. For instance, if , then accuracy is maximized with , if , accuracy is maximized with , and if , accuracy is the same for all ! The rest of the measures are no better behaved.

### R Source

Plotting functions is easy in R with the `curve()`

function. Here’s how I generated the one above.

p <- 0.7; f <- function(q) { - (p * log(q) + (1-p) * log(1-q)); } curve(f, xlim=c(0,1), ylim=c(0,4), xlab="yHat", ylab="LogLoss(y, yHat)", main="Log Loss ( y=0.7 )", lwd=2, cex.lab=1.5, cex.axis=1.5, cex.main=1.5, axes=FALSE, frame=FALSE) axis(1,at=c(0,0.5,0.7,1)) axis(2,at=c(0,f(0.7),2,4),labels=c("0", ".61", "2", "4")) abline(h=f(0.7),col="blue",lty=3) abline(v=0.7,col="red",lty=2)

Just be careful with that dynamic lexical scope if you want to redefine the reference result `p`

( in the definitions above).

November 8, 2010 at 8:51 pm |

You may find this paper relevant

Beta Regression for Modelling Rates and Proportions

Silvia Ferraria and Francisco Cribari-Netob

http://www.ime.usp.br/~sferrari/beta.pdf

They even have a R package called betareg.

November 11, 2010 at 1:07 pm |

Ferraria and Cribari-Netob use a generalized linear model with a Beta-type link. Translating what they do into notation I can more easily follow, they’re assuming responses with predictors . They then model the responses as beta distributed,

,

where the mean parameter is modeled using logistic regression

.

This’d be straightforward to program in BUGS or JAGS — you’d just need a prior on the regression coefficients and on the precision parameter .

Now that I’ve translated it into terms I understand, I’ll have to get my head around the implications. My first reaction is that the deterministic part of the prediction will be the same as if you just took , because the mean of is .

November 11, 2010 at 3:11 pm |

Phil Resnik pointed me to this paper, section 5 of which discusses training logistic regression using probabilistic “truth”:

Limin Yao, David Mimno and Andrew McCallum. 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections. KDD.

This made me realize that, of course, EM uses exactly the same kind of “probabilistic training” in its E step. It’s not much more complicated to plug in logistic regression in place of the more standard naive Bayes.

August 22, 2012 at 4:00 pm |

[...] The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on [...]