## Evaluating with Probabilistic Truth: Log Loss vs. O/1 Loss

I’ve been advocating an approach to classifier evaluation in which the “truth” is itself a probability distribution over possible labels and system responses take the same form.

### Is the Truth Out There?

The main motivation is that we don’t actually know what the truth is for some items being classified. Not knowing the truth can come about because our labelers are biased or inconsistent, because our coding standard is underspecified, or because the items being classified are truly marginal. That is, the truth may not actually be out there, at least in any practically useful sense.

For example, we can squint all we want at the use of a pronoun in text and still not be able to come to any consensus about its antecedent, no matter how many experts in anaphora we assemble. Similarly, we might look at a phrase like fragile X in a paper on autism and truly not know if it refers to the gene or the disease. We may not know a searcher’s intent when we try to do spelling correction on a query like Montona — maybe they want Montana the state, maybe they want the pop star, or maybe they really want the west central region of the island of São Vicente.

The truth’s not always out there. At least for language data collected from the wild.

### Partial Credit for Uncertainty

The secondary motivation is that many classifiers, like most of LingPipe’s, are able to return probabilistic answers. The basic motivation of log loss is that we want to evaluate the probability assigned by the system to the correct answers. The less uncertain the system is about the true answer, the better.

If we want to be able to trade precision (or specificity) for recall (i.e., sensitivity), it’s nice to have probabilistically normalized outputs over which to threshold.

### Probabilistic Reference Truth

Let’s just consider binary (two category) classification problems, where a positive outcome is coded as $1$ and a negative one as $0$.. Suppose there are $N$ items being classified, $x_1, \ldots, x_N$. A probabilistic corpus then consists of a sequence of probabilistic categorizations $y_1, \ldots, y_N$, where $y_n \in [0,1]$ for $1 \leq n \leq N$.

A non-probabilistic corpus is just an edge case, with $y_n \in \{ 0, 1 \}$ for $1 \leq n \leq N$.

### Probabilistic System Responses

Next, suppose that the systems we’re evaluating are themselves allowed to return conditional probabilities of positive responses, which we will write as $\hat{y}_n \in [0,1]$ for $1 \leq n \leq N$.

A non-probabilistic response is again an edge case, with $\hat{y}_n \in \{ 0, 1 \}$.

### Log Loss

The log loss for response $\hat{y}$ for reference $y$ is the negative expected log probability of the response given the reference, namely ${\cal L}(y,\hat{y}) = - \sum_{n=1}^N y_n \times \log \hat{y}_n + (1-y_n) \times \log (1 - \hat{y}_n)$.

If we think of $y$ and $\hat{y}$ as parameters to Bernoulli distributions, it’s just KL divergence.

If the reference $y$ is restricted to being $0$ or $1$, this reduces to the usual definition of log loss for a probabilistic classifier (that is, $\hat{y}$ may be any value in $[0,1]$.

### Plotting a Single Response

Because we just sum over all the responses, we can see what the result is for a single item $n$. Suppose $y_n = 0.7$, which means the reference says there’s a 70% chance of item $n$ being positive, and a 30% chance of it being negative. The plot of the loss for response $\hat{y}_n$ given reference $y_n = 0.7$ is It’s easy to see that the response $\hat{y} = 1$ is better than $\hat{y} = 0$ for the case where the reference is $y = 0.7$.

The red vertical dashed line is at the point $\hat{y} = 0.7$, which is where $y = \hat{y}$. The blue horizontal line is at ${\cal L}(y,\hat{y}) = 0.61$, which is the minimum loss for any $\hat{y}$ when $y = 0.7$.

### Classifier Uncertainty Matches Reference Uncertainty

It’s easy to see from the graph (and prove via differentiation) that the log loss function is convex and the minimum loss occurs when $y = \hat{y} = 0.7$. That is, best performance occurs when the system response has the same uncertainty as the gold standard. In other words, we want to train the system to be uncertain in the same places as the reference (training) data is uncertain.

### Log Loss Degenerates at the Edges

If the reference is $y = 0$ and the response $\hat{y} = 1$, the log loss is infinite. Similarly for $y = 1, \hat{y} = 0$. As long as the response is not 0 or 1, the results will be finite.

If the response is 0 or 1, the loss will only be finite if the reference category is the same value, $y = \hat{y}$. In fact, having the reference equal to the response, $y = \hat{y}$, with both being 0 or 1, is the only way to get zero loss. In all other situations, loss will be non-zero. Loss is never negative because we’re negating logs of values between 0 and 1.

This is why you don’t see log loss used for bakeoffs. If you’re using a system that doesn’t have probabilistic output and the truth is binary 0/1, then every system will have infinite error.

### Relation to Regularization via Prior Data

This matching of uncertainty can be thought of as a kind of regularization. One common way to implement priors if you only have maximum likelihood estimators is to add some artificial data. If I add a new data point $x$ which has all features turned on, and give it one positive instance and one negative instance, and I’m doing logistic regression, it acts as a regularizer pushing all the coefficients closer to zero (the further they are away from zero, the worse they do on the combination of the one negative and one positive example). If I add more of this kind of data, the regularization effect is stronger. Here, we can think of adding 3 negative examples and 7 positive examples, or 0.3 and 0.7 if we allow fractional training data.

### Comparison to 0/1 Loss

Consider what happens if instead of log loss, we consider the standard 0/1 loss as expressed in a confusion matrix (true positive, true negative, false positive, and false negative counts).

With a probabilistic reference category $y$ and probabilistic system response $\hat{y}$, the expected confusion matrix counts are $\mathbb{E}[\mbox{TP}] = y \times \hat{y}$, $\mathbb{E}[\mbox{TN}] = (1-y) \times (1-\hat{y})$, $\mathbb{E}[\mbox{FP}] = (1-y) \times \hat{y}$, and $\mathbb{E}[\mbox{FN}] = y \times (1 - \hat{y})$.

With this, we can compute expected accuracy, precision, recall (= sensitivity), specificity, and F-measure as $\mathbb{E}[\mbox{acc}] = \mathbb{E}[(\mbox{TP}+\mbox{TN})/(\mbox{TP}+\mbox{TN}+\mbox{FP}+\mbox{FN})]$ ${} = y \hat{y} + (1 - y)(1-\hat{y})$. $\mathbb{E}[\mbox{recall}] = \mathbb{E}[\mbox{TP}/(\mbox{TP}+\mbox{FN})]$ ${} = y \hat{y}/(y\hat{y} + y(1 - \hat{y})) = \hat{y}$, $\mathbb{E}[\mbox{precision}] = \mathbb{E}[\mbox{TP}/(\mbox{TP}+\mbox{FP})]$ ${} = y \hat{y}/(y\hat{y} + (1-y)(\hat{y})) = y$, $\mathbb{E}[\mbox{specificity}] = \mathbb{E}[\mbox{TN}/(\mbox{TN}+\mbox{FP})]$ ${} = (1-y)(1-\hat{y})/((1-y)(1-\hat{y}) + (1-y)\hat{y}) = 1-\hat{y}$, and $\mathbb{E}[\mbox{F-measure}] = \mathbb{E}[2 \times \mbox{prec} \times \mbox{recall} / (\mbox{prec} + \mbox{recall})]$ ${} = 2 y \hat{y} / (y + \hat{y})$.

All of these statistical measures are degenerate for probabilistic evaluation. For instance, if $y > 0.5$, then accuracy is maximized with $\hat{y} = 1$, if $y < 0.5$, accuracy is maximized with $\hat{y} = 0$, and if $y = 0$, accuracy is the same for all $\hat{y} \in [0,1]$! The rest of the measures are no better behaved.

### R Source

Plotting functions is easy in R with the curve() function. Here’s how I generated the one above.

p <- 0.7;
f <- function(q) {
- (p * log(q) + (1-p) * log(1-q));
}
curve(f, xlim=c(0,1), ylim=c(0,4),
xlab="yHat", ylab="LogLoss(y, yHat)",
main="Log Loss  ( y=0.7 )",
lwd=2, cex.lab=1.5, cex.axis=1.5, cex.main=1.5,
axes=FALSE, frame=FALSE)

axis(1,at=c(0,0.5,0.7,1))
axis(2,at=c(0,f(0.7),2,4),labels=c("0", ".61", "2", "4"))
abline(h=f(0.7),col="blue",lty=3)
abline(v=0.7,col="red",lty=2)


Just be careful with that dynamic lexical scope if you want to redefine the reference result p ( $y$ in the definitions above).

### 4 Responses to “Evaluating with Probabilistic Truth: Log Loss vs. O/1 Loss”

1. Vikas Raykar Says:

You may find this paper relevant

Beta Regression for Modelling Rates and Proportions
Silvia Ferraria and Francisco Cribari-Netob

They even have a R package called betareg.

2. lingpipe Says:

Ferraria and Cribari-Netob use a generalized linear model with a Beta-type link. Translating what they do into notation I can more easily follow, they’re assuming responses $y_n \in [0,1]$ with predictors $X_n \in \mathbb{R}^K$. They then model the responses as beta distributed, $y_n \sim \mbox{\sf Beta}(\mu_n \tau, (1-\mu_n) \tau)$,

where the mean parameter is modeled using logistic regression $\mu_n = \mbox{logit}^{-1}(\beta X_n^T)$.

This’d be straightforward to program in BUGS or JAGS — you’d just need a prior on the regression coefficients $\beta_k$ and on the precision parameter $\tau$.

Now that I’ve translated it into terms I understand, I’ll have to get my head around the implications. My first reaction is that the deterministic part of the prediction will be the same as if you just took $y_n = \mu_n$, because the mean of $\mbox{\sf Beta}(\mu_n \tau, (1-\mu_n) \tau)$ is $\mu_n$.

3. lingpipe Says:

Phil Resnik pointed me to this paper, section 5 of which discusses training logistic regression using probabilistic “truth”:

Limin Yao, David Mimno and Andrew McCallum. 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections. KDD.

This made me realize that, of course, EM uses exactly the same kind of “probabilistic training” in its E step. It’s not much more complicated to plug in logistic regression in place of the more standard naive Bayes.

4. Competitive Predictive Modeling – How Useful is it? at So much to do, so little time Says:

[…] The contest was open to anybody and who ever got the best classification model (as measured by log loss) was selected as the winner. You can read more about the details of the competition and also on […]