The issue came up in the last blog post, Discontinuities in ROC calculations, about smoothing out Area-under-the-curve (AUC) calculations for ROC and PR curves given observed data points that are very close together.

If you have a bunch of point measurements that you assume were taken with some error, what you can do is model the error itself. Suppose we don’t want to think too hard and just make the error normal, so that a measurement of is replaced with a distribution , where is the standard deviation of the error.

Then, rather than computing a function based on the empirical measurement , instead compute the posterior of given a random variable .

To get back to something on the same scale, it’s typical to use a Bayesian point estimate that minimizes expected square error, namely the expectation. Now, instead of a point , such as an AUC statistic, defined precisely by the observation , we have a point defined by the expectation of characterized by our normal measurement error model.

.

We can also compute even probabilities, such as comparing two systems with noisy measurements and , with noise , by

.

Of course, this leaves the problem of how to estimate . Ideally, we’ll have lots of measurements of the same thing and be able to estimate directly, amounting in what’s called an “empirical Bayes” estimate (note that “empirical Bayes” is not a full Bayesian method and is no more empirical than full Bayes, which is also estimated from data). Of course, we could go the full Bayes route integrate over our posterior for , giving us

.

As Mike Ross pointed out, the result is going to be sensitive to . In particular, as Mike pointed out, as measurement error goes to zero, we get our original statistic back,

.

Similarly, as gets big, the limit of and converge if and .

The bottom line is that you’d like a sensible model of measurement error.

Real AUC calculations depend on more than one point. If you have a vector points with discrete true values , you just chain the integrals together, as in

.

With a normal noise model (or really any model you can sample from), it’s easy to compute this integral using Monte Carlo methods. Just take a sequence of samples , where is drawn independently from , and approximate the integral required to compute the expectation by

.

As the number of samples grows, this summation converges to the expectation,

.

October 27, 2011 at 3:21 pm |

You making an assumption that the variance of your estimator is the same for the whole range of output which is not true in many cases. If you predict a probability it is, obviously, not true because it has the limited range 0..1. I think that if you have enough data bootstrapping methods will give you a more reliable result, in the simplest form, averaging over sub-samples.

October 27, 2011 at 4:34 pm |

@Max: You’re right. I was trying to keep it simple. If the scores are in the limited range of 0 to 1, then we need an error model that stays in range.

For instance, if we have a 0.99 estimate and a noise with deviation 0.01, we’ll get probabilties of results being greater than 1.

If instead, the prediction is converted to (-infinity,infinity) using logit, then normal noise is added, then the result converted back to [0,1] using inverse logit, you both stay in range, get skewed error, and get different ranges of error back on the [0,1] scale depending on your prediction because the slope of the inverse logit isn’t constant.

In general, what you really want to do is model the noise based on the process that you have. For instance, if I were doing a logistic regression, even without measurement error I’d have noise in my posteriors for the coefficients which could be integrated over to get predictions.

October 27, 2011 at 6:24 pm

@Bob, I agree 100% percent. But my point was that a simpler approach that solves the problem of “smoothing” is not trying to model the predictor’s noise but use a bootstrap method, for example sub-sample test data and average ROC AUC over sub-samples. In my experience it works pretty well and provides also intervals that can be useful for analysis.

October 28, 2011 at 1:22 pm

@Max. Right. Isn’t the bootstrap here just a way of measuring the noise in estimating the predictor (and its predictions)? The bootstrap always seems like the simplest non-Bayesian approach to me because implementing it is so much like sampling from a Bayesian posterior (though the interpretation is very different).

October 27, 2011 at 11:21 pm |

I agree with Max that the bootstrap looks attractive in situations like this.

But the bootstrap is not without its own problems. For example, because the bootstrap involves sampling with replacement, typically the bootstrap samples only include a subset of the original training items, which can bias some bootstrap estimates (e.g., of the variance). And AFAIK this bias is itself typically hard to estimate or bound.

I haven’t looked at the AUC case, so I don’t know if these problems arise here. But the possibility of introducing an unquantified bias into my evaluation code makes bootstrap methods less attractive, I think.

October 28, 2011 at 11:12 am |

I see what’s going on now (thanks Wikipedia, for confirming my guess, which I just replaced with this note). The bootstrap is consistent (asymptotically approaches the correct estimate), but may be biased in finite samples.

Thanks for pointing this out, Mark. I hadn’t thought about the bootstrap this way before (probably because I didn’t know enough stats the first time I looked at it).

October 28, 2011 at 3:00 pm |

@Bob. My idea is the following. When we estimate metrics we usually are not so interested in their expected values but in a statistically sound evidence that one estimator has better metrics than another on test data. Because I don’t want to make strong assumptions about metrics distribution, a non-parametric method (bootstrapping) is a good candidate. Yes, from a Bayesian point of view this is sampling from posterior and you can apply a 100% Bayesian treatment to it analysis.

October 28, 2011 at 9:56 pm |

But unfortunately it’s usually in these applications that the bias in the bootstrap can be problematic.

I came across this when I tried to use a bootstrap estimate of the significance of the f-score difference between two different parsing algorithms. It seems easy enough — parse a heldout corpus with each algorithm (I use section 24 of the PTB these days for things like this), and then repeatedly draw bootstrap samples and evaluate the f-scores of those samples.

When you first see this, it looks ideal, since you can evaluate the fraction of bootstrap samples on which parser 1 has a better f-score than parser 2 — this seems to give you a way of estimating the significance of parser 1 being better than parser 2.

But this significance estimate is itself biased, as I explained above. One way to see this is to imagine that the heldout corpus being bootstrap resampled consists of a single sentence. Then your significance estimate is clearly wrong.

November 1, 2011 at 7:44 am

I’ve been thinking about almost the same problem — comparing the area under the ROC curve of two classifiers on a held-out test — for a while. The problem is to come up with p-values that correctly captures the null hypothesis H0.

If we bootstrap the instances and compute the statistic separately for both classifiers on each bootstrap sample then the resulting bootstrap distributions of the statistic — be it AUC or f-score — will hopefully center around the corresponding, observed values in the original held-out sample (which are most likely different from each other in the first place). This is not consistent with H0 which states there is no difference in the mean of the two sample distributions of the statistic.

If we were computing a simple statistic like the average score on each sample, a simple permutation test would be enough: for each instance, randomly flip or leave intact the scores coming from each other classifier. Compute mean difference in the scores for the two conditions. Repeat the permutation N times, create the permutation distribution of mean difference. The distribution would correspond to the distribution we should expect to see if H0 is true. We could use p-value for the actual difference we observe in the original held-out sample.

However, for AUC — I assume for f-score as well — the case is not that simple. We need to aggregate over the samples to come up with a statistic. Simply iterating over the matched samples isn’t enough. In AUC, the rank of a sample is crucial — not the actual score assigned by the classifier. I can’t simply shuffle the scores coming from the two classifiers because they don’t have to be compatible. Maybe a permutation test shuffling the ranks of the instances would give me proper p-values, but I haven’t been able to convince myself that this is really the case.

November 1, 2011 at 10:18 am

IMHO:

@Mark. You parser is a deterministic algorithm that makes all “points” generated from one sentence correlated therefore you can’t “shuffle” them for bootstrapping that assumes IID, in this settings the sentences need to be shuffled.

@ammacinho. Yes, the problem is that this is an aggregated algorithm but I didn’t get your idea about “shuffling the ranks”. From a classical statistical frequentist point of view you have two deterministic algorithms (classifier1+ ROC AUC estimator, classifier2+ROC AUC estimator) and H0 is that for a distribution of input data and labels output of these algorithms are the same. Both deterministic algorithms are “fixed black boxes”, the only one thing that is random here is set of random values that we use to calculate the values and we can use bootstrapping to generate them from our hold-out set.

November 1, 2011 at 8:23 pm

@max I agree with you on H0. Let’s say classifier1 has an AUC of 0.7 and classifier2 has an AUC of 0.8 on the hold-out set. Correct me if I’m wrong, but assuming there is no bias, the bootstrap distribution of classifier1 should be centered around 0.7 and the bootstrap distribution of classifier2 should be centered around 0.8 (i.e., the observed statistics on the original sample). However, this is a contradiction with H0. We can’t simply look at the delta-AUC for each bootstrap and use the distribution of differences to calculate a p-value. The bootstrap, in this case, doesn’t represent the null hypothesis.

What I mean by shuffling the rank is below.

0. Transform the output of classifiers into rankings (which won’t change the AUC values). Say r(i, 1) and r(i, 2) are the ranks given to instance i by classifier1 and classifier2 respectively.

1. Go over each instance and either leave r(i, :) as it is or flip the two ranks of i (i.e., shuffle ranks assigned to the same instance by the two classifiers).

2. What you get is the permutation sample. Compute AUC on the first and second columns of r, record the difference as a permutation sample of delta-AUC.

Repeat steps 1 and 2 N times to get a permutation sample distribution of the delta-AUC. Under the null hypothesis, the ranks of the two classifiers should be exchangable. Therefore, the distribution should be centered around 0. Use it to compute p-value.

Once I thought about the procedure, I knew what to search. Turns out this and other related techniques are already used by medical community:

http://onlinelibrary.wiley.com/doi/10.1002/sim.2149/abstract

http://biostatistics.oxfordjournals.org/content/9/2/364.short