What I mean by shuffling the rank is below.

0. Transform the output of classifiers into rankings (which won’t change the AUC values). Say r(i, 1) and r(i, 2) are the ranks given to instance i by classifier1 and classifier2 respectively.

1. Go over each instance and either leave r(i, :) as it is or flip the two ranks of i (i.e., shuffle ranks assigned to the same instance by the two classifiers).

2. What you get is the permutation sample. Compute AUC on the first and second columns of r, record the difference as a permutation sample of delta-AUC.

Repeat steps 1 and 2 N times to get a permutation sample distribution of the delta-AUC. Under the null hypothesis, the ranks of the two classifiers should be exchangable. Therefore, the distribution should be centered around 0. Use it to compute p-value.

Once I thought about the procedure, I knew what to search. Turns out this and other related techniques are already used by medical community:

http://onlinelibrary.wiley.com/doi/10.1002/sim.2149/abstract

http://biostatistics.oxfordjournals.org/content/9/2/364.short

]]>@Mark. You parser is a deterministic algorithm that makes all “points” generated from one sentence correlated therefore you can’t “shuffle” them for bootstrapping that assumes IID, in this settings the sentences need to be shuffled.

@ammacinho. Yes, the problem is that this is an aggregated algorithm but I didn’t get your idea about “shuffling the ranks”. From a classical statistical frequentist point of view you have two deterministic algorithms (classifier1+ ROC AUC estimator, classifier2+ROC AUC estimator) and H0 is that for a distribution of input data and labels output of these algorithms are the same. Both deterministic algorithms are “fixed black boxes”, the only one thing that is random here is set of random values that we use to calculate the values and we can use bootstrapping to generate them from our hold-out set. ]]>

If we bootstrap the instances and compute the statistic separately for both classifiers on each bootstrap sample then the resulting bootstrap distributions of the statistic — be it AUC or f-score — will hopefully center around the corresponding, observed values in the original held-out sample (which are most likely different from each other in the first place). This is not consistent with H0 which states there is no difference in the mean of the two sample distributions of the statistic.

If we were computing a simple statistic like the average score on each sample, a simple permutation test would be enough: for each instance, randomly flip or leave intact the scores coming from each other classifier. Compute mean difference in the scores for the two conditions. Repeat the permutation N times, create the permutation distribution of mean difference. The distribution would correspond to the distribution we should expect to see if H0 is true. We could use p-value for the actual difference we observe in the original held-out sample.

However, for AUC — I assume for f-score as well — the case is not that simple. We need to aggregate over the samples to come up with a statistic. Simply iterating over the matched samples isn’t enough. In AUC, the rank of a sample is crucial — not the actual score assigned by the classifier. I can’t simply shuffle the scores coming from the two classifiers because they don’t have to be compatible. Maybe a permutation test shuffling the ranks of the instances would give me proper p-values, but I haven’t been able to convince myself that this is really the case.

]]>I came across this when I tried to use a bootstrap estimate of the significance of the f-score difference between two different parsing algorithms. It seems easy enough — parse a heldout corpus with each algorithm (I use section 24 of the PTB these days for things like this), and then repeatedly draw bootstrap samples and evaluate the f-scores of those samples.

When you first see this, it looks ideal, since you can evaluate the fraction of bootstrap samples on which parser 1 has a better f-score than parser 2 — this seems to give you a way of estimating the significance of parser 1 being better than parser 2.

But this significance estimate is itself biased, as I explained above. One way to see this is to imagine that the heldout corpus being bootstrap resampled consists of a single sentence. Then your significance estimate is clearly wrong.

]]>Thanks for pointing this out, Mark. I hadn’t thought about the bootstrap this way before (probably because I didn’t know enough stats the first time I looked at it).

]]>But the bootstrap is not without its own problems. For example, because the bootstrap involves sampling with replacement, typically the bootstrap samples only include a subset of the original training items, which can bias some bootstrap estimates (e.g., of the variance). And AFAIK this bias is itself typically hard to estimate or bound.

I haven’t looked at the AUC case, so I don’t know if these problems arise here. But the possibility of introducing an unquantified bias into my evaluation code makes bootstrap methods less attractive, I think.

]]>For instance, if we have a 0.99 estimate and a noise with deviation 0.01, we’ll get probabilties of results being greater than 1.

If instead, the prediction is converted to (-infinity,infinity) using logit, then normal noise is added, then the result converted back to [0,1] using inverse logit, you both stay in range, get skewed error, and get different ranges of error back on the [0,1] scale depending on your prediction because the slope of the inverse logit isn’t constant.

In general, what you really want to do is model the noise based on the process that you have. For instance, if I were doing a logistic regression, even without measurement error I’d have noise in my posteriors for the coefficients which could be integrated over to get predictions.

]]>