Fokkens et al. on Replication Failure


After ACL, Becky Passonneau said I should read this:

You should, too. Or if you’d rather watch, there’s a video of their ACL presentation.

The Gist

The gist of the paper is that the authors tried to reproduce some results found in the literature for word sense identification and named-entity detection, and had a rather rough go of it. In particular, they found that every little decision they made impacted the evaluation error they got, including how to evaluate the error. As the narrative of their paper unfolded, all I could do was nod along. I loved that their paper was in narrative form — it made it very easy to follow.

The authors stress that we need more information for reproducibility and that reproducing known results should get more respect. No quibble there. A secondary motivation (after the pedagogical one) of the LingPipe tutorials was to show that we could reproduce existing results in the literature.

The Minor Fall

Although they’re riding one of my favorite hobby horses, they don’t quite lead it in the direction I would have. In fact, they stray into serious discussion of the point estimates of performance that their experiments yielded, despite the variation they see under cross-validation folds. So while they note that rankings of approaches changes based on what seem like minor details, they stop short of calling into question the whole idea of trying to assign a simple ranking. Instead, they stress getting to the bottom of why the rankings differ.

A Slightly Different Take on the Issue

What I would’ve liked to have seen in this paper is more emphasis on two key statistical concepts:

  1. the underlying sample variation problem, and
  2. the overfitting problem with evaluations.

Sample Variation

The authors talk about sample variation a bit when they consider different folds, etc., in cross-validation. For reasons I don’t understand they call it “experimental setup.” But they don’t put it in terms of estimation variance. That is, independently of getting the features to all line up, the performance and hence rankings of various approaches vary to a large extent because of sample variation. The easiest way to see this is to run cross-validation.

I would have stressed that the bootstrap method should be used to get at sample variation. The bootstrap differs from cross-validation in that it samples training and test subsets with replacement. The bootstrap effectively evaluates within-sample sampling variation, even in the face of non-i.i.d. samples (usually there is correlation among the items in a corpus due to choosing multiple sentences from a single document or even multiple words from the same sentence). The picture is even worse in that the data set at hand is rarely truly representative of the intended out-of-sample application, which is typically to new text. The authors touch on these issues with discussions of “robustness.”


The authors don’t really address the overfitting problem, though it is tangential to what they call the “versioning” problem (when mismatched versions of corpora such as WordNet produce different results).

Overfitting is a huge problem in the current approach to evaluation in NLP and machine learning research. I don’t know why anyone cares about the umpteenth evaluation on the same fold of the Penn Treebank that produces a fraction of a percentage gain. When the variation across folds is higher than your estimated gain on a single fold, it’s time to pack the papers in the file drawer, not in the proceedings of ACL. At least until we also get a journal of negative results to go with our journal of reproducibility.

The Major Lift

Perhaps hill-climbing error on existing data sets is not the best way forward methdologically. Maybe, just maybe, we can make the data better. It’s certainly what Breck and I tell customers who want to build an application.

If making the data better sounds appealing, check out

I find the two ideas so compelling that it’s the only area of NLP I’m actively researching. More on that later (or sooner if you want to read ahead).


(My apologies to Leonard Cohen for stealing his lyrics.)

7 Responses to “Fokkens et al. on Replication Failure”

  1. Bob Carpenter Says:

    This blog post’s very relevant to the big issue of variance, and also ties into the Brill and Banko paper cited under “data quantity” above:

  2. Mark Johnson Says:

    All good comments and suggestions, I think. One implication of what you’ve just said is that the paired t-tests and the like which are what our field uses for significance testing (when we actually do any statistical tests at all) are actually measuring the wrong thing. We don’t want to know how unlikely it is that two systems produce identical-scoring results on the test set (if you think about it, you’ll realise this is a bizarre question if the systems are deterministic). Instead, what I want to know is: if I draw a new test set, what’s the chance that system 1 will outperform system 2? You can use bootstrap resampling to estimate this, of course, but does anyone do it?

    • Bob Carpenter Says:

      All of stats is a bizarre question if the data’s been observed. I was reading Bulmer’s intro book (a cheap Dover book that’s an awesome intro to frequentist stats), and he quoted J.S. Mill’s original take on stats as being counterfactual. That is, it’s all about the “what if we did it again, would it have the same outcome?”. Which brings up all the usual issues with counterfactuals, mainly that if everything’s exactly the same, we get the same output by definition!

      But the t-test does try to answer this question. The problem I have with its use in most NLP applications is that the i.i.d. normality assumptions don’t hold. So lots of people have moved to non-parametric tests, but they have the same i.i.d. assumption which causes them to underestimate uncertainty.

      I don’t think the bootstrap’s very widely used. I find I lose people at the point where I try to explain that it’s sampling with replacement to deal with variance properly.

  3. Mark Johnson Says:

    I guess what I meant is: we know exactly how our deterministic parsers (say) perform on some specific test set (section 22 of the WSJ), and if we run them again, they’ll do exactly the same. What we really want to know is: how well will they do when tested on a new test set (and possibly also trained on a new training set)? The t-test doesn’t answer this question.

    • Bob Carpenter Says:

      Isn’t that what the t-test is trying to do? It treats section 22 as a sample drawn from a larger population.

      In a simpler case, if I toss a coin and see 40 heads and 10 tails, I ask the question of whether I can reject the null hypothesis that this sample is an i.i.d. sample from a population with say an even number of heads and tails.

  4. brendan o'connor (@brendan642) Says:

    One thing that helps is if the test set is from a different domain. In the case of parsing, that was my interpretation of what the English Web Treebank is for:

    As far as the purely statistical issue on resampling variability, I’ve been wondering at what level resampling assumption should be used. For example, when evaluating token-level tagging accuracy, it’s straightforward to use a token-level resampling model (like binomial or mcnemar’s test where each token is an event … or could do bootstrap or whatever too). I have done this many times. However, accuracy might cluster/overdisperse at the sentence or document level; for example, some documents are much harder than others. (Imagine an additive logit model for the Bernoulli event of “is the token correct or not” with variables at the sentence and document levels.) Do we need to resample at the sentence or document level when evaluating token-level accuracy? This apparently would greatly diminish statistical power, but maybe it would be a more accurate reflection of the variability of accuracy rates when you get new data (and thus it should be taken into account when doing accuracy mean estimation, which is implicitly what everyone thinks is important to calculate for NLP evaluation).

    Beyond the question of how to correctly assess the standard error for the accuracy mean, I wonder if directly analyzing variance might be a useful direction for NLP evaluation. In finance, they evaluate algorithms by explicitly look at *both* mean and variance of returns, since a small improvement to the mean but with a big variance isn’t that useful (you might go broke before you get rich). Cross-document or cross-domain variance reduction seems useful, to me at least, for NLP. An algorithm that gives on-average 87% accuracy, but on some documents is only 20% accurate, is less useful to me than one averaging 85% but is always at least 80% on all documents. The former could destroy the accuracy of a question-answering system, for example.

    • Bob Carpenter Says:

      The quick answer is “yes,” we want to resample at the document level to account for the intra-document correlation of tokens for exactly the reasons you say — diminished power is the right estimate of what’ll happen on a new document or collection of documents.

      Absolutely I want to analyze the variance, again for exactly the reasons you say — it’s what you need to predict variability of the algorithm’s behavior in the future. And that matters in an application!

      The other major reason to analyze variance in finance is that if you’re a market maker, you need to set a spread.

      The reason funds report beta (correlation to S&P 500 or some other market representative) is that you need it for first-order adjustments to the effects of diversification (i.e., very rudimentary portfolio theory). Of course, what you really need is the full covariance matrix of your assets, because if you buy two instruments with very little correlation to the S&P 500 but very high correlation to each other that’s not as good for diversification. Of course, covariance is difficult to estimate for the usual reason that variance is hard to estimate and now you have a quadratic number of (co)variances to estimate.

Leave a Reply to Bob Carpenter Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: