After ACL, Becky Passonneau said I should read this:
- Antske Fokkens, Marieke van Erp, Marten Postma, Ted Pedersen, Piek Vossen, Nuno Freire. 2013. Offspring from Reproduction Problems: What Replication Failure Teaches Us. ACL Proceedings.
You should, too. Or if you’d rather watch, there’s a video of their ACL presentation.
The gist of the paper is that the authors tried to reproduce some results found in the literature for word sense identification and named-entity detection, and had a rather rough go of it. In particular, they found that every little decision they made impacted the evaluation error they got, including how to evaluate the error. As the narrative of their paper unfolded, all I could do was nod along. I loved that their paper was in narrative form — it made it very easy to follow.
The authors stress that we need more information for reproducibility and that reproducing known results should get more respect. No quibble there. A secondary motivation (after the pedagogical one) of the LingPipe tutorials was to show that we could reproduce existing results in the literature.
The Minor Fall
Although they’re riding one of my favorite hobby horses, they don’t quite lead it in the direction I would have. In fact, they stray into serious discussion of the point estimates of performance that their experiments yielded, despite the variation they see under cross-validation folds. So while they note that rankings of approaches changes based on what seem like minor details, they stop short of calling into question the whole idea of trying to assign a simple ranking. Instead, they stress getting to the bottom of why the rankings differ.
A Slightly Different Take on the Issue
What I would’ve liked to have seen in this paper is more emphasis on two key statistical concepts:
- the underlying sample variation problem, and
- the overfitting problem with evaluations.
The authors talk about sample variation a bit when they consider different folds, etc., in cross-validation. For reasons I don’t understand they call it “experimental setup.” But they don’t put it in terms of estimation variance. That is, independently of getting the features to all line up, the performance and hence rankings of various approaches vary to a large extent because of sample variation. The easiest way to see this is to run cross-validation.
I would have stressed that the bootstrap method should be used to get at sample variation. The bootstrap differs from cross-validation in that it samples training and test subsets with replacement. The bootstrap effectively evaluates within-sample sampling variation, even in the face of non-i.i.d. samples (usually there is correlation among the items in a corpus due to choosing multiple sentences from a single document or even multiple words from the same sentence). The picture is even worse in that the data set at hand is rarely truly representative of the intended out-of-sample application, which is typically to new text. The authors touch on these issues with discussions of “robustness.”
The authors don’t really address the overfitting problem, though it is tangential to what they call the “versioning” problem (when mismatched versions of corpora such as WordNet produce different results).
Overfitting is a huge problem in the current approach to evaluation in NLP and machine learning research. I don’t know why anyone cares about the umpteenth evaluation on the same fold of the Penn Treebank that produces a fraction of a percentage gain. When the variation across folds is higher than your estimated gain on a single fold, it’s time to pack the papers in the file drawer, not in the proceedings of ACL. At least until we also get a journal of negative results to go with our journal of reproducibility.
The Major Lift
Perhaps hill-climbing error on existing data sets is not the best way forward methdologically. Maybe, just maybe, we can make the data better. It’s certainly what Breck and I tell customers who want to build an application.
If making the data better sounds appealing, check out
- my favorite NLP paper in recent memory, which is about improving data quality, and
- one of my all-time favorite NLP papers, which is about improving data quantity.
I find the two ideas so compelling that it’s the only area of NLP I’m actively researching. More on that later (or sooner if you want to read ahead).
(My apologies to Leonard Cohen for stealing his lyrics.)