Finally something published after all these years working on the problem:
Rebecca J. Passonneau and Bob Carpenter. 2014. The Benefits of a Model of Annotation. Transactions of the Association for Comptuational Linguistics (TACL) 2(Oct):311−326. [pdf] (Presented at EMNLP 2014.)
Becky just presented it at EMNLP this week. I love the TACL concept and lobbied Michael Collins for short reviewer deadlines based on the fact that (a) I saw the bioinformatics journals could do it and (b) I either do reviews right away or wait until I’m nagged, then do them right away — they never take even three weeks of real work. The TACL editors/reviewers really were quick, getting responses to us on the initial paper and revisions in under a month.
What you Get
There’s nothing new in the paper technically that I haven’t discussed before on this blog. In fact, the model’s a step back from what I consider best practices, mainly to avoid having to justify Bayesian hierarchical models to reviewers and hence avoid reviews like the last one I got. I’d highly recommend using a proper hierarchical model (that’s “proper” in the “proper English breakfast” sense) and using full Bayesian inference.
One of the main points of the paper is that high kappa values are not necessary for high quality corpora and even more surprisingly, high kappa values are not even sufficient.
Another focus is on why the Dawid and Skene type models that estimate a full categorical response matrix per annotator are preferable to anything that relies on weighted voting (hint: voting can’t adjust for bias).
I also made a pitch for using expected information gain (aka mutual information) to evaluate quality of annotators.
The paper’s very focused on the Mechanical Turk data we collected for word sense. That data’s freely available as part of the manually annotated subcorpus (MASC) of the American national corpus (ANC). That was 1000 instances of roughly 50 words (nouns, adjectives and verbs) and roughly 25 annotations per instance, for a total of around 1M labels. You can get it in easily digestible form from the repo linked below.
I also wrote a simple EM-based optimizer in R to perform maximum penalized likelihood (equivalently max a posteriori [MAP]) estimates. It’s not super fast, but it can deal with 25K labels over 200 annotators and 1000 items in a minute or so. The code is available from a
GitHub Repo: bob-carpenter/anno.
It includes a CSV-formatted version of all the data we collected in case you want to do your own analysis.
We collected so many labels per item so that we could try to get a handle on item difficulty. We just haven’t had time to do the analysis. So if you’re interested or want to collaborate, let me know (firstname.lastname@example.org).