Dolores Labs has been doing some really nifty work on collecting and aggregating data using Amazon’s Mechanical Turk. I’ve been working on inferring gold standards and annotator accuracies. When they posted their forthcoming EMNLP paper, I immediately began to lust after the data, which the authors kindly made available:

- Dolores Labs’ Blog Post. AMT is fast, cheap, and good for machine learning data
- Rion Snow, Brendan O’Connor, Dan Jurafsky and Andrew Ng. 2008. Cheap and Fast — But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. EMNLP.
- Annotation Data from the paper
- Recognizing Textual Entailment (RTE) Task as seen by Mechanical Turk workers

I honed in on the text entailment data, as it looked easiest to deal with in a binary framework (other instances were multinomial or scalar). So I wrote a little Java program (I probably could’ve done this in R if I were fluent) to munge their data so that I could fit it with my gold-standard inference models. I then fixed the beta priors for sensitivity and specificity to be uniform (alpha=beta=1); BUGS is really nifty for this, you just move the relevant variables from the list of parameters to the list of data, give them values, then use the exact same BUGS model. Here’s the result (click on image to enlarge):

The axes represent sensitivity and specificity, which may range from 0.0 to 1.0. Each circle represents one of the 164 annotators, with an area proportional to the number of examples annotated (which ranged from the minimum allowed of 20 to all 800 items). The center of the circle is placed at the maximum likelihood estimate of sensitivity and specificity given the “gold standard” data provided in the distribution. The grey lines represent estimate residuals, starting at the center of the circles and terminating at the model’s estimate of sensitivity and specificity. Keep in mind that with only 20 examples, posterior intervals are around 10%, and many of the max likelihood estimates are 1.0 (all those on the top or right-hand boundaries).

The green diagonal running from (0,1) to (1,0) represents the expected performance of an annotator flipping a biased coin. Look at all those annotators near that line. Four of them annotated a lot of data, and it’s all noise.

I’m sold on hierarchical models, so let’s compare the model with hierarchical beta priors estimated along with the sensitivities and specificities (this is the model from my paper):

I’ve added blue lines at the means of the estimated beta priors, which had 95% intervals for specificity and sensitivity of (.80, .87) and (.82, .87), with intervals for the scale of (1.9, 3.8) and (7.0, 19.9) respectively. You’ll see that the hierarchical priors are pulling the estimates in their direction, which is not surprising given the counts of 20 for most of the annotators. As expected from the prior scales, sensitivity exerts a strong influence. Of course, we know we shouldn’t trust this too much given the sensitivity biases of the random annotators.

So what about Panos Ipeirotis’s question about correlation between sensitivity and specificity? I don’t see a strong positive or negative correlation, at least in this case. Positive correlation would indicate good annotators are just good, and negative would indicate bias (trading specificity for sensitivity). In general, it might be interesting to get annotators to use a scale of how much a sentence is entailed; then if they had general biases, you could fit a threshold to quantize their results into the most consistent form.

Here are the more traditional Bayesian posterior plots with 80% intervals around the gold standards and the prior mean indicated as a horizontal:

Finally, here’s a look at the category residuals, which I measure as the absolute difference between the true label and the sample mean:

Sorry for the dense plot; I haven’t figured out how to put histogram verticals on a log scale yet. 700 of the values are within epsilon of no residual error. 57 of the 800 examples have a residual error of 0.5 or more, meaning that the model would assign the wrong category. 23 of these had a residual error of 0.975 or more, which means the model was way off; 697 were within 0.025, meaning the model was spot on. It’d sure be nice to see just how “correct” the experts were. A gold standard with 95% accuracy would have 40 errors over 800 items.

Next up, I’d like to see what’ll happen when I remove the clear junk annotations. Playing with R and BUGS is addictive!

**Update:** 16 September 2008. Taking a majority vote of annotators leads to 50 errors and 65 deadlocked cases. If we flip a coin for the deadlocks, we get an expected number of errors of 50+65/2=82.5, for an accuracy of 89.7%, as reported by Snow et al. Taking maximum a posteriori estimates from our models produces 59 errors/92.6% accuracy (hierarchically estimated prior) and 57 errors/92.9% accuracy (uniform prior). Snow et al. report roughly the same gain (they don’t report the exact number) by weighting annotators’ votes according to their accuracies as estimated with Laplace smoothing (aka add 1 smoothing, which is equivalent to a Beta(2,2) prior under a MAP estimate) against the gold standard. Although Snow et al. mention (Dawid and Skene 1979), they don’t actually apply that model’s EM estimator to infer sensitivities and specificities without a gold standard. Our results show that you don’t need to have the gold standard to get a reliable estimate of annotator accuracies that will improve collective voting.

Of course, with the BUGS implementation of the full Bayesian model, you can add as much information as you have. If you know all or some gold standard labels, these can be fixed as data, which will then be used to help estimate the unknown parameters. Effectively, it just gets added to the Gibbs sampler as something that’s always sampled the same way.

“Expert” inter-annotator agreement on this ranged from 91-96%.

**Postscript:** If anyone’s read Ted Pedersen’s “last words”, “Empiricism is Not a Matter of Faith”, in the latest issue of *Computational Linguistics*, this is how science is supposed to work.

September 18, 2008 at 12:09 pm |

[…] Pedersen for finally saying out loud (in the latest issue of Computational Linguistics, thanks to Bob Carpenter for the pointer) what I’ve long thought about academic publications on topics like […]

April 14, 2011 at 9:26 pm |

Cujcgw YMMD with that awnesr! TX

September 19, 2008 at 3:52 am |

This is great. I’m struck by how small the worker model residuals are.

I have the exact numbers on my end to round out your comparisons of the techniques. I put it back on the other post.

On removing junk annotations: the model is figuring out that junk annotators are actually junk, so they should already be exerting very weak influence on the posterior of the labels, and therefore a weak influence the sens/spec estimates of other workers. So if I had to bet, I’d say taking out junk annotators won’t change the model’s inferences by very much? Though junk annotators do matter a lot; in my other comment on the other post i found that throwing out the junk makes naive voting perform as well as anything we’ve seen so far…

Hooray for open data/software/etc. This all sounds like a good kind of science to me.

September 19, 2008 at 11:30 am |

Yes, that’s exactly what happens. I just ran the data removing any annotator with an estimated sensitivity or specificity less than 50%. It makes almost no difference for the model-based approach, but makes a huge difference for majority voting, which winds up half way between the model-based results and unfiltered voting results if you remove the outliers. I was surprised the results were this robust. I’ll have to add another blog entry. I’m on the road now, so I might not get this stuff posted for a few days.

There’s a minor identifiability problem with the model with bad annotators in that you can get the same results with 20% accurate annotators being wrong a lot and 80% accurate annotators being right a lot. So when you open up the possibility of less than 50% sensitivity or specificity, you have to run until you don’t get these degenerate solutions (that is, until the chains mix the way the Bayesians like to see them during the Gibbs samples).

October 2, 2012 at 5:40 pm |

[…] is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without […]