As I mentioned a couple posts ago, the amount of noise in Dolores Labs’ results from their Amazon Mechanical Turk experiments was pretty amazing. Here’s the graph from that post:
I wanted to see how much this was affecting my models’ estimates. I removed any annotators where the original model estimated their sensitivity or specificity below 50%. A better way to do this might be to remove annotators whose performance was near the biased-coin-flip line (the green line in the graph). Here’s what it looks like without the spamnotations:
This eliminated 3120/8000 annotations, or about 39% of the data. The above graph is the re-estimated sensitivities, specificities and mean priors. The priors have better means with the noisy annotators removed. Note how much the prior pulls the low sensitivity (vs. the gold standard) annotators with very few annotations toward the prior. Without the hierarchical beta prior on sensitivity and specificity, more would’ve been eliminated. For instance, for sensitivity, the prior MAP estimate is Beta(9.8,1.8) for sensitivity, which effectively adds 9.8 correct annotations and 1.8 errors to each estimator’s empirical counts to determine the mean of the posterior estimate. With annotators only having 20 annotations, only half of which were positive, the prior dwarfs the actual annotation performance. That’s why the prior mean indicated where the blue lines cross, draws the estimates like a magnet.
Now the fun part. With the noisy annotators eliminated, majority vote is 92.5% accurate (60/800 errors) versus the model-based estimate which is 92.7% accurate (58/800 errors). So as Brendan O’Connor speculated in a previous blog comment, once you eliminate the noisy annotators, majority voting and coin-tossing on ties works just fine.
Keep in mind that there’s a chicken-and-egg estimation problem here. Without gold standard data, you don’t know the annotator’s accuracies, so you can’t eliminate the noisy ones so easily. The noisy annotators were so bad in this experiment, I’m guessing they could be eliminated by having low inter-annotator agreement on average. Because many of the prolific noise generators had a specificity bias, this would actually lead to decent raw inter-annotator agreement (68% chance agreement). This is exactly the kind of noise that kappa statistics should be able to detect, because their performance won’t be above random given their base distributions.
The model fits better on the training data without the priors. It eliminates a couple more errors, for instance (still well in the noise of the annotations themselves). The point of employing the hierarchical priors isn’t to measure the annotators’ performance on the training data, but to improve estimates on future data (aka held out data). This is the same effect you get if you estimate baseball batting averages (as in Jim Albert‘s highly readable Bayesian stats books). You wouldn’t estimate someone’s batting average in future at-bats at .700 if they went 7/10 in their first ten at bats, would you? If you’re a diehard non-Bayesian, you’d simply abstain from predicting anything based on 10 at bats. But if you’re a Bayesian, you try to squeeze out all you can from the data. By the same reasoning, one wouldn’t want to estimate an annotator’s sensitivity at 100% after only 10 positive examples.