- Raykar, Vikas C., Shipeng Yu, Linda H. Zhao, Anna Jerebko, Charles Florin, Gerardo Hermosillo Valadez, Luca Bogoni, and Linda Moy. Supervised Learning from Multiple Experts: Whom to trust when everyone lies a bit. In ICML.
The scientific zeitgest says to assess inter-annotator agreement, infer gold standards, and use them to train classifiers.
Raykar et al. use EM for a binomial model of annotator sensitivity and specificity (like Dawid and Skene’s original multinomial approach from the 1970s paper and the Snow et al. EMNLP paper). My experiments showed full Bayesian models slightly outperform EM, which slightly outperforms naive voiting (the effects are stronger with fewer annotators).
The obvious thing to do is to take the output of the gold standard inference and use that to train a classifier. With EM, you can use the MAP estimate of category likelihoods (a fuzzy gold standard); with Bayesian models, you can sample from the posterior, which provides more dispersion. Smyth et al.’s 1995 NIPS paper showed EM-style training was effective for simulations.
I was just in San Francisco presenting this work to the Mechanical Turk Meetup, and Jenny Finkel opined that fuzzy training wouldn’t work well in practice. Even taking the discussion offline, I’m still not sure why she thinks that [update: see her comments below]. In some ways, if we use the fuzzy truth as the gold standard, then using it to train should perform better than quantizing the gold standard to 0/1. There’s not a problem with convexity; we just impute a big data set with Gibbs sampling and train on that. We could even train up an SVM or naive Bayes system that way.
The interesting twist in the Raykar et al. paper is to jointly estimate a logistic regression classifier along with the gold standard. That is, throw the regression coefficients into the model and estimate them along with everything else. That’s the same linkage as I suggested above. But Raykar et al. go further — they let the trained model vote on the gold standard just like another annotator.
Even though the annotation model corrects for individual annotator bias (or in this case, the logistic regression classifier’s bias as estimated), each annotator still affects the overall model through its bias-adjusted vote (if it didn’t, you couldn’t get off the ground at all). If you evaluate the classifier on a “gold standard” which was voted upon by a committee including the classifier itself, the classifier should perform better because it’s getting a vote on the truth!
The right question is whether Raykar et al.’s jointly estimated classifiers are “better” in some sense than ones trained on the imputed gold standard. For that, I’d think we’d need some kind of held-out eval, but that begs the question on inferring the gold standard. The gold standards behind Snow et al.’s work weren’t that pure after all (I have some commentary on discrepancies in the paper cited below).
I have considered using the trained classifier as another annotator when doing active learning of the kind proposed in Sheng et al.’s 2008 KDD paper on getting another label for an existing item vs. annotating a new item. In fact, there’s no reason in principle why you can’t have more than one classifier being trained along with annotator sensitivities and specificities.
Another nice idea in the Raykar et al. paper is the use of simulation from a known gold standard to create a fuzzy gold standard. That’s still questionable, in that it’s generating fake data that are known to follow the model. But everyone should do this in every way possible for all parts of their models, so you can bet I’ll be saving this one for my bag of tricks.
I’m a little unclear on why the numbers in the lefthand plots in figures 1 and 2 don’t have the same AUC value for the proposed algorithm. Figure 2 actually does evaluate the gold-standard estimation followed by classifier estimation. If I’m reading that figure right, then training on the imputed gold standard didn’t do measurably better than the majority voted baseline.
[Update with comment: The right hand side plot in figure 2 is of the inferred gold standard versus the "golden gold standard". It's possible to plot this because the inferred gold standard is actually a point probability estimate of the item being in category 1.]
If we’re lucky, Raykar et al. will share their data. [Update 2: no luck -- the data belongs to Siemens.]
P.S. All of these models assume the annotators don’t actually lie. Specifically, in order for the models to be identifiable, we need to assume the annotators are not adversarial (that is, they don’t know the right answer and intentionally lie, and thus perform worse than chance). There was, to reinforce the zeitgeist, also a paper about mixing in adversarial coders at ICML, Dekel and Shamir’s Good learners for evil teachers.