[Update: 4:51 PM, 5 October 2009 after corrections from Jacob Whitehill (thanks!); they did use a prevalence estimate and did allow mixed panel fits, as the post now reflects.]

Thanks to Panos for pointing out this upcoming NIPS poster, which makes a nice addition to our running data annotation thread:

The authors’ knowledge of the epidemiology literature was limited when they stated:

To our knowledge BLOG [bilinear log odds generative] is the first model in the literature to simultaneously estimate the true label, item difficulty, and coder expertise in an unsupervised manner.

Just check out the literature survey portion of my technical report for a range of different approaches to this problem, some of which have even been applied to binary image data classification such as Handelman’s X-ray data for dental caries diagnoses (see the tech report for the Handleman data).

### Model Overview

In this case, the authors use a logistic scale (that’s the log-odds part) consisting of the product of an annotator accuracy term and an item (inverse) difficulty term (that’s the “bilinear” part). Although the authors mention item-response and Rausch models (see below), they do not exploit their full expressive power.

In particular, the authors model annotator accuracy, but do not break out sensitivity and specificity separately, and thus do not model annotator bias (a tendency to overlabel cases in one category or another). I and others have found huge degrees of annotator bias for real data (e.g. Handelman’s dentistry data and the Snow et al. NLP data).

The authors’ model also clips difficulty at a random coin flip, whereas in reality, some positive items may be so hard to find as to have less than a 50% chance of being modeled correctly.

They impose unit normal priors over annotator accuracy and normal priors over the log of item difficulty (ensuring item difficulties are non-negative). They fit the models with EM using conjugate gradient to solve the logistic regression in the M-step. Epidemiologists have fitted empirical Bayes priors by using other expert opinion, and I went further and actually fitted the full hierarchical Bayesian model using Gibbs sampling (in BUGS; the code is in the sandbox project).

Point estimates (like EM-derived maximum a posterior estimates as the authors use) always underestimate posterior uncertainty compared to full Bayesian inference. Posterior uncertainty in item difficulty is especially difficult to estimate with only 10 annotators. In fact, we found the Bayesian posteriors for item difficulty to be so diffuse with only 10 annotators that using the full posterior effectively eliminated the item difficulty effect.

### Synthetic Data

They run synthetic data and show fits. Not surprisingly given the results I’ve reported elsewhere about fitting item difficulty, they only report fits for difficulty with 50 annotators! (I found reasonable fits for a linear (non-multiplicative) model with 20 annotators, though recall the reviewers for my rejected ACL paper thought even 10 annotators was unrealistic for real data!)

They also simulate very low numbers of noisy annotators compared to the actual numbers found on Mechanical Turk (even with pre-testing, we had 1/10 noisy annotations and without testing, Snow et al. found 1/2 noisy labels). I was surprised they had such a hard time adjusting for the noisy labelers. I think this may be due to trying to model item difficulty. Without item difficulty, as in the Dawid and Skene-type models, there’s no problem filtering out bad annotators.

### Pinning Values

The authors note that you can fix some values to known gold-standard values and thus improve accuracy. I noted this in my papers and in my comment on Dolores Labs’ CrowdControl, which only uses gold-standard values to evaluate labelers.

### Real Data

They show some nice evaluations for image data consisting of synthetic images and classification of Duchenne smiles. As with other data sets of this kind (such as my results and Raykar et al.’s results), they show decreasing advantage of the model-based methods over pure voting as the number of annotators approaches 10. This is as we’d expect — the Bayesian priors and proper weighting are most important for sparse data.

### Mathematical Presentation

The authors suppose items (images in this case) and annotators. The correct label for item is and the label provided by the annotator is is . They consider fitting for the case where not every annotator labels every item.

The authors model correctness of an annotation by:

where is a measure of an annotators ability and a measure of inverse item difficulty. The authors observe some limits to help understand the parameterization. First, as inverse item difficulties approach 0, items become more difficult to label, and accuracy approaches chance (recall ):

.

As inverse item difficulties approach infinity, the item becomes easier to label:

.

As annotator accuracy approaches infinity, accuracy approaches perfection:

.

As annotator accuracy approaches zero, accuracy approaches chance:

.

If accuracy is less than zero, the annotator is adversarial. We didn’t find any adversarial annotators in any of our Mechanical Turk data, but there were lots of noisy ones, so some of the models I fit just constrained prior accuracies to be non-adversarial. Others have fixed priors to be non-adversarial. In some settings, I found initialization to non-adversarial accuracies in EM or Gibbs sampling led to the desired solution. Of course, when lots of annotators are adversarial and priors are uniform, the solution space is bimodal with a flip of every annotator’s adversarial status and every item’s label.

The authors also model prevalence with a term. If prevalence is 0.5, it drops out, but their Duchenne smile example was unbalanced, so the prevalence term is important.

### Comparison with Item-Response and Ideal Point Models

The authors mention item-response theory (IRT), which is also where I got the idea to use these models in the first place (too bad the epidemiologists beat us all to the punch).

A basic IRT-like model for annotation would use the difference between annotator accuracy and item difficulty . By allowing to be positive or negative, you can model positive and negative items on the same scale. Or you can fit separate and for positive and negative items, thus independently modeling sensitivity and specificity.

Discriminativeness can be modeled by a multiplicative factor , producing a predictor . In this way, the term models a positive/negative bias and the the sharpness of the decision boundary.

I’m a big fan of the approach in (Uebersax and Grove 1993), which handles ordinal responses as well as discrete ones. Too bad I can’t find an open-access version of the paper.