Here’s a 2 page write-up of one of the models I’ve been looking at for evaluating data annotation in order to evaluate coding standards and annotator sensitivity and specificity:

- Carpenter, Bob. 2008. Hierarchical Bayesian Models of Categorical Data Analysis.

I’ve submitted it as a poster to the New York Academy of Sciences 3rd Annual Machine Learning Symposium, which will be October 10, 2008.

Please let me know what you think (carp@alias-i.com). I didn’t have room to squeeze in the more complex model that accounts for “easy” items. This model and the one for easy items derive from the epidemiology literature (cited in the paper), where they’re trying to estimate disease prevalence from a heterogeneous set of tests. I’ve added some more general Bayesian reasoning, and suggested applications for annotation (though Bruce and Wiebe were mostly there in their 1999 paper, which I cite), and for training using probabilistic supervision (don’t think anyone’s done this yet).

I’m happy to share the R scripts and BUGS models I used to generate the data, fit the models, and display the results. I’d also love to know how to get rid of those useless vertical axes in the posterior histogram plots.

September 5, 2008 at 6:41 pm |

On removing vertical axes — I think it’s something like, make sure to use hist(axes=FALSE), then do a call to axis(side=1).

Here’s an overview … http://www.statmethods.net/advgraphs/axes.html (The Quick-R site is really nice in general.)

September 6, 2008 at 10:08 pm |

Bob,

This is a very nice model. How fast was it?

One technicality: The current model draws “true positive rate” and “true negative rate” independently of each other. In many cases, we will have a natural tradeoff between the two and they will not be independent.

Would it be possible to alter the model so that it first draws the “quality” of the annotator (e.g., to be the AUC), and then picks the tradeoff between “true positive rate” and “true negative rate”?

September 8, 2008 at 1:39 pm |

BUGS and R are ridiculously slow compared to direct implementations. The model only takes 50 or so iterations to converge, but three chains running for 1000 iterations each takes a couple of minutes on my notebook.

I’m also thinking a collapsed sampler that only takes the category assignments, computing sensitivity and specificity from those and then the alpha and beta priors by moment matching would do this whole thing almost instantly in a direct implementation.

One way the epidemiologists have suggested modeling the dependence between sensitivity and specificity is to use a 4-parameter Dirichlet prior over the rates FPR, TPR, FNR, TNR, which must sum to 1.0. But that’s pretty weakly relating TPR and TNR.

I’m guessing you’re thinking about a kind of bias dimension, in which some annotators are more prone to classify things as category 1 than others. You see some of this in the dentistry data. You also see a case where one annotator has both a higher sensitivity and a higher specificity than another annotator.

The ideal-point model of voting (like the item-response model for tests) is a natural way to look at the kind of systematic correlation between TPR and TNR you’re talking about. It models outcomes on the logit (or probit) scale and may include a bias term as an intercept varying by annotator.

September 9, 2008 at 2:51 pm |

Brendan’s blog entry AMT is fast, cheap, and good for machine learning data is another must-read in this area.

There’s a pointer to a very important EMNLP paper with Dan Jurafsky, Andrew Ng and Rion Snow evaluating five NLP tasks with annotations done by non-experts through Amazon’s Mechanical Turk (aka AMT). What I can’t believe is how cheap it is; they’re getting 3000 multiple-choice annotations for US$2 (yes, that’s two bucks).

September 10, 2008 at 5:56 pm |

[…] given these annotations. We described a general latent-category approach in our previous post, Hierarchical Bayesian Models of Categorical Data Annotation. But the paper didn’t have room to display the derived category densities, which is one of […]