I (Bob) have been looking at the inter-annotator agreement and gold-standard adjudication problems. I’ve also been hanging out with Andrew Gelman and Jennifer Hill thinking about multiple imputation, which has me studying their regression book, which covers the item-response (Rasch) model, and Gelman et al.’s book, which describes the the beta-binomial model. After thinking about how these tools could be put together, I believed I might have a novel approach to modeling inter-annotator agreement. I got so excited I went and wrote a 30 page paper based on a series of simulations in R and BUGS, which may still turn out to be useful as an introduction to these approaches. It’s already been useful in giving me practice in building hierarchical models, simulations, and generating graphics from BUGS and R. I even analyzed one of our customer’s real annotated data sets (more on that to follow, I hope, as they had 10 annotators look at 1000 examples each over half a dozen categories).
Alas, the epidemiologists have beaten me to the punch. They generalized the item-response model in exactly the way I wanted to and have even implemented it in WinBUGS. In fact, they even used pretty much the same variable names! It just goes to show how much people with the same tools (hieararchical Bayesian modeling, logistic regression, beta-binomial distributions, 0/1 classification problems, sensitivity vs. specificity distinctions) tend to build the same things.
If you can only read one paper, make it this one (if you can find it; I had to schlep up to Columbia where I can download papers on campus):
- Albert, Paul S. and Lori E. Dodd. 2004. A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard. Biometrics 60:427-435.
It cites most of the relevant literature other than (Dawid and Skene 1979), who, as far as I can tell, first introduced latent class models into this domain using EM for estimation.
To make a long story short, the model is quite simple in modern sampling/graphical notation, and just about as easy to code up in BUGS. Here’s the model without accounting for missing data. First the variable key:
I # of items to classify J # of annotators pi overall prevalence of category 1 c[i] true category of item i d[i] difficulty of item i a[0,j] specificity of annotator j a[1,j] sensitivity of annotator j x[i,j] annotation of item i by annotator j
Recall that sensitivity is accuracy on positive cases, which is just recall, TP/(TP+FN), whereas specificity is just accuracy on negative cases, TN/(TN+FP). Precision is TP/(TP+FP), but that doesn’t account for TN cases, and is thus incomplete as a full probability specification when paired with recall. That’s why ROC curves, which plot specificity vs. sensitivity, are more popular than precision-recall curves in the rest of the civilized world.
Other than the annotations x[i,j]
, all other variables are unknown. That includes the prevalence pi
, the true categories c
, the annotator specificities and selectivities a[0]
and a[1]
.
The sampling model without the noninformative priors:
c[i] ~ Bernoulli(pi) d[i] ~ Normal(0,1) a[0,j] ~ Beta(alpha[0],beta[0]) a[1,j] ~ Beta(alpha[1],beta[1]) x[i,j] ~ Bernoulli(inv-logit(logit(a[1,j]) - d[i])) if c[i] = 1 Bernoulli(1 - inv-logit(logit(a[0,j]) - d[i])) if c[i] = 0
That’s it. Not that complex as these things go. Scaling the difficulties to have 0 mean and variance 1 identifies the scale of the model; as Gelman and Hill describe in their book, there are lots of ways this can be done, including only scaling the mean of the difficulties to be 0.
There are lots of different priors that could be put on what’s here the logit(a[m,j])
terms. The more traditional thing to do in this kind of model is to use a normal prior. In any case, you’re not going to be able to estimate the priors for specificity and selectivity with only a handful of annotators.
There’s a simplified version of this model mentioned in Dodd and Albert where items are divided into easy and regular cases, with the easy cases having all annotators agree and regular cases having annotators respond independently according to their own specificity and selectivity.
The point of the Albert and Dodd paper cited above wasn’t to introduce these models, but to evaluate a range of them one against the other by simulating data in one model and evaluating it in the other. They also evaluated real data and saw how it fit differently in the different models.
I should also point out that the following paper mentions the Dawid and Skene model in the computational linguistics literature:
- Bruce, Rebecca F. and Janyce M. Wiebe. 1999. Recognizing subjectivity: a case study of manual tagging. Natural Language Engineering 1:1-16.
Bruce and Wiebe even talk about using the posterior distribution over true categories as a proxy for a gold standard, which seems to be the right way to go with this work. But they use a single latent variable (roughly problem difficulty) and don’t separately model annotator specificity from selectivity, which is critical in both the simulations and real world data analyses I’ve done recently.
The depressing conclusion for NLP and other applications of classifiers is that it’s clear that with only 3 annotators, it’s going to be impossible to get a gold standard of very high purity. Even with 5 annotators, there are going to be lots of difficult cases.
The other applications besides inter-annotator agreement that I’ve run across in the past couple of days are educational testing, epidemiology of infections and evaluating multiple tests (e.g. stool inspection and serology), evaluations of health care facilities in multiple categories, evaluations of dentists and their agreement on caries (pre-cavities), adjustments to genome-wide prevalence assertions, and many more.
August 14, 2008 at 2:36 pm |
Great post, thanks for all the pointers and the model demonstration — I’ve been working exactly on this recently with Mechanical Turk annotators. (Found your post via Panos Ipeirotis, http://behind-the-enemy-lines.blogspot.com/2008/08/mechanical-turk-worker-quality-and-hit.html , via some posts of mine, http://blog.doloreslabs.com/topics/wisdom/ )
It’s really interesting that so many fields have reinvented aspects of these techniques.
Brendan
October 24, 2008 at 6:51 am |
I do wish Albert and Dodd had cited a paper by Uebersax & Grove (Biometrics, 2003) which introduced the random effects IRT model to the analysis of rater agreement. Of course, the original credit is due Robert Mislevy for introducing the general model to psychological testing.
You’re right, though, this is basically a logical idea which has been re-discovered independently in several disciplines (the problem this shows, however, is that few researchers these days know how to do a decent literature search).
In any case, in a series of articles, Gelfand and Solomon (JASA 1973/74/75) show that latent class models originated with Poisson who used them to estimate the accuracy of jury decisions.
October 24, 2008 at 10:46 am |
If you want more, check out John’s useful online bibliography.
link: Latent Trait and item-Response Model Bibliography
and also a top-level overview of inter-coder agreement:
link: Statistical Methods for Evaluating Interannotator Agreement
I’ve caught up on more of the literature since this post. I started thinking about item-response models initially. Part of the problem is the variety of terminology used for the same concepts.
I actually just finished my second pass through this:
Uebersax JS, Grove WM. 1993. A latent trait finite mixture model for the analysis of rating agreement. Biometrics.
The Uebersax and Grove (1993) paper not only introduces the latent trait model (coder traits are rating threshold and noisiness), but has a really nice description of the Gaussian mixture underlying the model and the resulting ordinal logistic/probit regression model (ordinal models allow ratings on a scale, such as 1-5 movie ratings).
The model from the Qu, Tan and Kutner (1996) Biometrics paper splits the predictors in two based on the latent class (inferred true category), using one set for positive cases (sensitivity) and one for negative cases (specificity). These are derived properties in the Uebersax and Grove approach.