My last blog post, Good Kappa’s Not Enough, summarized Reidsma and Carletta’s arguments that a good kappa score is not sufficient for agreement. In this post, I’d like to point out why it’s not necessary, either. My real goal’s to refocus the problem on discovering when a gold standard can be trusted.

Suppose we have five annotators who are each 80% accurate for a binary classification problem whose true category distribution is 50-50. Now let’s say they annotate an example item (1,1,1,1,0), meaning annotator 1 assigns category 1, annotator 2 assigns category 1, up through annotator 5 who assigns category 0. What can we conclude about the example? Assuming the errors are independent (as kappa does), what’s the likelihood that the example really is of category 1 versus category 0? Bayes’ rule lets us calculate:

p(1|(1,1,1,1,0)) proportional to p(1) * p((1,1,1,1,0)|1) = 0.5 * 0.8^4 * 0.2^1 p(0|(1,1,1,1,0)) proportional to 0.5 * 0.8^1 * 0.2^4 p(1|(1,1,1,1,0)) = (0.8^4 * 0.2^1) / (0.8^4 * 0.2^1 + 0.8^1 * 0.2^4) = 98.5%

Recall the definition of kappa:

kappa = (agree - chanceAgree) / (1 - chanceAgree)

If errors are distributed randomly, agreement will be around 0.8^2 + 0.2^2 = 0.68 in a large sample, and the chance agreement will be 0.5^2 + 0.5^2 = 0.5, for a kappa value of (0.68-0.5)/(1-0.5)=0.36. That’s a level of kappa that leads those who follow kappa to say “go back and rethink your task, your agreement’s not high enough”.

Unfortunately, with 80% per-annotator accuracy, we only expect 74% or so of the examples to have a 4:1 or 5:0 vote by 80% accurate annotators (74% = 5 * 0.8^4 0.2^1 + 0.8^5).

I believe the question we really care about is when we can trust an annotation enough to put it in the gold standard. So let’s say we have two 80% accurate annotators and the true category is 1. The likelihood of various annotation outcomes are:

p(1,1) = 0.64 p(0,1) = 0.16 p(0,0) = 0.04 p(1,0) = 0.16

So clearly two annotators aren’t enough to be confident to the 99% level. We’d need 90% accurate annotators for that. But what about three annotators? The chance of three 80% annotators agreeing by chance is only 0.8%. And in 51.2% of the cases, they will agree and be right. So we use a minimum of three annotators, and if they agree, go on.

If they disagree, we need to bring in more annotators until we’re confident of the outcome. When we get to a four out of five vote, as in our first example, we’re confident again. But even 3/4 agreement is still pretty weak, yielding only a 94% chance that the agreed upon value is correct.

Of course, this line of reasoning supposes we know the annotator’s accuracy. In practice, we can’t evaluate an annotator’s accuracy because we don’t know the true labels for items.

July 31, 2008 at 10:57 am |

An alternative, when the annotator’s accuracy is unknown, is to use an EM-like approach.

We first assume known annotator accuracy (initialized, say, to perfect accuracy), and based on that we compute the most likely labels.

Then, based on the estimated labels, we can estimate the labelers accuracy.

By iterating, we can converge to some good estimates of the labels, and generate a confusion matrix for each annotator.

Take a look at

Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm

A. P. Dawid and A. M. Skene

Applied Statistics, Vol. 28, No. 1 (1979), pp. 20-28

http://www.jstor.org/pss/2346806

July 31, 2008 at 2:59 pm |

Panos:

Thanks for the reference. This is just the kind of thing I was looking for. And looking at the papers that cite it has opened up a vein of this kind of literature.

I’m working on a very similar approach using a Gibbs sampler, which has the nice property of giving me posterior uncertainty estimates of things like confusion matrices.

In searching for your reference, I found this, which is where I was planning to go with the posterior category samples from the fitted models:

Learning with Multiple Labels. Rong Jin and Zoubin Ghahramani.

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8894

I’ll post the details next week when I’m done writing up the paper.

July 31, 2008 at 3:49 pm |

Okay, Panos was holding out. He should’ve cited his own March 2008 blog post on Mechanical Turk: The Foundations. It contains a great overview of the true label induction problem and even more references. I just found it searching for Dawid’s paper.

Figuring out true paper “quality” given referee scores was one of the applications I’ve been thinking about. I originally thought a simple bootstrap analysis would be interesting, and it would be simple. But now I’ve been thinking more along the lines of linear modeling, perhaps with logistic linking to get the boundedness of the scale.

August 14, 2008 at 2:47 pm |

If you have access to massively scaling up the number of annotators (which is easy on Mechanical Turk), I’ve had some success by using a high number of annotators — e.g. 50 per example — for a small number of examples to derive “true” labels. Then it’s possible to derive what accuracy rates will be for 3 or 5 or however many annotators will be used for the rest of the data.

In my experience, scaling up the number of annotators tends to be more useful than running EM. I could be doing something wrong though.

August 14, 2008 at 4:26 pm |

What I’m trying to figure out is just how many annotators by items we’ll need to get tight posterior intervals on annotator accuracies and on true categories. In the simulations, I haven’t needed more than 5 annotators on 500 items to get a pretty good read on true categories and item accuracy (posterior intervals of +/- 1%). The kind of item-response or beta-binomial models I’ve been looking at are also amenable to having 50 annotators all do random subsets of the data.

EM’s awfully prone to local maxima in complex problems. Once there’s a reasonable amount of training data, I haven’t seen much in the way of improvement from EM. I’m thinking I’ll try Gibbs sampling instead of EM next; it’s even easier for classifiers than EM, and less prone to get stuck in local optima.