My last blog post, Good Kappa’s Not Enough, summarized Reidsma and Carletta’s arguments that a good kappa score is not sufficient for agreement. In this post, I’d like to point out why it’s not necessary, either. My real goal’s to refocus the problem on discovering when a gold standard can be trusted.
Suppose we have five annotators who are each 80% accurate for a binary classification problem whose true category distribution is 50-50. Now let’s say they annotate an example item (1,1,1,1,0), meaning annotator 1 assigns category 1, annotator 2 assigns category 1, up through annotator 5 who assigns category 0. What can we conclude about the example? Assuming the errors are independent (as kappa does), what’s the likelihood that the example really is of category 1 versus category 0? Bayes’ rule lets us calculate:
p(1|(1,1,1,1,0)) proportional to p(1) * p((1,1,1,1,0)|1) = 0.5 * 0.8^4 * 0.2^1 p(0|(1,1,1,1,0)) proportional to 0.5 * 0.8^1 * 0.2^4 p(1|(1,1,1,1,0)) = (0.8^4 * 0.2^1) / (0.8^4 * 0.2^1 + 0.8^1 * 0.2^4) = 98.5%
Recall the definition of kappa:
kappa = (agree - chanceAgree) / (1 - chanceAgree)
If errors are distributed randomly, agreement will be around 0.8^2 + 0.2^2 = 0.68 in a large sample, and the chance agreement will be 0.5^2 + 0.5^2 = 0.5, for a kappa value of (0.68-0.5)/(1-0.5)=0.36. That’s a level of kappa that leads those who follow kappa to say “go back and rethink your task, your agreement’s not high enough”.
Unfortunately, with 80% per-annotator accuracy, we only expect 74% or so of the examples to have a 4:1 or 5:0 vote by 80% accurate annotators (74% = 5 * 0.8^4 0.2^1 + 0.8^5).
I believe the question we really care about is when we can trust an annotation enough to put it in the gold standard. So let’s say we have two 80% accurate annotators and the true category is 1. The likelihood of various annotation outcomes are:
p(1,1) = 0.64 p(0,1) = 0.16 p(0,0) = 0.04 p(1,0) = 0.16
So clearly two annotators aren’t enough to be confident to the 99% level. We’d need 90% accurate annotators for that. But what about three annotators? The chance of three 80% annotators agreeing by chance is only 0.8%. And in 51.2% of the cases, they will agree and be right. So we use a minimum of three annotators, and if they agree, go on.
If they disagree, we need to bring in more annotators until we’re confident of the outcome. When we get to a four out of five vote, as in our first example, we’re confident again. But even 3/4 agreement is still pretty weak, yielding only a 94% chance that the agreed upon value is correct.
Of course, this line of reasoning supposes we know the annotator’s accuracy. In practice, we can’t evaluate an annotator’s accuracy because we don’t know the true labels for items.