I stumbled across Reidsma and Carletta’s Reliability measurement without limits, which is pending publication as a Computational Lingusitics journal squib (no, not a non-magical squib, but a linguistics squib).
The issue they bring up is that if we’re annotating data, a high value for the kappa statistic isn’t enough to guarantee what they call "reliability". The problem is that the disagreements may not be random. They focus on simulating the case where an annotator over-uses a label, which results in kappa way overestimating reliability when compared to performance versus the truth. The reason is that the statistical model will be able to pick up on the pattern of mistakes and reproduce them, making the task look more reliable than it is.
This discussion is similar to the case we’ve been worrying about here in trying to figure out how we can annotate a named-entity corpus with high recall. If there are hard cases that annotators miss (over-using the no-entity label), random agreement assuming equally hard problems will over-estimate the "true" recall.
Reidsma and Carletta’s simulation shows that there’s a strong effect from the relationship between true labels and features of the instances (as measured by Cramer’s phi).
Review of Cohen’s Kappa
Recall that kappa is a "chance-adjusted measure of agreement", which has been widely used in computational linguistics since Carletta’s 1996 squib on kappa, defined by:
kappa = (agreement - chanceAgreement) / (1 - chanceAgreement)
agreement is just the empirical percentage of cases on which annotators agree, and
chanceAgreement is the percentage of cases on which they’d agree by chance. For Cohen’s kappa, chance agreement is measured by assuming annotators pick categories at random according to their own empirical category distribution (but there are lots of variants, as pointed out in this Artstein and Poesio tech report, a version of which is also in press at The Journal of Kappa Studies, aka Computational Linguistics). Kappa values will range between -1 and 1, with 1 only occurring if they have perfect agreement.
I (Bob) don’t like kappa, because it’s not estimating a probability (despite being an arithmetic combination of [maximum likelihood] probability estimates). The only reason to adjust for chance is that it allows one, in theory, to compare different tasks. The way this plays out in practice is that an arbitrary cross-task threshold is defined above which a labeling task is considered "reliable".
A final suggestion for those using kappa: confidence intervals from bootstrap resampling would be useful to see how reliable the estimate of kappa itself is.