Lam and Stork (2003) The Effect of Annotator Error on Classifier Evaluation

by

Anyone who’s looked at a corpus realizes that the “gold standards” are hardly 24 karat; they are full of impurities in the form of mislabeled items. Just how many, no one really knows. That’d require a true gold standard to evaluate! We’ve been wondering how we can conclude we have a 99.9% recall entity annotator when the gold standard data’s highly unlikely to be 100% accurate itself.

Check out this 2003 publication from the OpenMind Initiative’s list of publications:

Lam, Chuck P. and David G. Stork. 2003. Evaluating classifiers by means of test data with noisy labels. In Proceedings of IJCAI.

In particular, table 1, which plots the observed error rates for different true error rates and corpus mislabeling rates. Here’s a small slice:

Corpus Mislabeling Rate
True Classifier Error 1% 3% 5%
2% 3.0% 4.9% 6.8%
6% 6.9% 8.6% 10.4%
10% 10.8% 12.4% 14%
Observed Classifier Error vs. Mislabeled Corpus

For simplicity, Lam and Stork assumed the classifier being evaluated and the data annotators make independent errors. This is not particularly realistic, as problems that are hard for humans tend to be hard for classification algorithms, too. Even the authors point out that it’s common for the same errors to be in training data and test data, thus making it very likely errors will be correlated.

Lam and Stork’s paper is also the first I know to raise the problem addressed by Sheng et al.’s 2008 KDD paper, Get Another Label?, namely:

… it is no longer obvious how one should spend each additional labeling effort. Should one spend it labeling the unlabeled data, or should one spend it increasing the accuracy of already labeled data?   (Lam and Stork 2003)

Lam and Stork also discuss a problem related to that discussed in Snow et al.’s 2008 EMNLP paper, Cheap and Fast – But is it Good?, namely how many noisy annotators are required to estimate the true error rate (their figure 1). The answer, if they’re noisy, is “a lot”. Snow et al. considered how many really noisy annotators were required to recreate a gold standard approximated by presumably less noisy annotators, which is a rather different estimation.

Of course, this is all very Platonic in assuming that the truth is out there. Here at Alias-i, we are willing to accept that there is no truth, or at least that some cases are so borderline as to not be encompassed by a coding standard, and that any attempt to extend the standard will still leave a fuzzy boundary.

The question we have now is whether our models of annotation will distinguish the borderline cases from hard cases. With hard cases, enough good annotators should converge to a single truth. With borderline cases, there should be no convergence.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s