Aleks just sent me a pointer to this paper from last year’s COLT:
- Dekel, Ofer and Ohad Shamir. 2009. Vox Populi: Collecting High-Quality Labels from a Crowd. In COLT.
The paper presents, proves theorems about, and evaluates a very simple idea in the context of crowdsourcing binary classifier data coding. Unlike other approaches I’ve blogged about, Dekel and Shamir use only one coder per item, and often coders only label a handful of items.
They train an SVM on all the labels, then use the SVM to evaluate coders. They then prune out low-quality coders using the SVM as a reference.
This paper reminded me of (Brodley and Friedl 1999), which was also about using multiple trained classifiers to remove bad labels. Brodley and Friedl remove items on an item by item basis rather than on an annotator by annotator basis.
This paper is also somewhat reminiscent of (Raykar et al. 2009), who train a classifier and use it as another annotator.
Lots of Theory
There’s lots of theory saying why this’ll work. It is a COLT paper after all.
Evaluation: Web Page Query Relevance
They ran an eval on Amazon’s Mechanical Turk asking people to judge a pair consisting of a web query and web page as to whether the page is relevant to the query.
They used 15 coders per query/page pair in order to define a gold standard by voting. They then evaluated their approach using only one coder per item by subsampling their larger data set.
One thing I noticed is that as they lowered the maximum number of items labeled by an annotator, their average annotator accuracy went up. This is consistent with our findings and those in Snow et al. that the spammy Mechanical Turkers complete a disproportionate number of tasks.
With only one coder per item, Dekel and Shamir can’t evaluate the way everyone else is evaluating crowdsourcing, because they know their resulting data will remain noisy because it’s only singly annotated.
Estimates of coder sensitivity and specificity could be made based on their performance relative to the SVM’s best guesses. That’d provide some handle on final corpus quality in terms of false positives and negatives relative to ground truth.
Rather than evaluating trained classifier performance, Dekel and Shamir measure the “noise level” of the resulting data set after pruning. What we really care about in practice is how much pruning bad annotators helps train a more reliable classifier (or help evaluate if that’s the goal). They discuss this kind of end-to-end evaluation under “future work”! It’s really striking how different the evaluation versus theory requirements are for different conferences.