I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:
- Tang, Wei and Matthew Lease. 2011. Semi-supervised consensus labeling for crowdsourcing. SIGIR Workshop on Crowdsourcing for Information Retrieval.
But that’s not why we’re here today. I want to talk about their modeling decisions.
Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).
The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).
Semi Supervision
Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.
Good Results vs. NIST Gold
The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.
Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.
Really Gold?
They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.
Voting: Quantity vs. Quality
Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).
Regularized Estimates (vs. MLE vs. Bayesian)
I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.
For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.
I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.
Really Adversarial Turkers?
How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.
Active Learning
Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.
Adding a Model
Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.
Not Quite Naive Bayes
Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).
So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)
Two-Way vs. K-Way Independence
Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.
One Last Nitpick
A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.
Synthetic Data Generation?
I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.
September 25, 2011 at 7:49 pm |
[…] Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing […]