CrowdControl: Gold Standard Crowdsource Evals

by

CrowdControl is a new technology from Dolores Labs. It’s a component of their new offering CrowdFlower, a federated crowdsourcing application server/API. It fronts for providers like Amazon Mechanical Turk and others. The interface looks a little bit simpler than Mechanical Turk’s and there are other nice add-ons, including help with application building from the experts at Dolores. So far, I’ve only looked over Luke‘s shoulder for a demo of the beta when I was out visiting Dolores Labs.

CrowdControl: Supervised Annotator Evaluation

So what’s CrowdControl? For now, at least, it’s described on the CrowdFlower Examples page. In a nutshell, it’s Dolores labs’ strategy of inserting gold-standard items into crowdsourced tasks and using responses to them to estimate annotator accuracy in an ongoing fashion. You may have seen this technique described and evaluated for natural language data in:

Dolores Labs uses the evaluations to give annotators feedback, filter out bad/spam annotators, and derive posterior confidence estimates for labels.

Unsupervised Approach

I took their data and showed you could do just as well as gold-standard seeding using completely unsupervised model-based gold-standard inference (see my blog post on Dolores’s data and my analysis and the related thread of posts).

The main drawback to using inference is that it requires more data to get off the ground (which isn’t usually a problem in this setting). It also requires a bit more computational power to run the model, as well as all the complexity in scheduling, which is admittedly a pain (we haven’t implemented it). It’d also be harder for customers to understand and runs the risk of giving workers bad feedback if you use it as a feedback mechanism (but then so does any non-highly-redundantly-rechecked gold standard).

Overestimating Confidence

I was mostly interested in posterior estimates of gold-standard labels and their accuracy. The problem is that the model they use in the paper above (Dawid and Skene’s), typically overestimates confidence due to false independence assumptions among annotations for an item.

The problem I have in my random effects models is that I can’t estimate the item difficulties accurately enough to draw measurably better predictive inferences. That is, I can get slightly more diffuse priors in the fully Bayesian item difficulty approach, but not enough of an estimate on item difficulty to really change the gold-standard inference.

Semi-Supervised Approach

A mixture of gold-standard inference and gold-standard seeding (a semi-supervised approach) should work even better. The model structure doesn’t even change — you just get more certainty for some votes. In fact, the Bayesian Gibbs sampling and point estimated EM computations generalize very neatly, too. I just haven’t gotten around to evaluating it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s