Here’s the home page:
Clean and Small
The data’s super clean. I love it when task organizers take the time to release clean and easy to use data.
I was a bit surprised that there was so little of it, but I suppose this is just the dev data. There’s around 10K judgments from around 200 Mechanical Turkers for a single TREC topic (think query) with relevance labels. There appears to be the usual Turker quantity of noise annotations and a hugely skewed distribution of number of judgments per Turker (with the active Turkers looking more spammy, again as usual).
Mixed in Gold-Standard Data
One nice twist to the data is that there is also NIST gold-standard judgments for a subset of the data. In fact, the HITs (tasks assigned) for the Turkers had 5 relevance judgments, 1 of which was annotated previously by NIST.
As many of you may know, this is (among) the strategy(s) that CrowdFlower advocates for their customers.
What We Plan to Do
I don’t know how many submissions each team gets, but I’ll at least run the full Bayesian hierarchical model for a single topic (that is, the model Massimo and I described in our LREC tutorial on data annotation models). That’s a fully automatic system. I don’t know if we can remove the spam annotators by hand, but I might run that one, too. And probably something simpler with point estimates and simple voting, but that’d be four submissions. Hopefully they’ll evaluate voting themselves.
I have posted to their mailing list, but no luck getting anyone interested in log loss-based evaluations. Recall that in log loss, your score is the negation of the sum of the log probabilities your system assigned to the correct answers (it’s thus negated total probability, which is a loss function — I really wish optimization had settled on maximizing gain rather than minimizing loss).
Log loss the metric that’s being minimized by my (and any other statistical) model over the training data (that’s the M in maximum a posterior estimation, only we’re dealing with loss, not gain). This evaluation gets at how well calibrated a system’s probabilistic predictions are.
I’ll at least do that for our submissions once we get the “official gold standard” evaluation labels.
Hopefully, they’ll do the usual TREC bangup job of presenting things like precision-recall curves.