Posts Tagged ‘log loss’

Ethics, logistic regression, and 0-1 loss

December 27, 2016

Andrew Gelman and David Madigan wrote a paper on why 0-1 loss is so problematic:

This is related to the issue of whether one should be training on an artificial gold standard. Suppose we have a bunch of annotators and we don’t have perfect agreement on items. What do we do? Well, in practice, machine learning evals tend to either (1) throw away the examples without agreement (e.g., the RTE evals, some biocreative named entity evals, etc.), or (2) go with the majority label (everything else I know of). Either way, we are throwing away a huge amount of information by reducing the label to artificial certainty. You can see this pretty easily with simulations, and Raykar et al. showed it with real data.

Yet 0/1 corpora and evaluation remain the gold standard (pun intended) in most machine learning evals. Kaggle has gone largely to log loss, but even that’s very flat around the middle of the range, as Andrew discusses in this blog post:

The problem is that it’s very hard to train a model that’s well calibrated if you reduce the data to an artificial gold standard. If you don’t know what I mean by calibration, check out this paper:

It’s one of my favorites. Once you’ve understood it, it’s hard not to think of evaluating models in terms of calibration and sharpness.

TREC 2011 Crowdsourcing Track Data Released

June 16, 2011

Here’s the home page:

Clean and Small

The data’s super clean. I love it when task organizers take the time to release clean and easy to use data.

I was a bit surprised that there was so little of it, but I suppose this is just the dev data. There’s around 10K judgments from around 200 Mechanical Turkers for a single TREC topic (think query) with relevance labels. There appears to be the usual Turker quantity of noise annotations and a hugely skewed distribution of number of judgments per Turker (with the active Turkers looking more spammy, again as usual).

Mixed in Gold-Standard Data

One nice twist to the data is that there is also NIST gold-standard judgments for a subset of the data. In fact, the HITs (tasks assigned) for the Turkers had 5 relevance judgments, 1 of which was annotated previously by NIST.

As many of you may know, this is (among) the strategy(s) that CrowdFlower advocates for their customers.

What We Plan to Do

I don’t know how many submissions each team gets, but I’ll at least run the full Bayesian hierarchical model for a single topic (that is, the model Massimo and I described in our LREC tutorial on data annotation models). That’s a fully automatic system. I don’t know if we can remove the spam annotators by hand, but I might run that one, too. And probably something simpler with point estimates and simple voting, but that’d be four submissions. Hopefully they’ll evaluate voting themselves.

Probabilistic Evaluation

I have posted to their mailing list, but no luck getting anyone interested in log loss-based evaluations. Recall that in log loss, your score is the negation of the sum of the log probabilities your system assigned to the correct answers (it’s thus negated total probability, which is a loss function — I really wish optimization had settled on maximizing gain rather than minimizing loss).

Log loss the metric that’s being minimized by my (and any other statistical) model over the training data (that’s the M in maximum a posterior estimation, only we’re dealing with loss, not gain). This evaluation gets at how well calibrated a system’s probabilistic predictions are.

I’ll at least do that for our submissions once we get the “official gold standard” evaluation labels.

Hopefully, they’ll do the usual TREC bangup job of presenting things like precision-recall curves.