Andrew Gelman and David Madigan wrote a paper on why 0-1 loss is so problematic:
- Gelman and Madigan. 2015. How is Ethics Like Logistic Regression? Chance 28(12).
This is related to the issue of whether one should be training on an artificial gold standard. Suppose we have a bunch of annotators and we don’t have perfect agreement on items. What do we do? Well, in practice, machine learning evals tend to either (1) throw away the examples without agreement (e.g., the RTE evals, some biocreative named entity evals, etc.), or (2) go with the majority label (everything else I know of). Either way, we are throwing away a huge amount of information by reducing the label to artificial certainty. You can see this pretty easily with simulations, and Raykar et al. showed it with real data.
Yet 0/1 corpora and evaluation remain the gold standard (pun intended) in most machine learning evals. Kaggle has gone largely to log loss, but even that’s very flat around the middle of the range, as Andrew discusses in this blog post:
- Gelman. 2016. TOP SECRET: Newly declassified documents on evaluating models based on predictive accuracy. Blog post.
The problem is that it’s very hard to train a model that’s well calibrated if you reduce the data to an artificial gold standard. If you don’t know what I mean by calibration, check out this paper:
- Gneiting, Balabdaoui, and Raftery. 2007. Probabilistic forecasts, calibration and sharpness. J. R. Statist. Soc. B 69.
It’s one of my favorites. Once you’ve understood it, it’s hard not to think of evaluating models in terms of calibration and sharpness.