I couldn’t agree more with the first conclusion:
First, the justifications for using accuracy to compare classifiers are questionable at best.
of this paper:
- Foster Provost, Tom Fawcett, and Ron Kohavi. 1998. The Case Against Accuracy Estimation for Comparing Induction Algorithms. In ICML.
In fact, I’d extend it to micro-averaged and macro-averaged F-measures, AUC, BEP, etc.,
Foster and crew’s argument is simple. They evaluate naive Bayes, decision trees, boosted decision trees, and k-nearest neighbor algorithms on a handful of UCI machine learning repository problems. They show that there aren’t what they call dominating ROC curves for any of the classifiers on any of the problems. For example, here’s their figure 1 (they later apply smoothing to better estimate ROC curves):
The upshot is that depending on whether you need high recall or high precision, the “best” classifier is different. As I’ve said before, it’s horses for courses.
To be a little more specific, they plot receiver operating characteristic (ROC) curves for the classifiers, which shows (1-specificity) versus sensitivity.
- sensitivity = truePos / (truePos + falseNeg)
- specificity = trueNeg / (trueNeg + falsePos)
In LingPipe, any ranked classifier can be evaluated for ROC curve using the method:
It’d be nice to see this work extended to today’s most popular classifiers: SVMs and logistic regression.