Provost, Fawcett & Kohavi (1998) The Case Against Accuracy Estimation for Comparing Induction Algorithms


I couldn’t agree more with the first conclusion:

First, the justifications for using accuracy to compare classifiers are questionable at best.

of this paper:

In fact, I’d extend it to micro-averaged and macro-averaged F-measures, AUC, BEP, etc.,

Foster and crew’s argument is simple. They evaluate naive Bayes, decision trees, boosted decision trees, and k-nearest neighbor algorithms on a handful of UCI machine learning repository problems. They show that there aren’t what they call dominating ROC curves for any of the classifiers on any of the problems. For example, here’s their figure 1 (they later apply smoothing to better estimate ROC curves):

ROC Curves from Provost et al. (1998)

The upshot is that depending on whether you need high recall or high precision, the “best” classifier is different. As I’ve said before, it’s horses for courses.

To be a little more specific, they plot receiver operating characteristic (ROC) curves for the classifiers, which shows (1-specificity) versus sensitivity.

  • sensitivity = truePos / (truePos + falseNeg)
  • specificity = trueNeg / (trueNeg + falsePos)

In LingPipe, any ranked classifier can be evaluated for ROC curve using the method:

It’d be nice to see this work extended to today’s most popular classifiers: SVMs and logistic regression.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s