Amaç Herdagdelen brought up some interesting examples on our mailing list that caused LingPipe’s ROC curve (and PR curve) implementation to provide some unexpected (and arguably wrong) results. (We then took the detailed discussion off line, and are now reporting back.)

#### Computing ROC Curves by Sorting

The basic problem is that the computation of an ROC curve involves first sorting the responses by the system’s estimated probability a response should be positive, then incrementally computing sensitivity versus 1 – specificty operating points by working down the ranked list. You can see a worked example in the class documentation for

#### What if there are Ties?

Mike Ross realized there was a problem when you had two examples with the same score, but different reference values. For instance, a system might be looking at two Tweets and estimate a 0.98 probability that both are positive sentiment. If one was classified as positive in the reference (i.e., the gold standard) and the other negative, then there’s a problem. With one order, you get one curve, and with the other order, you get a different curve. Mike figured out we might as well just leave off the intermediate point, because we get to the same operating point after handling both of them.

#### What if there are Near Ties?

Then Amaç wrote back after noticing he had some cases that were very close, but not quite, the same. This comes up with arithmetic precision all the time. In his case, it was different one-versus-all results for a multi-category classifier. He suggested treating two results as the same if they were within some small absolute value of each other. This is the typical thing to do when checking results are the same in unit tests (or code) — it’s built into junit for Java and googletest for C++.

#### Discretized ROC Curve Estimates are Discontinuous

That led me (Bob) to realize that the real problem is that these measures are discontinuous in the underlying scores. Suppose I have three items, A, B and C, with true values POS, NEG, and POS. Now if I have assigned scores to them of 0.97, 0.96, and 0.98, I get a curve based on the sequence TP, TP, FP, or a perfect score. I get the same reslt as the score for B moves from 0.96 to any value less than 0.97. But at the point it crosses 0.97, I wind up with a sequence TP, FP, TP, which is a whole different story.

The upshot is that area under the ROC curve (or PR curve) is not continuous in the underlying scores.

#### Help? Should we Probabilistically Smooth?

Does anyone have any idea what we should do here?

I’m thinking moving to some kind of probabilistic version might be able to smooth things out in a well-defined way. For instance, if I add normally-distributed noise around each point and then integrate it out, things are smooth again.

October 27, 2011 at 12:39 pm |

Bob – this seems like a nice idea. This method of smoothing allows you to account for both floating point precision errors, and overfitting of the curve to the data sample (if your data is sampled from a larger set). And you can do both in a single shot, by choosing your distribution’s standard deviation. I like it much better than my solution!

Small correction to your description: I think the “perfect score” should be TP, TP, TN.

Mike

October 27, 2011 at 2:15 pm |

The perfect score is actually TP, TP, FP, and NOT TP, TP, TN. I agree that it seems confusing, but the reason you only get TP and FP on the list is that you move down the ranking of documents, each time making the next item a positive guess. You get a perfect score when all the positive examples are ranked ahead of all the negative examples. As you go down the list you hit TPs until you exhaust all the positives, then hit a string of FPs. When you interpolate, those all go away.

This is ROC, which is sensitivity (TP rate) versus 1 – specificity (FP rate).

October 27, 2011 at 4:46 pm |

I did a little digging (I’m forgetting how frustrating it is to be back outside the academic paywall), and found that Tom Fawcett implemented the same fix as Mike Ross suggested in his 2004

Machine Learningpaper, ROC Graphs: Notes and Practical Considerations for Researchers (the first “further reading” hit on the Wikipidea page, receiver operating characeristic.If you search Google for [measurement error roc curve], you’ll also find a lot of articles, including some that include normal errors like I was suggesting.