Amaç Herdagdelen brought up some interesting examples on our mailing list that caused LingPipe’s ROC curve (and PR curve) implementation to provide some unexpected (and arguably wrong) results. (We then took the detailed discussion off line, and are now reporting back.)
Computing ROC Curves by Sorting
The basic problem is that the computation of an ROC curve involves first sorting the responses by the system’s estimated probability a response should be positive, then incrementally computing sensitivity versus 1 – specificty operating points by working down the ranked list. You can see a worked example in the class documentation for
What if there are Ties?
Mike Ross realized there was a problem when you had two examples with the same score, but different reference values. For instance, a system might be looking at two Tweets and estimate a 0.98 probability that both are positive sentiment. If one was classified as positive in the reference (i.e., the gold standard) and the other negative, then there’s a problem. With one order, you get one curve, and with the other order, you get a different curve. Mike figured out we might as well just leave off the intermediate point, because we get to the same operating point after handling both of them.
What if there are Near Ties?
Then Amaç wrote back after noticing he had some cases that were very close, but not quite, the same. This comes up with arithmetic precision all the time. In his case, it was different one-versus-all results for a multi-category classifier. He suggested treating two results as the same if they were within some small absolute value of each other. This is the typical thing to do when checking results are the same in unit tests (or code) — it’s built into junit for Java and googletest for C++.
Discretized ROC Curve Estimates are Discontinuous
That led me (Bob) to realize that the real problem is that these measures are discontinuous in the underlying scores. Suppose I have three items, A, B and C, with true values POS, NEG, and POS. Now if I have assigned scores to them of 0.97, 0.96, and 0.98, I get a curve based on the sequence TP, TP, FP, or a perfect score. I get the same reslt as the score for B moves from 0.96 to any value less than 0.97. But at the point it crosses 0.97, I wind up with a sequence TP, FP, TP, which is a whole different story.
The upshot is that area under the ROC curve (or PR curve) is not continuous in the underlying scores.
Help? Should we Probabilistically Smooth?
Does anyone have any idea what we should do here?
I’m thinking moving to some kind of probabilistic version might be able to smooth things out in a well-defined way. For instance, if I add normally-distributed noise around each point and then integrate it out, things are smooth again.