I’m revising LingPipe’s `ScoredPrecisionRecallEvaluation`

class. My goal is to replace what’s currently there with more industry-standard definitions, as well as fixing some bugs (see the last section for a list).

### Uninterpolated and Interpolated Curves?

I’ve seen multiple different definitions for uninterpolated and interpolated curves.

For the uninterpolated case, I’ve seen some authors include all the points on the curve and others include only the points corresponding to relevant results (i.e., true positives). LingPipe took the latter approach, but Mannig et al.’s IR book took the former.

For the interpolated case, I’ve seen some authors (e.g., Manning et al.) set precision for a given recall point r to be the maximum precision at an operating point with recall r or above. I’ve seen others just drop points that were dominated (LingPipe’s current behavior). I seem to recall reading that others used the convex hull (which is what LingPipe’s doc says it does, but the doc’s wrong).

### Limit Points?

Should we always add the limit points corresponding to precision/recall values of (0,1) and (1,0)?

### Break-Even Point?

If we use all the points on the precision-recall curve, the break-even point (BEP) will be at the rank equal to the number of relevant results (i.e., BEP equals R-precision). If we remove points that don’t correspond to true positives, there might not be an operating point that is break even (e.g., consider results: irrel, irrel, rel, rel in the case where there are only two relevant results — BEP is 0/0 in the full case, but undefined if we only consider true positive operating points).

### Area Under the Curves?

For area under the curve (AUC) computations, I’ve also seen multiple definitions. In some cases, I’ve seen people connect all the operating points (however defined), and then calculate the actual area. Others define a step function more in keeping with interpolating to maximum precision at equal or higher recall, and then take the area under that.

### Bugs in LingPipe’s Current Definitions

For the next version of LingPipe, I need to fix three bugs:

1. The ROC curves should have 1-specificity on the X axis and sensitivity on the Y axis (LingPipe currently returns sensitivity on the X axis and specificity on the Y axis, which doesn’t correspond to anyone’s definition of ROC curves.)

2. The break-even point calculation can return too low a value.

3. Area under curve calculations don’t match anything I can find in the literature, nor do they match the definitions in the javadoc.

October 1, 2010 at 6:11 pm |

I found the following paper to be an excellent review of ROC curves:

T.A. Lasko et al., “The use of receiver operating characteristic curves in biomedical informatics”, Journal of Biomedical Informatics 38 (2005), 404–415.

October 2, 2010 at 2:54 am |

Regarding limit points, these are different for ROC curves and P/R curves. P/R curves generally don’t go from (0,1) to (1,0).

(Note: When dealing with precision, recall, etc., it helps if you define 0/0 to be 1. This makes a certain amount of sense: e.g. a classifier that never predicts a positive label when there are indeed no positive items should intuitively have 100% precision. It also saves you from having to exclude otherwise undefined corner cases. Although letting 0/0 = 1 will introduce inconsistencies if not used sparingly.)

The two simple extreme points are these:

(A) You never predict the positive class, so your precision is trivially 1 (0/0, see note above) and your recall is necessarily 0 (except in the degenerate case where there are no actual positive items).

(B) You always predict the positive class, so your recall is necessarily 1 and your precision is the maximum achievable precision on your test data, which can be any rational between 0 and 1.

There are two other extreme cases, but you can only achieve them with a perfect classifier:

(C) Precision is 1, recall is 1. This generally only happens when your classifier makes no mistakes (or trivially if your test set is empty).

(D) Precision is 0, recall is 0. This happens when the number of true positives is 0, there is at least one positive item (otherwise R=0/0=1), and your classifier emits at least one positive label (otherwise P=0/0=1). For this, you need to take a perfect classifier and invert its predictions.

The last corner only happens in a degenerate case:

(E) Precision is 0, recall is 1 (actually, recall is 0/0 per the Note above). This can only happen on a nonempty test set with no positive items. If you have such a test set, this situation happens whenever your classifier emits at least one positive label.

You can visualize this on a square D,A,C,E where B is somewhere along CE. In practice, if you only vary the decision threshold of a binary classifier, your P/R curve goes from A to B.

October 2, 2010 at 10:34 am |

Thanks! That was exactly what I was looking for on the edge cases. This is the kind of stuff that drives me crazy when I’m coding something, because most presentations in textbooks or papers don’t go over degenerate instances.

And the Lasko et al. paper already won me over by saying you can only estimate the true ROC curve. Spot on. This always bugs me when people talk about sensitivity as true positive rate — it’s the sample true positive rate, which is also the maximum likelihood estimate, but we just don’t know the real true positive rate.

October 4, 2010 at 2:05 am |

Here’s an example (using synthetic data) of how I prefer to draw precision/recall plots:

http://twitpic.com/2u95vs

The red line is the P/R curve. The diagonal quickly shows you the point of equal precision and recall (about 0.77 in this contrived example). The black contour lines show F-score for equally weighted precision and recall: you can immediately find the operating point with the highest F-score (~0.85) and see its precision (~0.80) and recall (~0.92).

The script that generated this allows you to adjust the weighting of precision and recall, which changes the shape of the contour lines and the slope of the diagonal line.

October 4, 2010 at 3:01 pm |

I like the dotted lines, too. I’d just drawn the F-measure contour lines in a previous plot but hadn’t thought about putting them together.

As Dave says, it’s always recall on the x axis and precision on the y. Except for ROC curves, where it’s 1-specificity on the x axis and sensitivity (= recall) onthe y axis.

October 6, 2010 at 2:13 am

I prefer to have recall on the y-axis so that I can have ROC and P/R plots side by side, ROC on the left. Then there’s a single y-axis, and you can read off precision, F-score, and specificity all at once for a given level of recall.

October 6, 2010 at 10:27 am

In the current LingPipe, my ROC calcs have sensitivity on the x axis for just this reason. I just rewrote the code to follow the conventions in the literature (x=1-specificity, y = sensitivity/recall for ROC; x=sensitivity/recall, y = precision/positivie-predictive-accuracy for precision-recall).

October 4, 2010 at 9:01 am |

Martin – I like displaying the F1 contour lines. But is there a reason you put precision on the x-axis? It’s almost always on the y-axis in the IR and NLP literature.

October 6, 2010 at 3:04 am |

Regarding Area Under the Curve, in the ML literature it’s usually calculated like the Wilcoxon-Mann-Whitney U statistic. This is equivalent to taking the definite integral of a step function. There are two algorithms:

http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U#Calculations

The first one is quadratic in the worst case, the second Theta(n log n) because you need to sort the test data by predicted score, then do a linear pass over them.

October 6, 2010 at 10:36 am |

The Lasko et al. paper you cited is great. It goes through alternative estimates for area under the curve and its connection to Wilcoxon/U stats. Unlike most of the presentations I see, Lasko et al. frame the problem properly as that of estimating the true ROC curve (which is not only unknown, but not directly observable).

Like a lot of these things (e.g. collapsed Gibbs for LDA, hybrid Monte Carlo, SGD for logistic regression), the proof and formal definitions are way more complex than the straightforward implementation.

The answer to the question as I framed it is that the convention is to use the parallelograms over uninterpolated sample ROC to estimate area under ROC. For precision-recall, the convention is to use the interpolated step functions, which is why the latter is equivalent to average precision at true positive points (as stated but not explained in the Manning et al.

IRbook). If you add a 1.0 recall and 0.0 precision point, it’ll get interpolated away. And the 0.0 recall and 1.0 precision point is not part of the curve because of the way interpolation is done.