Like a lot of these things (e.g. collapsed Gibbs for LDA, hybrid Monte Carlo, SGD for logistic regression), the proof and formal definitions are way more complex than the straightforward implementation.

The answer to the question as I framed it is that the convention is to use the parallelograms over uninterpolated sample ROC to estimate area under ROC. For precision-recall, the convention is to use the interpolated step functions, which is why the latter is equivalent to average precision at true positive points (as stated but not explained in the Manning et al. *IR* book). If you add a 1.0 recall and 0.0 precision point, it’ll get interpolated away. And the 0.0 recall and 1.0 precision point is not part of the curve because of the way interpolation is done.

http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U#Calculations

The first one is quadratic in the worst case, the second Theta(n log n) because you need to sort the test data by predicted score, then do a linear pass over them.

]]>As Dave says, it’s always recall on the x axis and precision on the y. Except for ROC curves, where it’s 1-specificity on the x axis and sensitivity (= recall) onthe y axis.

]]>The red line is the P/R curve. The diagonal quickly shows you the point of equal precision and recall (about 0.77 in this contrived example). The black contour lines show F-score for equally weighted precision and recall: you can immediately find the operating point with the highest F-score (~0.85) and see its precision (~0.80) and recall (~0.92).

The script that generated this allows you to adjust the weighting of precision and recall, which changes the shape of the contour lines and the slope of the diagonal line.

]]>And the Lasko et al. paper already won me over by saying you can only estimate the true ROC curve. Spot on. This always bugs me when people talk about sensitivity as true positive rate — it’s the sample true positive rate, which is also the maximum likelihood estimate, but we just don’t know the real true positive rate.

]]>(Note: When dealing with precision, recall, etc., it helps if you define 0/0 to be 1. This makes a certain amount of sense: e.g. a classifier that never predicts a positive label when there are indeed no positive items should intuitively have 100% precision. It also saves you from having to exclude otherwise undefined corner cases. Although letting 0/0 = 1 will introduce inconsistencies if not used sparingly.)

The two simple extreme points are these:

(A) You never predict the positive class, so your precision is trivially 1 (0/0, see note above) and your recall is necessarily 0 (except in the degenerate case where there are no actual positive items).

(B) You always predict the positive class, so your recall is necessarily 1 and your precision is the maximum achievable precision on your test data, which can be any rational between 0 and 1.

There are two other extreme cases, but you can only achieve them with a perfect classifier:

(C) Precision is 1, recall is 1. This generally only happens when your classifier makes no mistakes (or trivially if your test set is empty).

(D) Precision is 0, recall is 0. This happens when the number of true positives is 0, there is at least one positive item (otherwise R=0/0=1), and your classifier emits at least one positive label (otherwise P=0/0=1). For this, you need to take a perfect classifier and invert its predictions.

The last corner only happens in a degenerate case:

(E) Precision is 0, recall is 1 (actually, recall is 0/0 per the Note above). This can only happen on a nonempty test set with no positive items. If you have such a test set, this situation happens whenever your classifier emits at least one positive label.

You can visualize this on a square D,A,C,E where B is somewhere along CE. In practice, if you only vary the decision threshold of a binary classifier, your P/R curve goes from A to B.

]]>T.A. Lasko et al., “The use of receiver operating characteristic curves in biomedical informatics”, Journal of Biomedical Informatics 38 (2005), 404–415.

]]>