I’m revising LingPipe’s
ScoredPrecisionRecallEvaluation class. My goal is to replace what’s currently there with more industry-standard definitions, as well as fixing some bugs (see the last section for a list).
Uninterpolated and Interpolated Curves?
I’ve seen multiple different definitions for uninterpolated and interpolated curves.
For the uninterpolated case, I’ve seen some authors include all the points on the curve and others include only the points corresponding to relevant results (i.e., true positives). LingPipe took the latter approach, but Mannig et al.’s IR book took the former.
For the interpolated case, I’ve seen some authors (e.g., Manning et al.) set precision for a given recall point r to be the maximum precision at an operating point with recall r or above. I’ve seen others just drop points that were dominated (LingPipe’s current behavior). I seem to recall reading that others used the convex hull (which is what LingPipe’s doc says it does, but the doc’s wrong).
Should we always add the limit points corresponding to precision/recall values of (0,1) and (1,0)?
If we use all the points on the precision-recall curve, the break-even point (BEP) will be at the rank equal to the number of relevant results (i.e., BEP equals R-precision). If we remove points that don’t correspond to true positives, there might not be an operating point that is break even (e.g., consider results: irrel, irrel, rel, rel in the case where there are only two relevant results — BEP is 0/0 in the full case, but undefined if we only consider true positive operating points).
Area Under the Curves?
For area under the curve (AUC) computations, I’ve also seen multiple definitions. In some cases, I’ve seen people connect all the operating points (however defined), and then calculate the actual area. Others define a step function more in keeping with interpolating to maximum precision at equal or higher recall, and then take the area under that.
Bugs in LingPipe’s Current Definitions
For the next version of LingPipe, I need to fix three bugs:
1. The ROC curves should have 1-specificity on the X axis and sensitivity on the Y axis (LingPipe currently returns sensitivity on the X axis and specificity on the Y axis, which doesn’t correspond to anyone’s definition of ROC curves.)
2. The break-even point calculation can return too low a value.
3. Area under curve calculations don’t match anything I can find in the literature, nor do they match the definitions in the javadoc.