Due to the bug we fixed in our precision-recall curve evaluations in the latest version (3.8.2) of LingPipe, I have to retract the results reported in:
- Carpenter, Bob. 2007. LingPipe for 99.99% Recall of Gene Mentions. In Proceedings of the 2nd Biocreative Workshop. Valencia, Spain.
and unfortunately, the paragraph I contributed to:
- Larry Smith et al. 2008. Overview of BioCreative II gene mention recognition. Genome Biology 9(Suppl 2):S2.
My very first MEDLINE entry, and it’s buggy.
Original (Erroneous) Results
In those papers, I reported 7% precision at 99.99% recall, 8% precision at 99.9% recall, and 11% precision at 99% recall.
It turns out the real numbers, using default LingPipe settings (n-gram = 5, interpolation = 5) with a max of 1024 chunks/sentence in our
chunk.CharLmHmmChunker, the actual results are:
Reducing n-gram length to 4 raises precision at 99.9% recall to 1% and 3-grams raise precision at 99.9% recall to 1.3%. As I speculated in the paper, less tightly fit models do a bit better in high recall settings, even though they do worse in high-F-measure evaluations.
Luckily it still only takes 2 minutes to do a complete confidence-based 20-fold cross-validation in a single thread.
The code for the evaluation’s all checked into the
- LingPipe Sandbox, project name
The bug causing the problem is described in LingPipe's release notes for 3.8.2:
3.8.2: Bug Fix: Scored Precision-Recall and Chunker Evaluations
We made major bug fixes for the precision-recall evaluations. There were two bugs. First, a tree set was being used where a list should've been used, causing some items to be ignored. Second, there was no way to add counts for missed items.
This bug affected the confidence-based chunker evaluator by overreporting recall in cases where the chunker did not return every reference chunk with at least some score. A new method addMisses(int) was added to the scored precision recall evaluation and called from the chunker evaluator.
As I've said before, unit tests are great, but they only catch the bugs you think to check for. Lack of imagination on testing is still a killer. I'm sure it'd help to have an independent tester.
Thanks again to Mike Ross for finding the bug. He was suspicious of the results he was getting using LingPipe's chunker evaluator, and subsequently wrote an independent P/R evaluator that isolated the bug.