Retraction: Only 1% Precision at 99.9% Recall for BioCreative Gene Chunks


Due to the bug we fixed in our precision-recall curve evaluations in the latest version (3.8.2) of LingPipe, I have to retract the results reported in:

  • Carpenter, Bob. 2007. LingPipe for 99.99% Recall of Gene Mentions. In Proceedings of the 2nd Biocreative Workshop. Valencia, Spain.

and unfortunately, the paragraph I contributed to:

My very first MEDLINE entry, and it’s buggy.

Original (Erroneous) Results

In those papers, I reported 7% precision at 99.99% recall, 8% precision at 99.9% recall, and 11% precision at 99% recall.

Corrected Results

It turns out the real numbers, using default LingPipe settings (n-gram = 5, interpolation = 5) with a max of 1024 chunks/sentence in our chunk.CharLmHmmChunker, the actual results are:

Recall Precision
99% 3.6%
99.9% 0.9%
99.99% 0.6%
100% 0.5%

Reducing n-gram length to 4 raises precision at 99.9% recall to 1% and 3-grams raise precision at 99.9% recall to 1.3%. As I speculated in the paper, less tightly fit models do a bit better in high recall settings, even though they do worse in high-F-measure evaluations.

Luckily it still only takes 2 minutes to do a complete confidence-based 20-fold cross-validation in a single thread.

The code for the evaluation’s all checked into the

The Bug

The bug causing the problem is described in LingPipe's release notes for 3.8.2:

3.8.2: Bug Fix: Scored Precision-Recall and Chunker Evaluations
We made major bug fixes for the precision-recall evaluations. There were two bugs. First, a tree set was being used where a list should've been used, causing some items to be ignored. Second, there was no way to add counts for missed items.

This bug affected the confidence-based chunker evaluator by overreporting recall in cases where the chunker did not return every reference chunk with at least some score. A new method addMisses(int) was added to the scored precision recall evaluation and called from the chunker evaluator.

As I've said before, unit tests are great, but they only catch the bugs you think to check for. Lack of imagination on testing is still a killer. I'm sure it'd help to have an independent tester.

Thanks again to Mike Ross for finding the bug. He was suspicious of the results he was getting using LingPipe's chunker evaluator, and subsequently wrote an independent P/R evaluator that isolated the bug.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s