Biocreative Encore: High Precision and High Recall Entity Extraction


I verified that the Biocreative scoring script could handle inputs with overlaps. It can! This is great. All the scoring scripts for named entity detection should work like this one. Even better, Biocreative allows 3 submissions, so we could get in our first-best rescoring model for the big F measure, but still have one submission aiming for high precision and one for high recall. We’re really excited about being able to tune these values, and we finally have a public evaluation that will allow us to submit relevant results.

So with a couple hours still to go before a party tonight, I decided it’d be worth writing a confidence-based entry. It simply uses the CharLmHmmChunker as a ConfidenceChunker and set a confidence threshold. I put in one entry with threshold 0.90 and one with 0.0001. The high precision setting (0.90 probability estimate or better) returned only 1/5 as many entity mentions as the first-best entry. The high recall setting (0.0001 probability estimate or better) returned about 5 times as many mentions as the first-best entry. Given our previous experiments on NCBI’s GeneTag corpus, as outlined in our named entity tutorial, these should result in relatively high precision and recall respectively compared to the first-best entry. All in, it took about another hour and a half or so, including digging up final result email addresses, submitting our results, and writing this blog entry.

Everything’s in the CVS sandbox module biocreative2006. See the last blog entry for anonymous checkout details.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: