Confidence-Based Gene Mentions for all of MEDLINE

by

I ran LingPipe’s new confidence-based named-entity extractor over every title and abstract body in MEDLINE. The model is the one distributed on our site built from the NLM GeneTag corpus (a refined version of the first BioCreative corpus) — that’s a compiled com.aliasi.chunk.CharLmHmmChunker. There’s just a single category, GENE.

The 2006 MEDLINE baseline contains 10.2 billion characters in titles and abstracts (with brackets for translations cut out of titles and truncation messages removed form abstracts). I extracted the text using LingPipe’s MEDLINE parser and wrote the output in a gzipped form almost identical to that used in NLM’s GeneTag corpus (also used for BioCreative).

I set the minimum confidence to be 0.001. I set the caches to be 10M entries each, but then capped the JVM memory at 2GB, so the soft references in the cache are getting collected when necessary. I should try it with a smaller cache that won’t get GC-ed and see if the cache is better at managing itself than the GC is.

Including the I/O , XML parsing, gzipping and unzipping and MEDLINE DOM construction, it all ran over all of MEDLINE in just under 9 hours. That’s 330,000 characters/second!!! That’s on a fairly modest 1.8GHz dual opteron, 8GB PC2700 ECC memory, Windows64, 1.5 64-bit JDK in server mode). That’s in a single analysis thread (of course, the 1.5 server JVM uses a separate thread for GC).

All I can say is woot!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s