CiteXplore: MEDLINE with Citation Details and Entity Highlighting

by

[Update, 10 August 2009: check out the comment for some responses to feedback from EBI’s development team.]

I found this site through the syllabus for EBI’s upcoming text mining course (October 2009):

It supports searches over MEDLINE with gene name, protein name, MeSH or protein-protein interaction highlighting. They’re sensibly exploiting external resources, such as iHOP (“info hyperlinked over proteins”), Entrez-PubMed and Entrez-Gene. I greatly prefer the link-out approach to trying to encapsulate everything on their own page.

What’s really neat is that CiteXplore mined the full-text articles for citation details, which they link back to MEDLINE. It looks very much like the ACM’s proprietary “Portal”, only over MEDLINE. One limitation is that CiteXplore only links back into MEDLINE (thus leaving some refs as “not found”), whereas ACM pulls out arbitrary references. This is something we couldn’t do, because we don’t have subscriptions to all the full text doc feeds (I have no idea how complete CiteXplore is in this regard, but they do have proprietary articles linked for citations.)

CiteXplore supports named-entity tagging for genes and/or proteins, though it looks like gene search also includes proteins. For instance, pull up their page and do a search for [serpina3], which is a gene that has the same name in six different mammals and birds. (Sorry I can’t link to the search — like Entrez itself, they used a web engine that doesn’t allow query bookmarking; they also don’t support “open in new window” or “open in new tab”, which is really frustrating.) Then pull down the “gene name highlighting” menu item and click on the “highlight” button. The gene names it finds are rendered in green.

CiteXplore finds some of the SERPINA3 mentions, but misses others. For instance, in the context “Alpha 1-antichymotrypsin/SerpinA3” it seems to have tokenization problems, because both are names of the same gene, but neither is highlighted. The mention of the bos taurus (cow) gene SERPINA3 (PMID 18384666) isn’t highlighted, but mus musculus (mouse) is (PMID 15638460). Unfortunately, the mouse mention is linked to the human SerpinA3 in iHOP. (Argh! iHOP’s all Javascripted up to intercept search on page and their search doesn’t jump to the matches. Why, oh why, do people try to innovate on things that already work?)

CiteXplore misses mentions with too many parens — they match “alpha-1-antichymotrypsin”, but miss “alpha(1)-antichymotrypsin”. And they have trouble with mixed case, matching “SERPINA3” and “serpina3”, but missing “SerpinA3”. Other misses seem inexplicable, as in “SERPINA3 (aka alpha-1-antichymotrypsin)”.

They also miss the alias “ACT” for SERPINA3 (it overlaps with a common word, so is probably just hard filtered out). That one’s tricky, as it requires context, and is very hard to recognize with high precision.

There are also other tokenization problems in that they seem to have inserted extra whitespace after open parens. I really wish people would take more care with text data. This is why you need tokenizers that give you info on whitespace (either directly or through start/end offsets for tokens).

I also tried CiteXplore’s “Proteins / Gene ontology” highlighting, but that adds precision problems to their recall problems. For instance, it matches the “Alpha 1” in “Alpha1-antichymotrypsin” to the Uniprot entry for mating type protein alpha 1. They also mess up the whitespace in this application, inserting extra spaces after hyphens.

There are also bigger recall problems. The search is by exact string match, so overall, they only find 38 citations for the search [serpina3]. Entrez-PubMed itself only finds 35 citations for the same query.

CiteXplore’s advanced search allows expansion by synonym, raising the number of hits for query [serpina3] to 133. Judging from their display, they only add “aact” as a synonym for “serpina3”. iHOP, which now has greatly improved recall over its original version, finds what looks to be many more articles about SERPINA3 searching by Entrez-Gene ID (12 for human), but they don’t report hit numbers, and I’m not going to count them. They also deal with the precision problems involved in expanding with synonyms (and also skip the synonym “ACT”).

CiteXplore also has a frustratingly short fuse on session timeouts; I had to keep redoing all these searches as I wrote this post.

CiteXplore seems to be based on EBI’s text matching system Whatizit, which supports tagging arbitrary text, not just MEDLINE citations found by search. There’s also a freely downloadable paper explaining the system:

An application like this could be built with LingPipe’s exact dictionary matcher coupled with a normalizing tokenizer.

One Response to “CiteXplore: MEDLINE with Citation Details and Entity Highlighting”

  1. lingpipe Says:

    I cc-ed a copy of this blog post to the CiteXplore support page, and they got back to me instantly, which is good news for an application like this. The problem with many apps is that people build demos and then never support them.

    They’re going to look into timeouts and bookmarking. The latter requires exposing the query format, so there are issues with backward compatibility with the bookmarks.

    The 38 results versus PubMed’s 35 was due to the fact that they also return patents (I should’ve looked at the search results more closely).

    I should’ve pointed out that the open-in-new-window/tab issue is just for search buttons, which was a bit unfair as a criticism, because I don’t know how to fix the issue, which is a general one with web forms. For instance, the submit button for this form works exactly the same way as CiteXplore’s and our demos. I’m guessing you could do something with AJAX, but then it’d be non-standard, which brings its own issues.

    I think the really interesting issue with this kind of system is how the underlying text mining system (e.g. Whatizit), gets integrated into a usable and useful application (e.g. CiteXplore). The challenge is that the state of the art in text mining for maximum balanced F has precision and recall of 0.8 at best, and the maximum recall at very high precision is much lower, and the maximum precision at very high recall is much lower.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s