[Update, 10 August 2009: check out the comment for some responses to feedback from EBI’s development team.]
I found this site through the syllabus for EBI’s upcoming text mining course (October 2009):
- EBI: CiteXPlore
It supports searches over MEDLINE with gene name, protein name, MeSH or protein-protein interaction highlighting. They’re sensibly exploiting external resources, such as iHOP (“info hyperlinked over proteins”), Entrez-PubMed and Entrez-Gene. I greatly prefer the link-out approach to trying to encapsulate everything on their own page.
What’s really neat is that CiteXplore mined the full-text articles for citation details, which they link back to MEDLINE. It looks very much like the ACM’s proprietary “Portal”, only over MEDLINE. One limitation is that CiteXplore only links back into MEDLINE (thus leaving some refs as “not found”), whereas ACM pulls out arbitrary references. This is something we couldn’t do, because we don’t have subscriptions to all the full text doc feeds (I have no idea how complete CiteXplore is in this regard, but they do have proprietary articles linked for citations.)
CiteXplore supports named-entity tagging for genes and/or proteins, though it looks like gene search also includes proteins. For instance, pull up their page and do a search for [serpina3], which is a gene that has the same name in six different mammals and birds. (Sorry I can’t link to the search — like Entrez itself, they used a web engine that doesn’t allow query bookmarking; they also don’t support “open in new window” or “open in new tab”, which is really frustrating.) Then pull down the “gene name highlighting” menu item and click on the “highlight” button. The gene names it finds are rendered in green.
CiteXplore misses mentions with too many parens — they match “alpha-1-antichymotrypsin”, but miss “alpha(1)-antichymotrypsin”. And they have trouble with mixed case, matching “SERPINA3” and “serpina3”, but missing “SerpinA3”. Other misses seem inexplicable, as in “SERPINA3 (aka alpha-1-antichymotrypsin)”.
They also miss the alias “ACT” for SERPINA3 (it overlaps with a common word, so is probably just hard filtered out). That one’s tricky, as it requires context, and is very hard to recognize with high precision.
There are also other tokenization problems in that they seem to have inserted extra whitespace after open parens. I really wish people would take more care with text data. This is why you need tokenizers that give you info on whitespace (either directly or through start/end offsets for tokens).
I also tried CiteXplore’s “Proteins / Gene ontology” highlighting, but that adds precision problems to their recall problems. For instance, it matches the “Alpha 1” in “Alpha1-antichymotrypsin” to the Uniprot entry for mating type protein alpha 1. They also mess up the whitespace in this application, inserting extra spaces after hyphens.
There are also bigger recall problems. The search is by exact string match, so overall, they only find 38 citations for the search [serpina3]. Entrez-PubMed itself only finds 35 citations for the same query.
CiteXplore’s advanced search allows expansion by synonym, raising the number of hits for query [serpina3] to 133. Judging from their display, they only add “aact” as a synonym for “serpina3”. iHOP, which now has greatly improved recall over its original version, finds what looks to be many more articles about SERPINA3 searching by Entrez-Gene ID (12 for human), but they don’t report hit numbers, and I’m not going to count them. They also deal with the precision problems involved in expanding with synonyms (and also skip the synonym “ACT”).
CiteXplore also has a frustratingly short fuse on session timeouts; I had to keep redoing all these searches as I wrote this post.
CiteXplore seems to be based on EBI’s text matching system Whatizit, which supports tagging arbitrary text, not just MEDLINE citations found by search. There’s also a freely downloadable paper explaining the system:
- Rebholz-Schuhmann D, Arregui M, Gaudan S, Kirsch H, Jimeno A. 2008. Text processing through Web services: calling Whatizit. Bioinformatics 24(2):296–298.
An application like this could be built with LingPipe’s exact dictionary matcher coupled with a normalizing tokenizer.