I just finished up the doc on a complete parser and object model for the for the U.S. National Library of Medicine’s XML-based Medical Subject Headings (MeSH) distribution. MeSH contains over 25,000 controlled vocabulary items in the biomedical domain, organized into highly structured and cross-linked records. Here are links that explain what’s in MeSH:
- MeSH Element Descriptions
- MeSH example files and download links
- Online MeSH Vocabulary Searching and Browsing
What’s really neat is that MEDLINE is distributed with MeSH terms added by a dedicated staff of research librarians.
The parser and object representations follow the XML pretty closely, which may be familiar from LingPipe’s longstanding MEDLINE parsing and object package (which has its own tutorial). The parsers for MeSH and MEDLINE can take the compressed distribution files and parse out object-oriented representations of the associated MeSH or MEDLINE records.
The parsing pattern employed in LingPipe’s corpus package is very much like SAX’s parser/handler pattern. At one point, I wrote a whole blog entry on LingPipe’s parser/handler framework. A handler is attached to a parser, and as the parser parses the raw data, it sends object-oriented representations of records to the handler. There’s a simple demo command in the package that just prints them out to show how the code can be used.
As part of the preparations for LingPipe 4.0, I’m going to be moving the MEDLINE package (
com.aliasi.medline) out of LingPipe proper and into the LingMed sandbox project. LingMed will probably graduate from the sandbox and start getting distributed on its own.
LingMed also includes parsers for distributions of NLM’s Entrez-Gene, NLM et al.’s Online Mendelian Inheritance in Man (OMIM) and the GO Consortium’s Gene Ontology (GO). These all work using the same basic SAX-like pattern.
LingMed also has data access layer implementations of search for many of these data sets (like Entrez-Gene and MEDLINE) integrated with Lucene. In particular, you can do fielded searches, yet retrieve object-oriented results (rather than Lucene documents) out the other side.
You can find instructions on the LingPipe site for anonymous subversion (SVN) access to our sandbox, which includes specific information about checking out the LingMed project.
Let us know if you find it useful or want us to add other features. We’d love to get some other folks using this. As is, the documentation’s a bit sketchy in places. Please feel free to send the LingPipe mailing list or me (
email@example.com) questions about it directly.