- Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval. Bioinformatics. [open access home, including link to the Methodology Supplement]
The same technique could be applied to assign Gene Ontology (GO) terms to texts, tags to tweets or blog posts or consumer products, or keywords to scientific articles.
k-NN via Search
To assign MeSH terms to a text chunk, they:
- convert the text chunk to a search query,
- run the query against a relevance-based MEDLINE index, then
- rank MeSH terms by frequency in the top k (k=10) hits.
In other words, k-nearest-neighbors (k-NN) where “distance” is implemented by a relevance-based search.
Just the Text, Ma’am
Trieschnigg et al. concatenated the title and abstract of MEDLINE citations into a single field for both document indexing and query creation.
k-NN implemented as search could be faceted to include journals, years, authors, etc. For products, this could include all the facets seen on sites like Amazon or New Egg.
They use language-model-based search, though I doubt that’s critical for success. Specifically, they estimate the maximum-likelihood unigram language model for the query and interpolated (with a model trained on all documents) model for each document, and then rank documents versus that query by cross-entropy of the query model given the document model (given the MLE estimate for the query, this is just the log probability of the query in the document’s LM.) Other LM-based search systems measure similarity by KL-divergence.
There weren’t any details on stemming, stoplisting, case normalization or tokenization in the paper or supplement; just a pointer to author Wessel Kraaij’s Ph.D. thesis on LM-based IR.
Application to MEDLINE
The text being assigned MeSH terms was another MEDLINE title-plus-abstract. This may seem redundant given that MEDLINE citations are already MeSH annotated, but it’s useful if you’re the one at NLM who has to assign the MeSH terms, or if you want a deeper set of terms (NLM only assigns a few per document).
It’s easy to apply the authors’ approach to arbitrary texts, such as paragraphs out of textbooks or full text articles or long queries of the form favored by TREC.
I did a double-take when I saw k-nearest-neighbors and efficiency together. As we all know, k-NN scales linearly with training set size and MEDLINE is huge. But, in this case, the search toolkit can do the heavy lifting. The advantage of doing k-NN here is that it reproduces the same kind of sparse assignment of MeSH terms as are found in MEDLINE itself.
The authors did a nice job in the little space they devoted to error analysis, with more info in the supplements (PR curves and some more parameter evaluations and the top few hits for one example). They reported that k-NN was better than some other systems (e.g. thesaurus/dictionary-based tagging and direct search with MeSH descriptors as pseudocuments) at assigning the sparse set of MeSH terms found in actual MEDLINE citations.
Errors tended to be more general MeSH terms that just happened to show up in related documents. I’d also like to see how sensitive performance is to the parameter setting of k=10, as it was chosen to optimize F measure against the sparse terms in MEDLINE. (All results here are for optimal operating points (aka oracle results), which means the results are almost certainly over-optimistic.)
What You’ll Need for an Implementation
It should be particularly easy to reproduce for anyone with:
Look for a demo soon.
Discriminative Semi-Supervised Classifiers?
It’d be nice to see an evaluation with text generated from MeSH and articles referencing those terms that used any of the semi-supervisied or positive-only training algorithms (even just sampling negative training instances randomly) with some kind of discriminative classifier like logistic regression or SVMs.
Improving IR with MeSH (?)
I didn’t quite follow this part of the paper as I wasn’t sure what exactly they indexed and what exactly they used as queries. I think they’re assigning MeSH terms to the query and then adding them to the query. Presumably they also index the MeSH terms for this.
I did get that only the best MeSH-term assigner improved search performance (quantized to a single number using TREC’s mean average precision metric).
Alternatives like generating a 24,000-category multinomial classifier are possible, but won’t run very fast (though if it’s our tight naive Bayes, it might be as fast as the authors