Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval


I’m about to implement the technique for assigning Medical Subject Heading (MeSH) terms to text described in this paper from the TLAs EBI, HMI and TNO ICT:

The same technique could be applied to assign Gene Ontology (GO) terms to texts, tags to tweets or blog posts or consumer products, or keywords to scientific articles.

k-NN via Search

To assign MeSH terms to a text chunk, they:

  1. convert the text chunk to a search query,
  2. run the query against a relevance-based MEDLINE index, then
  3. rank MeSH terms by frequency in the top k (k=10) hits.

In other words, k-nearest-neighbors (k-NN) where “distance” is implemented by a relevance-based search.

Just the Text, Ma’am

Trieschnigg et al. concatenated the title and abstract of MEDLINE citations into a single field for both document indexing and query creation.

k-NN implemented as search could be faceted to include journals, years, authors, etc. For products, this could include all the facets seen on sites like Amazon or New Egg.

Language-Model-Based Search

They use language-model-based search, though I doubt that’s critical for success. Specifically, they estimate the maximum-likelihood unigram language model for the query and interpolated (with a model trained on all documents) model for each document, and then rank documents versus that query by cross-entropy of the query model given the document model (given the MLE estimate for the query, this is just the log probability of the query in the document’s LM.) Other LM-based search systems measure similarity by KL-divergence.

There weren’t any details on stemming, stoplisting, case normalization or tokenization in the paper or supplement; just a pointer to author Wessel Kraaij’s Ph.D. thesis on LM-based IR.

Application to MEDLINE

The text being assigned MeSH terms was another MEDLINE title-plus-abstract. This may seem redundant given that MEDLINE citations are already MeSH annotated, but it’s useful if you’re the one at NLM who has to assign the MeSH terms, or if you want a deeper set of terms (NLM only assigns a few per document).

It’s easy to apply the authors’ approach to arbitrary texts, such as paragraphs out of textbooks or full text articles or long queries of the form favored by TREC.

Efficient k-NN!

I did a double-take when I saw k-nearest-neighbors and efficiency together. As we all know, k-NN scales linearly with training set size and MEDLINE is huge. But, in this case, the search toolkit can do the heavy lifting. The advantage of doing k-NN here is that it reproduces the same kind of sparse assignment of MeSH terms as are found in MEDLINE itself.

Error Analysis

The authors did a nice job in the little space they devoted to error analysis, with more info in the supplements (PR curves and some more parameter evaluations and the top few hits for one example). They reported that k-NN was better than some other systems (e.g. thesaurus/dictionary-based tagging and direct search with MeSH descriptors as pseudocuments) at assigning the sparse set of MeSH terms found in actual MEDLINE citations.

Errors tended to be more general MeSH terms that just happened to show up in related documents. I’d also like to see how sensitive performance is to the parameter setting of k=10, as it was chosen to optimize F measure against the sparse terms in MEDLINE. (All results here are for optimal operating points (aka oracle results), which means the results are almost certainly over-optimistic.)

What You’ll Need for an Implementation

It should be particularly easy to reproduce for anyone with:

Look for a demo soon.

Discriminative Semi-Supervised Classifiers?

It’d be nice to see an evaluation with text generated from MeSH and articles referencing those terms that used any of the semi-supervisied or positive-only training algorithms (even just sampling negative training instances randomly) with some kind of discriminative classifier like logistic regression or SVMs.

Improving IR with MeSH (?)

I didn’t quite follow this part of the paper as I wasn’t sure what exactly they indexed and what exactly they used as queries. I think they’re assigning MeSH terms to the query and then adding them to the query. Presumably they also index the MeSH terms for this.

I did get that only the best MeSH-term assigner improved search performance (quantized to a single number using TREC’s mean average precision metric).

Alternatives like generating a 24,000-category multinomial classifier are possible, but won’t run very fast (though if it’s our tight naive Bayes, it might be as fast as the authors

4 Responses to “Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval”

  1. Dolf Trieschnigg Says:

    Thanks for the review of our paper.
    Just to clarify things a bit further about improving IR with MeSH:
    we used two indices, one with MeSH descriptors, one with full-text. Two query representations were used: the original textual query, and the automatically obtained MeSH representation (determined using the KNN approach). The MeSH query is matched aginst the MeSH index, the text query against the text index and the relevance scores are mixed through linear interpolation.

    Your point about the retrieval model is probably right: probably a vector space retrieval model will give similar results. In fact, Neveol et al recently wrote a comment on our paper (see the Bioinformatics website) which shows that the Pubmed Related Citations (PRC) algorithm also gives very good results. In its describing paper, PRC outperformed BM25 (if I am not mistaken). Similarly, LM IR also has shown to outperform BM25. In any case, KNN seems a pretty steady baseline for automatic indexing.

  2. lingpipe Says:

    Thanks — that description of the IR component really helped.

    Just because it’s easy and relatively efficient, we often work with character n-gram language model rescoring of Lucene’s n-best TF/IDF results with a fairly simple tokenizer. Looking at all the TF/IDF variants and parameterizations crossed with stemming, stoplisting and other normalization and crossed with realistic queries makes my head hurt.

    The Névéol, Mork, and Aronson comment on your paper isn’t available without a subscription. Ironic given that the authors work for NLM.

  3. Dolf Trieschnigg Says:

    Well, it’s the same for our published response to that comment: you have to pay the open-access publication charges to make it available to everyone. Fortunately, you are allowed to make the pre-print version available on your own website, in this case no editors or reviewers made or proposed any changes, so the pre-print on the author’s website is the same as the actual publication except from the final layout.

    We also investigated the tokenization for the biomedical domain, have a look at our SIGIR’07 poster (“The Influence of Basic Tokenization on Biomedical
    Document Retrieval”, SIGIR, 2007) and the article by Jiang and Zhai (“An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval”, IR, 2007). In our experience, generating surplus tokens really worked well, so e.g. tokenizing “P53-activator” as [p], [53], [p53], [activator] and [p53activator] increased both precision and recall.
    Interestingly, many machine learning methods often use character n-gramming techniques, where in my own experience this doesn’t really work well for biomedical IR.

    • lingpipe Says:

      The U.S. National Institutes of Health require pre-prints to be submitted to them by any grantholders, which they then put up on PubMed. We’ll see how long that lasts.

      We are huge fans of fine-grained tokenizations. Our basic MEDLINE index has an all-info tokenization (no case norm, no punctuation removal, etc.), but a pretty fine-grained tokenization. We have a second standard reduced index. And a third that uses character n-grams.

      We’ve found the n-grams to be really robust across languages and to work well for MEDLINE. For instance, we used them in the one TREC genomics evaluation, and we use them for classifiers. We use a range of token lengths, with 3-5 grams as well as 2-grams and longer than 5-grams if we can afford it. And we use the cross-word character n-grams, which can really help with precision (boosting phrases).

      The best Chinese indexers I know use unigrams, bigrams, and any substrings matching dictionary entries.

      I’ve been recommending that same strategy with stemmers, which is to include “unfortunately”, “unfortunate”, “fortunate” and “fortune” when you see the first. It maintains some precision while increasing recall. The only problem is space and time for the searches.

      You can do the same thing with thesauri.

      One way of looking at all these maneuvers is as finding a good feature-based representation for classification. There have been lots of interesting papers recently on doing this with everything from multi-view learning to clustering with SVD and other techniques to deep belief nets.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: