<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Trieschnigg, Pezik, Lee, de Jong, Kraaij, and Rebhoz-Schumann (2009) MeSH Up: Effective MeSH Text Classification for Improved Document Retrieval</title>
	<atom:link href="http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:47:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/#comment-5421</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Tue, 01 Sep 2009 21:50:39 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=2079#comment-5421</guid>
		<description><![CDATA[The U.S. National Institutes of Health require pre-prints to be submitted to them by any grantholders, which they then put up on PubMed. We’ll see how long that lasts.

We are huge fans of fine-grained tokenizations. Our basic MEDLINE index has an all-info tokenization (no case norm, no punctuation removal, etc.), but a pretty fine-grained tokenization. We have a second standard reduced index. And a third that uses character n-grams.

We’ve found the n-grams to be really robust across languages and to work well for MEDLINE. For instance, we used them in the one TREC genomics evaluation, and we use them for classifiers. We use a range of token lengths, with 3-5 grams as well as 2-grams and longer than 5-grams if we can afford it. And we use the cross-word character n-grams, which can really help with precision (boosting phrases).

The best Chinese indexers I know use unigrams, bigrams, and any substrings matching dictionary entries. 

I’ve been recommending that same strategy with stemmers, which is to include “unfortunately”, “unfortunate”, “fortunate” and “fortune” when you see the first. It maintains some precision while increasing recall. The only problem is space and time for the searches. 

You can do the same thing with thesauri.

One way of looking at all these maneuvers is as finding a good feature-based representation for classification.  There have been lots of interesting papers recently on doing this with everything from multi-view learning to clustering with SVD and other techniques to deep belief nets. ]]></description>
		<content:encoded><![CDATA[<p>The U.S. National Institutes of Health require pre-prints to be submitted to them by any grantholders, which they then put up on PubMed. We’ll see how long that lasts.</p>
<p>We are huge fans of fine-grained tokenizations. Our basic MEDLINE index has an all-info tokenization (no case norm, no punctuation removal, etc.), but a pretty fine-grained tokenization. We have a second standard reduced index. And a third that uses character n-grams.</p>
<p>We’ve found the n-grams to be really robust across languages and to work well for MEDLINE. For instance, we used them in the one TREC genomics evaluation, and we use them for classifiers. We use a range of token lengths, with 3-5 grams as well as 2-grams and longer than 5-grams if we can afford it. And we use the cross-word character n-grams, which can really help with precision (boosting phrases).</p>
<p>The best Chinese indexers I know use unigrams, bigrams, and any substrings matching dictionary entries. </p>
<p>I’ve been recommending that same strategy with stemmers, which is to include “unfortunately”, “unfortunate”, “fortunate” and “fortune” when you see the first. It maintains some precision while increasing recall. The only problem is space and time for the searches. </p>
<p>You can do the same thing with thesauri.</p>
<p>One way of looking at all these maneuvers is as finding a good feature-based representation for classification.  There have been lots of interesting papers recently on doing this with everything from multi-view learning to clustering with SVD and other techniques to deep belief nets.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dolf Trieschnigg</title>
		<link>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/#comment-5417</link>
		<dc:creator><![CDATA[Dolf Trieschnigg]]></dc:creator>
		<pubDate>Tue, 01 Sep 2009 07:55:39 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=2079#comment-5417</guid>
		<description><![CDATA[Well, it&#039;s the same for our published response to that comment: you have to pay the open-access publication charges to make it available to everyone. Fortunately, you are allowed to make the pre-print version available on your own website, in this case no editors or reviewers made or proposed any changes, so the pre-print on the author&#039;s website is the same as the actual publication except from the final layout.

We also investigated the tokenization for the biomedical domain, have a look at our SIGIR&#039;07 poster (&quot;The Inﬂuence of Basic Tokenization on Biomedical 
Document Retrieval&quot;, SIGIR, 2007) and the article by Jiang and Zhai (&quot;An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval&quot;, IR, 2007). In our experience, generating surplus tokens really worked well, so e.g. tokenizing &quot;P53-activator&quot; as [p], [53], [p53], [activator] and [p53activator] increased both precision and recall.
Interestingly, many machine learning methods often use character n-gramming techniques, where in my own experience this doesn&#039;t really work well for biomedical IR.]]></description>
		<content:encoded><![CDATA[<p>Well, it&#8217;s the same for our published response to that comment: you have to pay the open-access publication charges to make it available to everyone. Fortunately, you are allowed to make the pre-print version available on your own website, in this case no editors or reviewers made or proposed any changes, so the pre-print on the author&#8217;s website is the same as the actual publication except from the final layout.</p>
<p>We also investigated the tokenization for the biomedical domain, have a look at our SIGIR&#8217;07 poster (&#8220;The Inﬂuence of Basic Tokenization on Biomedical<br />
Document Retrieval&#8221;, SIGIR, 2007) and the article by Jiang and Zhai (&#8220;An Empirical Study of Tokenization Strategies for Biomedical Information Retrieval&#8221;, IR, 2007). In our experience, generating surplus tokens really worked well, so e.g. tokenizing &#8220;P53-activator&#8221; as [p], [53], [p53], [activator] and [p53activator] increased both precision and recall.<br />
Interestingly, many machine learning methods often use character n-gramming techniques, where in my own experience this doesn&#8217;t really work well for biomedical IR.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/#comment-5409</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 31 Aug 2009 19:32:30 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=2079#comment-5409</guid>
		<description><![CDATA[Thanks -- that description of the IR component really helped.  

Just because it&#039;s easy and relatively efficient, we often work with character n-gram language model rescoring of Lucene&#039;s n-best TF/IDF results with a fairly simple tokenizer.  Looking at all the TF/IDF variants and parameterizations crossed with stemming, stoplisting and other normalization and crossed with realistic queries makes my head hurt.  

The &lt;a href=&quot;http://www.ncbi.nlm.nih.gov/pubmed/19671694&quot; rel=&quot;nofollow&quot;&gt;Névéol, Mork, and Aronson  comment&lt;/a&gt; on your paper isn&#039;t available without a subscription.  Ironic given that the authors work for NLM.]]></description>
		<content:encoded><![CDATA[<p>Thanks &#8212; that description of the IR component really helped.  </p>
<p>Just because it&#8217;s easy and relatively efficient, we often work with character n-gram language model rescoring of Lucene&#8217;s n-best TF/IDF results with a fairly simple tokenizer.  Looking at all the TF/IDF variants and parameterizations crossed with stemming, stoplisting and other normalization and crossed with realistic queries makes my head hurt.  </p>
<p>The <a href="http://www.ncbi.nlm.nih.gov/pubmed/19671694" rel="nofollow">Névéol, Mork, and Aronson  comment</a> on your paper isn&#8217;t available without a subscription.  Ironic given that the authors work for NLM.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dolf Trieschnigg</title>
		<link>http://lingpipe-blog.com/2009/08/28/trieschnigg-2009-mesh-up-effective-mesh-textclassification-for-improved-document-retrieval/#comment-5396</link>
		<dc:creator><![CDATA[Dolf Trieschnigg]]></dc:creator>
		<pubDate>Sun, 30 Aug 2009 12:51:03 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=2079#comment-5396</guid>
		<description><![CDATA[Thanks for the review of our paper.
Just to clarify things a bit further about improving IR with MeSH:
we used two indices, one with MeSH descriptors, one with full-text. Two query representations were used: the original textual query, and the automatically obtained MeSH representation (determined using the KNN approach). The MeSH query is matched aginst the MeSH index, the text query against the text index and the relevance scores are mixed through linear interpolation. 

Your point about the retrieval model is probably right: probably a vector space retrieval model will  give similar results. In fact, Neveol et al recently wrote a comment on our paper (see the Bioinformatics website) which shows that the Pubmed Related Citations (PRC) algorithm also gives very good results. In its describing paper, PRC outperformed BM25 (if I am not mistaken). Similarly, LM IR also has shown to outperform BM25. In any case, KNN seems a pretty steady baseline for automatic indexing.]]></description>
		<content:encoded><![CDATA[<p>Thanks for the review of our paper.<br />
Just to clarify things a bit further about improving IR with MeSH:<br />
we used two indices, one with MeSH descriptors, one with full-text. Two query representations were used: the original textual query, and the automatically obtained MeSH representation (determined using the KNN approach). The MeSH query is matched aginst the MeSH index, the text query against the text index and the relevance scores are mixed through linear interpolation. </p>
<p>Your point about the retrieval model is probably right: probably a vector space retrieval model will  give similar results. In fact, Neveol et al recently wrote a comment on our paper (see the Bioinformatics website) which shows that the Pubmed Related Citations (PRC) algorithm also gives very good results. In its describing paper, PRC outperformed BM25 (if I am not mistaken). Similarly, LM IR also has shown to outperform BM25. In any case, KNN seems a pretty steady baseline for automatic indexing.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

