<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Flexible Dictionary-Based Chunking for Extracting Gene and Protein Names</title>
	<atom:link href="http://lingpipe-blog.com/2009/05/21/flexible-dictionary-based-chunking-gene-protein-names/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2009/05/21/flexible-dictionary-based-chunking-gene-protein-names/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:47:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/05/21/flexible-dictionary-based-chunking-gene-protein-names/#comment-4772</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 25 May 2009 20:49:33 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1479#comment-4772</guid>
		<description><![CDATA[Right.  LingPipe already implements the linear-time version (see Dan Gusfield&#039;s strings book, as usual) of inexact dictionary matching with respect to an arbitrary character-by-character weighted edit distance with an upper bound on distance (the upper bound&#039;s what allows it to be linear): &lt;a href=&quot;http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ApproxDictionaryChunker.html&quot; rel=&quot;nofollow&quot;&gt;&lt;code&gt;com.aliasi.dict.ApproxDictionaryChunker&lt;/code&gt;&lt;/a&gt;.

This strategy works OK for finding transliterations and typos, but doesn&#039;t work so well for things like gene names or company names, where you tend to get whole token order variation and whole dropped tokens.  

You can do something similar in the Apache Lucene search API using their fuzzy term matching (which I helped recode to only use linear space in matching), or with character n-gram matching (the latter&#039;s covered in our case study in the first edition of the &lt;i&gt;Lucene in Action&lt;/i&gt; book).]]></description>
		<content:encoded><![CDATA[<p>Right.  LingPipe already implements the linear-time version (see Dan Gusfield&#8217;s strings book, as usual) of inexact dictionary matching with respect to an arbitrary character-by-character weighted edit distance with an upper bound on distance (the upper bound&#8217;s what allows it to be linear): <a href="http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ApproxDictionaryChunker.html" rel="nofollow"><code>com.aliasi.dict.ApproxDictionaryChunker</code></a>.</p>
<p>This strategy works OK for finding transliterations and typos, but doesn&#8217;t work so well for things like gene names or company names, where you tend to get whole token order variation and whole dropped tokens.  </p>
<p>You can do something similar in the Apache Lucene search API using their fuzzy term matching (which I helped recode to only use linear space in matching), or with character n-gram matching (the latter&#8217;s covered in our case study in the first edition of the <i>Lucene in Action</i> book).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Wei</title>
		<link>http://lingpipe-blog.com/2009/05/21/flexible-dictionary-based-chunking-gene-protein-names/#comment-4769</link>
		<dc:creator><![CDATA[Wei]]></dc:creator>
		<pubDate>Mon, 25 May 2009 11:03:09 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1479#comment-4769</guid>
		<description><![CDATA[One possibility is to allow certain errors when performing the dictionary matching. One recent algorithm is in &quot;Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009&quot;

To handle the last two cases, one can, in principle, modify the neighborhood generation methods to allow transformations such as &#039;a&#039; to &#039;\alpha&#039;.]]></description>
		<content:encoded><![CDATA[<p>One possibility is to allow certain errors when performing the dictionary matching. One recent algorithm is in &#8220;Wei Wang, Chuan Xiao, Xuemin Lin, Chengqi Zhang. Approximate Entity Extraction with Edit Distance Constraints. SIGMOD 2009&#8243;</p>
<p>To handle the last two cases, one can, in principle, modify the neighborhood generation methods to allow transformations such as &#8216;a&#8217; to &#8216;\alpha&#8217;.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

