<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Ratinov and Roth (2009) Design Challenges and Misconceptions in Named Entity Recognition</title>
	<atom:link href="http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:47:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5656</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Wed, 14 Oct 2009 23:10:04 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5656</guid>
		<description><![CDATA[There&#039;s definitely more written on tokenization in IR than in NLP, and even then the emphasis is on non-whitespace-separated languages like Chinese.   

There&#039;s been a fair bit written on tokenization in the bio-NLP world, because of the complex token structures of biochemistry nomenclature (e.g. &quot;p53wt&quot; might become three tokens and &quot;mRNA&quot; two tokens).  

Another place you see it is in learning to segment unsegmented input (e.g. phonemes into words).

There&#039;s also a fair bit of attention in IR to issues of compound splitting, because of the Spider-man/Batman problem (&quot;Spider-man&quot; is two words, &quot;Batman&quot; is one, but many searchers get this wrong).

One problem is that tokenization isn&#039;t &quot;natural&quot;.  Speakers of a language don&#039;t have intuitions about tokenization, so no one ever runs bakeoffs on tokenization.  So the &quot;right&quot; tokenization is often system dependent.  Having said that, you could say the same about POS tagging.

A big issue is that much of the test data is pre-tokenized.  Consider CoNLL, the Penn Treebank, the Google N-gram corpus, etc. etc.  Bakeoff organizers often consider this a service (e.g. Senseval and many KDD-like efforts), and only provide stemmed, tokenized and stoplisted input.

What&#039;s really frustrating is getting a resource without a tokenizer (e.g. CoNLL, Google N-grams).  Then you can&#039;t convert your system to run on wild text.  At least the Penn Treebank distributes its crazy script (which first does sentence detection so it can treat end-of-sentence periods differently than sentence-internal ones).

Then there are the probabilistic tokenizers, like in BioIE.  I have too much carryover from my programming language days, but I like my tokenizers to be standalone and deterministic preprocessors, not part of some big joint probabilistic inference system.  That&#039;s why we tend to run finer-grained tokenizers than most.  

For our spelling correction and for our rescoring named entity detector, we rely heavily on tokenization.  For spelling, we can determine the set of tokens which may be suggested as tokens and make correction sensitive to that.  For rescoring named entities, we use a simple HMM, which determines hard, but fine-grained token boundaries, then rescore it in a non-tokenized longer-distance model.]]></description>
		<content:encoded><![CDATA[<p>There&#8217;s definitely more written on tokenization in IR than in NLP, and even then the emphasis is on non-whitespace-separated languages like Chinese.   </p>
<p>There&#8217;s been a fair bit written on tokenization in the bio-NLP world, because of the complex token structures of biochemistry nomenclature (e.g. &#8220;p53wt&#8221; might become three tokens and &#8220;mRNA&#8221; two tokens).  </p>
<p>Another place you see it is in learning to segment unsegmented input (e.g. phonemes into words).</p>
<p>There&#8217;s also a fair bit of attention in IR to issues of compound splitting, because of the Spider-man/Batman problem (&#8220;Spider-man&#8221; is two words, &#8220;Batman&#8221; is one, but many searchers get this wrong).</p>
<p>One problem is that tokenization isn&#8217;t &#8220;natural&#8221;.  Speakers of a language don&#8217;t have intuitions about tokenization, so no one ever runs bakeoffs on tokenization.  So the &#8220;right&#8221; tokenization is often system dependent.  Having said that, you could say the same about POS tagging.</p>
<p>A big issue is that much of the test data is pre-tokenized.  Consider CoNLL, the Penn Treebank, the Google N-gram corpus, etc. etc.  Bakeoff organizers often consider this a service (e.g. Senseval and many KDD-like efforts), and only provide stemmed, tokenized and stoplisted input.</p>
<p>What&#8217;s really frustrating is getting a resource without a tokenizer (e.g. CoNLL, Google N-grams).  Then you can&#8217;t convert your system to run on wild text.  At least the Penn Treebank distributes its crazy script (which first does sentence detection so it can treat end-of-sentence periods differently than sentence-internal ones).</p>
<p>Then there are the probabilistic tokenizers, like in BioIE.  I have too much carryover from my programming language days, but I like my tokenizers to be standalone and deterministic preprocessors, not part of some big joint probabilistic inference system.  That&#8217;s why we tend to run finer-grained tokenizers than most.  </p>
<p>For our spelling correction and for our rescoring named entity detector, we rely heavily on tokenization.  For spelling, we can determine the set of tokens which may be suggested as tokens and make correction sensitive to that.  For rescoring named entities, we use a simple HMM, which determines hard, but fine-grained token boundaries, then rescore it in a non-tokenized longer-distance model.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dr. Jochen L. Leidner</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5654</link>
		<dc:creator><![CDATA[Dr. Jochen L. Leidner]]></dc:creator>
		<pubDate>Wed, 14 Oct 2009 22:34:28 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5654</guid>
		<description><![CDATA[Tokenization is one of the key factors for high quality in NLP, yet Mikheev&#039;s article in Computational Linguistics is about the only serious reference that I can think of that studies it. Tokenization is language-dependent, interacts with the textual format of the input text, which makes it a complex problem.

I have seen people running around asking &quot;Um... do you have a tokenizer?&quot; at various institutions in more than one country, and many end up using some idiosyncratic Perl script that &quot;somebody hacked up one night in the lab&quot;.

Even for English, to date there is no standard component that people can share, there is not even interest in the problem. Yes, people will tell you their experiences, but they would never consider publishing on the topic, because they consider it &quot;solved&quot;.]]></description>
		<content:encoded><![CDATA[<p>Tokenization is one of the key factors for high quality in NLP, yet Mikheev&#8217;s article in Computational Linguistics is about the only serious reference that I can think of that studies it. Tokenization is language-dependent, interacts with the textual format of the input text, which makes it a complex problem.</p>
<p>I have seen people running around asking &#8220;Um&#8230; do you have a tokenizer?&#8221; at various institutions in more than one country, and many end up using some idiosyncratic Perl script that &#8220;somebody hacked up one night in the lab&#8221;.</p>
<p>Even for English, to date there is no standard component that people can share, there is not even interest in the problem. Yes, people will tell you their experiences, but they would never consider publishing on the topic, because they consider it &#8220;solved&#8221;.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lev Ratinov</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5130</link>
		<dc:creator><![CDATA[Lev Ratinov]]></dc:creator>
		<pubDate>Mon, 13 Jul 2009 00:38:09 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5130</guid>
		<description><![CDATA[Yoav, I emailed you and wanted to discuss some things with you, did you get my email? Email me directly to my UIUC email account.]]></description>
		<content:encoded><![CDATA[<p>Yoav, I emailed you and wanted to discuss some things with you, did you get my email? Email me directly to my UIUC email account.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lev Ratinov</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5129</link>
		<dc:creator><![CDATA[Lev Ratinov]]></dc:creator>
		<pubDate>Mon, 13 Jul 2009 00:36:12 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5129</guid>
		<description><![CDATA[Hi, Lingpipe.

Interesting comments. The beam size of the demo is 1, which is the size of our system we reported in the paper. We didn&#039;t get much improvement when increasing the beam size to 100, but it was painfully slow. So everything we did was plain greedy.

Unfortunately, I don&#039;t know anything about PHP, I&#039;m using a standard stub in our group for creating demos, and is has many problems.

But giving it a second thought, I do agree that character encoding, dealing with punctuation etc are important. One thing that I didn&#039;t talk in the paper, due to lack of space, but which had impact on the system performance for real-world text was tokenization and text normalization. We&#039;re using two tokenization schemes in our system, which play with how we parse hyphens and punctuation marks... All this thing is not trivial and under-researched.]]></description>
		<content:encoded><![CDATA[<p>Hi, Lingpipe.</p>
<p>Interesting comments. The beam size of the demo is 1, which is the size of our system we reported in the paper. We didn&#8217;t get much improvement when increasing the beam size to 100, but it was painfully slow. So everything we did was plain greedy.</p>
<p>Unfortunately, I don&#8217;t know anything about PHP, I&#8217;m using a standard stub in our group for creating demos, and is has many problems.</p>
<p>But giving it a second thought, I do agree that character encoding, dealing with punctuation etc are important. One thing that I didn&#8217;t talk in the paper, due to lack of space, but which had impact on the system performance for real-world text was tokenization and text normalization. We&#8217;re using two tokenization schemes in our system, which play with how we parse hyphens and punctuation marks&#8230; All this thing is not trivial and under-researched.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Yoav</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5122</link>
		<dc:creator><![CDATA[Yoav]]></dc:creator>
		<pubDate>Sun, 12 Jul 2009 10:17:06 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5122</guid>
		<description><![CDATA[fwiw, the length 4 and 6 prefixes of the Brown clustering algorithm were also shown to be useful for dependency parsing by Koo et. al. 2008 (see paper for ref).]]></description>
		<content:encoded><![CDATA[<p>fwiw, the length 4 and 6 prefixes of the Brown clustering algorithm were also shown to be useful for dependency parsing by Koo et. al. 2008 (see paper for ref).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5108</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Fri, 10 Jul 2009 22:31:46 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5108</guid>
		<description><![CDATA[@fredrik Thanks for the reference.  I&#039;ve been trying to find survey papers like this to do a meta-survey!  Satoshi also has a &lt;a href=&quot;http://www.lrec-conf.org/proceedings/lrec2008/summaries/21.html&quot; rel=&quot;nofollow&quot;&gt;fairly extensive NE ontology&lt;/a&gt; that&#039;s been in play for various evaluations.

@Lev Thanks for the link correction; I fixed it in the post.

And if I didn&#039;t make it clear enough in the post, I really like both your approach and how well it seems to work on the examples I tried (many more than I put in the post; those were just the first two I tried).  I&#039;d urge everyone else to try it.  The demo&#039;s very speedy, too, which is great.  How big is the demo beam?  

I&#039;m assuming something&#039;s going wrong on your program&#039;s decoding side, because when I cut and paste from the NY Times, in Latin1, to your web form, also Latin1 (both with inferred encodings from firefox), the characters get mangled on output.  Is there a way I could&#039;ve entered the curly apostrophe with the form the way it is now?

I have no idea how to properly deal with different encodings in PHP.  It&#039;s a pain in Java servlets, because (a) you have to transcode through Latin1 to recover bytes, because the servlet interface is specified in unicode characters [or you have to do everything in bytes, losing the advantage of letting the implementation parse forms for you], and (b) you then have to either know the underlying charset (how I&#039;ve built our demos) or determine it using something like a unicode library (how we deal with untrusted char data).

Then there&#039;s the issue of what to do with these things in the models.  You wind up with names and things like that with very low frequency characters in them that aren&#039;t in any of the character-based training data.  We&#039;ve had to deal with issues like this in training models from the French Treebank, because live French uses all sorts of different punctuation and the training corpus was fairly uniform.]]></description>
		<content:encoded><![CDATA[<p>@fredrik Thanks for the reference.  I&#8217;ve been trying to find survey papers like this to do a meta-survey!  Satoshi also has a <a href="http://www.lrec-conf.org/proceedings/lrec2008/summaries/21.html" rel="nofollow">fairly extensive NE ontology</a> that&#8217;s been in play for various evaluations.</p>
<p>@Lev Thanks for the link correction; I fixed it in the post.</p>
<p>And if I didn&#8217;t make it clear enough in the post, I really like both your approach and how well it seems to work on the examples I tried (many more than I put in the post; those were just the first two I tried).  I&#8217;d urge everyone else to try it.  The demo&#8217;s very speedy, too, which is great.  How big is the demo beam?  </p>
<p>I&#8217;m assuming something&#8217;s going wrong on your program&#8217;s decoding side, because when I cut and paste from the NY Times, in Latin1, to your web form, also Latin1 (both with inferred encodings from firefox), the characters get mangled on output.  Is there a way I could&#8217;ve entered the curly apostrophe with the form the way it is now?</p>
<p>I have no idea how to properly deal with different encodings in PHP.  It&#8217;s a pain in Java servlets, because (a) you have to transcode through Latin1 to recover bytes, because the servlet interface is specified in unicode characters [or you have to do everything in bytes, losing the advantage of letting the implementation parse forms for you], and (b) you then have to either know the underlying charset (how I&#8217;ve built our demos) or determine it using something like a unicode library (how we deal with untrusted char data).</p>
<p>Then there&#8217;s the issue of what to do with these things in the models.  You wind up with names and things like that with very low frequency characters in them that aren&#8217;t in any of the character-based training data.  We&#8217;ve had to deal with issues like this in training models from the French Treebank, because live French uses all sorts of different punctuation and the training corpus was fairly uniform.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Lev Ratinov</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5103</link>
		<dc:creator><![CDATA[Lev Ratinov]]></dc:creator>
		<pubDate>Fri, 10 Jul 2009 15:01:27 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5103</guid>
		<description><![CDATA[Hi Bob.

Thanks for evaluating our system. I have two comments.
1) The Link you provided to our system is a bit incorrect. 
This is the link to use: http://l2r.cs.uiuc.edu/~cogcomp/LbjNer.php
2) I hope you&#039;ll agree with me that as a non-commercial demo we have hard time adjusting to different encodings without our web demo. It&#039;s really easy to fix the encoding prbolem, and I&#039;m a bit disappointed you didn&#039;t try our system AFTER fixing the encoding, which is a much more fair comparison. I&#039;ve done it, and the output is below. As you can see, our system does much better on the blog entry you mention, which, I agree with you is  hard, since it&#039;s not a newswire text.

Scope out these new exclusive pics of [PER Michael Jackson ] with two of his three kids - [LOC Paris ] and [PER Michael Joseph Jr ] . ( also known as Prince [PER Michael ] ) - from the new issue of [LOC OK ] ! . In this picture , [PER Prince ] , 4 , and [PER Paris Jackson ] , 3 , play dress-up in 2001 at the [LOC Neverland Ranch ] in [LOC Santa Barbara County ] , [LOC Calif ] . Also pictured is [LOC Michaelat Neverland ] , celebrating [PER Prince ] &#039;s sixth birthday in 2003 with a [MISC Spider-Man-themed ] party . Cute !]]></description>
		<content:encoded><![CDATA[<p>Hi Bob.</p>
<p>Thanks for evaluating our system. I have two comments.<br />
1) The Link you provided to our system is a bit incorrect.<br />
This is the link to use: <a href="http://l2r.cs.uiuc.edu/~cogcomp/LbjNer.php" rel="nofollow">http://l2r.cs.uiuc.edu/~cogcomp/LbjNer.php</a><br />
2) I hope you&#8217;ll agree with me that as a non-commercial demo we have hard time adjusting to different encodings without our web demo. It&#8217;s really easy to fix the encoding prbolem, and I&#8217;m a bit disappointed you didn&#8217;t try our system AFTER fixing the encoding, which is a much more fair comparison. I&#8217;ve done it, and the output is below. As you can see, our system does much better on the blog entry you mention, which, I agree with you is  hard, since it&#8217;s not a newswire text.</p>
<p>Scope out these new exclusive pics of [PER Michael Jackson ] with two of his three kids &#8211; [LOC Paris ] and [PER Michael Joseph Jr ] . ( also known as Prince [PER Michael ] ) &#8211; from the new issue of [LOC OK ] ! . In this picture , [PER Prince ] , 4 , and [PER Paris Jackson ] , 3 , play dress-up in 2001 at the [LOC Neverland Ranch ] in [LOC Santa Barbara County ] , [LOC Calif ] . Also pictured is [LOC Michaelat Neverland ] , celebrating [PER Prince ] &#8216;s sixth birthday in 2003 with a [MISC Spider-Man-themed ] party . Cute !</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: fredrik</title>
		<link>http://lingpipe-blog.com/2009/07/09/ratinov-and-roth-2009-design-challenges-and-misconceptions-in-named-entity-recognition/#comment-5102</link>
		<dc:creator><![CDATA[fredrik]]></dc:creator>
		<pubDate>Fri, 10 Jul 2009 12:27:19 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1830#comment-5102</guid>
		<description><![CDATA[as regards to features, the paper by nadeau and sekine; &quot;A survey of named entity recognition and classification&quot; (2007) lists a fair amount of different features used in the literature. the paper is available here: http://nlp.cs.nyu.edu/sekine/papers/li07.pdf]]></description>
		<content:encoded><![CDATA[<p>as regards to features, the paper by nadeau and sekine; &#8220;A survey of named entity recognition and classification&#8221; (2007) lists a fair amount of different features used in the literature. the paper is available here: <a href="http://nlp.cs.nyu.edu/sekine/papers/li07.pdf" rel="nofollow">http://nlp.cs.nyu.edu/sekine/papers/li07.pdf</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>

