<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>LingPipe Blog</title>
	<atom:link href="http://lingpipe-blog.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com</link>
	<description>Alias-i’s Natural Language Processing and Text Analytics API</description>
	<pubDate>Sun, 10 Aug 2008 19:37:35 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
	<language>en</language>
			<item>
		<title>Where&#8217;s Georgia?  DB Linkage isn&#8217;t Easy</title>
		<link>http://lingpipe-blog.com/2008/08/10/wheres-georgia-db-linkage-isnt-easy/</link>
		<comments>http://lingpipe-blog.com/2008/08/10/wheres-georgia-db-linkage-isnt-easy/#comments</comments>
		<pubDate>Sun, 10 Aug 2008 19:36:13 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<category><![CDATA[LingPipe in Use]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=140</guid>
		<description><![CDATA[In case you didn&#8217;t see the link on Slashdot, Google was supplying maps of the American state Georgia when they should&#8217;ve been linking to the Caucasian country Georgia.  As Homer (the Simpson, not the classical Greek poet, the Alaskan city, the Illinois city, the American painter, or the tunnel in New Zealand) would say, [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>In case you didn&#8217;t see the <a href="http://news.slashdot.org/news/08/08/09/213236.shtml">link on Slashdot</a>, Google was supplying maps of the <a href="http://en.wikipedia.org/wiki/Georgia_(U.S._state)">American state Georgia</a> when they should&#8217;ve been linking to the Caucasian country <a href="http://en.wikipedia.org/wiki/Georgia_(country)">Georgia</a>.  As <a href="http://en.wikipedia.org/wiki/Homer_Simpson">Homer</a> (the Simpson, not the classical Greek poet, the Alaskan city, the Illinois city, the American painter, or the tunnel in New Zealand) would say, &#8220;<a href="http://en.wikipedia.org/wiki/D%27oh%21">D&#8217;oh!</a>.&#8221;</p>
<p>Slashdot was just picking up the <a href="http://valleywag.com/5034988/google-news-informs-us-that-the-russians-are-invading-the-south">story from Valleywag</a>.  Google didn&#8217;t take this lying down; the <a href="http://afp.google.com/article/ALeqM5hpNRP9ysixHH3P9izLJRjYT1ATkA">current page</a> shows a map centered on Vienna, which if you click on it, takes you to New York, saying &#8220;U.N.&#8221;.  </p>
<p>The problem is that linking textual mentions to database entries is non-trivial, even for the relatively simple problem of geo-location.  This is the business <a href="http://www.metacarta.com/">Metacarta</a> is in, and users we&#8217;ve talked to say they do a very good job of it.  </p>
<p>You see similar problems for products, as in <a href="http://www.google.com/products">the app formerly known as Froogle</a> or <a href="http://www.shopwiki.com/">ShopWiki</a>, as the Slashdot story pointed out in the case of yet another search engine <a href="http://www.eweek.com/c/a/Enterprise-Applications/Cuil-Search-Engine-Triggers-Image-Concerns/">mismatching product photos</a>.  </p>
<p>We ran into this problem trying to find rap artists in text, who have names like &#8220;<a href="http://en.wikipedia.org/wiki/The_Game_(rapper)">The Game</a>.&#8221;  And we&#8217;re battling the problem in our ongoing NIH project on linking genes and protein mentions in text to Entrez-Gene.</p>
<p>I&#8217;ve noted before that the <i>NY Times</i> site runs into problems in cases such as distinguishing the Pittsburgh suburb named Mount Lebanon from the area of middle Eastern country Lebanon.  In general, text matching doesn&#8217;t work real well by itself.  You run into false positives by linking the Pittsburgh suburb to the middle East, but you get false negatives if you just don&#8217;t link &#8220;Mount Lebanon&#8221;.  But the <i>Times</i> has full editorial control, so they can catch these problems and fix them manually during copy editing.</p>
<p>To solve this problem automatically with better results than plain text matching, we need context, which is available on the web in the form of both texts and links.  We discuss this in our <a href="http://alias-i.com/lingpipe/demos/tutorial/cluster/read-me.html">clustering tutorial</a>, which uses context to disambiguate multiple John Smiths in the news, and in our <a href="http://lingpipe.files.wordpress.com/2008/04/alias-i-gene-linkage-06.pdf">white paper on gene linkage</a>.  This basically reduces the linkage problem to that of <a href="http://alias-i.com/lingpipe/demos/tutorial/wordSense/read-me.html">word sense disambiguation</a>. The only real problem (besides the remaining difficult to disambiguate cases) is that it&#8217;s relatively storage and compute intensive to use context compared to a simple dictionary matcher.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/140/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/140/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/140/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/140/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/140/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=140&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/08/10/wheres-georgia-db-linkage-isnt-easy/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Hunter/Gatherer vs Farming with Respect to Information Access</title>
		<link>http://lingpipe-blog.com/2008/08/07/huntergatherer-vs-farming-with-respect-to-information-access/</link>
		<comments>http://lingpipe-blog.com/2008/08/07/huntergatherer-vs-farming-with-respect-to-information-access/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 21:59:37 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Breck's Blog]]></category>

		<category><![CDATA[Business]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=134</guid>
		<description><![CDATA[I kept having images of Google's web spider out digging roots and chasing animals with a pointed stick wearing a grubby little loin cloth.]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Antonio Valderrabanos of Bitext.com gave a talk at NYU today and he compared the current search/NLP strategies by information providers to  humanity&#8217;s hunter/gatherer stage and offered a vision of information farming. I kept having images of Google&#8217;s web spider out digging roots and chasing animals with a pointed stick wearing a grubby little loin cloth. Then I would switch to images of a farm stocked supermarket with well organized shelves, helpful clerks and lots of choice.</p>
<p>The analogy brought up a strong bias that I have in applying natural language processing (NLP) to real word problems&#8211; I generally assume that the software must encounter text as it occurs in the &#8220;wild&#8221;&#8211;after all it is what humans do so well and we are in the business of emulating human language processing right?</p>
<p>Nope, not on the farm we&#8217;re not. On the farm we use NLP help to enhance information that was never a part of standard written form. We use NLP to suggest and assign meta tags, connect entities to databases of concepts and create new database entries for new entities. These are things that humans are horrible at but humans are excellent at choosing from NLP driven suggestions&#8211; NLP is pretty good at suggestions. So NLP is helping create the tools to index and cross reference at the concept level all the information in the supermarket. Humans function as filters of what is correct. At least initially.</p>
<p>As the information supermarket gets bigger, the quality of the NLP (machine learning based) will get better, perhaps good enough to start automatically bringing in &#8220;wild&#8221; information with decent concept indexing and meta tagging. A keyword index is crude yet effective tool but an inventory system it is not and that is what we need to advance to the next level of information access.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/134/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/134/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/134/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/134/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/134/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=134&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/08/07/huntergatherer-vs-farming-with-respect-to-information-access/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Epidemiologists&#8217; Bayesian Latent Class Models of Inter-Annotator Agreement</title>
		<link>http://lingpipe-blog.com/2008/08/07/epidemiologists-bayesian-latent-class-models-of-inter-annotator-agreement/</link>
		<comments>http://lingpipe-blog.com/2008/08/07/epidemiologists-bayesian-latent-class-models-of-inter-annotator-agreement/#comments</comments>
		<pubDate>Thu, 07 Aug 2008 19:48:01 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=127</guid>
		<description><![CDATA[I (Bob) have been looking at the inter-annotator agreement and gold-standard adjudication problems.  I&#8217;ve also been hanging out with Andrew Gelman and Jennifer Hill thinking about multiple imputation, which has me studying their regression book, which covers the item-response (Rasch) model, and Gelman et al.&#8217;s book, which describes the the beta-binomial model.  After [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I (Bob) have been looking at the inter-annotator agreement and gold-standard adjudication problems.  I&#8217;ve also been hanging out with Andrew Gelman and Jennifer Hill thinking about <a href="http://www.stat.psu.edu/~jls/mifaq.html">multiple imputation</a>, which has me studying their <a href="http://www.stat.columbia.edu/~gelman/arm/">regression book</a>, which covers the <a href="http://en.wikipedia.org/wiki/Item_response_theory">item-response (Rasch) model</a>, and <a href="http://www.stat.columbia.edu/~gelman/book/">Gelman et al.&#8217;s book</a>, which describes the the beta-binomial model.  After thinking about how these tools could be put together, I believed I might have a novel approach to modeling inter-annotator agreement.  I got so excited I went and wrote a 30 page paper based on a series of simulations in R and <a href="http://www.mrc-bsu.cam.ac.uk/bugs/welcome.shtml">BUGS</a>, which may still turn out to be useful as an introduction to these approaches.  It&#8217;s already been useful in giving me practice in building hierarchical models, simulations, and generating graphics from BUGS and R.  I even analyzed one of our customer&#8217;s real annotated data sets (more on that to follow, I hope, as they had 10 annotators look at 1000 examples each over half a dozen categories).</p>
<p>Alas, the epidemiologists have beaten me to the punch.  They generalized the item-response model in exactly the way I wanted to and have even implemented it in WinBUGS.  In fact, they even used pretty much the same variable names!  It just goes to show how much people with the same tools (hieararchical Bayesian modeling, logistic regression, beta-binomial distributions, 0/1 classification problems, sensitivity vs. specificity distinctions) tend to build the same things.</p>
<p>If you can only read one paper, make it this one (if you can find it; I had to schlep up to Columbia where I can download papers on campus):</p>
<ul>
<li> <a href="http://linus.nci.nih.gov/~brb/palbert.htm">Albert, Paul S.</a> and Lori E. Dodd. 2004.  <a href="http://www.ncbi.nlm.nih.gov/pubmed/15180668">A Cautionary Note on the Robustness of Latent Class Models for Estimating Diagnostic Error without a Gold Standard.</a> <i>Biometrics</i> 60:427-435.
</li>
</ul>
<p>It cites most of the relevant literature other than (Dawid and Skene 1979), who, as far as I can tell, first introduced latent class models into this domain using EM for estimation.</p>
<p>To make a long story short, the model is quite simple in modern sampling/graphical notation, and just about as easy to code up in BUGS.  Here&#8217;s the model without accounting for missing data.  First the variable key:</p>
<pre style="font-size:110%;">
     I       # of items to classify
     J       # of annotators
     pi      overall prevalence of category 1
     c[i]    true category of item i
     d[i]    difficulty of item i
     a[0,j]  specificity of annotator j
     a[1,j]  sensitivity of annotator j
     x[i,j]  annotation of item i by annotator j
</pre>
<p>Recall that <a href="http://en.wikipedia.org/wiki/Sensitivity_%28tests%29">sensitivity</a> is accuracy on positive cases, which is just recall, TP/(TP+FN), whereas <a href="http://en.wikipedia.org/wiki/Sensitivity_%28tests%29">specificity</a> is just accuracy on negative cases, TN/(TN+FP).  Precision is TP/(TP+FP), but that doesn&#8217;t account for TN cases, and is thus incomplete as a full probability specification when paired with recall.  That&#8217;s why <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic">ROC curves</a>, which plot specificity vs. sensitivity, are more popular than precision-recall curves in the rest of the civilized world.</p>
<p>Other than the annotations <code>x[i,j]</code>, all other variables are unknown.  That includes the prevalence <code>pi</code>, the true categories <code>c</code>, the annotator specificities and selectivities <code>a[0]</code> and <code>a[1]</code>. </p>
<p>The sampling model without the noninformative priors:</p>
<pre style="font-size:110%;">
  c[i]  ~ Bernoulli(pi)
  d[i]  ~ Normal(0,1)
a[0,j]  ~ Beta(alpha[0],beta[0])
a[1,j]  ~ Beta(alpha[1],beta[1])
x[i,j]  ~ Bernoulli(inv-logit(logit(a[1,j]) - d[i]))     if c[i] = 1
          Bernoulli(1 - inv-logit(logit(a[0,j]) - d[i])) if c[i] = 0
</pre>
<p>That&#8217;s it.  Not that complex as these things go.  Scaling the difficulties to have 0 mean and variance 1 identifies the scale of the model; as Gelman and Hill describe in their book, there are lots of ways this can be done, including only scaling the mean of the difficulties to be 0. </p>
<p>There are lots of different priors that could be put on what&#8217;s here the <code>logit(a[m,j])</code> terms.  The more traditional thing to do in this kind of model is to use a normal prior.  In any case, you&#8217;re not going to be able to estimate the priors for specificity and selectivity with only a handful of annotators.</p>
<p>There&#8217;s a simplified version of this model mentioned in Dodd and Albert where items are divided into easy and regular cases, with the easy cases having all annotators agree and regular cases having annotators respond independently according to their own specificity and selectivity.  </p>
<p>The point of the Albert and Dodd paper cited above wasn&#8217;t to introduce these models, but to evaluate a range of them one against the other by simulating data in one model and evaluating it in the other.  They also evaluated real data and saw how it fit differently in the different models.</p>
<p>I should also point out that the following paper mentions the Dawid and Skene model in the computational linguistics literature:</p>
<ul>
<li>
Bruce, Rebecca F. and Janyce M. Wiebe. 1999. <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.8785">Recognizing subjectivity: a case study of manual tagging</a>.  <i>Natural Language Engineering</i> 1:1-16.
</li>
</ul>
<p>Bruce and Wiebe even talk about using the posterior distribution over true categories as a proxy for a gold standard, which seems to be the right way to go with this work.  But they use a single latent variable (roughly problem difficulty) and don&#8217;t separately model annotator specificity from selectivity, which is critical in both the simulations and real world data analyses I&#8217;ve done recently.</p>
<p>The depressing conclusion for NLP and other applications of classifiers is that it&#8217;s clear that with only 3 annotators, it&#8217;s going to be impossible to get a gold standard of very high purity. Even with 5 annotators, there are going to be lots of difficult cases.  </p>
<p>The other applications besides inter-annotator agreement that I&#8217;ve run across in the past couple of days are educational testing, epidemiology of infections and evaluating multiple tests (e.g. stool inspection and serology), evaluations of health care facilities in multiple categories, evaluations of dentists and their agreement on caries (pre-cavities), adjustments to genome-wide prevalence assertions, and many more.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/127/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/127/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/127/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/127/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/127/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/127/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/127/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/127/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/127/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/127/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/127/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/127/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=127&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/08/07/epidemiologists-bayesian-latent-class-models-of-inter-annotator-agreement/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Searchme.com is Useful</title>
		<link>http://lingpipe-blog.com/2008/08/05/searchmecom-is-useful/</link>
		<comments>http://lingpipe-blog.com/2008/08/05/searchmecom-is-useful/#comments</comments>
		<pubDate>Tue, 05 Aug 2008 17:57:13 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=123</guid>
		<description><![CDATA[I&#8217;m repeating pretty much verbatim a comment I left on Páraic Sheridan&#8217;s blog, Returned Emigrant in a response to his post Search is not a solved problem.  
Discussing whether search is a solved problem reminds me of a talk Hynek Hermansky gave at ICSLP in Sydney in &#8216;98 on why speech recognition isn&#8217;t a [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I&#8217;m repeating pretty much verbatim a comment I left on <a href="http://www.linkedin.com/pub/3/329/54A">Páraic Sheridan</a>&#8217;s blog, <a href="http://returnedemigrant.wordpress.com/">Returned Emigrant</a> in a response to his post <a href="http://returnedemigrant.wordpress.com/2008/08/01/search-is-not-a-solved-problem/">Search is not a solved problem</a>.  </p>
<p>Discussing whether search is a solved problem reminds me of a talk <a href="http://www.bme.ogi.edu/~hynek/">Hynek Hermansky</a> gave at ICSLP in Sydney in &#8216;98 on why speech recognition isn&#8217;t a solved problem.  Hynek made an analogy to flight, where at any point between balloons and jet airplanes, flight might&#8217;ve been considered solved.</p>
<p><a href="http://searchme.com/">Searchme.com</a> is great.  They provide reduced resolution page views with overlayed snippets, and it&#8217;s fast.  This just feels right for navigation searches (e.g. typing &#8220;lingpipe&#8221; to try to find our home page or blog).  And the page views add more value than I could&#8217;ve imagined to the snippets.  Like Páraic, I&#8217;m so enamored of it that it&#8217;s replacing Google as my default search engine on Firefox (just click on the link in the upper right of searchme.com&#8217;s home page).  I sure hope they can scale as more people find out about them.</p>
<p><a href="http://cuil.com/">Cuil.com</a>, by focusing on recall (and marketing), seems less useful, even if they get the bugs ironed out. Despite the fact that we’re focusing on recall for genomics information extraction tasks, I’ve never felt recall was an issue for most web searches. I could use more approximate and contextual matching, perhaps, but the index size has never seemed an issue.</p>
<p>I miss <a href="http://en.wikipedia.org/wiki/Excite">Excite.com</a>, which used to run TF/IDF rather than social-network-based search ranking algorithms.  I missed Excite even as I was starting to use Google for many searches.  But then again, if Excite had been successful, we wouldn’t have the <a href="http://lucene.apache.org">Apache Lucene</a> search engine.  </p>
<p><a href="http://www.powerset.com/">PowerSet.com</a> was focusing on some kind of precision and question answering (and marketing), which I also felt was of questionable value (for me as a searcher; it clearly worked for their VCs and founders) compared to using Google. Plus, they never showed (at least to the public) that their tech scaled either in complexity (different page types, multiple pages for entities) and size (number of web pages).</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/123/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/123/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/123/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/123/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/123/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=123&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/08/05/searchmecom-is-useful/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Good Kappa&#8217;s Not Necessary, Either</title>
		<link>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/</link>
		<comments>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/#comments</comments>
		<pubDate>Mon, 28 Jul 2008 21:52:49 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=118</guid>
		<description><![CDATA[My last blog post, Good Kappa&#8217;s Not Enough, summarized Reidsma and Carletta&#8217;s arguments that a good kappa score is not sufficient for agreement.  In this post, I&#8217;d like to point out why it&#8217;s not necessary, either.  My real goal&#8217;s to refocus the problem on discovering when a gold standard can be trusted.
Suppose we [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>My last blog post, <a href="http://lingpipe-blog.com/2008/07/22/good-kappas-not-enough/">Good Kappa&#8217;s Not Enough</a>, summarized Reidsma and Carletta&#8217;s arguments that a good kappa score is not sufficient for agreement.  In this post, I&#8217;d like to point out why it&#8217;s not necessary, either.  My real goal&#8217;s to refocus the problem on discovering when a gold standard can be trusted.</p>
<p>Suppose we have five annotators who are each 80% accurate for a binary classification problem whose true category distribution is 50-50.  Now let&#8217;s say they annotate an example item (1,1,1,1,0), meaning annotator 1 assigns category 1, annotator 2 assigns category 1, up through annotator 5 who assigns category 0.  What can we conclude about the example?  Assuming the errors are independent (as kappa does), what&#8217;s the likelihood that the example really is of category 1 versus category 0?   Bayes&#8217; rule lets us calculate:</p>
<pre style="font-size:125%;">
     p(1|(1,1,1,1,0)) proportional to p(1) * p((1,1,1,1,0)|1)
                            = 0.5 * 0.8^4 * 0.2^1

     p(0|(1,1,1,1,0)) proportional to 0.5 * 0.8^1 * 0.2^4

     p(1|(1,1,1,1,0)) = (0.8^4 * 0.2^1)
                            / (0.8^4 * 0.2^1 + 0.8^1 * 0.2^4)
                       = 98.5%
</pre>
<p>Recall the definition of kappa:</p>
<pre style="font-size:125%;">
     kappa = (agree - chanceAgree) / (1 - chanceAgree)
</pre>
<p>If errors are distributed randomly, agreement will be around 0.8^2 + 0.2^2 = 0.68 in a large sample, and the chance agreement will be 0.5^2 + 0.5^2 = 0.5, for a kappa value of (0.68-0.5)/(1-0.5)=0.36.  That&#8217;s a level of kappa that leads those who follow kappa to say &#8220;go back and rethink your task, your agreement&#8217;s not high enough&#8221;.   </p>
<p>Unfortunately, with 80% per-annotator accuracy, we only expect 74% or so of the examples to have a 4:1 or 5:0 vote by 80% accurate annotators (74% = 5 * 0.8^4 0.2^1 + 0.8^5).  </p>
<p>I believe the question we really care about is when we can trust an annotation enough to put it in the gold standard.  So let&#8217;s say we have two 80% accurate annotators and the true category is 1.  The likelihood of various annotation outcomes are: </p>
<pre style="font-size:125%;">
     p(1,1) = 0.64     p(0,1) = 0.16
     p(0,0) = 0.04     p(1,0) = 0.16
</pre>
<p>So clearly two annotators aren&#8217;t enough to be confident to the 99% level.  We&#8217;d need 90% accurate annotators for that.  But what about three annotators?  The chance of three 80% annotators agreeing by chance is only 0.8%.  And in 51.2% of the cases, they will agree and be right.  So we use a minimum of three annotators, and if they agree, go on. </p>
<p>If they disagree, we need to bring in more annotators until we&#8217;re confident of the outcome.  When we get to a four out of five vote, as in our first example, we&#8217;re confident again.  But even 3/4 agreement is still pretty weak, yielding only a 94% chance that the agreed upon value is correct.    </p>
<p>Of course, this line of reasoning supposes we know the annotator&#8217;s accuracy.  In practice, we can&#8217;t evaluate an annotator&#8217;s accuracy because we don&#8217;t know the true labels for items.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/118/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/118/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/118/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/118/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/118/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=118&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Good Kappa&#8217;s not Enough</title>
		<link>http://lingpipe-blog.com/2008/07/22/good-kappas-not-enough/</link>
		<comments>http://lingpipe-blog.com/2008/07/22/good-kappas-not-enough/#comments</comments>
		<pubDate>Tue, 22 Jul 2008 21:49:58 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=112</guid>
		<description><![CDATA[I stumbled across Reidsma and Carletta&#8217;s Reliability measurement without limits, which is pending publication as a Computational Lingusitics journal squib (no, not  a non-magical squib, but a linguistics squib).  
The issue they bring up is that if we&#8217;re annotating data, a high value for the kappa statistic isn&#8217;t enough to guarantee what they [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I stumbled across Reidsma and Carletta&#8217;s <a href="http://homepages.inf.ed.ac.uk/jeanc/reidsma-and-carletta.CL2008.pdf">Reliability measurement without limits</a>, which is pending publication as a <i>Computational Lingusitics</i> journal squib (no, not  a non-magical <a href="http://en.wikipedia.org/wiki/Blood_purity_%28Harry_Potter%29#Squibs">squib</a>, but a linguistics <a href="http://en.wikipedia.org/wiki/Squib_%28linguistics%29">squib</a>).  </p>
<p>The issue they bring up is that if we&#8217;re annotating data, a high value for the kappa statistic isn&#8217;t enough to guarantee what they call &quot;reliability&quot;.  The problem is that the disagreements may not be random.  They focus on simulating the case where an annotator over-uses a label, which results in kappa way overestimating reliability when compared to performance versus the truth.  The reason is that the statistical model will be able to pick up on the pattern of mistakes and reproduce them, making the task look more reliable than it is.</p>
<p>This discussion is similar to the case we&#8217;ve been worrying about here in trying to figure out how we can annotate a named-entity corpus with high recall.  If there are hard cases that annotators miss (over-using the no-entity label), random agreement assuming equally hard problems will over-estimate the &quot;true&quot; recall.</p>
<p>Reidsma and Carletta&#8217;s simulation shows that there&#8217;s a strong effect from the relationship between true labels and features of the instances (as measured by <a href="http://en.wikipedia.org/wiki/Effect_size">Cramer&#8217;s phi</a>).</p>
<h3>Review of Cohen&#8217;s Kappa</h3>
<p>Recall that kappa is a &quot;chance-adjusted measure of agreement&quot;, which has been widely used in computational linguistics since Carletta&#8217;s 1996 <a href="http://www.aclweb.org/anthology-new/J/J96/J96-2004.pdf">squib on kappa</a>, defined by:</p>
<pre style="font-size:150%;">
kappa = (agreement - chanceAgreement)
           / (1 - chanceAgreement)
</pre>
<p>where <code>agreement</code> is just the empirical percentage of cases on which annotators agree, and <code>chanceAgreement</code> is the percentage of cases on which they&#8217;d agree by chance.  For Cohen&#8217;s kappa, chance agreement is measured by assuming annotators pick categories at random according to their own empirical category distribution (but there are lots of variants, as pointed out in this <a href="http://cswww.essex.ac.uk/technical-reports/2005/csm-437.pdf">Artstein and Poesio tech report</a>, a version of which is also in press at <i>The Journal of Kappa Studies</i>, aka <i>Computational Linguistics</i>).  Kappa values will range between -1 and 1, with 1 only occurring if they have perfect agreement.  </p>
<p>I (Bob) don&#8217;t like kappa, because it&#8217;s not estimating a probability (despite being an arithmetic combination of [maximum likelihood] probability estimates).  The only reason to adjust for chance is that it allows one, in theory, to compare different tasks.  The way this plays out in practice is that an arbitrary cross-task threshold is defined above which a labeling task is considered &quot;reliable&quot;.   </p>
<p>A final suggestion for those using kappa: confidence intervals from <a href="http://en.wikipedia.org/wiki/Bootstrapping_(statistics)">bootstrap resampling</a> would be useful to see how reliable the estimate of kappa itself is.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/112/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/112/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/112/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/112/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/112/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/112/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/112/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/112/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/112/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/112/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/112/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/112/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=112&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/07/22/good-kappas-not-enough/feed/</wfw:commentRss>
		</item>
		<item>
		<title>How Can I Build a Classifier with no Negative Data?</title>
		<link>http://lingpipe-blog.com/2008/07/17/how-can-i-build-a-classifier-with-no-negative-data/</link>
		<comments>http://lingpipe-blog.com/2008/07/17/how-can-i-build-a-classifier-with-no-negative-data/#comments</comments>
		<pubDate>Fri, 18 Jul 2008 00:30:50 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<category><![CDATA[LingPipe in Use]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=110</guid>
		<description><![CDATA[As part of our NIH grant, we&#8217;re working on the database linkage problem from gene/protein mentions in MEDLINE to database entries in EntrezGene.  Basically, it&#8217;s what the biologists call &#34;gene normalization&#34;, and was the basis of Biocreative Task 2.
I can summarize the problem we&#8217;re having with a simple example.  We&#8217;d like to classify [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As part of our NIH grant, we&#8217;re working on the database linkage problem from gene/protein mentions in <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed">MEDLINE</a> to database entries in <a href="http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene">EntrezGene</a>.  Basically, it&#8217;s what the biologists call &quot;gene normalization&quot;, and was the basis of <a href="http://biocreative.sourceforge.net/biocreative_2_gn.html">Biocreative Task 2</a>.</p>
<p>I can summarize the problem we&#8217;re having with a simple example.  We&#8217;d like to classify all 17M or so entries in MEDLINE as to whether they&#8217;re about genomics or not.  EntrezGene provides links to 200K citations that are about particular genes, so we have a pile of articles about genomics (making up about 300 million characters).  What we don&#8217;t have is any negative training data.  </p>
<p>So my question is:  how do I build a classifier for articles about genomics versus those that are not about genomics?</p>
<p>The job running in the background giving me time to write this post is generating cross-validation on cross-entropy rates for all of these 200K citations.  That is, I train a character-level language model on 180K citations and use it to evaluate the other 20K, for all possible choices.  This gives me a received set of expected scores for positive examples (assuming there&#8217;s no bias in that 200K set, which there is in terms of recency and particular gene focus, not to mention focus on human genomics).  I&#8217;m going to plot this and see what the curve looks like.  Empirically, we can then set a threshold that would accept 99% of the articles we have.  Unfortunately, I have no idea how well this&#8217;ll work in practice at rejecting the articles that aren&#8217;t about genomics.</p>
<p>For the genomics/non-genomics problem, we can just annotate a few thousand examples; it&#8217;ll only take a day or two.</p>
<p>The real problem is that we want to build models to classify contexts for the 30K or so human gene entries in Entrez.  Some of them have a handful of example docs, some have hundreds.  We&#8217;re going to pull out articles with potential mentions, then filter with the classifier.  It&#8217;s related to what we did in Phase I of our grant and reported in our <a href="http://lingpipe.files.wordpress.com/2008/04/alias-i-gene-linkage-06.pdf">gene linkage white paper</a>.  In that setting, we can generate candidate docs using approximate matching of aliases, then use the scores to rank the possible docs according to their language model scores against the known articles for the gene in question.  This is great in a search context, but doesn&#8217;t give us a go/no-go decision point, which we need for some of our downstream applications.</p>
<p>If anyone knows how to tackle this problem, I&#8217;d love to hear about it.  I might even implement it as part of LingPipe if the idea&#8217;s simple and general enough.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/110/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/110/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/110/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/110/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/110/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/110/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/110/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/110/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/110/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/110/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/110/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/110/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=110&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/07/17/how-can-i-build-a-classifier-with-no-negative-data/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Hyphenation as a Noisy Channel</title>
		<link>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/</link>
		<comments>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/#comments</comments>
		<pubDate>Fri, 11 Jul 2008 21:29:52 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<category><![CDATA[LingPipe in Use]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=108</guid>
		<description><![CDATA[Noisy channel models with very simple deterministic channels can be surprisingly effective at simple linguistic tasks like word splitting.  We&#8217;ve used them for Chinese word segmentation (aka word boundary detection, aka tokenization), not to mention spelling correction. 
In this blog entry, I&#8217;ll provide a first look at our forthcoming tutorial on hyphenation.  The [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Noisy channel models with very simple deterministic channels can be surprisingly effective at simple linguistic tasks like word splitting.  We&#8217;ve used them for <a href="http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html">Chinese word segmentation</a> (aka word boundary detection, aka tokenization), not to mention <a href="http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html">spelling correction</a>. </p>
<p>In this blog entry, I&#8217;ll provide a first look at our forthcoming tutorial on hyphenation.  The hyphenation problem is that of determining if a hyphen can be inserted between two positions in a word.  Hyphenation is an orthographic process, which means it operates over spellings.  In contrast, syllabification is a phonological (or phonetic) process, which means it operates over sounds.  Hyphenation roughly follows syllabification, but is also prone to follow morphological split points.</p>
<p>The hyphenation problem isn&#8217;t even well-defined on a per-word level.  There are pairs such as <code>num-ber</code> (something you count with) and <code>numb-er</code> (when you get more numb) that have the same spelling, but different pronunciations and corresponding hyphenations depending on how they are used.  I&#8217;ll just ignore this problem here; in our evaluation, ambiguities always produce at least one error.</p>
<h3>The Noisy Channel Model</h3>
<p>The noisy channel model consists of a source that generates messages and a noisy channel along which they are passed.  The receiver&#8217;s job is to recover (i.e. decode) the underlying message.  For hyphenation, the source is a model of what English hyphenation looks like.  The channel model deterministically removes spaces.  The receiver thus needs to figure out where the hyphens should be reinserted to recover the original message at the source.</p>
<p>The training corpus for the source model consists of words with hyphenations represented by hyphens.  The model is just a character language model (I used 8-grams, but it&#8217;s not very length sensitive).  This gives us estimates of probabilities like <code>p("che-ru-bic")</code> and <code>p("cher-u-bic")</code>.  Our channel model just deterministically removes spaces, so that <code>p("cherubic"|"che-ru-bic") = 1.0</code>.  </p>
<p>To use the noisy channel model to find a hyphenation given an unhyphenated word, we just find the hyphenation <code>h</code> that is most likely given the word <code>w</code>, using Bayes&#8217;s rule in a maximization etting:  <code>ARGMAX<sub>h</sub>&nbsp;p(h|w) = ARGMAX<sub>h</sub>&nbsp;p(w|h)*p(h)</code>.  Because the channel probabilities <code>p(w|h)</code> are always 1.0 if the characters in <code>w</code> match those in </code>h</code>, this reduces to finding the hyphenation <code>h</code> yielding character sequence <code>w</code> which maximizes <code>p(h)</code>.  For example, the model will estimate <code>"che-ru-bic"</code> to be a more likely hyphenation than <code>"cher-u-bic"</code> if <code>p("che-ru-bic") &gt; p("cher-u-bic")</code>.  </p>
<p> Decoding is fairly efficient for this task, despite using a high-order n-gram language model, because the channel bounds the combinatorics by only allowing a single hyphen to be inserted at each point; dynamic programming into language model states can reduce them even further.</p>
<h3>English Evaluation</h3>
<p>So how do we test how well the model works?  We just became members of the Linguistic Data Consortium, who distribute Baayen, Piepenbrock and Gulikers&#8217; <a href="http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14">CELEX 2 corpus</a>.  CELEX is a great corpus.  It contains pronunciations, syllabifications, and hyphenations for a modest sized dictionary of Dutch, English and German.  For instance, there are 66,332 distinct English words in the corpus (there are many more entries, but these constitute duplicates and compounds whose constituents have their normal hyphenations).  These 66,332 word have 66,418 unique hyphenations, meaning 86 of the words lead to ambiguities.  This is about 1/10th of a percent, so I won&#8217;t worry about it here.</p>
<p>I evaluated with randomly partitioned 10-fold <a href="http://en.wikipedia.org/wiki/Cross-validation">cross-validation</a>.  With the noisy channel model above, LingPipe had a 95.4% whole word accuracy, with a standard deviation of 0.2%.  That means we got 95.4% of the words completely correctly hyphenated.  We can also look at the 111,521 hyphens in the corpus, over which we had a 97.3% precision and a 97.4% recall.  That is, we missed 2.6% true hyphenation points (false negatives), and 2.7% of the hyphenations we returned were spurious (false positives).  Finally, we can look at per-decision accuracy, for which there were 482,045 positions between characters, over which we were 98.8% accurate.</p>
<h3>Forward and Backward Models</h3>
<p>But that&#8217;s not all.  Like HMMs, there&#8217;s a degree of <a href="http://lingpipe-blog.com/2008/06/09/per-tag-error-function-for-conditional-random-fields-for-removing-label-bias/">label bias</a> in a left-to-right language model.  So I reversed all the data and built a right-to-left (or back-to-front) model.  Using n-best extraction, I ran this two ways.  First, I just added the score to the forward model to get an interpolated score.  Somewhat surprisingly, it behaved almost identically to the forward-only model, with slightly lower per-hyphen and per-decision scores.  More interestingly, I then ran them in intersect mode, which means only returning a hyphen if it was in the first-best analysis of both the left-to-right and right-to-left models.  This lowered per-word accuracy to 94.6% (from 95.4%), but raised precision to 98.0% (from 97.3%) while only lowering recall to 96.7% (from 97.4%).  Overall, hyphenation is considered to be a precision business in application, as it&#8217;s usually used to split words across lines in documents, and many split points might work.</p>
<h3>Is it State of the Art?</h3>
<p>The best results I&#8217;ve seen for this task were reported in Bartlett, Kondrak and Cherry&#8217;s 2008 ACL paper <a href="http://aclweb.org/anthology-new/P/P08/P08-1065.pdf">Automatic Syllabification with Structured SVMs for Letter-To-Phoneme Conversion</a>, which also received an outstanding paper award.  They treated this problem as a tagging problem and applied structured support vector machines (SVMs).  On a fixed 5K testing set, they report a 95.65% word accuracy, which is slightly higher than our 95.4%.  The 95% binomial confidence intervals for 5000 test cases are +/- 0.58% and our measured standard deviation was .2%, with results ranging from 95.1 to 95.7% on different folds.  Their paper also tackled pronuncation, for which their hyphenation was only one feature.</p>
<p>German and Dutch are easier to hyphenate than English.  Syllabification in all of these languages is also easier than hyphenation.  </p>
<h3>But is it better than 1990?</h3>
<p>Before someone in the audience gets mad at me, I want to point out that we could follow <a href="http://research.microsoft.com/users/church/wwwfiles/published_1990_l2s.final.ps">Coker, Church and Liberman (1990)</a>&#8217;s lead in reporting results:</p>
<blockquote><p>
The Bell Laboratories Text-to-Speech system, <a href="http://www.bell-labs.com/project/tts/"><i>TTS</i></a>, takes a radical dictionary-based approach; dictionary methods (with morphological and analogical extensions) are used for the vast majority of words.  Only a fraction of a percent (0.5% of words overall; 0.1% of lowercase words) are left for letter-to-sound rules.
</p></blockquote>
<p>Although this insight won&#8217;t get us into <a href="http://nips.cc/">NIPS</a>, it&#8217;s how we&#8217;d field an application.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/108/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/108/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/108/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/108/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/108/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=108&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Lucene&#8217;s Missing Token Stream Factory</title>
		<link>http://lingpipe-blog.com/2008/07/06/lucenes-missing-token-stream-factory/</link>
		<comments>http://lingpipe-blog.com/2008/07/06/lucenes-missing-token-stream-factory/#comments</comments>
		<pubDate>Mon, 07 Jul 2008 04:48:16 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=107</guid>
		<description><![CDATA[While we&#8217;re on the subject of tokenization, every time I (Bob) use Lucene, I&#8217;m struck by its lack of a tokenizer factory abstraction.
Lucene&#8217;s abstract class TokenStream defines an iterator-style method next() that returns the next token or null if there are no more tokens.
Lucene uses the abstract class Token for tokens.  A Lucene token [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>While we&#8217;re on the subject of <a href="http://lingpipe-blog.com/2008/06/26/the-curse-of-intelligent-tokenization/">tokenization</a>, every time I (Bob) use <a href="http://lucene.apache.org">Lucene</a>, I&#8217;m struck by its lack of a tokenizer factory abstraction.</p>
<p>Lucene&#8217;s abstract class <a href="http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/analysis/TokenStream.html"><code>TokenStream</code></a> defines an iterator-style method <code>next()</code> that returns the next token or <code>null</code> if there are no more tokens.</p>
<p>Lucene uses the abstract class <a href="http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/analysis/Token.html"><code>Token</code></a> for tokens.  A Lucene token contains a string representing its text, a start and end position, and a lexical type which I&#8217;ve never seen used. Because LingPipe has to handle many tokens quickly without taxing garbage collection, it doesn&#8217;t create objects for them beyond their string texts.  But that&#8217;s the subject of another blog entry.</p>
<p>A Lucene <a href="http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/document/Document.html"><code>Document</code></a> is essentially a mapping from field names, represented as strings, to values, also represented as strings. Each field in a document may be treated differently with respect to tokenization.  For instance, some might be dates, others ordinary text, and others keyword identifiers).</p>
<p>To handle this fielded document structure, the Lucene class <a href="http://lucene.apache.org/java/2_3_2/api/core/org/apache/lucene/analysis/Analyzer.html"><code>analysis.Analyzer</code></a> maps field names, represented as strings, and values, represented as instances of <code>java.io.Reader</code>, to token streams.  The choice of <code>Reader</code> for values is itself puzzling because it introduces I/O exceptions and the question of who&#8217;s responsible for closing the reader.</p>
<p>Lucene overloads the analyzer class itself to provide the functionality of LingPipe&#8217;s tokenizer factory.  Lucene classes such as <a href="http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/SimpleAnalyzer.html"><code>SimpleAnalyzer</code></a> and <a href="http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/analysis/cjk/CJKAnalyzer.html"><code>CJKAnalyzer</code></a> return the same token stream no matter which field is specified. In other words, the field name is ignored.</p>
<p>What would be useful would be a Lucene interface <code>analysis.TokenStreamFactory</code> with a simple method <code>TokenStream tokenizer(CharSequence input)</code> (note how we&#8217;ve replaced the analyzer&#8217;s reader input with a generic string).  Then analyzers could be built by associating token stream factories with fields.  This would be the natural place to implement Lucene&#8217;s simple analyzer, CJK analyzer, and so on.  The current analyzer behavior is then easily derived with an analyzer which sets a default token stream factory for fields.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/107/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/107/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/107/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/107/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/107/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/107/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/107/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/107/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/107/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/107/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/107/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/107/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=107&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/07/06/lucenes-missing-token-stream-factory/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Curse of &#8220;Intelligent&#8221; Tokenization</title>
		<link>http://lingpipe-blog.com/2008/06/26/the-curse-of-intelligent-tokenization/</link>
		<comments>http://lingpipe-blog.com/2008/06/26/the-curse-of-intelligent-tokenization/#comments</comments>
		<pubDate>Thu, 26 Jun 2008 22:06:21 +0000</pubDate>
		<dc:creator>lingpipe</dc:creator>
		
		<category><![CDATA[Carp's Blog]]></category>

		<category><![CDATA[LingPipe in Use]]></category>

		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=106</guid>
		<description><![CDATA[We&#8217;re now running into a problem we&#8217;ve run into before: so-called &#8220;intelligent&#8221; tokenization.  The earliest version of this of which I&#8217;m aware is the Penn Treebank tokenization, which assumes sentence splitting has already been done.  That way, the end-of-sentence punctuation can be treated differently than other punctuation.  Specifically, &#8220;Mr. Smith ran.  [...]]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>We&#8217;re now running into a problem we&#8217;ve run into before: so-called &#8220;intelligent&#8221; tokenization.  The earliest version of this of which I&#8217;m aware is the <a href="http://www.cis.upenn.edu/~treebank/tokenization.html">Penn Treebank tokenization</a>, which assumes sentence splitting has already been done.  That way, the end-of-sentence punctuation can be treated differently than other punctuation.  Specifically, &#8220;Mr. Smith ran.  Then he jumped.&#8221; gets split into two sentences, &#8220;Mr. Smith ran.&#8221; and &#8220;Then he jumped.&#8221;.  Now the fun starts.  The periods at the end of a sentence are split off into their own token.  The period after &#8220;Mr&#8221; remains, so the tokens are &#8220;Mr.&#8221;, &#8220;Smith&#8221;, &#8220;ran&#8221; and &#8220;.&#8221;.   Note that the Treebank tokenizer also replaces double quotes with either left or right LaTex-style quotes, so there&#8217;s no way to reconstruct the underlying text from the tokens.  Like many other projects, they also throw away the whitespace information in the data, so there&#8217;s no way to train something to do tokenization that&#8217;s whitespace sensitive because we just don&#8217;t have the whitespace.  That&#8217;s what you get for releasing data like &#8220;John/PN ran/V ./PUNCT&#8221;  &#8212; you just don&#8217;t know if there was space between that final verb &#8220;ran&#8221; and the full stop &#8220;.&#8221;.   You&#8217;ll also see their <a href="http://www.cis.upenn.edu/~treebank/tokenizer.sed">script</a> builds in all sorts of knowledge about English, such as a handful of contractions, so that &#8220;cannot&#8221; is split out into two tokens, &#8220;can&#8221; and &#8220;not&#8221;.  </p>
<p>The most recent form of &#8220;intelligent&#8221; tokenization I&#8217;ve seen is perpetrated by UPenn, this time as part of their <a href="http://bioie.ldc.upenn.edu/">BioIE</a> project.  There, the data&#8217;s not even consistently tokenized, because they left it to annotators to decide on token boundaries.  They then use statistical chunkers to uncover the tokenizations probabilistically.</p>
<p><a href="http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html">Google&#8217;s n-gram data</a> is also distributed without a tokenizer.  It looks very simple, but there are lots of heuristic boundary conditions that make it very hard to run on new text.  Practically speaking, the data&#8217;s out of bounds anyway because of its <a href="http://www.ldc.upenn.edu/Catalog/mem_agree/Web_1T_5gram_V1_User_Agreement.html">research-only license</a>.  Unlike the Penn Treebank or French Treebank, there&#8217;s no option to buy commercial licenses.</p>
<p>I&#8217;ve just been struggling with the <a href="">French Treebank</a>, which follows the Penn Treebank&#8217;s lead in using &#8220;intelligent&#8221; tokenization.  The problem for us is that we don&#8217;t know French, so we can&#8217;t quite read the technical docs (in French), nor induce from the corpus how tokenization was done.  </p>
<p>This is all terribly problematic for the &#8220;traditional&#8221; parsing model of first running tokenization, then running analysis.  The problem is that the tokenization depends on the analysis and vice-versa.  At least with the Penn approach, there&#8217;s code to do their ad hoc sentence splitting and then their ad hoc heuristic tokenization.  It may not be coherent from an English point of view (a handful of contractions will be split; others won&#8217;t), but at least it&#8217;s reproducible.</p>
<p>Our own approach (in practice &#8212; in theory we can plug and play any tokenizer that can be coded) has been to take very fine-grained tokenizations so that the tokenization would be compatible with any old kind of tagger.  Our HMM chunker pays attention to tokenization, but the rescoring chunker uses whitespace and longer-distance token information.</p>
<p>At the BioNLP workshop at ACL 2008, <a href="http://wwmm.ch.cam.ac.uk/blogs/corbett/">Peter Corbett</a> presented a paper (with Anne Copestake) on  <a href="http://aclweb.org/anthology-new/W/W08/W08-0608.pdf">Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition</a>.  It&#8217;s a really neat paper that addresses issues of confidence estimation, and particularly trading precision for recall (or vice-versa).  But they weren&#8217;t able to reproduce our <a href="http://lingpipe.files.wordpress.com/2008/04/alias-i-biocreativeii.pdf">99.99% recall gene/protein name extraction</a>.  During the question period, we got to the bottom of what was going on, which turned out to be intelligent tokenization making mistakes so that entities weren&#8217;t extractable because they were only parts of tokens.  I&#8217;m hoping Peter does the analysis to see how many entities are permanently lost due to tokenization errors.</p>
<p>So why do people do &#8220;intelligent&#8221; tokenization?  The hope is that by making the token decisions more intelligently, downstream processing like part-of-speech tagging is easier.  For instance, it&#8217;s difficult to even make sense of assigning part-of-speech tags to three tokens derived from &#8220;p-53&#8243;, namely &#8220;p&#8221;, &#8220;-&#8221; and &#8220;53&#8243;.  Especially if you throw away whitespace information.</p>
<p>Tokenization is particularly vexing in the bio-medical text domain, where there are tons of words (or at least phrasal lexical entries) that contain parentheses, hyphens, and so on.  This turned out to be <a href="http://aclweb.org/anthology-new/W/W08/W08-0507.pdf">a problem for WordNet</a>).</p>
<p>In some sense, tokenization is even more vexing in Chinese, which isn&#8217;t written with spaces.  To get around that problem, our named-entity detector just treats each character as a token; that worked pretty well for <a href="http://lingpipe.files.wordpress.com/2008/04/alias-i-sighan06.pdf">our entry in the SIGHAN 3 bakeoff</a>.  There were even two papers on jointly modeling segmentation and tagging for Chinese at this year&#8217;s ACL (<a href="http://aclweb.org/anthology-new/P/P08/P08-1102.pdf">Jiang et al.</a> and <a href="http://aclweb.org/anthology-new/P/P08/P08-1101.pdf">Zhang et al.</a>).  Joint modeling of this kind seems like a promising approach to allowing &#8220;intelligent&#8221; tokenization; by extending the tokenization model far enough, we could even maintain high end-to-end recall, which is not possible with a state-of-the-art first-best probabilistic tokenizer.</p>
<img alt="" border="0" src="http://feeds.wordpress.com/1.0/categories/lingpipe.wordpress.com/106/" /> <img alt="" border="0" src="http://feeds.wordpress.com/1.0/tags/lingpipe.wordpress.com/106/" /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/lingpipe.wordpress.com/106/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/lingpipe.wordpress.com/106/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/lingpipe.wordpress.com/106/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/lingpipe.wordpress.com/106/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/lingpipe.wordpress.com/106/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/lingpipe.wordpress.com/106/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/lingpipe.wordpress.com/106/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/lingpipe.wordpress.com/106/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/lingpipe.wordpress.com/106/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/lingpipe.wordpress.com/106/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=lingpipe-blog.com&blog=2555819&post=106&subd=lingpipe&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://lingpipe-blog.com/2008/06/26/the-curse-of-intelligent-tokenization/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>