<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments for LingPipe Blog</title>
	<atom:link href="http://lingpipe-blog.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com</link>
	<description>Alias-i’s Natural Language Processing and Text Analytics API</description>
	<pubDate>Wed, 20 Aug 2008 23:59:55 +0000</pubDate>
	<generator>http://wordpress.org/?v=MU</generator>
		<item>
		<title>Comment on Good Kappa&#8217;s Not Necessary, Either by lingpipe</title>
		<link>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/#comment-2715</link>
		<dc:creator>lingpipe</dc:creator>
		<pubDate>Thu, 14 Aug 2008 21:26:57 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=118#comment-2715</guid>
		<description>What I'm trying to figure out is just how many annotators by items we'll need to get tight posterior intervals on annotator accuracies and on true categories.  In the simulations, I haven't needed more than 5 annotators on 500 items to get a pretty good read on true categories and item accuracy (posterior intervals of +/- 1%).   The kind of item-response or beta-binomial models I've been looking at are also amenable to having 50 annotators all do random subsets of the data.

EM's awfully prone to local maxima in complex problems.  Once there's a reasonable amount of training data, I haven't seen much in the way of improvement from EM.   I'm thinking I'll try Gibbs sampling instead of EM next; it's even easier for classifiers than EM, and less prone to get stuck in local optima.</description>
		<content:encoded><![CDATA[<p>What I&#8217;m trying to figure out is just how many annotators by items we&#8217;ll need to get tight posterior intervals on annotator accuracies and on true categories.  In the simulations, I haven&#8217;t needed more than 5 annotators on 500 items to get a pretty good read on true categories and item accuracy (posterior intervals of +/- 1%).   The kind of item-response or beta-binomial models I&#8217;ve been looking at are also amenable to having 50 annotators all do random subsets of the data.</p>
<p>EM&#8217;s awfully prone to local maxima in complex problems.  Once there&#8217;s a reasonable amount of training data, I haven&#8217;t seen much in the way of improvement from EM.   I&#8217;m thinking I&#8217;ll try Gibbs sampling instead of EM next; it&#8217;s even easier for classifiers than EM, and less prone to get stuck in local optima.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Good Kappa&#8217;s Not Necessary, Either by Brendan O'Connor</title>
		<link>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/#comment-2714</link>
		<dc:creator>Brendan O'Connor</dc:creator>
		<pubDate>Thu, 14 Aug 2008 19:47:15 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=118#comment-2714</guid>
		<description>If you have access to massively scaling up the number of annotators (which is easy on Mechanical Turk), I've had some success by using a high number of annotators -- e.g. 50 per example -- for a small number of examples to derive "true" labels.  Then it's possible to derive what accuracy rates will be for 3 or 5 or however many annotators will be used for the rest of the data.

In my experience, scaling up the number of annotators tends to be more useful than running EM.  I could be doing something wrong though.</description>
		<content:encoded><![CDATA[<p>If you have access to massively scaling up the number of annotators (which is easy on Mechanical Turk), I&#8217;ve had some success by using a high number of annotators &#8212; e.g. 50 per example &#8212; for a small number of examples to derive &#8220;true&#8221; labels.  Then it&#8217;s possible to derive what accuracy rates will be for 3 or 5 or however many annotators will be used for the rest of the data.</p>
<p>In my experience, scaling up the number of annotators tends to be more useful than running EM.  I could be doing something wrong though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Epidemiologists&#8217; Bayesian Latent Class Models of Inter-Annotator Agreement by Brendan O'Connor</title>
		<link>http://lingpipe-blog.com/2008/08/07/epidemiologists-bayesian-latent-class-models-of-inter-annotator-agreement/#comment-2713</link>
		<dc:creator>Brendan O'Connor</dc:creator>
		<pubDate>Thu, 14 Aug 2008 19:36:34 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=127#comment-2713</guid>
		<description>Great post, thanks for all the pointers and the model demonstration -- I've been working exactly on this recently with Mechanical Turk annotators.  (Found your post via Panos Ipeirotis, http://behind-the-enemy-lines.blogspot.com/2008/08/mechanical-turk-worker-quality-and-hit.html , via some posts of mine, http://blog.doloreslabs.com/topics/wisdom/ )

It's really interesting that so many fields have reinvented aspects of these techniques.

Brendan</description>
		<content:encoded><![CDATA[<p>Great post, thanks for all the pointers and the model demonstration &#8212; I&#8217;ve been working exactly on this recently with Mechanical Turk annotators.  (Found your post via Panos Ipeirotis, <a href="http://behind-the-enemy-lines.blogspot.com/2008/08/mechanical-turk-worker-quality-and-hit.html" rel="nofollow">http://behind-the-enemy-lines.blogspot.com/2008/08/mechanical-turk-worker-quality-and-hit.html</a> , via some posts of mine, <a href="http://blog.doloreslabs.com/topics/wisdom/" rel="nofollow">http://blog.doloreslabs.com/topics/wisdom/</a> )</p>
<p>It&#8217;s really interesting that so many fields have reinvented aspects of these techniques.</p>
<p>Brendan</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Searchme.com is Useful by lingpipe</title>
		<link>http://lingpipe-blog.com/2008/08/05/searchmecom-is-useful/#comment-2692</link>
		<dc:creator>lingpipe</dc:creator>
		<pubDate>Thu, 07 Aug 2008 20:06:02 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=123#comment-2692</guid>
		<description>Yikes -- there's not even titles on those back pages.  

They're also not doing spell checking (or at least doing it with such low recall I haven't seen it), and they have lots of non-existent pages referenced (as does Google these days).

After two days, I conclude that I can't use Searchme as a general Google replacement, but it's still very useful for searching for things like recipes where visual inspection is so much better than snippets.  So it's at least staying on my Firefox pulldown.</description>
		<content:encoded><![CDATA[<p>Yikes &#8212; there&#8217;s not even titles on those back pages.  </p>
<p>They&#8217;re also not doing spell checking (or at least doing it with such low recall I haven&#8217;t seen it), and they have lots of non-existent pages referenced (as does Google these days).</p>
<p>After two days, I conclude that I can&#8217;t use Searchme as a general Google replacement, but it&#8217;s still very useful for searching for things like recipes where visual inspection is so much better than snippets.  So it&#8217;s at least staying on my Firefox pulldown.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Searchme.com is Useful by Otis Gospodnetic</title>
		<link>http://lingpipe-blog.com/2008/08/05/searchmecom-is-useful/#comment-2691</link>
		<dc:creator>Otis Gospodnetic</dc:creator>
		<pubDate>Thu, 07 Aug 2008 14:44:04 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=123#comment-2691</guid>
		<description>Just tried them again and liked what I saw.  But look at your browser's history (back button) after you flip through 10-20+ pages.  That sucks for those of us who use that back button all the time.</description>
		<content:encoded><![CDATA[<p>Just tried them again and liked what I saw.  But look at your browser&#8217;s history (back button) after you flip through 10-20+ pages.  That sucks for those of us who use that back button all the time.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Good Kappa&#8217;s Not Necessary, Either by Bob Carpenter</title>
		<link>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/#comment-2656</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Thu, 31 Jul 2008 20:49:41 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=118#comment-2656</guid>
		<description>Okay, Panos was holding out.  He should've cited his own March 2008 blog post on &lt;a href="http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-foundations.html" rel="nofollow"&gt;Mechanical Turk: The Foundations&lt;/a&gt;.  It contains a great overview of the true label induction problem and even more references.  I just found it searching for Dawid's paper.

Figuring out true paper "quality" given referee scores was one of the applications I've been thinking about.  I originally thought a simple bootstrap analysis would be interesting, and it would be simple.  But now I've been thinking more along the lines of linear modeling, perhaps with logistic linking to get the boundedness of the scale.</description>
		<content:encoded><![CDATA[<p>Okay, Panos was holding out.  He should&#8217;ve cited his own March 2008 blog post on <a href="http://behind-the-enemy-lines.blogspot.com/2008/03/mechanical-turk-foundations.html" rel="nofollow">Mechanical Turk: The Foundations</a>.  It contains a great overview of the true label induction problem and even more references.  I just found it searching for Dawid&#8217;s paper.</p>
<p>Figuring out true paper &#8220;quality&#8221; given referee scores was one of the applications I&#8217;ve been thinking about.  I originally thought a simple bootstrap analysis would be interesting, and it would be simple.  But now I&#8217;ve been thinking more along the lines of linear modeling, perhaps with logistic linking to get the boundedness of the scale.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Good Kappa&#8217;s Not Necessary, Either by Bob Carpenter</title>
		<link>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/#comment-2655</link>
		<dc:creator>Bob Carpenter</dc:creator>
		<pubDate>Thu, 31 Jul 2008 19:59:32 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=118#comment-2655</guid>
		<description>Panos:

Thanks for the reference.  This is just the kind of thing I was looking for.  And looking at the papers that cite it has opened up a vein of this kind of literature.  

I'm working on a very similar approach using a Gibbs sampler, which has the nice property of giving me posterior uncertainty estimates of things like confusion matrices.

In searching for your reference, I found this, which is where I was planning to go with the posterior category samples from the fitted models:

Learning with Multiple Labels. Rong Jin and Zoubin Ghahramani.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8894

I'll post the details next week when I'm done writing up the paper.</description>
		<content:encoded><![CDATA[<p>Panos:</p>
<p>Thanks for the reference.  This is just the kind of thing I was looking for.  And looking at the papers that cite it has opened up a vein of this kind of literature.  </p>
<p>I&#8217;m working on a very similar approach using a Gibbs sampler, which has the nice property of giving me posterior uncertainty estimates of things like confusion matrices.</p>
<p>In searching for your reference, I found this, which is where I was planning to go with the posterior category samples from the fitted models:</p>
<p>Learning with Multiple Labels. Rong Jin and Zoubin Ghahramani.<br />
<a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8894" rel="nofollow">http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.8894</a></p>
<p>I&#8217;ll post the details next week when I&#8217;m done writing up the paper.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Good Kappa&#8217;s Not Necessary, Either by Panos Ipeirotis</title>
		<link>http://lingpipe-blog.com/2008/07/28/good-kappas-not-necessary-either/#comment-2654</link>
		<dc:creator>Panos Ipeirotis</dc:creator>
		<pubDate>Thu, 31 Jul 2008 15:57:02 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=118#comment-2654</guid>
		<description>An alternative, when the annotator's accuracy is unknown, is to use an EM-like approach.

We first assume known annotator accuracy (initialized, say, to perfect accuracy), and based on that we compute the most likely labels.

Then, based on the estimated labels, we can estimate the labelers accuracy.

By iterating, we can converge to some good estimates of the labels, and generate a confusion matrix for each annotator.

Take a look at
Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm
A. P. Dawid and A. M. Skene
Applied Statistics, Vol. 28, No. 1 (1979), pp. 20-28  
http://www.jstor.org/pss/2346806</description>
		<content:encoded><![CDATA[<p>An alternative, when the annotator&#8217;s accuracy is unknown, is to use an EM-like approach.</p>
<p>We first assume known annotator accuracy (initialized, say, to perfect accuracy), and based on that we compute the most likely labels.</p>
<p>Then, based on the estimated labels, we can estimate the labelers accuracy.</p>
<p>By iterating, we can converge to some good estimates of the labels, and generate a confusion matrix for each annotator.</p>
<p>Take a look at<br />
Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm<br />
A. P. Dawid and A. M. Skene<br />
Applied Statistics, Vol. 28, No. 1 (1979), pp. 20-28<br />
<a href="http://www.jstor.org/pss/2346806" rel="nofollow">http://www.jstor.org/pss/2346806</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Good Kappa&#8217;s not Enough by Good Kappa&#8217;s Not Necessary, Either &#171; LingPipe Blog</title>
		<link>http://lingpipe-blog.com/2008/07/22/good-kappas-not-enough/#comment-2645</link>
		<dc:creator>Good Kappa&#8217;s Not Necessary, Either &#171; LingPipe Blog</dc:creator>
		<pubDate>Tue, 29 Jul 2008 22:29:44 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=112#comment-2645</guid>
		<description>[...] LingPipe Blog Alias-i’s Natural Language Processing and Text Analytics API      &#171; Good Kappa&#8217;s not&#160;Enough [...]</description>
		<content:encoded><![CDATA[<p>[...] LingPipe Blog Alias-i’s Natural Language Processing and Text Analytics API      &laquo; Good Kappa&#8217;s not&nbsp;Enough [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Hyphenation as a Noisy Channel by lingpipe</title>
		<link>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/#comment-2619</link>
		<dc:creator>lingpipe</dc:creator>
		<pubDate>Tue, 22 Jul 2008 16:49:26 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=108#comment-2619</guid>
		<description>I ran on Bartlett et al.'s data splits and the results are interesting.

For their comparison with SbA (syllabification by analogy), they used about 15K training instances:


    #train=14696    #test=24921
    Whole word accuracy = 0.8789 (Struct SVM = 0.8999)
    Per hyph precision = 0.9316
    Per hyph recall  = 0.9305
    Per decision error = 0.0316 (0.0242)



Their structured SVM approach is significantly better than the simple character LM noisy channel approach.

With their 60K train, 5K test, and our default English params (8-grams with default 8.0 interpolation), we get these results


   #train=60000   #test=5000
   accuracy=0.9538  (struct SVM = 0.9565)
   Per hyphenation prec = 0.9732
   per hyphenation recall = 0.9739
   per tagging decision error = 0.0124
   accuracy=0.9544



With our best cross-validating result on the training data, running on their train/test split, we get this:


    whole word accuracy=0.9544 (struct SVM = 0.9565)
    per hyphenation precision=0.9745
    per hyphenation recall=0.9742

Here the results are much closer.  We've seen &lt;a href="http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf" rel="nofollow"&gt;this pattern&lt;/a&gt; before.

The errors have an interesting pattern where stress is a determining factor.  Lots of words have different hyphenation with different forms, such as CHER-ub vs. CHER-u-bim vs.  che-RU-bic.  It sure seems that building a joint model of pronunciation (including lexical stress) and syllabification/hyphenation would be a big win.  There's also ambiguity in syllable reduction, such as the word &#34;mayor&#34; being one syllable or two, and words like &#34;aqualung&#34; having variant pronuncations ak-wuh-luhng vs. ah-kwuh-lung (note which syllable the "q" sound shows up in).</description>
		<content:encoded><![CDATA[<p>I ran on Bartlett et al.&#8217;s data splits and the results are interesting.</p>
<p>For their comparison with SbA (syllabification by analogy), they used about 15K training instances:</p>
<p>    #train=14696    #test=24921<br />
    Whole word accuracy = 0.8789 (Struct SVM = 0.8999)<br />
    Per hyph precision = 0.9316<br />
    Per hyph recall  = 0.9305<br />
    Per decision error = 0.0316 (0.0242)</p>
<p>Their structured SVM approach is significantly better than the simple character LM noisy channel approach.</p>
<p>With their 60K train, 5K test, and our default English params (8-grams with default 8.0 interpolation), we get these results</p>
<p>   #train=60000   #test=5000<br />
   accuracy=0.9538  (struct SVM = 0.9565)<br />
   Per hyphenation prec = 0.9732<br />
   per hyphenation recall = 0.9739<br />
   per tagging decision error = 0.0124<br />
   accuracy=0.9544</p>
<p>With our best cross-validating result on the training data, running on their train/test split, we get this:</p>
<p>    whole word accuracy=0.9544 (struct SVM = 0.9565)<br />
    per hyphenation precision=0.9745<br />
    per hyphenation recall=0.9742</p>
<p>Here the results are much closer.  We&#8217;ve seen <a href="http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf" rel="nofollow">this pattern</a> before.</p>
<p>The errors have an interesting pattern where stress is a determining factor.  Lots of words have different hyphenation with different forms, such as CHER-ub vs. CHER-u-bim vs.  che-RU-bic.  It sure seems that building a joint model of pronunciation (including lexical stress) and syllabification/hyphenation would be a big win.  There&#8217;s also ambiguity in syllable reduction, such as the word &quot;mayor&quot; being one syllable or two, and words like &quot;aqualung&quot; having variant pronuncations ak-wuh-luhng vs. ah-kwuh-lung (note which syllable the &#8220;q&#8221; sound shows up in).</p>
]]></content:encoded>
	</item>
</channel>
</rss>