<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Hyphenation as a Noisy Channel</title>
	<atom:link href="http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Sat, 04 Feb 2012 20:56:48 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/#comment-2619</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Tue, 22 Jul 2008 16:49:26 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=108#comment-2619</guid>
		<description><![CDATA[I ran on Bartlett et al.&#039;s data splits and the results are interesting.

For their comparison with SbA (syllabification by analogy), they used about 15K training instances:


    #train=14696    #test=24921
    Whole word accuracy = 0.8789 (Struct SVM = 0.8999)
    Per hyph precision = 0.9316
    Per hyph recall  = 0.9305
    Per decision error = 0.0316 (0.0242)



Their structured SVM approach is significantly better than the simple character LM noisy channel approach.

With their 60K train, 5K test, and our default English params (8-grams with default 8.0 interpolation), we get these results


   #train=60000   #test=5000
   accuracy=0.9538  (struct SVM = 0.9565)
   Per hyphenation prec = 0.9732
   per hyphenation recall = 0.9739
   per tagging decision error = 0.0124
   accuracy=0.9544



With our best cross-validating result on the training data, running on their train/test split, we get this:


    whole word accuracy=0.9544 (struct SVM = 0.9565)
    per hyphenation precision=0.9745
    per hyphenation recall=0.9742

Here the results are much closer.  We&#039;ve seen &lt;a href=&quot;http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf&quot; rel=&quot;nofollow&quot;&gt;this pattern&lt;/a&gt; before.

The errors have an interesting pattern where stress is a determining factor.  Lots of words have different hyphenation with different forms, such as CHER-ub vs. CHER-u-bim vs.  che-RU-bic.  It sure seems that building a joint model of pronunciation (including lexical stress) and syllabification/hyphenation would be a big win.  There&#039;s also ambiguity in syllable reduction, such as the word &quot;mayor&quot; being one syllable or two, and words like &quot;aqualung&quot; having variant pronuncations ak-wuh-luhng vs. ah-kwuh-lung (note which syllable the &quot;q&quot; sound shows up in).]]></description>
		<content:encoded><![CDATA[<p>I ran on Bartlett et al.&#8217;s data splits and the results are interesting.</p>
<p>For their comparison with SbA (syllabification by analogy), they used about 15K training instances:</p>
<p>    #train=14696    #test=24921<br />
    Whole word accuracy = 0.8789 (Struct SVM = 0.8999)<br />
    Per hyph precision = 0.9316<br />
    Per hyph recall  = 0.9305<br />
    Per decision error = 0.0316 (0.0242)</p>
<p>Their structured SVM approach is significantly better than the simple character LM noisy channel approach.</p>
<p>With their 60K train, 5K test, and our default English params (8-grams with default 8.0 interpolation), we get these results</p>
<p>   #train=60000   #test=5000<br />
   accuracy=0.9538  (struct SVM = 0.9565)<br />
   Per hyphenation prec = 0.9732<br />
   per hyphenation recall = 0.9739<br />
   per tagging decision error = 0.0124<br />
   accuracy=0.9544</p>
<p>With our best cross-validating result on the training data, running on their train/test split, we get this:</p>
<p>    whole word accuracy=0.9544 (struct SVM = 0.9565)<br />
    per hyphenation precision=0.9745<br />
    per hyphenation recall=0.9742</p>
<p>Here the results are much closer.  We&#8217;ve seen <a href="http://acl.ldc.upenn.edu/P/P01/P01-1005.pdf" rel="nofollow">this pattern</a> before.</p>
<p>The errors have an interesting pattern where stress is a determining factor.  Lots of words have different hyphenation with different forms, such as CHER-ub vs. CHER-u-bim vs.  che-RU-bic.  It sure seems that building a joint model of pronunciation (including lexical stress) and syllabification/hyphenation would be a big win.  There&#8217;s also ambiguity in syllable reduction, such as the word &quot;mayor&quot; being one syllable or two, and words like &quot;aqualung&quot; having variant pronuncations ak-wuh-luhng vs. ah-kwuh-lung (note which syllable the &#8220;q&#8221; sound shows up in).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2008/07/11/hyphenation-a-a-noisy-channel/#comment-2592</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Tue, 15 Jul 2008 23:30:50 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=108#comment-2592</guid>
		<description><![CDATA[I (Bob) just finished the tutorial.  I cleaned up many of the encoding issues involving converting the CELEX2 encodings to Unicode, and it improved the baseline system (English 8-grams with 8.0 interpolation) from 95.4% to 95.5% accuracy.  There are some operating points where the performance is a bit better than our defaults (e.g. 9-grams with a 4.5 interpolation score 95.8%).  But the basic results still stand in that 95.8% is no more significantly better than the 95.65% state-of-the-art report from Bartlett et al. than 95.4% was significantly worse.  

You also have to be careful in that these are post-hoc results.  Only one parameter setting yielded 95.8% whole-word accuracy, though there were lots of 95.7% scores.  

I also ran German hyphenation (99.7% whole-word accuracy) and Dutch hyphenation (99.4% whole word accuracy).  These are default scores, not optimal post-hoc scores.

Finally, I ran English syllabification (98.8% whole-word accuracy), but that could probably still use some work cleaning up the phonetic alphabet.  I just didn&#039;t have the energy to encode dozens of replace-alls after finding the right IPA encoding, so I just left the multi-character symbols in place.

The tutorial will be out in LingPipe 3.6, but if you&#039;re dying to see it, drop me an e-mail and I can send you a tarball of the demo tutorial.]]></description>
		<content:encoded><![CDATA[<p>I (Bob) just finished the tutorial.  I cleaned up many of the encoding issues involving converting the CELEX2 encodings to Unicode, and it improved the baseline system (English 8-grams with 8.0 interpolation) from 95.4% to 95.5% accuracy.  There are some operating points where the performance is a bit better than our defaults (e.g. 9-grams with a 4.5 interpolation score 95.8%).  But the basic results still stand in that 95.8% is no more significantly better than the 95.65% state-of-the-art report from Bartlett et al. than 95.4% was significantly worse.  </p>
<p>You also have to be careful in that these are post-hoc results.  Only one parameter setting yielded 95.8% whole-word accuracy, though there were lots of 95.7% scores.  </p>
<p>I also ran German hyphenation (99.7% whole-word accuracy) and Dutch hyphenation (99.4% whole word accuracy).  These are default scores, not optimal post-hoc scores.</p>
<p>Finally, I ran English syllabification (98.8% whole-word accuracy), but that could probably still use some work cleaning up the phonetic alphabet.  I just didn&#8217;t have the energy to encode dozens of replace-alls after finding the right IPA encoding, so I just left the multi-character symbols in place.</p>
<p>The tutorial will be out in LingPipe 3.6, but if you&#8217;re dying to see it, drop me an e-mail and I can send you a tarball of the demo tutorial.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

