Hyphenation as a Noisy Channel


Update November 2008: Our tutorial on hyphenation is now available as part of LingPipe and on the web at:

Noisy channel models with very simple deterministic channels can be surprisingly effective at simple linguistic tasks like word splitting. We’ve used them for Chinese word segmentation (aka word boundary detection, aka tokenization), not to mention spelling correction.

In this blog entry, I’ll provide a first look at our forthcoming tutorial on hyphenation. The hyphenation problem is that of determining if a hyphen can be inserted between two positions in a word. Hyphenation is an orthographic process, which means it operates over spellings. In contrast, syllabification is a phonological (or phonetic) process, which means it operates over sounds. Hyphenation roughly follows syllabification, but is also prone to follow morphological split points.

The hyphenation problem isn’t even well-defined on a per-word level. There are pairs such as num-ber (something you count with) and numb-er (when you get more numb) that have the same spelling, but different pronunciations and corresponding hyphenations depending on how they are used. I’ll just ignore this problem here; in our evaluation, ambiguities always produce at least one error.

The Noisy Channel Model

The noisy channel model consists of a source that generates messages and a noisy channel along which they are passed. The receiver’s job is to recover (i.e. decode) the underlying message. For hyphenation, the source is a model of what English hyphenation looks like. The channel model deterministically removes spaces. The receiver thus needs to figure out where the hyphens should be reinserted to recover the original message at the source.

The training corpus for the source model consists of words with hyphenations represented by hyphens. The model is just a character language model (I used 8-grams, but it’s not very length sensitive). This gives us estimates of probabilities like p("che-ru-bic") and p("cher-u-bic"). Our channel model just deterministically removes spaces, so that p("cherubic"|"che-ru-bic") = 1.0.

To use the noisy channel model to find a hyphenation given an unhyphenated word, we just find the hyphenation h that is most likely given the word w, using Bayes’s rule in a maximization etting: ARGMAXh p(h|w) = ARGMAXh p(w|h)*p(h). Because the channel probabilities p(w|h) are always 1.0 if the characters in w match those in h, this reduces to finding the hyphenation h yielding character sequence w which maximizes p(h). For example, the model will estimate "che-ru-bic" to be a more likely hyphenation than "cher-u-bic" if p("che-ru-bic") > p("cher-u-bic").

Decoding is fairly efficient for this task, despite using a high-order n-gram language model, because the channel bounds the combinatorics by only allowing a single hyphen to be inserted at each point; dynamic programming into language model states can reduce them even further.

English Evaluation

So how do we test how well the model works? We just became members of the Linguistic Data Consortium, who distribute Baayen, Piepenbrock and Gulikers’ CELEX 2 corpus. CELEX is a great corpus. It contains pronunciations, syllabifications, and hyphenations for a modest sized dictionary of Dutch, English and German. For instance, there are 66,332 distinct English words in the corpus (there are many more entries, but these constitute duplicates and compounds whose constituents have their normal hyphenations). These 66,332 word have 66,418 unique hyphenations, meaning 86 of the words lead to ambiguities. This is about 1/10th of a percent, so I won’t worry about it here.

I evaluated with randomly partitioned 10-fold cross-validation. With the noisy channel model above, LingPipe had a 95.4% whole word accuracy, with a standard deviation of 0.2%. That means we got 95.4% of the words completely correctly hyphenated. We can also look at the 111,521 hyphens in the corpus, over which we had a 97.3% precision and a 97.4% recall. That is, we missed 2.6% true hyphenation points (false negatives), and 2.7% of the hyphenations we returned were spurious (false positives). Finally, we can look at per-decision accuracy, for which there were 482,045 positions between characters, over which we were 98.8% accurate.

Forward and Backward Models

But that’s not all. Like HMMs, there’s a degree of label bias in a left-to-right language model. So I reversed all the data and built a right-to-left (or back-to-front) model. Using n-best extraction, I ran this two ways. First, I just added the score to the forward model to get an interpolated score. Somewhat surprisingly, it behaved almost identically to the forward-only model, with slightly lower per-hyphen and per-decision scores. More interestingly, I then ran them in intersect mode, which means only returning a hyphen if it was in the first-best analysis of both the left-to-right and right-to-left models. This lowered per-word accuracy to 94.6% (from 95.4%), but raised precision to 98.0% (from 97.3%) while only lowering recall to 96.7% (from 97.4%). Overall, hyphenation is considered to be a precision business in application, as it’s usually used to split words across lines in documents, and many split points might work.

Is it State of the Art?

The best results I’ve seen for this task were reported in Bartlett, Kondrak and Cherry’s 2008 ACL paper Automatic Syllabification with Structured SVMs for Letter-To-Phoneme Conversion, which also received an outstanding paper award. They treated this problem as a tagging problem and applied structured support vector machines (SVMs). On a fixed 5K testing set, they report a 95.65% word accuracy, which is slightly higher than our 95.4%. The 95% binomial confidence intervals for 5000 test cases are +/- 0.58% and our measured standard deviation was .2%, with results ranging from 95.1 to 95.7% on different folds. Their paper also tackled pronuncation, for which their hyphenation was only one feature.

German and Dutch are easier to hyphenate than English. Syllabification in all of these languages is also easier than hyphenation.

But is it better than 1990?

Before someone in the audience gets mad at me, I want to point out that we could follow Coker, Church and Liberman (1990)‘s lead in reporting results:

The Bell Laboratories Text-to-Speech system, TTS, takes a radical dictionary-based approach; dictionary methods (with morphological and analogical extensions) are used for the vast majority of words. Only a fraction of a percent (0.5% of words overall; 0.1% of lowercase words) are left for letter-to-sound rules.

Although this insight won’t get us into NIPS, it’s how we’d field an application.

2 Responses to “Hyphenation as a Noisy Channel”

  1. lingpipe Says:

    I (Bob) just finished the tutorial. I cleaned up many of the encoding issues involving converting the CELEX2 encodings to Unicode, and it improved the baseline system (English 8-grams with 8.0 interpolation) from 95.4% to 95.5% accuracy. There are some operating points where the performance is a bit better than our defaults (e.g. 9-grams with a 4.5 interpolation score 95.8%). But the basic results still stand in that 95.8% is no more significantly better than the 95.65% state-of-the-art report from Bartlett et al. than 95.4% was significantly worse.

    You also have to be careful in that these are post-hoc results. Only one parameter setting yielded 95.8% whole-word accuracy, though there were lots of 95.7% scores.

    I also ran German hyphenation (99.7% whole-word accuracy) and Dutch hyphenation (99.4% whole word accuracy). These are default scores, not optimal post-hoc scores.

    Finally, I ran English syllabification (98.8% whole-word accuracy), but that could probably still use some work cleaning up the phonetic alphabet. I just didn’t have the energy to encode dozens of replace-alls after finding the right IPA encoding, so I just left the multi-character symbols in place.

    The tutorial will be out in LingPipe 3.6, but if you’re dying to see it, drop me an e-mail and I can send you a tarball of the demo tutorial.

  2. lingpipe Says:

    I ran on Bartlett et al.’s data splits and the results are interesting.

    For their comparison with SbA (syllabification by analogy), they used about 15K training instances:

    #train=14696 #test=24921
    Whole word accuracy = 0.8789 (Struct SVM = 0.8999)
    Per hyph precision = 0.9316
    Per hyph recall = 0.9305
    Per decision error = 0.0316 (0.0242)

    Their structured SVM approach is significantly better than the simple character LM noisy channel approach.

    With their 60K train, 5K test, and our default English params (8-grams with default 8.0 interpolation), we get these results

    #train=60000 #test=5000
    accuracy=0.9538 (struct SVM = 0.9565)
    Per hyphenation prec = 0.9732
    per hyphenation recall = 0.9739
    per tagging decision error = 0.0124

    With our best cross-validating result on the training data, running on their train/test split, we get this:

    whole word accuracy=0.9544 (struct SVM = 0.9565)
    per hyphenation precision=0.9745
    per hyphenation recall=0.9742

    Here the results are much closer. We’ve seen this pattern before.

    The errors have an interesting pattern where stress is a determining factor. Lots of words have different hyphenation with different forms, such as CHER-ub vs. CHER-u-bim vs. che-RU-bic. It sure seems that building a joint model of pronunciation (including lexical stress) and syllabification/hyphenation would be a big win. There’s also ambiguity in syllable reduction, such as the word "mayor" being one syllable or two, and words like "aqualung" having variant pronuncations ak-wuh-luhng vs. ah-kwuh-lung (note which syllable the “q” sound shows up in).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: