Update November 2008: Our tutorial on hyphenation is now available as part of LingPipe and on the web at:
Noisy channel models with very simple deterministic channels can be surprisingly effective at simple linguistic tasks like word splitting. We’ve used them for Chinese word segmentation (aka word boundary detection, aka tokenization), not to mention spelling correction.
In this blog entry, I’ll provide a first look at our forthcoming tutorial on hyphenation. The hyphenation problem is that of determining if a hyphen can be inserted between two positions in a word. Hyphenation is an orthographic process, which means it operates over spellings. In contrast, syllabification is a phonological (or phonetic) process, which means it operates over sounds. Hyphenation roughly follows syllabification, but is also prone to follow morphological split points.
The hyphenation problem isn’t even well-defined on a per-word level. There are pairs such as
num-ber (something you count with) and
numb-er (when you get more numb) that have the same spelling, but different pronunciations and corresponding hyphenations depending on how they are used. I’ll just ignore this problem here; in our evaluation, ambiguities always produce at least one error.
The Noisy Channel Model
The noisy channel model consists of a source that generates messages and a noisy channel along which they are passed. The receiver’s job is to recover (i.e. decode) the underlying message. For hyphenation, the source is a model of what English hyphenation looks like. The channel model deterministically removes spaces. The receiver thus needs to figure out where the hyphens should be reinserted to recover the original message at the source.
The training corpus for the source model consists of words with hyphenations represented by hyphens. The model is just a character language model (I used 8-grams, but it’s not very length sensitive). This gives us estimates of probabilities like
p("cher-u-bic"). Our channel model just deterministically removes spaces, so that
p("cherubic"|"che-ru-bic") = 1.0.
To use the noisy channel model to find a hyphenation given an unhyphenated word, we just find the hyphenation
h that is most likely given the word
w, using Bayes’s rule in a maximization etting:
ARGMAXh p(h|w) = ARGMAXh p(w|h)*p(h). Because the channel probabilities
p(w|h) are always 1.0 if the characters in
w match those in h, this reduces to finding the hyphenation
h yielding character sequence
w which maximizes
p(h). For example, the model will estimate
"che-ru-bic" to be a more likely hyphenation than
p("che-ru-bic") > p("cher-u-bic").
Decoding is fairly efficient for this task, despite using a high-order n-gram language model, because the channel bounds the combinatorics by only allowing a single hyphen to be inserted at each point; dynamic programming into language model states can reduce them even further.
So how do we test how well the model works? We just became members of the Linguistic Data Consortium, who distribute Baayen, Piepenbrock and Gulikers' CELEX 2 corpus. CELEX is a great corpus. It contains pronunciations, syllabifications, and hyphenations for a modest sized dictionary of Dutch, English and German. For instance, there are 66,332 distinct English words in the corpus (there are many more entries, but these constitute duplicates and compounds whose constituents have their normal hyphenations). These 66,332 word have 66,418 unique hyphenations, meaning 86 of the words lead to ambiguities. This is about 1/10th of a percent, so I won't worry about it here.
I evaluated with randomly partitioned 10-fold cross-validation. With the noisy channel model above, LingPipe had a 95.4% whole word accuracy, with a standard deviation of 0.2%. That means we got 95.4% of the words completely correctly hyphenated. We can also look at the 111,521 hyphens in the corpus, over which we had a 97.3% precision and a 97.4% recall. That is, we missed 2.6% true hyphenation points (false negatives), and 2.7% of the hyphenations we returned were spurious (false positives). Finally, we can look at per-decision accuracy, for which there were 482,045 positions between characters, over which we were 98.8% accurate.
Forward and Backward Models
But that's not all. Like HMMs, there's a degree of label bias in a left-to-right language model. So I reversed all the data and built a right-to-left (or back-to-front) model. Using n-best extraction, I ran this two ways. First, I just added the score to the forward model to get an interpolated score. Somewhat surprisingly, it behaved almost identically to the forward-only model, with slightly lower per-hyphen and per-decision scores. More interestingly, I then ran them in intersect mode, which means only returning a hyphen if it was in the first-best analysis of both the left-to-right and right-to-left models. This lowered per-word accuracy to 94.6% (from 95.4%), but raised precision to 98.0% (from 97.3%) while only lowering recall to 96.7% (from 97.4%). Overall, hyphenation is considered to be a precision business in application, as it's usually used to split words across lines in documents, and many split points might work.
Is it State of the Art?
The best results I've seen for this task were reported in Bartlett, Kondrak and Cherry's 2008 ACL paper Automatic Syllabification with Structured SVMs for Letter-To-Phoneme Conversion, which also received an outstanding paper award. They treated this problem as a tagging problem and applied structured support vector machines (SVMs). On a fixed 5K testing set, they report a 95.65% word accuracy, which is slightly higher than our 95.4%. The 95% binomial confidence intervals for 5000 test cases are +/- 0.58% and our measured standard deviation was .2%, with results ranging from 95.1 to 95.7% on different folds. Their paper also tackled pronuncation, for which their hyphenation was only one feature.
German and Dutch are easier to hyphenate than English. Syllabification in all of these languages is also easier than hyphenation.
But is it better than 1990?
Before someone in the audience gets mad at me, I want to point out that we could follow Coker, Church and Liberman (1990)'s lead in reporting results:
The Bell Laboratories Text-to-Speech system, TTS, takes a radical dictionary-based approach; dictionary methods (with morphological and analogical extensions) are used for the vast majority of words. Only a fraction of a percent (0.5% of words overall; 0.1% of lowercase words) are left for letter-to-sound rules.
Although this insight won't get us into NIPS, it's how we'd field an application.