Ignorance is Bliss: Tenfold Speedup through Pruning

by

HMM Decoding

I added two beams to the first-best HMM Viterbi decoder. Both beams are synchronous, operating on a single slice of the lattice corresponding to an input token. One beam prunes on emission probabilities, removing tag hypotheses for which p(token|tag) is much lower than p(token|tag’) for the best tag’. The other beam prunes on the Viterbi score going forward, removing any tag whose likelihood estimate is much lower than the best tag’s. Normal Viterbi decoding has complexity O(k2 n) where n is input length and k is the number of tags. The square comes from considering all pairs of tags for the previous token and current token. Our pruning attacks both sizes, reducing the k2 factor to c1 c2 where c1 is the number of forward hypotheses for the last tag surviving pruning and c2 is the number of tags surviving the emission pruning.

In practice, for our Brown corpus-based part-of-speech tagger, there are 90 tags, and therefore 90*90=8100 tag pairs. Pruning reduces this number roughly ten-fold on average without loss of first-best accuracy on held-out data.

The beam-configurable version should be out soon in LingPipe 2.4.1.

Spelling

We’ve been working on developing our second large-scale spelling corrector, this time trained with 600MB of movie and TV-related
data versus our last round training with 50GB of legal case law data (all English with some light Spanish in titles in the TV case).

In both cases, the out-of-the-box speed wasn’t good enough — around 10 queries/second. All of the time is spent exploring insertions and substitutions. But these are constrained to continue within a known token in the token-sensitive case. This means we never correct to a token we didn’t see in training data, which is generally good for accuracy on languages with light morphology like English. But it also means we have a huge lever to pull for increasing speed. As the number of terms is reduced, speed goes way up.

The second parameter to tune is the do-not-edit tokens. We generally don’t edit user tokens of 1 or 2 characters, nor do we edit longer tokens with high corpus counts. Combining the do-not-edit list with a pruned known-token list, we increased speed ten-fold. Unfortunately, to hit the uncached rate of 100 queries/second cost us about 5% in accuracy overall, mostly in false negatives (failure to suggest corrections for errors).

You may not realize this from the demo on trivial amounts of data, but our spelling corrector is surprisingly accurate when you train it on a gigabyte of domain specific data. We’re seeing correction rates of 60-70% (percentage of user errors for which our first-best suggestion is correct) with false positive rates around 1%. Of course, if 10% of the queries contain errors, the final breakdown is: 89% user correct, no system suggestion; 7% user error, system correct suggestion, 3% user error, system wrong or no suggestion, 1% user correct, system spurious suggestion. Many of the errors are fairly unrecoverable by local edits.

Keep in mind that this is a devilishly difficult domain for which to annotate truth, because user intentions are implicit. For instance, is “toomestone” the name of a band on YouTube (152 hits on Google) or a misspelling of “tombstone” (6.5 million hits)? Google suggests “tombstone” and that’s a true-positive in the second case, but a false-positive in the first case. Of course, if a user with the same IP address types “tombstone”, you can at least guess it was misspelling case. Google will suggest “tomb stone” might mean “tombstone”, but clearly “tomb” and “stone” could make up a query that’s not related to tombstones (for instance, for a survey of tomb construction). At best, we can hope the statistics play out to let us guess right most of the time. After all, if the band Toomestone had more fans and more people searching for it, there’d be more hits for it online. This is the whole point of building the source side (modeling what to expect) of the noisy channel model.

This time around we also built some spelling-specific weighted edit distances. We found that users simply don’t know when something is one word or two words, especially for names of products (e.g. they’d type “ling pipe” for “lingpipe” or “postagger” for “pos tagger”), so space insertion/deletion needed to be relatively cheap. Second, they made all the usually noted errors of reduced vowel confusion (“memoribilia” vs. “memorabilia”). Unfortunately, our distance model is single character and context-free, so we can’t easily model the confusion around when consonants are duplicated (“parallel” vs. “parralell” vs. “parallell”).

Of course, you’re going to add your own cache, aren’t you? In both cases, query logs show a huge number of repetitions (cache hits of over 2/3 in the TV-related queries after 20K queries). That’s one of the advantages Google has — virtually infinite cache space for things they can caste as indexing/search/lookup problems.

Caching and Pruning

Combined, these two general techniques have taken our very slow HMM and spelling correctors up to usable speeds.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s