Finite-State Queries in Lucene


Otis just posted Robert Muir‘s presentation to the NY Lucene Meetup:

What’s it Do?

Read the slide show. It’s really nicely presented.

The idea’s simple for us automata hackers. The set of terms making up the keys of the core term-document (reverse) index is used as suffix array (i.e. a compact representation of a trie). A query is then converted into a finite state automaton which is intersected with the trie implicitly represented by the suffix array.

This is much much faster than doing an edit distance comparison with every term, because the prefix computations are shared. If you put a bound on the maximum distance, you can reject most terms high up in the trie as you implicitly run the intersection.

It warms my heart to see this kind of nice algorithm so neatly implemented. The really groovy part is that the suffix array was already there!

Bonus Material

As an added bonus, there’s an extended bonus section including hints on how to do term expansion at query time instead of indexing time (though this does affect TF/IDF), and some hints about how to integrate morphological processing (e.g. stemming or lemmatization).

Similar Applications

LingPipe uses the same kind of algorithm for spell checking, only with an explicit trie with source probabilities and a customizable position-specific edit-distance model. The spell checker only handles single input strings rather than automata and provides ranked results.

I’m guessing Google’s using this kind of thing all over the place for speech rec, because they hired the guys who wrote the AT & T Finite State Machine Library and went on to build Google’s Open FST Library.

Do it Yourself

If you want to learn how to do this kind of thing yourself, you could do much worse than reading my perennial recommendation, Dan Gusfield’s Algorithms on Strings, Trees, and Sequences.

BWT Anyone?

Luckily, memory will probably keep up with our terms in language indexing.

In biological sequencing applications, where they store billions of different 50-grams, the Burrows-Wheeler transformed compressed index lets you do the same thing. The only problem for search is that it doesn’t let you update dynamically (that i know of).

2 Responses to “Finite-State Queries in Lucene”

  1. rcmuir Says:

    Thanks Bob, by the way: do you have any ideas on best-practices for handling the TF/IDF case with this sort of expansion?

    Its really similar to the problems that Lucene’s FuzzyLikeThis attempts to address with fuzzy-scoring (so that low IDF values of typos don’t boost the score above exact matches)

    For my simple test, i only disabled the coordination factor and boosted the exact match * 2, but I’m interested if there are smarter ways to compensate.

    • lingpipe Says:

      Interesting question — I’d never thought about the TF/IDF case, but it should work like spelling.

      You should pay a penalty for correction to a term based on its frequency, with frequent words paying less of a penalty. Each character corrected should have a penalty. It really helps if the correction penalties are increased for the first one or two characters in a word, which are much less frequently typo-ed during search.

      Consider queries [hte] and [the]. If the query is [the], there should be a high penalty to correct to [hte] because the term “hte” is low frequency. It will then contribute a high IDF factor, so the penalty could be designed to scale. If the query is [hte], it should be cheap to correct to [the], but “hte” will be weighted much more highly (“the” has a very low IDF and there’s a character correction penalty).

      The fundamental problem is that you can’t tell if [hte] is a typo or a query for a rare term.

      Our spelling corrector, like Google’s, uses cross-word contexts. For instance, Google doesn’t correct [hte] to [the], but it does correct [hte Dude] to [the dude].

      We label data from query logs to use to train the edit parameters. You’ll find the stats really varying as order of magnitude index size increases. You’ll also find typos in the data accumulating, especially for common typos.

Leave a Reply to lingpipe Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: