- Robert Muir. Finite State Queries in Lucene.
What’s it Do?
Read the slide show. It’s really nicely presented.
The idea’s simple for us automata hackers. The set of terms making up the keys of the core term-document (reverse) index is used as suffix array (i.e. a compact representation of a trie). A query is then converted into a finite state automaton which is intersected with the trie implicitly represented by the suffix array.
This is much much faster than doing an edit distance comparison with every term, because the prefix computations are shared. If you put a bound on the maximum distance, you can reject most terms high up in the trie as you implicitly run the intersection.
It warms my heart to see this kind of nice algorithm so neatly implemented. The really groovy part is that the suffix array was already there!
As an added bonus, there’s an extended bonus section including hints on how to do term expansion at query time instead of indexing time (though this does affect TF/IDF), and some hints about how to integrate morphological processing (e.g. stemming or lemmatization).
LingPipe uses the same kind of algorithm for spell checking, only with an explicit trie with source probabilities and a customizable position-specific edit-distance model. The spell checker only handles single input strings rather than automata and provides ranked results.
I’m guessing Google’s using this kind of thing all over the place for speech rec, because they hired the guys who wrote the AT & T Finite State Machine Library and went on to build Google’s Open FST Library.
Do it Yourself
If you want to learn how to do this kind of thing yourself, you could do much worse than reading my perennial recommendation, Dan Gusfield’s Algorithms on Strings, Trees, and Sequences.
Luckily, memory will probably keep up with our terms in language indexing.
In biological sequencing applications, where they store billions of different 50-grams, the Burrows-Wheeler transformed compressed index lets you do the same thing. The only problem for search is that it doesn’t let you update dynamically (that i know of).