I edited my first substantial Wikipedia page today:

There’s been discussion of the Wikipedia’s accuracy ever since:

There was even a *New Yorker* article this week. Of course,

*The Onion* said it best, in their article Wikipedia Celebrates 750 Years Of American Independence.

I wanted to link in some doc for the exact dictionary matching I just built for LingPipe 2.4 (`com.aliasi.dict.ExactDictionaryChunker`

), but I couldn’t find a good intro to the Aho-Corasick algorithm online.

The former Wikipedia article was confusing in its description of the data structure and wrong about the complexity bound (read in terms of number of entries versus size of dictionary strings). I restated it in the usual manner (e.g., that used by Dan Gusfield in *Algorithms on Strings, Trees and Sequences*) . And I provided an example like the one I’ve been using for unit testing.

The usual statement of Aho-Corasick is that it’s linear in dictionary size plus input plus number of outputs, as there may be quadratically many ouptuts. Thinking a bit more deeply, it can’t really be quadratic without a quadratic sized dictionary. For instance, with a dictionary {a, aa, aaa, aaaa, aaaaa}, there are quadratically many outputs for aaaaa,

but the dictionary is quadratic in aaaaa (sum of first n integers: 5+4+3+2+1). With a fixed dictionary, runtime is always linear in the input, though there may be outputs proportional to the number of dictionary entries for each input symbol.

I also restated it in terms of suffix trees rather than finite-state automata, as that matches the usual presentation.

## Leave a Reply