Lucene’s Missing Token Stream Factory

While we’re on the subject of tokenization, every time I (Bob) use Lucene, I’m struck by its lack of a tokenizer factory abstraction.

Lucene’s abstract class TokenStream defines an iterator-style method next() that returns the next token or null if there are no more tokens.

Lucene uses the abstract class Token for tokens. A Lucene token contains a string representing its text, a start and end position, and a lexical type which I’ve never seen used. Because LingPipe has to handle many tokens quickly without taxing garbage collection, it doesn’t create objects for them beyond their string texts. But that’s the subject of another blog entry.

A Lucene Document is essentially a mapping from field names, represented as strings, to values, also represented as strings. Each field in a document may be treated differently with respect to tokenization. For instance, some might be dates, others ordinary text, and others keyword identifiers).

To handle this fielded document structure, the Lucene class analysis.Analyzer maps field names, represented as strings, and values, represented as instances of java.io.Reader, to token streams. The choice of Reader for values is itself puzzling because it introduces I/O exceptions and the question of who’s responsible for closing the reader.

Lucene overloads the analyzer class itself to provide the functionality of LingPipe’s tokenizer factory. Lucene classes such as SimpleAnalyzer and CJKAnalyzer return the same token stream no matter which field is specified. In other words, the field name is ignored.

What would be useful would be a Lucene interface analysis.TokenStreamFactory with a simple method TokenStream tokenizer(CharSequence input) (note how we’ve replaced the analyzer’s reader input with a generic string). Then analyzers could be built by associating token stream factories with fields. This would be the natural place to implement Lucene’s simple analyzer, CJK analyzer, and so on. The current analyzer behavior is then easily derived with an analyzer which sets a default token stream factory for fields.

Leave a Reply