Lucene’s Missing Token Stream Factory


While we’re on the subject of tokenization, every time I (Bob) use Lucene, I’m struck by its lack of a tokenizer factory abstraction.

Lucene’s abstract class TokenStream defines an iterator-style method next() that returns the next token or null if there are no more tokens.

Lucene uses the abstract class Token for tokens. A Lucene token contains a string representing its text, a start and end position, and a lexical type which I’ve never seen used. Because LingPipe has to handle many tokens quickly without taxing garbage collection, it doesn’t create objects for them beyond their string texts. But that’s the subject of another blog entry.

A Lucene Document is essentially a mapping from field names, represented as strings, to values, also represented as strings. Each field in a document may be treated differently with respect to tokenization. For instance, some might be dates, others ordinary text, and others keyword identifiers).

To handle this fielded document structure, the Lucene class analysis.Analyzer maps field names, represented as strings, and values, represented as instances of, to token streams. The choice of Reader for values is itself puzzling because it introduces I/O exceptions and the question of who’s responsible for closing the reader.

Lucene overloads the analyzer class itself to provide the functionality of LingPipe’s tokenizer factory. Lucene classes such as SimpleAnalyzer and CJKAnalyzer return the same token stream no matter which field is specified. In other words, the field name is ignored.

What would be useful would be a Lucene interface analysis.TokenStreamFactory with a simple method TokenStream tokenizer(CharSequence input) (note how we’ve replaced the analyzer’s reader input with a generic string). Then analyzers could be built by associating token stream factories with fields. This would be the natural place to implement Lucene’s simple analyzer, CJK analyzer, and so on. The current analyzer behavior is then easily derived with an analyzer which sets a default token stream factory for fields.

4 Responses to “Lucene’s Missing Token Stream Factory”

  1. Thomas Jung Says:

    Please take this with a grain of salt, as I have not looked at Lucene in a while: I thought the choice of using a Reader as the input for the Analyzer/TokenStream was to provide a more flexible interface to any character stream. Especially for very large documents, it may be useful to have some sort of stream-based representation. Loading documents into memory in their entirety (e.g. as a String or CharSequence) may not always be an option; or they already exist as stream (such as from a network resource).

    Otherwise, I like your idea for the TokenStreamFactory.

  2. lingpipe Says:

    That’s a good point. Using a Reader makes it possible to read a very large text field without having it all in memory. Then you’d need a sparse tokenizer or you’d need to up the number of terms to save per doc.

    In any case, there should be some indication on the method as to the longevity of the handle to the reader and how long it’s kept open and whether Lucene closes it. What’s nice about Lucene’s being open source is that you can just look at the code and figure out what’s going on.

    Of course, having a TokenStreamFactory is independent of whether you tokenize strings or readers.

  3. Otis Gospodnetic Says:

    Bob, have you seen Solr’s TokenizerFactory interface? I think it matches what you are missing in Lucene. It contains the following method:

    /** Creates a TokenStream of the specified input */
    public TokenStream create(Reader input);

    It’s still not clear who closes the Reader, but a little javadoc would fix that (or is there a better way to set/force the Reader closing rules?)

  4. Bob Carpenter Says:

    I hadn’t seen it, but that’s exactly what I was talking about.

    The nice thing is, it’s compatible with the rest of Lucene. For instance, it’d be the right thing to plug in on a per-field basis to a org.apache.lucene.analysis.PerFieldAnalyzerWrapper, which now uses addAnalyzer(String field, Analyzer analyzer).

    Where do you get a benefit from using a Reader?

    If the token stream really streams the tokens from the reader, the reader has to be closed by whoever’s consuming the token stream, which in most use cases is the indexer. Does the indexer guarantee the close methods of each of the field’s readers will be closed even if one of them throws an I/O exception at some point?

    I haven’t used Lucene in many different contexts, but won’t the need to parse the documents you get into fields defeat the benefits of using a Reader anyway? I don’t see how I could usefully parse a doc and then reconstitute readers for it that streamed based on the original stream.

    Using readers is also the cause of the token stream’s next() method throwing an I/0 exception (which is why, for instance, it can’t implement Java’s Iterator<Token> interface).

    And a quick grep of the source reveals an awful lot of strings being wrapped as readers for compatibility, including internal wrapper classes like index.ReusableStringReader.

Leave a Reply to lingpipe Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: