What Should Tokenizers Do?


I’ve written before on the curse of intelligent tokenization, where I bemoaned the context-sensitivity of tokenizers like the Penn Treebank’s, which is sensitive to ends of sentences, and Penn’s BioIE’s and OpenNLP‘s tokens, both of which can be extracted by statistical decoders.

Question of the Day

Should tokenizers modify tokens or should they just return slices of their input?

I think there’s a lot to be said for both views, both in terms of conceptual clarity and efficiency. We chose to allow the former in LingPipe, but I’m not sure I’d make the same choice if I had a do-over.

I’m confident in saying that tokens should link to a span of text, if not be identified with a span of text. Otherwise, you can’t do things like highlighting in applictaions very easily. LingPipe only gets halfway there with a start position; if there’s stemming, etc. you can lose the end position. Maybe I’ll go and add that; luckily I made Tokenizer an abstract base class, so I can add (implemented) methods without compromising backward compatibility.

Java’s Tokenizers

Java’s Stream Tokenizer

At one extreme are the familiar programming language tokenizers like Lex, which is the model followed by Java’s own java.io.StreamTokenizer, as originally written by the master himself, James Gosling. First note that the input is completely streaming in the form of a java.io.Reader; this means it can really scale to lots of text with little memory. The method nextToken() returns an integer type of the next token (i.e. number, word, end-of-line, and end-of-stream in the stream tokenizer) and toString() gives you a string representation. More modern Java would convert that integer type to an enum, and no one would dream of overloading toString() that way. The to-string method actually ends with the statement return "Token[" + ret + "], line " + LINENO;; as you can see, no one expected you to actually use these as tokens in an NLP sense.

This way of doing things is the most efficient way possible if you’re not going to modify any underlying tokens and just return slices. You can easily generate Java strings, UIMA offset representations, or OpenPipeline-type tokens.

Java’s String Tokenizer

If you then consider Java’s StringTokenizer (which no one took credit for in the code!), you’ll see that they follow a pattern like Chris Cleveland used for OpenPipeline (and commented about in our last blog entry about tokenization). That is, they take an entire representation of a string rather than streaming, then return substrings of it as tokens, which means no more char array allocation. They don’t need to keep all of the strings — you can free up tokens for garbage collection after using them.

Tokenization for NLP

LingPipe and Lucene

Back in version 1.0, LingPipe only needed tokenizers to convert strings into tokens for named-entity detectors and soon after, general HMMs for part-of-speech taggers. These models are defined as mappings from sequences of symbols (e.g. tokens in natural language or base pairs in genomics) to sequences of labels (or tags). The models are neither position-sensitive nor whitespace sensitive.

LingPipe, like Apache Lucene, set up a base tokenizer that essentially iterates tokens, along with filters. Lucene’s Tokenizer iterates tokens, and their TokenFilter modifes these streams.

One difference is that Lucene takes Reader instances as input for tokenization, like Java’s stream tokenizers, where LingPipe operates over character array slices, like Java’s string tokenizer. With a do-over, I’d have tokenizers accept CharSequence instances rather than character array slices. But it’s an interface, so I can’t do it without sacrificing backward compatibility.

This pattern makes it particularly easy to define operations like case normalization, stemming, stoplisting, synonym expansion, etc. etc. But should we?


Mallet, on the other hand, reifies complete tokenizations in a cc.mallet.extract.Tokenization object, which is a document plus set of spans. This is what I was talking about in the comments as representing an entire tokenization; they even named it properly.

LingPipe Chunking

A Mallet tokenization shares some degree of similarity with LingPipe’s chunkings. A chunking is a character sequence plus a set of chunks, where a chunk has a start position, end position, string-based tag, and double-based score. They can overlap, there can be multiple chunks with the same spans (but not the same span and type), etc. It’d be easy to implement an adapter that would implement a chunker (function taking char sequences to chunkings) based on a tokenizer. There are, in fact, reg-ex based chunkers in the chunk package and reg-ex based tokenizer factories in the tokenizer package.


UIMA‘s common analysis system (CAS) encodings tend to do the same thing, but they’re very flexible, so you could build just about anything you want.

So to filter the output of a tokenizer, you get the whole tokenization to work with. This gives you the utmost in flexibility, though it doesn’t necessarily track the history of derivations. Of course, you can represent just about anything in CAS, so you could track this kind of thing if you want, and do it very flexibly, as in pointing to the three base tokens that produced a token trigram, or pointing to dictionary resources online used for synonymy, or pointing to a whole analysis tree for stemming.

Once you start thinking like this, tokenization becomes a very rich process.


MinorThird defines its edu.cmu.minorthird.text.Tokenizer interface to map documents (or strings) to arrays of TextToken objects. These token objects hold a pointer to the underlying string or document data and represent token positions as offsets.


OpenNLP defines its opennlp.tools.tokenize.Tokenizer interface to map an input string to an array of tokens represented as strings or as span objects.

An OpenNLP span contains a start and end offset, as well as a method, getCoveredText(), to return the string yield. But then I didn’t see any typing information other than the covered text or any (public) extensions. There were lots of useful utility methods which we’ve packed (some of) into static methods to consider overlaps, containment, begins and ends. It also implements Comparable (generic would be Comparable<Span>), which our chunks don’t (but there are static utility comparators).


Dieselpoint, Inc.’s OpenPipeline follows a hybrid model that uses Java string reader style representations of tokens, but then allows them to pipeline through filters like Mallet/UIMA. I had to download the API to get at the javadoc — I couldn’t find it online. [Update: Chris uploaded the Javadocs, so you can get them here:

I believe what Chris Cleveland, the architect of OpenPipeline, was suggesting was to do a simple tokenization as spans, then everything on top of that is a different kind of analysis. I think this makes sense as an answer to the question of the day.

And More…

Please feel free to add other models of tokenization if I’ve missed them.

4 Responses to “What Should Tokenizers Do?”

  1. Chris Cleveland Says:

    Just a couple of comments —

    Our intent with OpenPipeline was to have a tokenizer/analyzer that was as flexible as UIMA’s. You can get that by subclassing the Token class and adding your own fields, the same way that UIMA does it. So you could have a new Token type that points to the three subtokens that make it up.

    That’s in fact what we do with our Chunk class. Suppose you want to recognize acronyms like “I.B.M.”. Our chunk analyzer will recognize the entire construct and retain pointers to the underlying tokens. You can use them for further analysis.

    We’ve put the javadoc up on the OpenPipeline site. No need to download the product. Sorry about that.

  2. lingpipe Says:

    This whole discussion’s been very useful to me, and has led to a flurry of discussion in the office, because we’re doing lots of token-level string matching these days to link gene mentions to databases. I really had no idea people were leaning so heavily on tokenizations.

    We have the same kind of Chunk interface in LingPipe. It’s basically a start and end position in a char sequence, though the interface requires a string-based type (these can, of course, be shared across chunks), and a double-based score.

    Chunk extends Scored
        int start();
        int end();
        String type();
        double score();

    We have different implementations produced by factory that store only start/end (and provide default types and scores by class-level method implementation, not by pointer/storage), start/end+type, and start/end+type+score.

    The interface doesn’t specify any way to get from a chunk back to its char sequence. Instead, a Chunking then combines a char sequence and a set of chunks that are in range of the char sequence. Subclassing Chunk lets you add payloads to the individual chunks, but requires an unchecked cast to get back. What would be neater in Java’s type system is:

    GenericChunking<C extends Chunk>
        Set<C> chunkSet();
        CharSequence charSequence();

    Then our current Chunking, which returns sets of plain-old chunks, could extend GenericChunking<Chunk>.

    The alternative to extending Chunk would’ve been to generify its type, as in:

        int start();
        int start();
        E type();

    And then our current Chunk interface could extend GenericChunk<String> and add an implementation of the Scored interface [that is, double score();].

  3. Tom Morton Says:

    OpenNLP actually supports strings and spans. The string implementation just calls the span implementation and uses the spans to construct the strings. tokenizePos is probably poorly named but means tokenize positions.


    java.lang.String[] tokenize(java.lang.String s)
    Span[] tokenizePos(java.lang.String s)

  4. lingpipe Says:

    Thanks Tom — I amended the original post to reflect this. It shows the problem with reading JavaDoc too quickly — I didn’t look at the inherited methods on the English tokenizer:


    I really think they should pull down and copy all the doc from superclasses as the default behavior in JavaDoc.

    It would be less confusing for casual browsers like me if the four classes named Tokenizer implementing the interface named Tokenizer were given more specific names, like EnglishTokenizer or ThaiTokenizer.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s