Tokenizer Pattern Smackdown: Factories, Constructors, Singletons, Filters, and Adapters

by

We’re rolling out a suite of feature extraction utilities and tokenizer factories for the upcoming 3.8 release. This has caused me (and those in my questioning range, namely Mike, Mitzi and Breck), considerable anguish. I’m trying to be a good boy and follow Josh Bloch’s advice in Effective Java. But like Paul Grice’s conversational maxims, which urge you to be both concise and unambiguous, Bloch’s individually reasonable sounding advice is rather contradictory when considered collectively.

The Tokenization Interfaces

We’re pretty much locked into our basic interfaces and base classes, which for the point of this discussion may be thought of as:

Tokenizer
    String nextToken();

TokenizerFactory
    Tokenizer tokenizer(char[] cs, int start, int end);

Chaining Tokenizers with Filters

Tokenizers like to chain. Typically, you start with a base tokenizer, like our Indo-European tokenizer, then filter it by lowercasing tokens, converting to stems, removing stopwords, and so on. This is a classic instance of the filter pattern.

Way back in version 1.0, tokenizers were often external to models. This is still the case, for instance, with HMMs, which take in an array of strings (representing tokens) and return an array of strings (representing tags, or hidden states). It’s then the client’s job to make sure the tokens supplied at run time match those used for training.

The natural thing to do was to use a filter pattern to implement. With simple tokenizers and filtered tokenizers, we’d do things like:

Tokenizer ieTok = new IndoEuropeanTokenizer(cs,start,len);
Tokenizer lowTok = new LowerCaseFilterTokenizer(ieTok);
Tokenizer stopTok = new EnglishStopFilterTokenizer(lowTok);
String[] tokens = stopTok.tokenize();

Looks just like java.io stream, reader and writers in action. That is, we construct an instance of tokenizer using a constructor (in this case, IndoEuropeanTokenizer(cs,start,len)). Then we apply a bunch of filters through their constructors (e.g. LowerCaseFilterTokenizer(ieTok)). Each filter holds onto its contained tokenizer and modifies its output in some way.

Type Raising: Factories

In order to supply tokenization capabilities to model, we need a way to create tokenizers from strings, which is where the tokenizer factory comes in. This interface is easy but the implementation is harder. To apply the tokenizer-level filters, we need to use:

TokenizerFactory tf = new TokenizerFactory() {
    Tokenizer tokenizer(char[] cs, int start, int len) {
        Tokenizer ieTok 
            = new IndoEuropeanTokenizer(cs,start,len);
        Tokenizer lowtok 
            = new LowerCaseFilterTokenizer(ieTok);
        Tokenizer stoptok 
            = new EnglishStopFilterTokenizer(lowTok);
        return stopTok;
    }
}

Oh, and you probably want to make that implement java.io.Serializable, which means you need a static nested class or new class file on which to hang the declaration. Writing robust serialization is a pain (LingPipe’s util.AbstractExternalizable helps immensely in following Josh Bloch’s recommended practices).

Filtering Tokenizer Factories

This was all getting to be too much. What we really need is something to filter tokenizer factories. What we eventually settled on was the same-old pattern, reasoning that it’d be much more familiar to programmers because of its ubiquity in the Java libraries (like java.io and org.xml.sax). What we now have is something like this:

TokenizerFactory ieTf = new IndoEuropeanTokenizerFactory();
TokenizerFactory lowTf 
    = new LowerCaseTokenizerFactory(ieTf);
TokenizerFactory stopTf 
    = new EnglishStopTokenizerFactory(lowTf);

Singleton Base Factories

Typically, a tokenizer factory is stateless and thread safe, and hence can be modeled using the singleton pattern, meaning a single instance that is used wherever an instance of that tokenizer factory needed.

So I’ve gone through and deprecated the constructors (e.g. new IndoEuropeanTokenizerFactory) in favor of static final instances (e.g. IndoEuropeanTokenizerFactory.INSTANCE), which is by far the safest way to implement singletons in Java. Java’s class loader even provides lazy instantiation (the singleton’s created when the class is loaded, and being static and final, is always safely published).

It’s an easy change for the clients, with the first line nullary constructor being replaced with the instance:

TokenizerFactory ieTf = IndoEuropeanTokenizerFactory.INSTANCE;
...

Singleton Tokenizer Factory Filters?

So why not do the same thing with the filters? That is, define an interface such as this one:

TokenizerFactoryFilter
    TokenizerFactory filter(TokenizerFactory factory);

They’re easy to implement, too, given that we’ve already defined the constructor-based filter LowerCaseTokenizerFactoryFilter. Here we’ve just gone ahead and assigned them to a singleton, as we don’t need more than one instance of the filter:

static final TokenizerFilter LOWER_CASE_FILTER
    = new TokenizerFilter() {
        public TokenizerFactory filter(TokenizerFactory f) {
            return new LowerCaseTokenizerFactoryFilter(f);
        }
    };

We can even collect the singleton tokenizer factories and filters into a big utility class and privatize the classes that implement them. Then what we’re left with is something like this:

TokenizerFactory f = 
    ENGLISH_STOP_FILTER
    .filter(LOWER_CASE_FILTER
            .filter(INDO_EUROPEAN_FACTORY));

In the end, we decided that as cool as that is, and as much as it follows all of Josh Bloch’s advice to a tee. By hiding all the classes, you cut down on cognitive load. Unfortunately, you also cut down on findability and make one big class with tons of doc and singletons in it. You do get the advantage of being able to return whatever types you want and retain freedom to change implementations (modulo serializability backward compatiblity) at will.

Let’s Not Innovate on Style

But in the end, we decided it was so non-standard that it’d just be confusing to clients. So we went with the more familiar constructor-based filter pattern. Although I did make the base factories singletons.

Adapters to the Rescue

What I did need to do was add adapter implementations. Most of the tokenizer factory filter implementations really only modify the tokenizer, and then, typically they only modify the tokens and whitespaces individually. So I broke out ye olde adapter pattern, first for tokenizers:

abstract class ModifedTokenizerFactory 
    implements TokenizerFactory {

    private final TokenizerFactory mBaseFactory;

    public ModifiedTokenizerFactory(TokenizerFactory factory) {
        mBaseFactory = factory;

    public Tokenizer tokenizer(char[] cs, int start, int len) {
        Tokenizer tokenizer 
            = mBaseFactory.tokenizer(cs,start,len);
        return filter(tokenizer);
    }

    public abstract Tokenizer filter(Tokenizer tok);

}

and then one more level to filter the tokenizer itself (in simplified form; the real implementation has to deal with deletions of tokens, too, in order to generalize to stop lists):

abstract class ModifyTokenFactory 
    extends ModifiedTokenizerFactory {

    public ModifyTokenFactory(TokenizerFactory factory) {
        super(factory);
    }

    public Tokenizer modify(final Tokenizer in) {
        return new Tokenizer() {
            public String nextToken() {
                return modify(in.nextToken());
            }
        }
    }

    abstract String modify(String token);

}

It makes implementing the filters a piece of cake. The basic idea is to adapt a token modifier up to a tokenizer modifier and then a tokenizer factory modifier.

Feature Extractors: More of the Same

Most of this discussion applies to feature extractors, too, which we’ve modified in the same way as tokenizers for 3.8. In fact, when you create feature extractors on top of tokenizers, it looks even more like Java I/O, because you move from tokenizer factories to feature extractors the same way Java I/O moves from streams to readers/writers.

7 Responses to “Tokenizer Pattern Smackdown: Factories, Constructors, Singletons, Filters, and Adapters”

  1. Chris Cleveland Says:

    I’m not convinced. A few impressions:

    First, in enterprise search pipeline architectures, you can mix and match processing stages, including stages in the tokenization process. You can, for example, modify the character normalization routines, strip or not strip stopwords, etc.. Each stage needs to run independently. It’s hard to do that with filter chains or any kind of streaming of tokens. We’ve elected to do it in OpenPipeline just by passing lists of tokens from stage to stage.

    Second, using Strings as tokens is inefficient. Yes, garbage collectors are better these days, but they’re not perfect. The alternative is to have a Token object that points to a span of characters in an underlying char buffer, the way OpenPipeline and UIMA do it. Way more memory efficient.

    Third, if the Token points to a char buffer, then you can examine the surrounding chars, including whitespace, and you don’t have to store whitespace tokens. You can also attach the character offset in the original file to the Token which can be really helpful when displaying search engine results.

    Third, it’s helpful in some circumstances to see a chain of modified tokens going back into history. Example: we use a tokenizer to analyze queries in our search engine. The words in the query need to be normalized (lowercased), but the reserved words are case-sensitive. You don’t know whether a token is a search term or a reserved word until the query has been parsed, which happens *after* the tokenization process is complete. So for the reserved words, we need to be able to follow a chain of modified tokens back to the original token.

    Take a look at OpenPipeline to see how we do it. I’m planning to make a mod to the Token class to store a linked list of previous tokens (instead of the current getOriginalToken() method), but the rest of the architecture might be useful to you.

  2. lingpipe Says:

    Thanks for the detailed comments. What I’m thinking we all need is something like writers workshops for programmers where we can “workshop” API and other ideas.

    I have to admit I have heard of but haven’t followed OpenPipeline, which is billed on its home page as “open source software for crawling, parsing, analyzing and routing documents.” To get the javadoc, you need to download the package — it comes with precompiled javadoc in the obvious location. (And there’s even a LingPipe wrapper — cool.)

    The javadoc’s thin on doc comments, but it’s definitely worth looking at for tokenization. What I really like is that their Token class implements CharSequence. I don’t like the linked list aspect of it; it seems like it’s requiring tokens to know too much about their context and history, but this is partly what Chris is arguing for in points one and three. It looks like it’d be hairy for clients to implement, but then I don’t see any public constructors that do anything. One reason I stuck to strings is their simplicity and modularity/independence.

    Re First) Running filters independently from an underlying tokenizer seems reasonable. We haven’t felt the need to do it, but I see how it could come up. Of course, nothing stops you from just passing lists around. What we’ve found is that we often create objects that represent entire tokenizations.

    Re Second) I toyed with the idea of doing everything offset based five years or so ago when I laid out the tokenization interfaces. Keep in mind that Java’s string implementation is a pointer to an underlying array of characters. You get reuse if you take a big string and do substring(int,int) on it — it’s much easier because it’s immutable.

    In retrospect, I wish I’d gone the other way and had TokenizerFactory declare it’s interface method as Tokenizer tokenizer(CharSequence) instead of tokenizer(char[],int,int).

    It’s the classic struggle of familiarity and ease of use versus efficiency.

    OpenPipeline’s Token’s actually look much less efficient than just using strings, not that tokenization is ever a serious bottleneck (though it is the bottleneck in running, if not training logistic regression and naive Bayes classifiers). There’s still a Token object being allocated, which stores four ints, two token pointers (linked list and history) and a fast string buffer pointer. Presumably the buffer’s shared at least until things start getting modified with stemming or other normalizations. When there are modifications, this looks like it’ll require something just like string copies.

    How do Token objects get created? OK, now I see — there’s a nullary default constructor and a bunch of setters. I find this a very error-prone pattern, and greatly prefer initializing immutables (that’s what String does, for instance).

    Shouldn’t the linking of tokens be double? It seems I’d want to move backward and forward in the list.

    A generic declaration of Comparable<CharSequence> on Token would eliminate compareTo(Object), which just throws a class-cast exception if it gets something other than a CharSequence anyway.

    You might also want to think about caching the hash codes like String does. But it’d be unsafe given that you construct a Token with a char buffer (FastStringBuffer) and store it rather than copying it defensively. That means something on the outside could change the buffer, changing the equality and hash conditions on tokens!

    The definition of equals(Object o) is broken because it violates the basic contract by not being symmetric. That is, you can have a token x and string y such that x.equals(y) but not y.equals(x). String is only ever equals() to other strings (which you could argue is a shortcoming of the retrofitted CharSequence interface in Java).

    The getOriginalToken() method assumes each output token comes from a single base token. We have tokenizer filters that do things like create token n-grams that have overlapping spans and are derived from multiple tokens generated by a base tokenizer (that one’s not in the public API yet, but we just built it for a customer project). For instance, it’d turn “John”, “ran”, “home” into “John_ran”, “ran_home”. (Of course, one might argue this is abusing the notion of tokenization.)

    The other thing I wished we did a better job of keeping track of is original position. We only include the start offset, but it should really include an end offset too. Start plus token length is often not the original position.

    I didn’t understand the SimpleTokenizer class at all — it’s not implementing any kind of tokenizer interface, nor does it seem to produce anything that looks like a tokenization. What am I missing here? What I’d really like to see is an implementation in this framework of what I was talking about: simple reg-ex breaking, lowercasing, stemming, and stoplisting.

    Re First Third) I’m so much in agreement that this is useful that I actually implemented Chunking the way you’re suggesting. It stores an underlying char array, then chunks are offsets, types and scores on top of that.

    Re Second Third) That makes sense, and you’re right, chaining the way we’re doing things makes this completely opaque. The best you can do is create tf, filt1(tf), filt2(filt1(tf)), filt3(filt2(filt2(tf))), and so on, running them all and collecting up the results, which is crazily inefficient (quadratic).

    What we should all do now is go check out how Mahout’s going to do all of this. I sent them a bunch of comments on this kind of thing before they started. They have another whole set of issues as they’re trying to scale out using Hadoop, so everything must have to be somewhat network friendly.

  3. Chris Cleveland Says:

    Strings, or more generally immutable objects, have some big advantages in reducing errors and making multithreading easier, and I’d love to use them. The trouble is the memory use and performance.

    The general rule these days is that small objects are cheap to allocate, but larger buffers are not. So our method for getting good performance in tokenizers is to allocate one large buffer, the FastStringBuffer, and then just have small tokens point to it. This buffer gets reused from document to document. The tokens get garbage collected, but the buffer never has to be reallocated. When we create new tokens, we just append the text to the end of the buffer and point to it. We don’t have to do much copying.

    I suppose we could use substrings, but then I don’t get to attach other information to the token.

    Yes, allowing Tokens to be constructed without initializing all the internal variables, like the buffer, is error-prone. I’ll puzzle over a way around the problem.

    >>A generic declaration of Comparable on Token
    >>would eliminate compareTo(Object), which just throws a
    >>class-cast exception if it gets something other than
    >>a CharSequence anyway.

    Yup. Fixed.

    >>I didn’t understand the SimpleTokenizer class at all — it’s not
    >>implementing any kind of tokenizer interface, nor does it seem
    >>to produce anything that looks like a tokenization. What am I
    >>missing here? What I’d really like to see is an implementation
    >>in this framework of what I was talking about: simple reg-ex
    >>breaking, lowercasing, stemming, and stoplisting.

    Yes, it was intended as something very primitive. We’ve got all the more sophisticated stuff in Dieselpoint Search. At a minimum I suppose we should publish the Tokenizer interface and a few implementations. SimpleTokenizer really does produce a tokenization, though; it produces a list of tokens that attach to the field in the document and get passed down the pipeline. Subsequent stages just operate on that list.

  4. Lance Norskog Says:

    Does Tokenizer really need to call something?
    Can’t it just accept a structure that it fills in?

    Isn’t it just a producer-consumer relationship? If so, does it need to do any processing after producing a result and then consuming the next one?

    A pipeline would then be an active object: it contains an array of tokenizer objects and calls one after another with “fill this in” objects. This is a Spring-type (Inversion of Control) pattern: it puts the pipeline object in charge. It can do debugging reports, change data at interesting places, etc.

    In this design, a Tokenizer can be stateful or stateless. Stateless ones can be singletons, but stateful ones are created for the pipeline.

    A general opinion: in the filter pattern the filter has two jobs: first to do its real work, and second to launch more execution. A class should do one job, not two.

  5. lingpipe Says:

    @Lance: The LingPipe Tokenizer objects stream tokens and whitespaces. They’re stateful objects. They implement next-token and next-whitespace methods. You can pass them two lists to fill in with the remaining tokens and whitespaces. So filters just call those methods on the contained tokenizers.

    So I’m afraid it’s not really a pipeline in any reified sense. That is, there isn’t a producer-consumer object implemented. That’s what keeps confusing me about all of this — I want to make the filter an object itself. Then you could implement a generic pipeline that itself takes control as you suggest. That’s how the pipes in Java’s util.concurrent work.

    The LingPipe Tokenizer objects can’t be singletons. But many of the tokenizer factories, and filters over tokenizer factories can be singletons. It’s the ones like general stoplisters that can’t be — they need to get configured with the stop list by the client.

    I think what you’re suggesting is what I’d call a Tokenization object, which could hold, say, the underlying char sequence and tokens. That’s essentially what we call a Chunking in LingPipe. A chunking is a char sequence with a set of chunks, where a chunk has a start and end position in range for the char sequence, a type, and a score.

    As Chris pointed out above, if I’d done this, I wouldn’t have needed all that fiddling with whitespaces in our tokenizers.

    UIMA uses a general form of something like LingPipe’s Chunking interface to handle tokenizations and other offset annotations.

    @Chris: I see the tradeoff here.

    Java’s String class’s substring method is implemented to share the underlying char array with its source. If you create a string, then create a bunch of substrings of it, the substrings share the original string’s array. It’s all possible because it’s immutable, because on creation, strings always create a new array they don’t share (actually, it’s a little trickier with string buffers and builders, which share things behind the scenes in a package-protected way we can’t get at).

    How would you do something like run a tokenizer over a corpus and calculate token counts? You’ll need something permanent to act as keys in a map containing the counts. I suppose those keys could be allocated as necessary and somehow compared to Token objects before that.

    I’m still not seeing how this setup’s going to deal with things like stemming or case normalization.

  6. Chris Cleveland Says:

    In OpenPipeline you generally don’t modify the underlying buffer once you start generating tokens. So you get the benefits of substrings without the hassles. Of course, it’s possible to modify the buffer, so there’s a potential source of error. You just adopt a convention that you don’t.

    The benefit of having a reusable buffer for holding documents is enormously improved performance. It’s worth the tradeoff.

    Handling stemming and case normalization is easy. (Actually, you should *never* do stemming in doc processing pipeline, but that’s another discussion). If you want to create a token which is a modification of an earlier token, allocate space for it at the end of the buffer and have your new token object point to it there.

    The thing to understand is that tokenization isn’t special, it’s just a specific case of a more general process of recognizing things in text and annotating them. You should be treating tokens, chunks, entity extraction, field recognition, even classification and sentiment analysis identically, and they should have similar data structures.

    And Lance is right — separate your tokenizers/annotators from the pipeline processing logic. Two different functions, two different sets of classes.

    And not to harp, but really, think about using OpenPipeline. It’ll open your code to a much broader world of connectors, producers, and consumers of content.

  7. lingpipe Says:

    The big performance advantage I see of pipelining everything is scaling, not efficiency. Pipelines are very easy to scale out across machines or across threads, and that’s where we tend to set them up.

    If efficiency’s that much of an issue, how about not generating object representations at all? An old-school tokenizer in C would be very much like what you have (that is, everything based on a mutable array of chars), but would just implement a nextToken() method along with start(), end() and perhaps type() methods. The cool thing is that’s enough to do matching and lookup and to calculate features. And you can go even further at runtime by not even creating the features, a la Vowpal Wabbit’s strategy.

    Another way to save space (though not relevant to OpenPipeline’s tokenizer design) is by reusing strings, which is safe, because they’re immutable. You can let Java this with String.intern(), or manage your own string pool. This can be a huge savings with NLP data, which tends to be very Zipfian. For instance, the most frequent 100K tokens covers well over 95% of newswire tokens in English (obviously time dependendent).

    I’m way out of my depth here, but I’m told protocol buffers are the efficient way to pass around objects in pipelines these days (e.g. Apache Thrift or Google’s protocol buffers), and they’re even language independent. But that’d just be the infrastructure on top of which something like OpenPipeline or UIMA could be implemented. But that’s another blog thread.

    There’s unfortunately a host of APIs out there to choose from and not nearly enough time: GATE, UIMA, OpenPipeline for NLP, Java RMI, J2EE EJBs, protocol buffers, SOAP, RDF, etc. I’d also love to be able to run easily on top of EC2 using Hadoop and/or protocol buffers.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s