Batch vs. Online Learning : Handler, Parser, and Corpus APIs


Online learning assumes that training cases are presented one at a time in a stream and the model may be output at any point. Batch learning assumes the training cases are all available at once.

Note that this distinction is independent of the generative/discriminitive distinction. Simple frequency-based discriminitive models could easily be online whereas latent mixture generative models may require batch processing for estimation.

Most of LingPipe’s online trainable models are also online executable (e.g. language models, LM-based classifiers and taggers, etc.). This means you can interleave training and executing without compiling in between. Some models may require compilation after being given online training (e.g. the spelling trainer), but that’s only a matter of efficient execution — the decoder is written against the compiled LM implementation, not the generic LM interface.

For online learning, we use the Handler marker interface. Sub-interfaces define actual handlers, such as the ChunkHandler, which defines a single method handle(Chunking).

The idea is that training data is incrementally provided through the handle() methods of the various handlers.

Although the handle methods of handlers may be called directly, they are typically called by parsers. The Parser<H> (where H extends Handler) interface is generic, requiring the type of its handler to be specified. The setHandler(H) and getHandler() methods manage the handlers. A wide range of parse() methods accept arguments for the input data to be parsed, which may be presented as a file, input source, character array slice or character sequence.

This setup is a classic instance of the visitor pattern. The individual models with handler methods don’t need to know anything about looping over data. Without the pattern baggage, this is simply what’s known as a callback. In Java, the executable code is encapsulated as an object, which usually implements some interface for abstraction from implementation. In C, you just have a function pointer used as a callback, and there’s no type checking.

The setup should be familiar from XML parsing using SAX. A SAX parser must implement the XMLReader interface, whereas a SAX (content) handler implements the ContentHandler interface. This is sometimes called an event-based model, with the callbacks from the reader to the handler being called "events".

The nice thing about LingPipe’s implementation is that many of the gory details are implemented in two abstract classes, InputSourceParser and StringParser. The input source parser implements the string parsing methods in terms of input source parsing, and the string parser does the reverse. These are instances of the adapter pattern, which is used elsewhere in LingPipe. There is also an adapter for the case where the input’s XML, namely the XMLParser subclass of InputSourceParser.

By allowing plug-and-play with parsers, we do not require our data to be in any standard on-disk representation. We need merely write parsers for representations as they arise. We prefer not to munge underlying data on disk if we can help it, as it almost invariably becomes a version management nightmare.

So everything’s simple and standard for online parsers. But how do we handle batch processing? If you’re used to thinking of files, the idea’s that you’d just run your loops multiple times over the files. But our parser/handler setup isn’t quite so simple, as there’s no guarantee the data’s ever in files in a simple way. Instead, we introduce a higher-order abstraction, the Corpus<H> interface, where H extends Handler. This interface specifies a method VisitCorpus(H), where the argument is a handler. The idea is that the handler is passed the entire corpus, one argument at a time. There’s also a convenience abstract implementation for on-disk corpora, DiskCorpus, which is constructed with a parser and disk directory.

Our batch learning methods don’t define handlers, they require corpora to be given to them wholesale. For instance, the PerceptronClassifier takes a corpus argument in its constructor. It is then able to call the corpus’s visitCorpus() method as many times as is necessary to achieve convergence or max out the number of allowed visits.

This looks more complicated than it is in practice. If you’re not convinced, you’ll find further grist for your mill by reading Peter Norvig’s take on "patterns" (pitting Lisp against Java).