Parsers, Handlers and Corpora: Patterns for Data Sets


Now that we’ve thoroughly discussed tokenization, I’d like to move on to the other big design problem facing any machine learning or NLP toolkit: how to represent data sets.

Why not File Formats?

The usual solution, as adopted by almost all of the popular machine learning toolkits, is to this is to define a structured file format and read training and evaluation data from such files. It’s a very unix-y way of working on problems.

The problem is that it sweeps under the rug all the problems of raw data parsing and feature extraction. Feature extraction happens outside of the tools themselves, making the process difficult to instrument for debugging in conjunction with the models (e.g. you have to keep track of which input line corresponded to which classification instance to put the system’s answers back together with the data you care about). It’s also rather inefficient to convert to a line and string-based representation when what you really want is feature vectors. But then that’s not our major motivation, or we would’ve went even further toward raw vector internals.

Programmatic Access or Bust

The unix-y style of working tends to encourage the application of a succession of scripts to on-disk representations. The problem is that the sequence of steps used to create the final representations are often lost during production due to the dynamic nature of scripting languages and the problems that tend to crop up during munging.

You can solve this problem for file-based systems by insisting the file-based representation is generated by a single script. I’d encourage you to do so if you’re working in this world.

The LingPipe Way

We prefer to have everything built into Java. Partly because I’m too lazy to learn other languages beyond a reading knowledge, and partly because it allows us to bring a unified toolkit to bear on the problem. Sure, Java’s a pain to wheel out for small problems (you are scripting compilation and run, I hope), but it’s just right for larger, more complex problems with lots of moving parts, such as when comparing different approaches to feature extractionin

To support code-based development of data sets, we’ve introduced object-oriented representations of data handlers, data parsers, and corpora. The model we followed for handlers and parsers is that of SAX parsing. Corpora were introduced later to represent whole data sets.

Handler and Handlers

There’s a marker interface corpus.Handler, which all data handlers extend. Its modeled on SAX’s ContentHandler interface.

There are specific handler interfaces that extend it, such as:

TextHandler extends Handler
    void handle(char[] cs, int start, int len);

TagHandler extends Handler
    void handle(String[] tags, 
                String[] tokens, 
                String[] whitespaces)

ChunkHandler implements Handler
   void handle(Chunking chunking);

MedlineCitationHandler implements Handler 
    void handle(MedlineCitation Citation);
    void delete(String pubmedId);

ClassificationHandler<E,C extends Classification> 
    implements Handler

    void handle(E e, C c);

Basically, they all accept streaming content as specified by their methods. Our online trainable models implement these methods. For instance, trainable HMMs implement the TagHandler interface and trainable chunkers implement the ChunkingHandler interface and trainable classifiers like naive Bayes and k-nearest neighbors implement the ClassificationHandler interface. This allows different models to accept the same form of input. It’s like defining a file format, only in code.


Parsers are modeled on SAX’s XMLReader interface. Specifically, the abstract base class is defined as:

Parser<H extends Handler>
    void setHandler(H handler);    
    H getHandler();
    void parse(File);     
    abstract void parse(InputSource);  
    void parse(String sysId);
    void parseString(CharSequence);  
    abstract void parseString(char[],int,int);

The generic parameter specifies the type of handler receives events from the parser. Just as with an XML reader, you can set and get the handler from a parser. Then, when one of the parse methods is called, events extracted from the specified input (file, URL, InputSource or string) are sent to the handler.

Parsers are data-source specific. So we have parsers for the Brown corpus, Penn Treebank, Reuters classification corpus, CoNLL tagging corpora, MEDLINE, and so on. In fact, you need one for each data format.

Here’s an example of training our most accurate chunker using MUC-format named-entity data in two files:

CharLmRescoringChunker chunker
    = new CharLmRescoringChunker(tokenizerFactory,numRescored,
Muc6ChunkParser parser = new Muc6ChunkParser();
// optionally compile for improved performance
Chunker compiledChunker
    = (Chunker) AbstractExternalizable.compile(chunker);

One Parser (per app) to Rule them All

An alternative would be to provide a single format for each type of data, then require external data sources to be converted to this format externally. You can still do that with LingPipe, and in fact, many of our users find it easier to munge data than implement parsers.


We ran into a problem with parsers and handlers when we started building models such as latent Dirichlet allocation clustering, EM semi-supervised classifier training, and logistic regression model fitting. In these cases, we needed the entire corpus to be provided to the estimator at once in order to do batch processing (this isn’t strictly required for these models, but it’s how they’re typically implemented in practice).

So we introduced another abstract base class representing an entire corpus of data:

abstract class Corpus<H extends Handler>
    public void visitTest(H handler);
    public void visitTrain(H handler);

The pattern here is that a corpus sends all of the information in a corpus to a single handler. Note that this is divided by testing and training data. An alternative would’ve been to define a single method visit(H), then implement two separate corpora for testing and training data.

We lean very heavily on our XValidatingClassifcationCorpus implementation, which implements Handler to collect a bunch of data, which it then divides into training and test splits using the usual cross-validation logic with a configurable number of folds. It doesn’t even require a parser, because it accumulates data programatically.

Here’s an example of how it works, supposing we want to evaluate naive Bayes on a bunch of data encoded in multiple files in SVMlight format. You can use:

    corpus = new XValidatingClassificationCorpus<...>(numFolds);
SvmLightClassificationParser parser 
    = new SvmLightClassificationParser();
for (File file : dataFiles)

for (int fold = 0; fold < corpus.numFolds(); ++fold) {
    NaiveBayesClassifier classifier = ...;
    ClassifierEvaluator<CharSequence,JointClassification> eval
        = new ClassifierEvaluator<...>(classifier,categories);
    System.out.println("fold=" + fold + "\n" + eval);

An alternative would be to require data to be munged into a single file. That way, a parser could handle it. Then something like cross-validation would happen by creating separate training and testing corpora to send to a parser.

Simplifying those Handlers?

I made a conscious decision to follow the SAX reader/handler paradigm. That meant that it’s possible for handlers to define all sorts of content-handling methods. SAX’s content handler has eleven. The only time I ever used this flexibility is for MEDLINE; there events can be either add citation or delete citation events coming from update files.

I also used multiple arguments. Tag handlers have a handle method with three arguments, text handlers with three arguments, classification handlers with two objects, etc.

But what if we replaced handler with:

    void handle(E);

That is, we require a single object to be handled. That means we need to convert the objects of the handlers into first-class objects. This is easy with text — just replace (char[],int,int) with CharSequence. Similarly we could replace (E,C) in the classifier handler with a new interface Classified<E,C> representing an object of type E classified as a C.

The big problem is MEDLINE. Mike suggested the workaround of a single command-type object, with two subtypes, delete and citation, but some other kind of compound container could be used.

We actually introduced this interface as ObjectHandler<E>, which extends Handler, and extended all the single-argument handlers to implement it.

With the data handled being represented as a single object, we could change the parser specification and rename to:

    Handler<E> getHandler();
    void setHandler(Handler<E>);
    void parse(File);

Corpora undergo the same change (and also a change of name):

    void visitTest(ObjectHandler<E>);
    void visitTrain(ObjectHandler<E>);

It’d be pretty easy; just deprecate all the current handlers other than ObjectHandler, create object representations of all handler arguments, then replace parsers and corpora with readers and data sets in all the classes that use them. I could even do it in two steps by deprecating the old interfaces while introducing the new ones.

It doesn’t seem clean enough to be worth the effort, but I’m curious what everyone else thinks. Of course, only the API nuts like me are still following this discussion, so the answers may be biased.

Corpora and Parsers as Iterators

If we move to a pure object handling setup, it’d be easy (API-wise; implementation’s a pain) to extend corpora to implement iterators over data objects (one each for testing and training). Parsers could return iterators given inputs. This is a much more familiar pattern for handling data, and would work for all of our online trainable models.

I’m about to roll out a truly online logistic regression estimator in the next LingPipe; at that point, running batch-style will require the user to send in the same data multiple times. It’s all just a matter of control.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: