Backward-Compatible Java (De)Serialization

by

Time to ‘fess up. Sometimes when you look at code you wrote in the past, it looks very ugly. I took a very ill-advised expedient in serializing tokenized language models in the original implementation.

Current TokenizerFactory Serialization

I simply wrote out their fully qualified class name:

void writeExternal(ObjectOutput out) throws IOException {
    out.writeUTF(mLM.mTokenizerFactory.getClass().getName());
    ...
}

Then I read it back in the obvious way:

Object read(ObjectInput in) 
    throws IOException, ClassNotFoundException {

    String className = in.readUTF();
    TokenizerFactory factory 
        = (TokenizerFactory) Reflection.newInstance(className);
    ...
}

where Reflection.newInstance(String) is our (about to be deprecated) utility to create an object from the name of its class using the no-arg constructor.

Pure Unadulterated Laziness

What was I thinking? The sad truth is that I was too lazy to write all those serializers. Always think twice before implementing something because it’s easy. The whole util.Reflection was just such a hack; the original reflection implementations threw all those exceptions for a reason!

The Real Problem

The problem here is that folks ran into run-time issues with their factories (and ours) not having no-arg constructors. For instance, suppose we have a factory that requires a string and integer to construct:

class SomeFactory implements TokenizerFactory {
    SomeFactory(String a, int b) { ... }
}

A Hacked Solution

In practice, when you’re building a specific model, there’s a fixed value for the constructor args. So you can define a quick adapter class:

class MyFactory extends SomeFactory {
    public MyFactory() { super("foo",42); }
}

Refactoring the Right Way

So I want to now refactor to serialize the factory rather than writing the name of its class. But how to handle backward compatibility? At least in this case, it wasn’t too hard. I use the fact that I used to write out a string to write out a “magic value” as a flag, in this case the empty string, because it’s short and it can’t be a class name.

void writeExternal(ObjectOutput out) throws IOException {
    if (mLM.mTokenizerFactory instanceof Serializable) {
        out.writeUTF("");
        out.writeObject(mLM.mTokenizerFactory);
    } else {
        out.writeUTF(mLM.mTokenizerFactory.getClass().getName());
    }
    ...
}

To read, I just first read the string, and if its empty, read the object:

Object read(ObjectInput in) 
    throws IOException, ClassNotFoundException {

    TokenizerFactory factory = null;
    String className = in.readUTF();
    if ("".equals(className)) {
        factory = (TokenizerFactory) in.readObject();
    } else {
        factory 
            = (TokenizerFactory)
                Reflection.newInstance(className);
    }
    ...
}

There you have it, backward compatibility.

3 Responses to “Backward-Compatible Java (De)Serialization”

  1. Lance Norskog Says:

    I have not used or studied Lingpipe, thus my ignorance. Why does any of this processing state need to be Serializable?

    It seems an odd feature to require in this context. Do these processing chains take some big blob of half-processed text and ship them to another program? I can see that this would enhance parallelization, but perhaps packaging for transmission could be a more coarse-grained thing?

  2. Lance Norskog Says:

    Off-topic: you mention Java bugs and correct code in various posts. The PMD project and a few other projects automate checking large lists of dubious coding practices. PMD has an Eclipse plug-in, most have command-line executors.

    These tools have found some bugs for me and explained some code structure problems that had previously bothered me. You might like them.

  3. lingpipe Says:

    My use of “serialization” was confusing. I just meant Java’s serialization interface java.io.Serializable, which is what LingPipe implements to store models. For models like our named-entity chunkers and token-based language models, we store tokenizer factories as part of the model. The factory converts character sequences into token sequences. It’s the factory that’s getting serialized, not any particular tokenization.

    Feature extraction works the same way with logistic regression, k-nearest neighbors, and other models which require feature extractors.

    Thanks for the tip about PMD — I’ve blogged about it before. I love these kinds of tools, though I don’t always agree with their recommendations, like making me put at least a comment in a no-op implementation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s