Spring Cleaning: Generics for LingPipe 3.0

by

We’ve decided to refactor LingPipe for generics for the 3.0 release. After three solid days of work (and untold nights and mornings reading about generics), I’ve finished the refactor of com.aliasi.util; the Iterators class and SmallSet were rather challenging for a rookie like me.

My major complaint about Java when I first saw it was the lack of parametric typing. Many, if not most, of my programming errors are from assigning something to the wrong argument. I’m hopeless in Perl, for instance. As always, be careful what you wish for. Little prepared me for the complexity of Java 1.5 generics.

My hat’s off to the architects and developers of this amazing retro-fit. The addition of generics to a non-generic language is a heroic feat of backward compatibility engineering. What I should’ve realized before I started, but only dawned on me about 10 classes in is that just adding types doesn’t break backward compatibility anywhere. That’s why it’s possible to run 1.4-style programs in a 1.5 or 1.6 compiler or JVM. Very very cool.

Luckily, using generic classes is a whole lot easier than writing them. Even writing them is easy if you’re only worried about type safety and saving some work, and not on making something general and extensible.

My main gripe has to do with arrays. As in pre-generics Java, they’re the poor cousins of types. There’s no way to create arrays of generic types without having a handle on the java.lang.Class object for the type being created. For instance, there are two toArray() methods in java.util.Collection. Suppose we have a Collection<E>. The first array creation method has the signature Object[] toArray(), not E[] toArray() as one might expect, because it’s simply not possible to define the latter. The second array creation method has the signature <T> T[] toArray(T[] a), which is problematic in two ways. First, you need to pass in an array to fill. Second, you can still generate type errors if you do something like this:

import java.util.*;
public class Test {
    public static void main(String[] args) {
	List<String> xs = new ArrayList<String>();
	xs.add("foo");
	Integer[] result = xs.<Integer>toArray(new Integer[10]);
    }
}

If you omit the second line (in bold), the program compiles and runs with no complaint. If you add the second line, the program compiles, but throws a runtime exception during the underlying call to System.arrayCopy().

In LingPipe, I often used arrays exactly because they were type safe compared to returning collections. They also tend to be smaller than collections in memory, as there’s no overhead for wrappers, buckets, tree balancing counts, etc. Now I’ve painted myself into a corner with generics.

I’ve had to replace some return types that would’ve been E[] for a type variable E, with List<E>, because it’s the only way I could ensure type safety without requiring a class object (or array) be passed in. Luckily, none of that stuff’s in deep loops. That’s all arrays of primitive types, which nobody ever expected to play nicely with generics or anything else.

As always, my advice to someone wanting to learn Java types is to study the java.util collections framework. That, and the Kernighan and Ritchie of Java:

Ken Arnold, James Gosling and David Holmes. 2006. The Java Programming Language, Fourth Edition. Addison-Wesley.

I had to read Chapter 11 (Generic Types) many many many many times. Check out Section 16.10.1 (Genericity and Dynamic Arrays) for a deep explanation of what’s going on in the above example.

Here are some links to complement the book:

*Note that Arnold actually got his types wrong and his example should be T[] toArray(Class<T>); makes me think he didn’t read his own book!

Update: April 11, 2007

It occurred to me yesterday that java.util.ArrayList must allocate generic arrays, because it implements List<Egt; based on arrays. Checking out Java’s source shows you how to “cheat” the type system:

public class ArrayList<E> ...

    private transient E[] elementData;

    public ArrayList(int initialCapacity) {
	...
	this.elementData = (E[])new Object[initialCapacity];
    }

It’s easy to forget that types are simply erased in the implementation. This kind of operation is safe if it’s under your control, as it is in java.util.ArrayList and many of the LingPipe internals. It’ll still cause the compiler to gripe. I think I can live with that.

Update April 26, 2007: I’m done. We’re about to release 3.0. But there was one final bug that Breck caught compiling on 1.6. Turns out that it’s a known issue with Java 1.6 throwing an assertion error from the compiler. It’s already in Sun’s DB of bugs, as bug 6500207. It arises with conjunctions of types in generic restriction. The bug report has a simple example; it arose for us in a class that wasn’t going out with 3.0 anyway, an EM/soft clusterer with Dirichlet priors, so hopefully it’ll be fixed soon.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s