Archive for the ‘Java’ Category

Big Bit-Packed Array Abstraction (for Java, C, etc.)

July 28, 2010

One limitation of Java is that arrays can only be up to Integer.MAX_VALUE entries long. That means you can only get 2 billion entries into an array. That’s only 2GB if the arrays contain bytes, and we have way more memory than that these days. That’s not enough.

A second limitation, shared by C, is that each entry is a round number of bits, usually a power of two bytes (8, 16, 32, or 64 bits). Often 32 is too little and 64 too much.

I’ve been spending some time lately working on bioinformatics applications. The size and shape of the data’s a bit unusual for me, and very awkwardly sized for Java or even C-based arrays.

Packing the Human Genome

For instance, if you want to represent the bases in both strands of the human genome, it’s a sequence 6 billion bases long with each base being 1 of 4 values (A, C, G, or T). OK, no big, just pack four values into each byte and you can squeeze under the limit with a 1.5 billion long array of bytes. It’s even pretty easy to write the bit fiddling because everything aligns on byte boundaries.

Indexing the Human Genome

Now let’s suppose we want to index the genome so we can find occurrences of sequences of arbitrary length in time proportional to their length. Easy. Just implement a suffix array over the positions. (A suffix array is a packed form of a suffix tree, which is a trie structure that represents every suffix in a string. Technically, it’s an array consisting of all the positions, sorted based on the suffix defined from the position to the end of the string.)

Because there are 6G positions, the positions themselves won’t fit into an int, even using an unsigned representation. There are only 32 bits in an int, and thus a max of about 4 billion values. But a long is overkill. We need 33 bits, not 64. Using 8 bytes per entry, the suffix array would fill 48GB. We really only need a little over half that amount of space if we use 33 bits per pointer. But now things aren’t so easy. First, the bit fiddling’s more of a pain because 33-bit values can’t be both packed and byte aligned. Second, you can’t create an array that’ll hold 24GB of values, even if it’s a maximally sized array of longs (that’ll get you to 16GB).

The Abstraction

What I’ve defined in the package I’m working on with Mitzi is a BigArray interface. The interface is super simple.

interface BigArray {
    long length();
    long get(long n);
    void set(long n, long val);

That’s the basics. We might want to make the set operations optional so we could have immutable big arrays. As is, there’s no way to define immutable arrays in Java or C (though Java provides immutable wrappers for lists in the collections framework and C probably has something similar in its standard template lib).

I followed the idiom of Java’s Number interface and define extra getters for bytes, shorts and integers for convenience, but don’t show them here.

On top of a big array, we could implement a variety of the new I/O buffer classes, using something like the random access file repositioning pattern, but it’d be clunky. It’s really a shame the buffers are also sized to integers.


What’s groovy about interfaces is that we can implement them any way we want. The simplest implementation would just be backed by an array of long values. An implementation specifically for two-bit values might be used for the genome, with an efficient byte-aligned implementation.

What I did was create an implementation that allows you to specify the array length and bits per value. I then pack the bits into an array of longs and fiddle them out for getting and setting. This is fine for the genome sequence, because it’ll fit, although it’s not as efficient as the byte-aligned implementation, so I’ll probably add extra special-case implementations.

But it’s still not enough for the suffix array. I can’t fit all the bits into an array of longs. Although I haven’t implemented it yet, the obvious thing to do here is to build up the big array out of an array of arrays. I’d only need two arrays of longs for the human genome suffix array. This is pretty easy to implement hierarchically with an array of smaller big array implementations. First figure out which subarray, then do the same thing as before.

Yes, I Know about Burrows-Wheeler Indexing

If you’re just interested in the suffix array packing problem, the Burrows-Wheeler compression algorithm may be combined with indexing to get the entire genome suffix array into 4GB of memory in a way that’s still usable at run time. This was first explored by Ferragini and Manzini in their 2000 paper, Opportunistic data structures with applications. Very very cool stuff. And it works even better in the genomics setting than for text documents because of the small alphabet size (just four bases for DNA/RNA or 20 amino acids for proteins). Several projects have taken this approach (e.g. BWA and Bowtie).

The Latin1 Transcoding Trick (for Java Servlets, etc.)

July 14, 2010

This post uses Java servlets as an example, but applies more broadly to a situation where a program automatically converts streams of bytes to streams of Unicode characters.

The Trick

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte. These char values may be translated back into bytes with a simple cast:

char[] cs = Converter.getChars();
byte[] bs = new byte[cs.length];
for (int i = 0; i < cs.length; ++i)
    bs[i] = (byte) cs[i];

Now you have the original bytes back again in the array bs. In Java, char values act as unsigned 16-bit values, whereas byte values are signed 8-bit values. The casting preserves values through the magic of integer arithmetic overflow and twos-complement notation. (I attach a program that’ll verify this works at the end.)

You can now use your own character encoding detection or pull out a general solution like the International Components for Unicode (which I highly recommend — it tracks the Unicode standard very closely, performing character encoding detection, fully general and configurable Unicode normalization, and even transliteration).

Use in Servlets for Forms

I learned this trick from Jason Hunter’s excellent book, Java Servlet Programming, (2nd Edition, O’Reilly). Hunter uses the trick for decoding form data. The problem is that there’s no way in HTML to declare what character encoding is used on a form. What Hunter does is add a hidden field for the value of the char encoding followed by the Latin1 transcoding trick to recover the actual string.

Here’s an illustrative code snippet, copied more or less directly from Hunter’s book:

public void doGet(HttpServletRequest req, ...) {
    String encoding = req.getParameter("charset");
    String text = req.getParameter("text");
    text = new String(text.getBytes("ISO-8859-1"), encoding);

Of course, this assumes that the getParameter() will use Latin1 to do the decoding so that the getBytes("ISO-8859-1") returns the original bytes. According to Hunter, this is typically what happens because browsers insist on submitting forms using ISO-8859-1, no matter what the user has chosen as an encoding.

We borrowed this solution for our demos (all the servlet code is in the distribution under $LINGPIPE/demos/generic), though we let the user choose the encoding (which is itself problematic because of the way browsers work, but that’s another story).

Testing Transcoding

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] bs = new byte[256];
        for (int i = -128; i < 128; ++i)
            bs[i+128] = (byte) i;
        String s = new String(bs,"ISO-8859-1");
        char[] cs = s.toCharArray();
        byte[] bs2 = s.getBytes("ISO-8859-1");
        for (int i = 0; i < 256; ++i)
            System.out.printf("%d %d %d\n",

which prints out


c:\carp\temp>java Test
-128 128 -128
-127 129 -127
-2 254 -2
-1 255 -1
0 0 0
1 1 1
126 126 126
127 127 127

Unicode Shell for Java: Windows PowerShell ISE

July 6, 2010

I finally figured out how to have my Java programs write Unicode to a Windows shell and show up looking the right way. Here’s the final result (click to enlarge):

DOS vs Powershell for Unicode

And here’s how I did it.

1. Fire Up PowerShell ISE

My new Windows 7 Professional notebook comes with an application called PowerShell ISE (make sure to run the ISE version; the unmarked one is more like DOS and has the same problems noted below). The “ISE” is for “integrated scripting environment”.

It defaults to Consolas font, which looks a lot like Lucida console.

2. Set Character Encoding to UTF-8

This’ll set the DOS character encoding to be UTF-8.

> chcp 65001
Active code page: 65001

3. Set Java System.out to UTF-8

Before writing to System.out, Java’s constant for the standard output, you’ll need to set it up to use UTF-8. It’s a one-liner.

System.setOut(new PrintStream(System.out,true,"UTF-8"));

The true value enables auto-flushing, which is a good idea for standard output.

Demo Program

Here’s a simple test program (the backslash escapes are Java literals for Unicode code points).


public class Test {

    public static void main(String[] args) 
        throws UnsupportedEncodingException {

        PrintStream utf8Out
            = new PrintStream(System.out,false,"UTF-8");

        System.out.println("English: Hello");
        System.out.println("French: D\u00E9j\u00E0 vu");
        System.out.println("Tamil: \u0B92");
        System.out.println("Han: \u4E52");


You can see the output in action in the screen dumps at the top of this post.

Problems with DOS

The DOS shell will let you set the character encoding and change the font to Lucida console. But it oddly duplicates characters at the end of lines if the lines contain non-ASCII code points. You can see this on the example. And it can’t handle the Tamil or Han characters.

The Unbearable Heaviness of Java Strings

June 22, 2010

Instantaneous Update: No sooner had I posted this than I received a really helpful clarifying comment from Ismael Juma with a link to exactly the references I was looking for. It turns out the 64-bit JVMs do compress references in some cases. And people ask me why I waste my time blogging!

Update: 23 June 2010. Super cool. As of Java SE 6 build 14, all you need to do is provide the java command the argument -XX:+UseCompressedOops and all references will consume 32 bits (“OOP” is for “ordinary object pointer”). That does limit you to 4 billion objects on the heap (232), which if they’re just plain objects will consume 32GB. You’d need at least 64GB of RAM to make it worth turning the option off, because 4 billion objects consume 64GB without OOP compression.

Does anyone know how fast the JVM is with compression? I’ll have to run some tests.

Let’s start with the punchline. In Sun/Oracle’s 64-bit JVM, a string consumes 60 bytes plus the size of its UTF-16 encoding, which is at least 2 times the number of Unicode characters. In a 32-bit JVM, if anyone’s still using one, there’s a 36-byte fixed cost plus the cost of the UTF-16 encoding.

My number one peeve with Java is that there is no equivalent of C’s struct. I can do without pointer arithmetic, but objects in Java are just too heavy for serious lifting (somehow that metaphor worked out backward). It’s the reason there are parallel arrays all over LingPipe — arrays of structured objects just take up too much space and cause too much garbage collection.


Yes, Java made the questionable choice of using UTF-16 to encode strings internally. Each char is a UTF-16 byte pair. Any unicode character with a code point at or below U+FFFF uses a single UTF-16 byte pair to encode, and anything higher requires two UTF-16 byte pairs. (In the designer’s defense, there weren’t any Unicode characters with code points above U+FFFF when Java was released and 64-bit architectures in laptops seemed a long way off.)

The bummer is that for all of us with piles of ASCII or near-ASCII (e.g. Latin1 or Windows-1252) text, UTF-16 takes nearly twice as much space as UTF-8.

Java Object Memory Layout

The layout of objects in memory is not part of the Java language specification, so implementations may do this differently. We’ll only consider Sun/Oracle’s 32-bit and 64-bit JVMs.

After a lot of guesswork and empirical profiling, I finally found some references that look solid on the topic of how Java objects are laid out in memory:

What’s worrisome is that they look awfully similar and there are no external references. How did they figure this out? Looking at the byte code? By looking at the implementation of Java itself? By reverse engineering?

Principles of Object Layout

Assuming they’re correct, and the argument looks convincing, the basic idea of sizing a Java object (not including the size of any object it references), is based on a few principles motivated by byte alignment. The idea is that it’s more convenient for the underlying architecutre if items start on even number multiple of their number of bytes. So a double or long would want to start at position 0, 8, 16, …, an int or float on positions 0, 4, 8, 12, …, a short or char on positions 0, 2, 4, 6, … and bytes on any position at all.

References (aka pointers), will be 8 bytes on 64-bit architectures and 4 bytes on 32-bit architectures.

Basic Object Layout

An instance of Object requires two references, which consume 8 bytes (32 bit) or 16 bytes (64 bit).

One reference is to the class of the object, and one is the ID of the object. Objects get IDs because they may be relocated in memory yet need a unique identifier to resolve reference equality (==). The ID is also used for synchronization and for the hash code method implemented in Object.

In order to support byte alignment, the member variables are laid out from largest to smallest, that is doubles and longs, then ints and floats, then shorts and chars, then bytes.

Objects are reference aligned, because they start with references, so object sizes are rounded up to the nearest even multiple of 4 bytes (32 bit) or 8 bytes (64 bit).

Subclass Layout

Subclasses add their own member length to their superclass’s.

Subclasses point to their own class definition with one of their references and lay out their member values below the superclass’s. There are some subtleties here surrounding ordering member variables in the subclass layout to reuse space that would’ve been padding in the superclass.

Array Layout

Arrays take up 12 bytes (32 bit) or 20 bytes (64 bit) plus their member size, plus an additional 4 bytes if the array holds longs, doubles or object references.

Arrays behave like objects in java, and contain the same first two references (to the array class and to the ID). They additionally contain an integer length (a choice that, sadly, limits the size of Java arrays). Then they contain the equivalent of their length in member variables. These start after the length, which means 4 more bytes of padding for double or long or object reference arrays.


The Sun JDK implements strings as holding an array of char values, an integer start and an integer length definining a slice of the array, and an integer hash code.

REF -> String
REF -> [ REF -> Array of char
            REF -> ID
            int -> length
            char -> charAt(0)
            char -> charAt(length-1) ]
int -> start
int -> length
int -> hashCode
(REF-int) padding

That’s 5 REFs, 4 ints, and a number of char equal to the length (in UTF-16 byte pairs) of the string, plus padding. So with 8-byte refs in the 64-bit JVM, that’s 60 bytes and with 4-byte refs in the 32-bit JVM, it’s 36 bytes. Plus the length of the string times 2.


That’s why efficient coding in Java looks like C. It’s just too expensive to allocate objects in tight loops, no matter what you may have heard to the contrary about the compiler being smarter than you and generational garbage collection curing all ills.

Should We Migrate to Java 1.6?

June 1, 2010

The impending release of LingPipe 4 brings up the question of Java version. Specifically:

Should we start releasing LingPipe compiled with J2SE 1.6 rather than 1.5? And what about the models?

Compiling for 1.5 Still Possible

LingPipe would still be compilable with the 1.5 JDK, so users could build their own 1.5-compatible LingPipe jar. But the official release would be compiled with Java 1.6 and that’s what we’d target for support.

There are no immediate plans to add any dependencies on features introduced in Java 1.6.

No More Public Java 1.5 Support

The reason we’re considering migrating is that the 1.5 version of the JDK (officially Java SE 5.0) reached its end of service life (EOSL) in November 2009. In practice, this means no more public bug fixes for 1.5.

Oracle will support 1.5 for those willing to pay for it through “Java for Business“.

Real and Perceived Non-Determinism in java.util.HashSet Iteration (and Print) Order

May 25, 2010

We had a recent inquiry on the mailing list about LingPipe’s complete-link clustering implementation. For the same inputs, it was producing what looked like different results. Well, it actually was producing different results. Different results that were structurally equivalent, but not Java equal.

It turns out that the culprit was our old friends Object.equals(Object), Object.hashCode() and collection implementations inside java.util.

Perceived Non-Determinism Sample

Here’s the simplest code I could think of that illustrates what appears to be a problem with non-deterministic behavior:

import java.util.HashSet;

public class HashOrderDemo {

    public static void main(String[] args) {
        for (int i = 0; i < 10; ++i) {
            HashSet<Foo> s = new HashSet<Foo>();
            for (int k = 0; k < 10; ++k)
                s.add(new Foo(k));

    static class Foo extends Object {
        final int mK;
        Foo(int k) {
            mK = k;
        public String toString() {
            return Integer.toString(mK);


First note that the static nested class Foo extends java.lang.Object and hence inherits Object‘s implementation of equals(Object) and hashCode() [links are to the JDK 1.5 Javadoc]. As a result, two Foo objects holding the same integer are not equal unless they are the same object.

The main() runs a loop adding 10 instances of Foo to a java.util.HashSet then printing them out. The result is different orderings:


c:\carp\temp>java HashOrderDemo
[5, 4, 9, 6, 8, 7, 0, 2, 1, 3]
[9, 7, 6, 5, 3, 0, 8, 2, 1, 4]
[4, 6, 1, 0, 9, 2, 5, 8, 7, 3]
[0, 2, 1, 7, 8, 3, 6, 5, 9, 4]
[0, 2, 1, 4, 7, 8, 6, 9, 5, 3]
[2, 1, 3, 9, 4, 6, 7, 0, 8, 5]
[8, 7, 4, 0, 2, 9, 5, 6, 1, 3]
[6, 1, 9, 7, 2, 4, 3, 0, 8, 5]
[5, 2, 0, 9, 6, 4, 7, 3, 1, 8]
[9, 2, 0, 3, 8, 7, 5, 4, 6, 1]

Of course, given our definition of Foo, the sets aren’t actually equal, because the members aren’t actually equal. The members just appear to be equal because they’re structurally equivalent in terms of their content.

If you want instances of Java classes things to behave as if their equal, you have to override the default implementations equals(Object) and hashCode() accordingly.

But what if the Objects are Equal?

So what if we look at a case where the objects and hence sets really are equal? Alas, we can still get what looks to be non-equal outputs.

Consider the following program:

import java.util.HashSet;
public class HashOrderDemo2 {
    public static void main(String[] args) {
        for (int i = 1; i < 200; i *= 3) {
            HashSet<String> s = new HashSet<String>(i);
            for (int k = 0; k < 10; ++k)

which if we compile and run, gives us


c:\carp\temp>java HashOrderDemo2
[3, 5, 7, 2, 0, 9, 4, 8, 1, 6]
[3, 5, 7, 2, 0, 9, 4, 8, 1, 6]
[3, 5, 7, 2, 0, 9, 4, 8, 6, 1]
[3, 7, 2, 0, 6, 1, 5, 9, 4, 8]
[0, 9, 3, 6, 1, 5, 4, 7, 2, 8]

Here the print order is different because the iteration order is different in different hash sets. With different sizes, the underlying array of buckets is a different size, hence after modular arithmetic, elements wind up ordered differently.

But the sets produced here are actually equal in the Java sense that set1.equals(set2) would return true. They just don’t look equal because the print and iteration order is different.

Bug or Feature?

Let’s just call it a quirk. The doc for hash sets and other collections is pretty explicit. That doesn’t mean it’s not confusing in many applications.

Upgrading Java Classes with Backward-Compatible Serialization

May 4, 2010

When using Java’s default serialization, any changes to the member variables or method signatures of the class breaks backward compatibility in the sense that anything serialized from the old class won’t deserialize in the new class. So do similar changes to superclasses or changing the superclass type.

After a bit of overview, I’ll show you the trick I’ve been using to maintain backward compatibility when upgrading classes with new meaningful parameters.

Serial Version ID

The problem stems from the fact that Java uses reflection to compute a serial version ID for each class. (You can calculate it yourself using the serialver command that ships with Sun/Oracle’s JDK.) When methods or member variables change, the serial version changes. To deserialize, the serialized class’s ID must match the current class’s ID.

As long as the (non-transient) member variables don’t change, this problem can be tackled by explicitly declaring the serial version ID. (See the Serializable interface javadoc for how to declare). With an explicit serial version ID declared, the serialization mechanism uses the declared value rather than computing it from the class signature via reflection.

If there are serializable objects out there you need to deserialize, use serialver to calculate what the version ID used to be, then declare it in the class. If you’re starting from scratch, the ID can be anything.

So always use an explicit serial version when you first create a class. It can’t hurt, and it’s likely to help with backward compatibility.

Taking Control with the Externalizer Interface

But what if (non-transient) member variables (non-static class variables) change? Version IDs won’t help, because the default read and write implemented through reflection will try to deserialize the new variables and find they’re not there and throw an exception.

You need to take control using the Externalizable interface, which extends Serializable. Now you control what gets read and written.

Serialization and deserialization will still work, as long as the new member variable is either not final or is defined in the nullary (no-arg) constructor.

The Simplest Case of The Problem

Suppose we have the class

public class Foo implements Externalizable {
    static final long serialVersionUID = 42L;
    public String mArg1;
    public Foo() { }
    public void setArg1(String arg1) { mArg1 = arg1; }
    public void readExternal(ObjectInput in) 
        throws IOException, ClassNotFoundException {
        mArg1 = (String) in.readObject();
    public void writeExternal(ObjectOutput out) 
        throws IOException {

So far, so good. If I add the following main(),

    public static void main(String[] args) 
        throws IOException, ClassNotFoundException {
        Foo foo = new Foo();
        foo.mArg1 = "abc";
        System.out.println("1: " + foo.mArg1);

        FileOutputStream out = new FileOutputStream("temp.Foo");
        ObjectOutput objOut = new ObjectOutputStream(out);

        FileInputStream in = new FileInputStream("temp.Foo");
        ObjectInput objIn = new ObjectInputStream(in);
        Foo foo2 = (Foo) objIn.readObject();
        System.out.println("2: " + foo2.mArg1);

and compile and run, I get

c:\carp>java Foo
1: abc
2: abc
c:\carp>ls -l temp.Foo
----------+ 1 Bob Carpenter None 31 2010-05-04 15:14 temp.Foo

So far, so good.

But now suppose I add a second variable to Foo,

public class Foo ...
    public String mArg2;
    public void setArg2(String arg2) { mArg2 = arg2; }

And try to run just the deserialization with the existing file,

    public static void main(String[] args) 
        throws IOException, ClassNotFoundException {

        FileInputStream in = new FileInputStream("temp.Foo");
        ObjectInput objIn = new ObjectInputStream(in);
        Foo foo2 = (Foo) objIn.readObject();
        System.out.println("2: " + foo2.mArg1);

Because I haven’t changed readExternal(), we still get the same answer.

But what if I want to serialize the second argument? And give it a default value of null for classes serialized before it was added? The obvious change to read and write don’t work,

    public void readExternal(ObjectInput in) 
        throws IOException, ClassNotFoundException {
	mArg1 = (String) in.readObject();
	mArg2 = (String) in.readObject();
    public void writeExternal(ObjectOutput out) 
        throws IOException {

It’ll throw an exception, as in

c:\carp>java Foo
Exception in thread "main"
        at Foo.readExternal(
        at Foo.main(

Now what?

The Trick: Marker Objects

Here’s what I’ve been doing (well, actually, I’m doing this through a serialization proxy, but the idea’s the same),

    public void readExternal(ObjectInput in) 
        throws IOException, ClassNotFoundException {
        Object obj = in.readObject();
        if (Boolean.TRUE.equals(obj)) {
            mArg1 = (String) in.readObject();
            mArg2 = (String) in.readObject();
        } else {
            mArg1 = (String) obj;
    public void writeExternal(ObjectOutput out) 
        throws IOException {

The new implementation always writes an instance of Boolean.TRUE (any small object that’s easily identifiable would work), followed by values for the first and second argumet. (In practice, these might be null in this class, so we’d have to be a bit more careful.) The trick here is the external read method reads an object, then checks if it’s Boolean.TRUE or not. If it’s not, we have an instance serialized by the previous version, so we cast that object to a string, assign it to the first argument and return. If it is, we read two args and assign them.

Now the result of deserializing our previous instance works again:

c:\carp>java Foo
2: abc

And there you have it, backward compatibility with new non-transient member variables.

The Fine Print

The trick only works if there’s an object that’s being serialized (or a primitive with a known restricted range of possible values). It doesn’t have to be the first variable, but there has to be something you can test to tell which version of the class you have.

Function Input/Output Polarity and Java Wildcard Generics: extends vs. super

March 2, 2010

One of the things people seem to find confusing is the distinction in wildcard or captured generics between extends and super.

Wild Card extends

One issue with Java generics is that even though String extends CharSequence, List<String> does not extend List<CharSequence>.

The most specific variable to which we can assign either list type is List<? extends CharSequence>, which is extended by both List<String> and List<CharSequence>. We can assign an object of type List<E> to a variable with static type List<? extends CharSequence> as long as E extends CharSequence.

Wild Cards super

The super keyword moves in the inheritance hierarchy in the opposite direction from extends. A variable of type List<? super String> may be assgined an object List<E> if E is a supertype of String (equivalent, String extends E). Thus a variable of type List<? super String> may be assigned an object of type List<String> or of type List<CharSequence>.

What about List<StringBuilder> (hint: StringBuilder implements CharSequence)? A variable of type List<StringBuilder> may be assigned to List<? extends CharSequence>, but not List<? super String>, because StringBuilder extends CharSequence (by implementing it), but StringBuilder is not a supertype of String.

Function Inputs and Outputs

Suppose we have a simple interface, Foo<A,B>, with two generic arguments, A for input, and B for output:

interface Foo<A,B> {
    B apply(A x);

Functions present a further problem in that their inputs and outputs generalize in different directions. In our example Foo<A,B>, A is a function input. A more general function takes a more general input than A. That means a more general type is Foo<? super A,B>. It can accept any input to apply() that Foo<A,B> could and more.

But what about outputs? In order for an output type to be used wherever a B could be used, it must extend B so that it implements everything required for a B. Therefore, a more general result type is Foo<A,? extends B>. An object assigned to this type is known to provide outputs that at least implement everything a B would.

Function Argument Polarity

This is related to the polarity of a type variable in a functional type. Basic types are positive, and arguments to functions reverse polarity. For example, if A and B are primitive types, for a function from A to B (of type A->B), A is negative polarity and B positive polarity.

Multiple arguments are all negative, so A->(B->C) has C as positive and A and B as negative.

Compound arguments further reverse polarity, so in (A->B)->C, C is positive, and (A->B) is negative, so B is negative, and A is positive again.

In general, you want to extend positive polarity arguments and super negative polarity arguments in order to be able to use the compound object wherever the original object could be used.

What if an Generic Parameter is Input and Output?

That’s actually the case with List<E>, because the get() method returns an E result, whereas the add() method consumes an E argument. Thus there’s really no way to generalize the type of a collection. If you generalize List<E> by subclassing E, you restrict the set of inputs; if you generalize it by superclassing E, you wind up with more general outputs that aren’t guaranteed to implement everything E does.

Data Structure for Parse Trees?

February 23, 2010

Writing parsers used to be my bread and butter, but I’m a bit stumped on how to create a reasonable object-oriented data structure for parse trees. Here’s a simple parse tree in the Penn Treebank‘s notation (borrowed from Wikipedia: Treebank):

(S (NP (NNP John))
   (VP (VPZ loves)
       (NP (NNP Mary)))
   (. .))

Conceptually trees come in two varieties. Lexical trees are made up of a root category and a word. For example, the lexical tree (NNP John) has a root category NNP and a word John. Phrasal trees are made up of a root category and a sequence of subtrees. For example, the phrasal tree (VP (VPZ loves) (NP (NNP Mary))) consists has root category VP and two subtrees, the lexical subtree (VPZ loves) and the phrasal subtree (NP (NNP Mary)). The latter tree, rooted at NP, only has a single subtree, (NNP Mary). On the other hand, the top level tree, rooted at S, has three subtrees, rooted at NP, VP, and . (period).

Java Data Structures?

So my question is, how do I represent parse trees in Java? For simplicity assume categories are strings (rather than, say generic objects or integer symbol table representations) and that words are also strings.

A First Attempt

An obvious first step would be to define an interface for trees,

interface Tree {
    String root();

and then extend it for lexical trees,

interface LexicalTree extends Tree {
    String word();

and phrasal trees,

interface PhrasalTree extends Tree {
    List<Tree> subtrees();

This pretty much matches our informal description of trees directly.

The Problem

So what’s the problem?

Waiter, Taste My Soup

Coming to America Movie Poster

Eddie Murphy‘s memorable schtick from Coming to America provides a hint:

Eddie Murphy – as the old white man in the barber shop says, “Wait a minute. Wait a minute. Wait. Stop right there! Listen. Stop right there, man.

A man goes into a restaurant. You listenin’? A man goes into a restaurant, and he sits down, he’s having a bowl of soup and he says to the waiter, “waiter come taste the soup.”

Waiter says, “Is something wrong with the soup?”

He says, “Taste the soup.”

He [waiter] says, “Is there something wrong with the soup? Is the soup too hot?”

He says, “Will you taste the soup?”

[Waiter says ,] “What’s wrong, is the soup too cold?”

[He says, ] “Will you just taste the soup?”

[Waiter says, ] “Alright, I’ll taste the soup – where’s the spoon? Aha. Aha!”

Not to mention, a widely applicable moral that may be applied to software. Maybe “eating your customer’s soup” should be as well known an idiom as “eating your own dogfood.”

The Problem is Unions

So now that you sussed out the moral of Mr. Murphy’s story, and diligently applied it by trying to use the tree interface, hopefully, like Mr. Murphy’s waiter, you see the problem.

To do anything other than get the root of a tree, you need to use instanceof and then apply an unchecked cast. Double Yucky.

The root of the problem (pun intended) is that the tree data structure is essentially a union data structure (aka a “variant record”). You see these in straight-up C all the time.

An Alternative

What I actually have in place is a C-like approach that doesn’t require any unchecked casts. But it’s giving off a really bad smell. Darn near boufin.

interface Tree {
    String rootCategory();
    boolean isLexical();
    String word();
    List<Tree> subtrees();

The idea is to use the isLexical() test and if it’s true, use word() to get the word and if it’s false, use subtrees() to get the list of subtrees. In C, there’s often a key indicating which type a union data structure contains.

I can’t decide which is better, returning null from word() if it’s not a lexical tree, or throwing an unsupported operation exception (it’s not quite an illegal state, as my tree implementations would almost certainly be immutable).

Any Better Ideas?

I’m very open to suggestions for a better way to represent trees.

Java Inheritance vs. Equality: Chunking and Span Case Study

February 16, 2010

There’s a tension between allowing inheritance and definining equality properly in Java. I’ve been working on refactoring many of the LingPipe interfaces and abstract base classes, and this problem comes up again and again.

Spans and Chunks

In dealing with text annotations, the simplest meaningful data structure is the span, which picks out a start and end (or start and length) in a given character sequence. The term “chunk” typically applies to spans with additional type information, such as person/organization/location for named entity mention chunking or noun phrase/verb phrase for phrase chunking.


Using inheritance, we might define a pair interfaces, such as:

interface Span {
    int start();
    int end();
interface Chunk extends Span {
    int start();
    int end();
    String type();

Equality interface requirements can’t be specified in Java itself, so they are typically specified informally in the Javadoc.

Span and Chunk Equality?

It makes sense to consider a span equal to another object if that object is a span that has the same start and end points.

It also makes sense to consider a chunk equal to another object if that object is a chunk with the same start and end points and the same type.

This is all well and good if chunk and span are independent interfaces. But what if we include the italicized declaration making Chunk a subinterface of Span?

If I have a span sp and a chunk ch that have the same start and end points, we’d wind up having sp.equals(ch) evaluating to true, but ch.equals(sp) evaluating to false. Unfortunately, this breaks Java’s implicit contract on the equals method, which requires symmetry (a.equals(b) if and only if b.equals(a)). I’ve actually written special-purpose JUnit assert methods (in com.aliasi.test.unit.Asserts) to test object equality both ways, and to ensure that if two objects are equal, they have the same hash code.

So even though chunks include all the information in spans, they don’t satisfy a span’s notion of identity.

Encapsulation vs. Inheritance

It’s much cleaner conceptually to instead use encapsulation, defining:

interface Span {
    int start();
    int end();
interface Chunk {
    Span span();
    String type();

Now it’s clear that a chunk has a span, but it isn’t itself a span per se. Now we can say that two chunks are equal if their spans and types are equal. We could even add convenience methods for start() and end() to the Chunk interface.

Adding Scores via Overspecification

The current state of LingPipe interfaces does not have a Span-like interface, and Chunk includes not only start, end and type, but also score information. This is a clumsy interface style because not all chunks necessarily have score information. If we include score equality in the definition of equality, the degenerate scoreless chunks need to all take the same default score.

It wouldn’t help to add scores via inheritance. We actually do that, defining Chunk to extend the Scored interface.

Adding Scores via Encapsulation

With encapsulation, it’d make sense to define a scored chunk relative to a chunk the same way we defined a chunk relative to a span:

interface ScoredChunk implements Scored {
    Chunk chunk();
    double score();

Or we could have a scored object interface:

interface ScoredObject<E> {
    E getObject();
    double score();

and use ScoredObject<Chunk> for scored chunks.


The main problem with encapsulation is that it involves more objects, and despite the discounts provided by recent JVMs, object creation and destruction is still crazy expensive compared to arithmetic.

If we implemented a chunk as a holder for a span and a string type, the object representation for the span itself is unnecessary memory overhead.

If we only store the start, end and type for a chunk, we’d need to create the span to return it to a client. Just the kind of object creation that efficient Java programs try to avoid.

Example 2: Java Set Equality

Java’s java.util.Set.equals(Object) method specifies:

Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set).

That’s a sensible definition following the mathematical notion of set.

But what about the superinterface java.util.Collection? It’s not a coincidence that set’s super-interface, java.util.Collection does not define equality. Although it might make sense to define collections as equal if they had the same iterated objects in the same order, this would conflict with set’s definition of equality in the same way that span’s definition conflicted with chunk’s.

Similarly, java.util.SortedSet can’t extend set’s definition of equality to require sort equality because it would mess up the use of Set itself and make it impossible for a class to implement both Set and SortedSet.