Java Inheritance vs. Equality: Chunking and Span Case Study

by

There’s a tension between allowing inheritance and definining equality properly in Java. I’ve been working on refactoring many of the LingPipe interfaces and abstract base classes, and this problem comes up again and again.

Spans and Chunks

In dealing with text annotations, the simplest meaningful data structure is the span, which picks out a start and end (or start and length) in a given character sequence. The term “chunk” typically applies to spans with additional type information, such as person/organization/location for named entity mention chunking or noun phrase/verb phrase for phrase chunking.

Inheritance

Using inheritance, we might define a pair interfaces, such as:

interface Span {
    int start();
    int end();
}
interface Chunk extends Span {
    int start();
    int end();
    String type();
}

Equality interface requirements can’t be specified in Java itself, so they are typically specified informally in the Javadoc.

Span and Chunk Equality?

It makes sense to consider a span equal to another object if that object is a span that has the same start and end points.

It also makes sense to consider a chunk equal to another object if that object is a chunk with the same start and end points and the same type.

This is all well and good if chunk and span are independent interfaces. But what if we include the italicized declaration making Chunk a subinterface of Span?

If I have a span sp and a chunk ch that have the same start and end points, we’d wind up having sp.equals(ch) evaluating to true, but ch.equals(sp) evaluating to false. Unfortunately, this breaks Java’s implicit contract on the equals method, which requires symmetry (a.equals(b) if and only if b.equals(a)). I’ve actually written special-purpose JUnit assert methods (in com.aliasi.test.unit.Asserts) to test object equality both ways, and to ensure that if two objects are equal, they have the same hash code.

So even though chunks include all the information in spans, they don’t satisfy a span’s notion of identity.

Encapsulation vs. Inheritance

It’s much cleaner conceptually to instead use encapsulation, defining:

interface Span {
    int start();
    int end();
}
interface Chunk {
    Span span();
    String type();
}

Now it’s clear that a chunk has a span, but it isn’t itself a span per se. Now we can say that two chunks are equal if their spans and types are equal. We could even add convenience methods for start() and end() to the Chunk interface.

Adding Scores via Overspecification

The current state of LingPipe interfaces does not have a Span-like interface, and Chunk includes not only start, end and type, but also score information. This is a clumsy interface style because not all chunks necessarily have score information. If we include score equality in the definition of equality, the degenerate scoreless chunks need to all take the same default score.

It wouldn’t help to add scores via inheritance. We actually do that, defining Chunk to extend the Scored interface.

Adding Scores via Encapsulation

With encapsulation, it’d make sense to define a scored chunk relative to a chunk the same way we defined a chunk relative to a span:

interface ScoredChunk implements Scored {
    Chunk chunk();
    double score();
}

Or we could have a scored object interface:

interface ScoredObject<E> {
    E getObject();
    double score();
}

and use ScoredObject<Chunk> for scored chunks.

Implementations

The main problem with encapsulation is that it involves more objects, and despite the discounts provided by recent JVMs, object creation and destruction is still crazy expensive compared to arithmetic.

If we implemented a chunk as a holder for a span and a string type, the object representation for the span itself is unnecessary memory overhead.

If we only store the start, end and type for a chunk, we’d need to create the span to return it to a client. Just the kind of object creation that efficient Java programs try to avoid.

Example 2: Java Set Equality

Java’s java.util.Set.equals(Object) method specifies:

Returns true if the specified object is also a set, the two sets have the same size, and every member of the specified set is contained in this set (or equivalently, every member of this set is contained in the specified set).

That’s a sensible definition following the mathematical notion of set.

But what about the superinterface java.util.Collection? It’s not a coincidence that set’s super-interface, java.util.Collection does not define equality. Although it might make sense to define collections as equal if they had the same iterated objects in the same order, this would conflict with set’s definition of equality in the same way that span’s definition conflicted with chunk’s.

Similarly, java.util.SortedSet can’t extend set’s definition of equality to require sort equality because it would mess up the use of Set itself and make it impossible for a class to implement both Set and SortedSet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s