## Real and Perceived Non-Determinism in java.util.HashSet Iteration (and Print) Order

We had a recent inquiry on the mailing list about LingPipe’s complete-link clustering implementation. For the same inputs, it was producing what looked like different results. Well, it actually was producing different results. Different results that were structurally equivalent, but not Java equal.

It turns out that the culprit was our old friends Object.equals(Object), Object.hashCode() and collection implementations inside java.util.

### Perceived Non-Determinism Sample

Here’s the simplest code I could think of that illustrates what appears to be a problem with non-deterministic behavior:

import java.util.HashSet;

public class HashOrderDemo {

public static void main(String[] args) {
for (int i = 0; i < 10; ++i) {
HashSet<Foo> s = new HashSet<Foo>();
for (int k = 0; k < 10; ++k)
s.add(new Foo(k));
System.out.println(s);
}
}

static class Foo extends Object {
final int mK;
Foo(int k) {
mK = k;
}
public String toString() {
return Integer.toString(mK);
}
}

}

First note that the static nested class Foo extends java.lang.Object and hence inherits Object‘s implementation of equals(Object) and hashCode() [links are to the JDK 1.5 Javadoc]. As a result, two Foo objects holding the same integer are not equal unless they are the same object.

The main() runs a loop adding 10 instances of Foo to a java.util.HashSet then printing them out. The result is different orderings:

c:\carp\temp>javac HashOrderDemo.java

c:\carp\temp>java HashOrderDemo
[5, 4, 9, 6, 8, 7, 0, 2, 1, 3]
[9, 7, 6, 5, 3, 0, 8, 2, 1, 4]
[4, 6, 1, 0, 9, 2, 5, 8, 7, 3]
[0, 2, 1, 7, 8, 3, 6, 5, 9, 4]
[0, 2, 1, 4, 7, 8, 6, 9, 5, 3]
[2, 1, 3, 9, 4, 6, 7, 0, 8, 5]
[8, 7, 4, 0, 2, 9, 5, 6, 1, 3]
[6, 1, 9, 7, 2, 4, 3, 0, 8, 5]
[5, 2, 0, 9, 6, 4, 7, 3, 1, 8]
[9, 2, 0, 3, 8, 7, 5, 4, 6, 1]

Of course, given our definition of Foo, the sets aren’t actually equal, because the members aren’t actually equal. The members just appear to be equal because they’re structurally equivalent in terms of their content.

If you want instances of Java classes things to behave as if their equal, you have to override the default implementations equals(Object) and hashCode() accordingly.

### But what if the Objects are Equal?

So what if we look at a case where the objects and hence sets really are equal? Alas, we can still get what looks to be non-equal outputs.

Consider the following program:

import java.util.HashSet;
public class HashOrderDemo2 {
public static void main(String[] args) {
for (int i = 1; i < 200; i *= 3) {
HashSet<String> s = new HashSet<String>(i);
for (int k = 0; k < 10; ++k)
s.add(Integer.toString(k));
System.out.println(s);
}
}
}

which if we compile and run, gives us

c:\carp\temp>javac HashOrderDemo2.java

c:\carp\temp>java HashOrderDemo2
[3, 5, 7, 2, 0, 9, 4, 8, 1, 6]
[3, 5, 7, 2, 0, 9, 4, 8, 1, 6]
[3, 5, 7, 2, 0, 9, 4, 8, 6, 1]
[3, 7, 2, 0, 6, 1, 5, 9, 4, 8]
[0, 9, 3, 6, 1, 5, 4, 7, 2, 8]

Here the print order is different because the iteration order is different in different hash sets. With different sizes, the underlying array of buckets is a different size, hence after modular arithmetic, elements wind up ordered differently.

But the sets produced here are actually equal in the Java sense that set1.equals(set2) would return true. They just don’t look equal because the print and iteration order is different.

### Bug or Feature?

Let’s just call it a quirk. The doc for hash sets and other collections is pretty explicit. That doesn’t mean it’s not confusing in many applications.

### 7 Responses to “Real and Perceived Non-Determinism in java.util.HashSet Iteration (and Print) Order”

1. Arun Iyer Says:

Any implementation of set is not expected to return elements in any particular order [1]. Its neither a bug or a feature, just the contract of the interface.

Depending upon implementation specific feature would lead to unnecessary bugs. It’s best to avoid such things.

• lingpipe Says:

That’s what I meant by it not being a bug. It’s well documented what it’s supposed to do. It’s just that it continues to trip people up.

2. Arun Iyer Says:

Forgot to add,
OTOH, you have LinkedHashSet [1], which is more in the line of what you might be looking for.

• lingpipe Says:

That’ll use a linked list to return the results in the order they were added. The problem here for clustering as I point out in the next post is that there’s an underlying sorted priority queue that needs to pay attention to more than just the scores to ensure determinism.

3. tr8dr Says:

Actually, rather than use HashSet, use TreeSet. This data-structure implicitly sorts your elements, provided that you implement the Comparable interface.

• lingpipe Says:

Right! That’s what I’ve been suggesting. As is, my priority queues are based on tree sets. As is, they’re sorting pairs of clusters pending merging based on their distances. Two clusters can have the same distance, so something needs to be done to impose a sort order on pairs of clusters at the same distance (like recency of creation). I could roll that into the ordering and create a deterministic version of the algorithm.

Given that the algorithm’s not deterministic, I’m just not that worried about doing this for our hierarchical clusterers.

4. Andrew Clegg Says:

I hit the same problem with the Stanford parser libraries one time. Took me about a day to confirm I wasn’t going mad (and rule out things like file system corruption), another day to confirm my own code was fine, and then another day to hazard a guess about where the problem was (which was right).

I’ve used it as a case study in teaching ever since…