Sorting Primitive Types in Java

by

One of the big pains of Java is the lack of built-in collection-like utilities for primitive types like int. As a result, you need to box all the integers, which leads to huge bloat in terms of size, and because of memory locality and dereferencing issues, a huge slowdown to boot.

Oddly, more than one person at ICML told me they were impressed with IBM’s tuning on Watson (their champion-level Jeopardy! playing system Breck blogged about recently) because they reimplemented search and sorting!

C’mon people, this isn’t that hard. In fact, as I was telling our intern today, it’s basic CS 101. At least it was back in the 1980s when I was an undergrad. I have no idea what the kids are up to these days.

Primitive int Sorts for LingPipe Suffix Arrays

As I was implementing suffix arrays for LingPipe 4.1, I realized I’d need to implement sorting over primitive int types for both efficiency and scalability. So I designed everything to let me plug in a better sort implementation when I got some time. Well, I got some time today, so here goes.

It’s also not that rare a problem. There are open-source primitive collection libs available from GNU Trove (GPL, of course). And while not geared toward sorting, there’s the Apache Primitives lib which gives you collection-like behavior with primitives (also something LingPipe could benefit from in several places).

Gift Horses and Strings

Ah, the GPL. It giveth and it taketh away.

In general, one shouldn’t look a gift horse in the mouth. But what if there are strings attached? (Sorry, but I just love mixed metaphors, even if they are rather fixed idioms; I laugh out loud at the New Yorker “block that metaphor” quotes from news sources.)

We can hardly drop GPL-ed code into LingPipe (in fact, as of many years ago, at the behest of our customers, we removed all code dependencies from LingPipe, except for Java itself). So that means it’s time to saddle up (hard to turn off that metaphor machine) and write some really low-level code.

I actually find this kind of thing more fun than a slap in the face with a wet fish. (Last one — I just had to drop in a bit of the old litotes).

Programming Pearls is Your Friend

The best book I’ve ever read on designing algorithms for invariants and then profiling and tuning is:

  • Bentley, Jon. 2000. Programming Pearls, 2nd Edition. Addison-Wesley.

It’s so good that Josh Bloch used as the basis for the Java collections, as he outlines in this awesome blog post on a related topic, also related to Bentley’s search algorithms and his own experience as a CS student at CMU where Bentley was teaching:

I just went for Bentley’s qsort4, which is dead easy. It took only half the afternoon to write and unit test the whole thing with random tests, leaving plenty of time for this blog entry (enough rope to hang myself, one might say).

Show, Don’t Tell

How easy is it? Easy peasey isn’t the half of it. Here’s the code for com.aliasi.util.Sort other than the package declaration and imports and doc.

    public static void isort(int[] x, CompareInt compare) {
        for (int i = 0; i < x.length; ++i) {
            int t = x[i];
            int j = i;
            for ( ; j > 0 && compare.lessThan(t,x[j-1]); --j)
                x[j] = x[j-1];
            x[j] = t;
        }
    }
public static void qsort(int[] x, CompareInt compare) {
    Random random = new Random();
    qsortPartial(x,0,x.length-1,compare,random);
    isort(x,compare);
}
   static void qsortPartial(int[] x, int lower, int upper,
                             CompareInt compare,
                             Random random) {
        if (upper - lower < MIN_QSORT_SIZE)
            return;
        swap(x, lower, lower + random.nextInt(upper-lower+1));
        int t = x[lower];
        int i = lower;
        int j = upper + 1;
        while (true) {
            do {
                ++i;
            } while (i <= upper && compare.lessThan(x[i],t));
            do {
                --j;
            } while (compare.lessThan(t,x[j]));
            if (i > j)
                break;
            swap(x,i,j);
        }
    }
public static void swap(int[] xs, int i, int j) {
    int temp = xs[i];
    xs[i] = xs[j];
    xs[j] = temp;
}
public interface CompareInt {
    public boolean lessThan(int a, int b);
}
static int MIN_QSORT_SIZE = 7;

Thanks, Jon!

I don’t think this needs much explanation. There’s a comparator-like interface CompareInt to implement comparisons. Bentley’s quicksort uses Robert Sedgewick‘s speedups for worst case via randomization of the pivot. It also uses insertion sort, also shown, for small cases, as it’s faster with memory locality. This is more or less a symbol-by-symbol translation of Bentley’s code, with a general plug-in comparator.

Look for it wherever high quality software is sold (or given away).

Randomized Unit Tests

I’ve also started moving to more large-scale randomized unit tests. Here’s what I did for testing sorts.

@Test
public void testSort() {
    Random random = new Random();
    for (int i = 0; i < 129; ++i) {
        int[] xs = new int[i];
        int[] ys = new int[i];
        int[] zs = new int[i];
        for (int numTests = 1; numTests < 10; ++numTests) {
            randomFill(xs,random);
            copy(xs,ys);
            copy(xs,zs);
            Arrays.sort(xs);
            Sort.qsort(ys, Sort.NATURAL_INT_COMPARE);
            Sort.isort(zs, Sort.NATURAL_INT_COMPARE);
            assertArrayEquals(xs,ys);
            assertArrayEquals(xs,zs);
        }
    }
}

static void randomFill(int[] xs, Random random) {
    for (int i = 0; i < xs.length; ++i)
        xs[i] = random.nextInt();
}

static void copy(int[] xs, int[] ys) {
    for (int i = 0; i < xs.length; ++i)
        xs[i] = ys[i];
}

The built-in Sort.NATURAL_INT_COMPARE just implements Sort.CompareInt with the natural sort.

 public static final CompareInt NATURAL_INT_COMPARE
        = new CompareInt() {
                public boolean lessThan(int a, int b) {
                    return a < b;
                }
            };

That’s my afternoon in a nutshell (sorry, just one more (idiom) for the road).

12 Responses to “Sorting Primitive Types in Java”

  1. Stuart Moore Says:

    You probably want to look at your < (less than) tags – they're getting detected as html and not appearing, which makes the code a little harder to follow!

    • Bob Carpenter Says:

      Thanks for the heads-up. I pasted in a fresh copy and manually escaped everything. I should never trust WordPress to do this right. For reasons I don’t understand, it escaped the unit test code comparison signs properly and some of the ones in the main code, but then mysteriously dropped pieces, like the second do-while loop (not exactly a loop construct I use regularly).

  2. jessicabunneh Says:

    Interesting – I’ll be picking up Programming Pearls now.

  3. Ted Says:

    Does Arrays.sort(int[] a) give you what you need as well?

    http://download.oracle.com/javase/6/docs/api/java/util/Arrays.html

    • Bob Carpenter Says:

      I’m afraid not. It doesn’t let you plug in your own comparison function, which I need when the int values are indexes into a string that need to be sorted into a suffix array.

      I built the natural sorting comparison object in our framework and used java.util.Arrays.sort(int[]) for testing, as shown in the JUnit test above.

      The closest you can come is to box the integers and use

      Arrays.<T>sort(T[],Comparator<T>)

      with

      T = java.lang.Integer.

      But that’s very costly in terms of both space and time.

  4. drmaciver Says:

    Two comments on your testing.

    1) randomized testing is good, but the way you’re doing it means you’re only testing on a very specific class of arrays – those tending towards maximum disorder. You should probably have versions of the test which run on more structured input

    2) tests for sorting should always include tests with a non natural comparator so as to catch bugs where you’ve accidentally used the natural ordering

    • Bob Carpenter Says:

      Thanks! Those are both excellent points.

      (1) is about performance tuning, and it’s definitely an issue that needs testing with quicksort. We know insertion sort’s going to be bad in some easily quantifiable cases, but I’m not recommending using that at the top level, just as a cleanup path for qsort.

      (2) is what I’m normally after in unit testing, and I’ll definitely follow your advice. I was lazy because I had Arrays.sort() built in as a comparison. This does get tested in the suffix array code, but it should have its own independent test, not be tested through a complex fuction.

      I’ll implement Java’s Comparator and our CompareInt with the same comparison, then box up int as Integer for testing.

      Or I could just use some fixed cases. I always worry about those not covering enough of the simple low-count boundary conditions. The random tests can be relied upon to do that.

      To your list, I’ll add

      (3) Randomized testing is usually not sufficient because there are often boundary conditions that are rare under randomized inputs. That’s related to your (1) on the soundness front.

      • David R. MacIver Says:

        Yes, sorry, your point 3 was actually more what I meant with my point 1 – I wasn’t even really thinking about the performance issues.

      • David R. MacIver Says:

        incidentally, one important boundary condition it occurs to me that you currently hit with very low probability: When there are duplicate values in the array. Because you’re drawing your random integers from a very large set (the whole range), the chance of duplicates is pretty low, and this is definitely a case worth testing.

      • Bob Carpenter Says:

        That’s a very good point about duplicated entries, which won’t happen in short arrays. I’ll run a second test with a low range.

        Also, even with a modest 8 element minimum for qsort, that’s 8! possibilities for an exhaustive test w/o duplicates. This is beginning to feel like combinatorics homework!

        In the usual case for suffix arrays, there are never duplicate entries. But I’m allowing the length of prefix comparisons to be bounded (particularly across concatenated documents), so it will even come up with my intended use case here. Not that a limitied use case would be a good reason not to test a boundary condition for a general algorithm. For instance, WordPress missed the case for this reply box where I delete the final newline. It doesn’t show the text until I’m halfway done with a line.

  5. josh Says:

    I guess the primary difficulty here is that there isn’t to may knowledge a good way to apply such a sorting scheme to other primitive types. Should i want to make a similar sort for doubles, i’m left in a situation where i would need to duplicate a bunch of the above code, violating DRY. i can use the integers as indexes into a double array, but this is unappealing; i’d need to make the index array as a preprocessing step, adding extra cost and reducing locality.

    I suppose these and other reasons are motivating factors for using collections and reference types in java. Has profiling revealed that sorting reference types was a substantial time drain? have you compared your code before and after, and benchmarked a reference type sort vs a primitive sort?

    • Bob Carpenter Says:

      You’re absolutely right about that. There isn’t a good way to deal with primitive types in Java w/o duplicating code. Even principles like don’t-repeat-yourself (DRY) need to be broken from time to time. Sort of like database normalization.

      With C++, you use templates, but then there’s generated code bloat because you get a separate compilation and code generation for each type you plug into a template. Of course, you get the bloat and repetitive, hard-to-maintain code by copying things for primitives in Java.

      Up until this particular application, I’ve just boxed all of the primitives and used the Java collections (or arrays, but when I refactored to generics, I had to convert many of these operations to collections), or sorted things in arrays using natural comparisons, which Java supports.

      For the suffix array application, the array of int values to be sorted is the integers 0 to N-1, where N is the length of an underlying char array. The ordering on the int indices n is determined by treating them as character sequences starting at position n to the end in the underlying char array.

      The speed bottleneck is the comparison operation, because it’s over long suffixes of the char array, many of which have long matches due to the properties of natural language data we’re dealing with. To make that faster, I’d have to write a custom comparison and integrate it with search so that it keeps track when it’s on a subblock of which prefixes already match. I may yet have to do that, especially if we need to start looking for really large matches and get text with different hashing properties than we have now.

      For this refactor, it was mainly a size issue. Boxing the integers between 0 and N-1 is a huge waste of space, no matter how you cut it. If you have a lot of repetition in the values in some cases (not this one, where there’s zero repetition due to the suffix array data structure), you can share the objects (either by building your own pool or leaning on the run time by using the static factory method Integer.valueOf(int) rather than the constructor (the constructor always generates a fresh boxed integer on the heap; the valueOf() method uses an underlying pool of values.

      We needed the app to be able to scale better than using boxed integers would’ve allowed.

      The next space saving will be in the token-based suffix arrays, where instead of char values, there are strings. Given the (power law) distribution of tokens, it’s a pretty big win to pool them rather than allocating each token on the heap.

      There’s been a ton of work on algorithms for consructing suffix arrays.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 817 other followers