Big Bit-Packed Array Abstraction (for Java, C, etc.)

by

One limitation of Java is that arrays can only be up to Integer.MAX_VALUE entries long. That means you can only get 2 billion entries into an array. That’s only 2GB if the arrays contain bytes, and we have way more memory than that these days. That’s not enough.

A second limitation, shared by C, is that each entry is a round number of bits, usually a power of two bytes (8, 16, 32, or 64 bits). Often 32 is too little and 64 too much.

I’ve been spending some time lately working on bioinformatics applications. The size and shape of the data’s a bit unusual for me, and very awkwardly sized for Java or even C-based arrays.

Packing the Human Genome

For instance, if you want to represent the bases in both strands of the human genome, it’s a sequence 6 billion bases long with each base being 1 of 4 values (A, C, G, or T). OK, no big, just pack four values into each byte and you can squeeze under the limit with a 1.5 billion long array of bytes. It’s even pretty easy to write the bit fiddling because everything aligns on byte boundaries.

Indexing the Human Genome

Now let’s suppose we want to index the genome so we can find occurrences of sequences of arbitrary length in time proportional to their length. Easy. Just implement a suffix array over the positions. (A suffix array is a packed form of a suffix tree, which is a trie structure that represents every suffix in a string. Technically, it’s an array consisting of all the positions, sorted based on the suffix defined from the position to the end of the string.)

Because there are 6G positions, the positions themselves won’t fit into an int, even using an unsigned representation. There are only 32 bits in an int, and thus a max of about 4 billion values. But a long is overkill. We need 33 bits, not 64. Using 8 bytes per entry, the suffix array would fill 48GB. We really only need a little over half that amount of space if we use 33 bits per pointer. But now things aren’t so easy. First, the bit fiddling’s more of a pain because 33-bit values can’t be both packed and byte aligned. Second, you can’t create an array that’ll hold 24GB of values, even if it’s a maximally sized array of longs (that’ll get you to 16GB).

The Abstraction

What I’ve defined in the package I’m working on with Mitzi is a BigArray interface. The interface is super simple.

interface BigArray {
    long length();
    long get(long n);
    void set(long n, long val);
}

That’s the basics. We might want to make the set operations optional so we could have immutable big arrays. As is, there’s no way to define immutable arrays in Java or C (though Java provides immutable wrappers for lists in the collections framework and C probably has something similar in its standard template lib).

I followed the idiom of Java’s Number interface and define extra getters for bytes, shorts and integers for convenience, but don’t show them here.

On top of a big array, we could implement a variety of the new I/O buffer classes, using something like the random access file repositioning pattern, but it’d be clunky. It’s really a shame the buffers are also sized to integers.

Implementations

What’s groovy about interfaces is that we can implement them any way we want. The simplest implementation would just be backed by an array of long values. An implementation specifically for two-bit values might be used for the genome, with an efficient byte-aligned implementation.

What I did was create an implementation that allows you to specify the array length and bits per value. I then pack the bits into an array of longs and fiddle them out for getting and setting. This is fine for the genome sequence, because it’ll fit, although it’s not as efficient as the byte-aligned implementation, so I’ll probably add extra special-case implementations.

But it’s still not enough for the suffix array. I can’t fit all the bits into an array of longs. Although I haven’t implemented it yet, the obvious thing to do here is to build up the big array out of an array of arrays. I’d only need two arrays of longs for the human genome suffix array. This is pretty easy to implement hierarchically with an array of smaller big array implementations. First figure out which subarray, then do the same thing as before.

Yes, I Know about Burrows-Wheeler Indexing

If you’re just interested in the suffix array packing problem, the Burrows-Wheeler compression algorithm may be combined with indexing to get the entire genome suffix array into 4GB of memory in a way that’s still usable at run time. This was first explored by Ferragini and Manzini in their 2000 paper, Opportunistic data structures with applications. Very very cool stuff. And it works even better in the genomics setting than for text documents because of the small alphabet size (just four bases for DNA/RNA or 20 amino acids for proteins). Several projects have taken this approach (e.g. BWA and Bowtie).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 823 other followers