The Unbearable Heaviness of Java Strings

by

Instantaneous Update: No sooner had I posted this than I received a really helpful clarifying comment from Ismael Juma with a link to exactly the references I was looking for. It turns out the 64-bit JVMs do compress references in some cases. And people ask me why I waste my time blogging!

Update: 23 June 2010. Super cool. As of Java SE 6 build 14, all you need to do is provide the java command the argument -XX:+UseCompressedOops and all references will consume 32 bits (“OOP” is for “ordinary object pointer”). That does limit you to 4 billion objects on the heap (232), which if they’re just plain objects will consume 32GB. You’d need at least 64GB of RAM to make it worth turning the option off, because 4 billion objects consume 64GB without OOP compression.

Does anyone know how fast the JVM is with compression? I’ll have to run some tests.

Let’s start with the punchline. In Sun/Oracle’s 64-bit JVM, a string consumes 60 bytes plus the size of its UTF-16 encoding, which is at least 2 times the number of Unicode characters. In a 32-bit JVM, if anyone’s still using one, there’s a 36-byte fixed cost plus the cost of the UTF-16 encoding.

My number one peeve with Java is that there is no equivalent of C’s struct. I can do without pointer arithmetic, but objects in Java are just too heavy for serious lifting (somehow that metaphor worked out backward). It’s the reason there are parallel arrays all over LingPipe — arrays of structured objects just take up too much space and cause too much garbage collection.

UTF-16?

Yes, Java made the questionable choice of using UTF-16 to encode strings internally. Each char is a UTF-16 byte pair. Any unicode character with a code point at or below U+FFFF uses a single UTF-16 byte pair to encode, and anything higher requires two UTF-16 byte pairs. (In the designer’s defense, there weren’t any Unicode characters with code points above U+FFFF when Java was released and 64-bit architectures in laptops seemed a long way off.)

The bummer is that for all of us with piles of ASCII or near-ASCII (e.g. Latin1 or Windows-1252) text, UTF-16 takes nearly twice as much space as UTF-8.

Java Object Memory Layout

The layout of objects in memory is not part of the Java language specification, so implementations may do this differently. We’ll only consider Sun/Oracle’s 32-bit and 64-bit JVMs.

After a lot of guesswork and empirical profiling, I finally found some references that look solid on the topic of how Java objects are laid out in memory:

What’s worrisome is that they look awfully similar and there are no external references. How did they figure this out? Looking at the byte code? By looking at the implementation of Java itself? By reverse engineering?

Principles of Object Layout

Assuming they’re correct, and the argument looks convincing, the basic idea of sizing a Java object (not including the size of any object it references), is based on a few principles motivated by byte alignment. The idea is that it’s more convenient for the underlying architecutre if items start on even number multiple of their number of bytes. So a double or long would want to start at position 0, 8, 16, …, an int or float on positions 0, 4, 8, 12, …, a short or char on positions 0, 2, 4, 6, … and bytes on any position at all.

References (aka pointers), will be 8 bytes on 64-bit architectures and 4 bytes on 32-bit architectures.

Basic Object Layout

An instance of Object requires two references, which consume 8 bytes (32 bit) or 16 bytes (64 bit).

One reference is to the class of the object, and one is the ID of the object. Objects get IDs because they may be relocated in memory yet need a unique identifier to resolve reference equality (==). The ID is also used for synchronization and for the hash code method implemented in Object.

In order to support byte alignment, the member variables are laid out from largest to smallest, that is doubles and longs, then ints and floats, then shorts and chars, then bytes.

Objects are reference aligned, because they start with references, so object sizes are rounded up to the nearest even multiple of 4 bytes (32 bit) or 8 bytes (64 bit).

Subclass Layout

Subclasses add their own member length to their superclass’s.

Subclasses point to their own class definition with one of their references and lay out their member values below the superclass’s. There are some subtleties here surrounding ordering member variables in the subclass layout to reuse space that would’ve been padding in the superclass.

Array Layout

Arrays take up 12 bytes (32 bit) or 20 bytes (64 bit) plus their member size, plus an additional 4 bytes if the array holds longs, doubles or object references.

Arrays behave like objects in java, and contain the same first two references (to the array class and to the ID). They additionally contain an integer length (a choice that, sadly, limits the size of Java arrays). Then they contain the equivalent of their length in member variables. These start after the length, which means 4 more bytes of padding for double or long or object reference arrays.

Strings

The Sun JDK implements strings as holding an array of char values, an integer start and an integer length definining a slice of the array, and an integer hash code.

REF -> String
REF -> ID
REF -> [ REF -> Array of char
            REF -> ID
            int -> length
            char -> charAt(0)
            ...
            char -> charAt(length-1) ]
int -> start
int -> length
int -> hashCode
(REF-int) padding

That’s 5 REFs, 4 ints, and a number of char equal to the length (in UTF-16 byte pairs) of the string, plus padding. So with 8-byte refs in the 64-bit JVM, that’s 60 bytes and with 4-byte refs in the 32-bit JVM, it’s 36 bytes. Plus the length of the string times 2.

Ouch!

That’s why efficient coding in Java looks like C. It’s just too expensive to allocate objects in tight loops, no matter what you may have heard to the contrary about the compiler being smarter than you and generational garbage collection curing all ills.

6 Responses to “The Unbearable Heaviness of Java Strings”

  1. Ismael Juma Says:

    Hi,

    I agree that it would be nice to have struct support in the JVM and that UTF-16 can be very inefficient for some common workloads. Maybe one day:

    http://blogs.sun.com/jrose/entry/tuples_in_the_vm
    http://blogs.sun.com/jrose/entry/fixnums_in_the_vm

    Having said that, your analysis of 64-bits is not complete without mentioning compressed references:

    http://wikis.sun.com/display/HotSpotInternals/CompressedOops

    Best,
    Ismael

  2. Chris Cleveland Says:

    Nice analysis. This is the reason we did a lot of backflips in OpenPipeline to put all document data in reusable character arrays, and have tokens just be pointers into those arrays. This scheme is really fast and relatively memory efficient, but the API isn’t as clean as it would be if all tokens and documents were strings.

    We’ve been considering doing a big refactoring to allow for an simpler but less efficient API, while also allowing you to get at the low level arrays for speed. It isn’t clear how to do that.

    If only String had been an interface… then we’d be able to do it right.

    • lingpipe Says:

      We represent chunkings that way — there’s an underlying character sequence (probably should’ve been a string so it’d be immutable) and then the chunks are objects that point into the underlying array. But even having the chunks be objects is expensive.

      We didn’t want to do that for tokenization because we think of tokenization more like Lucene does, where there might be filters that change the underlying string, such as case normalization and stemming. So we wound up using strings for tokens. As a result, tokenization is one of the most expensive operations in some of our packages. It dominates naive Bayes compute time, and can easily dominate logistic regression classification if the feature set’s simple.

      While I agree that it’d be useful, I can’t even imagine the bugs that would’ve been caused if they’d made String an interface.

  3. Ismael Juma Says:

    Thanks to a spam comment, I noticed the update at the top of the blog that includes a question about performance of compressed references. I wrote a blog entry about the initial version of it a while ago:

    http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

    I then wrote a brief follow-up about improvements to it:

    http://blog.juma.me.uk/2009/04/03/load-unsigned-and-better-compressed-oops/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s