JDK 7 Twice as Fast* as JDK 6 for Arrays and Arithmetic

by

The 7th version of the Java Developer’s Kit (aka JDK 7) delivers quite a speed boost over JDK 6 array accesses. For us, this is huge. It’s like another year and a half of Moore’s law for free. Only in software. And you don’t even have to write multi-threaded code.

I’ve been profiling my new K-Means++ implementation for the next LingPipe release on some randomly generated data. It’s basically a stress test for array gets, array sets, and simple multiply-add arithmetic. Many LingPipe modules are like this at run-time: named entity, part-of-speech tagging, language modeling, LM-based classifiers, and much more.

While I was waiting for a run using JDK 1.6 to finish, I installed the following beta release of JDK 7:

> java -version
java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b52)
Java HotSpot(TM) 64-Bit Server VM (build 15.0-b03, mixed mode)

You can get it, too:

I believe much of the reason it’s faster is the work of these fellows:

Java’s always suffered relative to C in straight matrix multiplication because Java does range checks on every array access (set or get). With some clever static and run-time analysis, Würthinger et al. are able to eliminate most of the array bounds checks. They show on matrix benchmarks that this one improvement doubles the speed of the LU matrix factorization benchmark in the U.S. National Institute of Standards (NIST) benchmark suite SciMark 2, which like our clustering algorithm, is basically just a stress test for array access and arithmetic.

So far, my tests have only been on a Thinkpad Z61P notebook running Windows Vista (64 bit) with an Intel Core 2 CPU (T2700; 2.0GHz), and 4GB of reasonably zippy memory. I don’t know if the speedups will be as great for other OSes or for 32-bit JDKs.

I’m pretty excited about the new fork-join concurrency, too, as it’s just what we’ll need to parallelize the inner loops without too much work for us or the operating system.

*Update: 2:30 PM, 30 March 2009 JDK 7 is only about 15% faster than Sun’s JDK 6 on my quad Xeons (E5410, 2.33GHz) at work running the same code. I’ll have to check the exact specs on both of my memory buses. The notebook has surprisingly fast memory and the Xeon’s running ECC registered memory that I don’t think is quite as fast.

Update: 11:00 AM, 31 March 2009 Like other matrix algorithms, k-means clustering is extremely front-side-bus sensitive (connection between memory and the CPU), because the bottleneck is often between memory and the CPU’s L2 cache. Memory’s significantly slower than CPU these days.

The Intel dual quad-core Xeon E5410 have 12MB of L2 cache at 2.3GHz, whereas the Thinkpad Z61P has Intel Core 2 Mobile T7200 has only 4MB of L2 cache at 2GHz. The Core 2 has a 667 MHz front-side bus whereas the Xeon reports a 1333 MHz front-side bus (is that just the confusion between spec reporting). I actually don’t know what kind of memory’s in the workstation — I’ll have to crack it open and look. I’ve got 4GB of RAM in the notebook, but the motherboard can only see 3GB (ithat is, it’s not Windows — the same thing happened when I installed Ubuntu on the notebook and it’s a known design limitation in many motherboards); I have 16GB of RAM in the workstation and the motherboard can see all of it. But it has two physical chips, each of which share the memory, so the motherboard’s architecture’s very different. There are so many confounding factors that I can’t tease apart what’s speeding up in JDK 7 so much on the notebook.

Anway, go forth and test. If you’re using a machine like my notebook to do number crunching, JDK 7 really is twice as fast as JDK 6 for matrix algorithms.

24 Responses to “JDK 7 Twice as Fast* as JDK 6 for Arrays and Arithmetic”

  1. Vitaly Mikheev Says:

    I believe much of the reason it’s faster is the work of these fellows:

    * Wurthinger, Thomas, Christian Wimmer, and Hanspeter Mossenbock. 2007.
    Array Bounds Check Elimination for the Java HotSpot Client Compiler. PPPJ.

    Basically, this paper is about the same optimization that was described in

    V. Mikheev, S. Fedoseev, V. Sukharev, N. Lipsky. 2002
    Effective Enhancement of Loop Versioning in Java. Compiler Construction

    and implemented in Excelsior JET JVM back in 2001.

    Nevertheless, certain opportunities for index check elimination are still
    not exploited by the HotSpot compiler as shown in this head-to-head comparison

  2. lingpipe Says:

    Thanks for the pointer — the Mikheev paper is cited in the Würthinger et al. paper. Is there an abbreviation for “I am not a compiler expert”?

    I did find Stefan Krause.blog()’s head-to-head comparison, which you mention, but it stops at JDK 1.6. It’s JDK 1.7 beta that has the Würthinger et al. optimizations in it (or so they say — I haven’t looked through the code, just benchmarked). It indeed shows Excelsior’s JET compiler (referenced in the last comment) significantly outperforming Sun’s JDK 6.

    I hadn’t heard of JET before, but it looks interesting. It’s an ahead-of-time compiled Java that conforms to Java SE 6. And the good folks at Excelsior were nice enough to publish a benchmark they didn’t win — that distinction went to the Java Native Compiler, which I’d also never heard of, and which appears to be moribund (last release Feb 11, 2007; last news March 2008 saying they were still alive).

  3. Vitaly Mikheev Says:

    >>I did find Stefan Krause.blog()’s head-to-head comparison,
    >>which you mention, but it stops at JDK 1.6. It’s JDK 1.7

    Oh, you are right. I thought they included it in Java 6 Update 6. Probabaly, I meant that I know a few index check optimization opportunities not covered by the Wurthinger/Wimmer/Mossenbock approach. ;)

    We will place JDK 7 on our benchmark servers soon and I will let you know the results for SciMark2, the tests used by Stefan Krause and other benchmarks.

    >>And the good folks at Excelsior were nice enough to publish a benchmark they didn’t win

    Hah, there is no the world fastest JVM despite some vendors (not Sun and not Excelsior) used to claim. ;)

    Even in that performance study you mentioned (http://www.excelsior-usa.com/jetcs00007.html) Excelsior JET beats Sun HotSpot on the 32-bit platforms but HotSpot (64-bit JRE) strikes back if run on x64.

    Actually, we are now working on 64-bit port of Excelsior JET so one day the benchmark will need an update.

  4. lingpipe Says:

    I completely agree about benchmarks. It’s why I’m so reluctant to publish any of our own, despite Breck’s constant pleas that we need it for marketing.

    And thanks for qualifying my “didn’t win” comment — I should be more careful in these reports; I was indeed focusing only on the 64-bit JDKs.

  5. RawThinkTank Says:

    If an array is using a for loop variable and no other variable to access it inside the for block then its possible.

  6. Akhilesh Says:

    Yeps, JDK7 is indeed fast and my observation consurs with yours. I’m a bit puzzled by the results on the Xeon machine. Is it possible that by chance the tests were run using 32 bit JVM instead of 64 bit ? I presume that you already know that on hybrid installations, 32 and 64 bit JVMs can be selected with -d32 and -d64.

  7. lingpipe Says:

    I didn’t know there were hybrid installations. Trying “-d32″ gives me the error: Error: This Java instance does not support a 32-bit JVM. Please install the desired version.

    Being bitten by environment vars and other global config in the past, I run all of my projects with explicit classpaths. For tests, I call the java.exe directly so I’m really sure about what I’m calling.

  8. Akhilesh Says:

    Ok, then it’s really strange. BTW for a numerical simulation test I ran, 64 bit JVM was twice as fast as 32 bit JVM (both run with server VM and with warm-up loops before timing run). So I thought may be that’s the case with your runs.

    JVM (at least Sun’s JVM) can be installed in hybrid (both 32 and 64 bit versions present) and can be chosen at run time with -d32/-d64 as needed for a particular application (some applications, especially those dependent on native DLLs are sensitive to it).

    • Bernd Says:

      SUN VM isn’t hybrid on Linux or Windows. That particular mode was only available on solaris x86. On those architectures the 64bit install was just a small upgrade to a larger 32bit package.

  9. Stefan Krause Says:

    When benchmarking Sun’s JVM please keep in mind that the 64-bit version supports currently only the server hotspot compiler, whereas the 32-bit version supports client and server hotspot compiler. Depending on your machine (num processors, memory) there might be chances that the server compiler is chosen automatically (might happen on your xeon box, but not on your laptop) and the server compiler had array bounds checks removal since quite a time, (so no speedup there).
    To make sure you get the right hotspot compiler always use the -client or -server switch on 32 bit machines.

  10. Stefan Krause Says:

    Sorry, should have read more carefully. Your laptop also runs on 64 Bit, so there’s no client compiler (and your output confirms that too “Java HotSpot(TM) 64-Bit Server VM”) .
    But then the paper you’ve cited can’t be the reason for the speedup (this is about getting the bounds check removal into the client hotspot compiler)

  11. Thomas Wuerthinger Says:

    Our work is targeted on the HotSpot Client Compiler and not yet part of the official JDK. It is a paper about an array bounds check elimination optimization just as is the work of Vitaly Mikheev et al.; it is however not at all the “same” optimization. For a complete comparison the papers can be reviewed.
    I just want to mention two main differences:
    – No loop versioning or code duplication
    – Focus on fast compilation time (as this is crucial for the HotSpot Client Compiler); we also benchmark compile-time impact

    For getting there, we use a less sophisticated algorithm than other approaches, but get good results for the common case.

    There is now also an extended journal version of our paper that contains a more detailed description as well as additional benchmarks. We also use less array-intensive benchmarks like SpecJVM98, DaCapo or Java Grande for the evaluation. The article is published at http://dx.doi.org/10.1016/j.scico.2009.01.002

  12. lingpipe Says:

    Thanks so much for all the feedback. This is great — I don’t even need to look up authors’ current e-mails! Sorry for being so lazy on this one.

    @Thomas: I misinterpreted the sentence “It is currently integrated into the early access build b04 of the JDK 7 [16].” to mean that it was actually in the main trunk of development, whereas I take Thomas to be saying that they branched.

    @Stefan: I didn’t even notice that the title of the paper had Hotspot Client in it! I don’t think much about client or 32-bit JVMs these days.

    @Thomas, @Vitaly: As to who did what first, check out my previous blog entry on the scientific zeitgeist.

    @Everyone So does anyone have any idea why JDK 7 64-bit server for Windows running on my Core 2 Thinkpad is so much faster than the JDK 6 version? Better use of CPU cache? Specialized instructions for Core 2? There really shouldn’t be any GC in what I’m dong — everything’s preallocated as arrays, and all 2D arrays get assigned to 1D arrays before inner-inner loops.

    I’m pretty ignorant at the lower levels of compilers — is there something to read about this to give me some feeling for what’s going on?

  13. Foo Says:

    Could be the compressed pointers that have entered jdk7 recently.

  14. Ismael Juma Says:

    Hi,

    These days, one cannot compare “JDK 6″ to “JDK 7″ because different HotSpot versions are available in different JDK 6 builds. For example, JDK 6 Update 14 (still in early access) includes HS14 which was the version in the JDK 7 snapshots until a few months ago.

    What kind of results do you get when you run your benchmark with JDK 6 Update 14[1]?

    Ismael

    [1] https://jdk6.dev.java.net/

  15. Ismael Juma Says:

    Foo,

    Compressed pointers are not enabled by default. They are available for JDK 7 and JDK 6 Update 14 (the implementation in JDK 7 has a few improvements at the moment[1]). I would also try the benchmark with compressed pointers enabled as they can improve performance by quite a bit in cases where memory bandwidth is important[2].

    Ismael

    [1] http://blog.juma.me.uk/2009/04/03/load-unsigned-and-better-compressed-oops/
    [2] http://blog.juma.me.uk/2008/10/14/32-bit-or-64-bit-jvm-how-about-a-hybrid/

  16. Alex Miller - Weekly twitter links (April 6th) Says:

    [...] JDK 7 Twice as Fast as JDK 6 for Arrays and Arithmetic (via @andreisavu) [...]

  17. Java 7 é mais rápido que a versão 6 « TekiMania Says:

    [...] A notícia original pode ser vista em: http://lingpipe-blog.com/2009/03/30/jdk-7-twice-as-fast-as-jdk-6-for-arrays-and-arithmetic/ [...]

  18. Alex Miller - What Java 7 will mean to you Says:

    [...] optimizations. We’ve already seen some very encouraging results in String performance, array performance, and a new concurrent garbage collector (G1). I suspect many people will find that their existing [...]

  19. Java 7 What’s New, Performance Benchmark – 1.5, 1.6, 1.7 | Taranfx: Technology Blog Says:

    [...] optimizations. We’ve already seen some very encouraging results in String performance, array performance, and a new concurrent garbage collector (G1). I suspect many people will find that their existing [...]

  20. Bernd Says:

    I am quite sure, that the array bounds check supression for cases like for(int i=0;i<a.length;i++) was added in 1.5 and earlier. At least most of the early JIT/C comparisions mention this "trick".

    And I really doubt much runtime speed improvements in 1.7 over 1.6, since the hotspot releases are nearly in sync.

    • Bob Carpenter Says:

      Earlier 1.6 releases weren’t in synch with 1.7. So when I wrote this, the performance difference was easily reproducible. And both were faster than 1.5. Again, at the time I wrote this.

      Later 1.6 versions incorporated the relevant 1.7 optimizations, which is great for those of us with conservatively version updating customers.

  21. Java 7: Evolution of Java – Revolution for the JVM « dblackherod Says:

    [...] optimizations. We’ve already seen some very encouraging results in String performance, array performance, and a new concurrent garbage collector (G1). I suspect many people will find that their existing [...]

  22. Java VM tuning « errantlinguist.com blog Says:

    [...] found in the newest JVM(s); At the moment, Sun’s JDK 7 is starting to take off, and it seems to be  significantly faster than JDK 6 in e.g. arithmetic operations and operations on arrays — two things critical to performance in machine-learning [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 820 other followers