Java Object Allocations Costly in Inner Loops

by

I’d pretty much stopped worrying about short-lived object allocations in Java. GC’s gotten so fast it’s usually not the bottleneck in anything I’ve profiled lately, especially for short-lived objects in programs where throughput rather than low maximum latency is critical.

That is, until I got a phone call from a customer asking if I’d added multi-threading to the logistic regression estimator. The short answer was “no”.

Hmm. Logistic regression version 3.8 uses two threads. What changed? Just a really tiny, really short-lived object allocation inside the next-to-innermost loop. It was a wrapper for the non-zero dimensions. Mike added it in order to clean up my original code which had cut-and-paste the prior/regularization/shrinkage steps in multiple places to deal with middle/end of loop differences and sparse/dense array differences.

Here’s what version 3.8.0 looks like (slightly simplified), with this loop being executed once per training epoch:

for (int j = 0; j < numTrainingInstances; ++j) {
  Vector xsJ = xs[j];
  int csJ = cs[j];
  // prior gradient
  for (int k = 0; k < numOutcomesMinus1; ++k) {
    PrimitiveInt dimensions
      = hasSparseInputs
      ? Iterators.intArray(xsJ.nonZeroDimensions())
      : Iterators.intRange(numDimensions);
    adjustWeightsWithPrior(weightVectors[k], j, dimensions,
         prior, learningRate, numTrainingInstances, 
         lastRegularizations);
  }
  // likelihood gradient
  double[] conditionalProbs = regression.classify(xsJ);
  for (int k = 0; k < numOutcomesMinus1; ++k) {
    adjustWeightsWithConditionalProbs(weightVectors[k], 
          conditionalProbs[k], learningRate, xsJ, k, csJ);
  }
  ...
}

and then here’s the code in the inner loop:

void adjustWeightsWithPrior(DenseVector weightVectorsK, 
                            int curInstance, 
                            PrimitiveInt dimensions,
                            RegressionPrior prior, 
                            double learningRate, 
                            int numTrainingInstances, 
                            int[] lastRegularizations) {

  while (dimensions.hasNext()) {
    int dim = dimensions.nextPrimitive();
    if (curInstance == lastRegularizations[dim]) 
      continue;
    int skippedDimMultiplier 
      = curInstance - lastRegularizations[dim];
    double weightVectorsKDim = weightVectorsK.value(dim);
    if (weightVectorsKDim == 0.0) 
      continue;
    double priorGradient 
      = prior.gradient(weightVectorsKDim,dim);
    double delta 
      = (skippedDimMultiplier * learningRate * priorGradient) 
        / numTrainingInstances;
    double newVal = weightVectorsKDim > 0.0
                  ? Math.max(0.0,weightVectorsKDim-delta)
                  : Math.min(0.0,weightVectorsKDim-delta);
    weightVectorsK.setValue(dim, newVal);
    lastRegularizations[dim] = curInstance;
  }
}

With the primitive int iterator (in util.Iterators along with factory creation methods as used above), there’s no further boxing/unboxing required for the numbers.

Mike’s changes seemed like a good idea to me when I code reviewed it, as they made the code much cleaner and more modular. And after all, the innermost loop went over the non-zero dimensions themselves, doing multiply/add steps after method calls. And usually hundreds of these. And all the code to actually follow the likelihood gradient is even more expensive than the prior gradients. So how expensive could the object allocation be, relatively speaking?

Very. It was enough to cause a second thread to be pinned on the computer running the job. That’s exactly what you’d expect if you maxed out the garbage collector with short-lived object deallocations. In the server versions of the JVMs (pretty much all we use as all our machines are 64 bit), garbage collection runs concurrently by default.

At work, on our dual quad-core Xeons with big caches and fast memory, you couldn’t even notice it in terms of either wall clock time or its effect on the system. Debugging on my Core Duo notebook was a different story. Firing up logistic regression version 3.8.0 so hosed the machine I couldn’t even open Firefox without a 2 minute wait for the screen to draw.

To make a long story short, I ripped the lid off again and we now have the best of both worlds. I refactored Mike’s refactoring to preserve the modularity but remove the object allocation. Profiling showed it worked in that logistic regression estimation is only using a single thread, which is what you’d expect, because there’s no object allocation anywhere. It’s also faster on my notebook.

The moral of the story is that fast Java programs need to look like C (without the malloc, of course). In most cases, we’re not deep enough in inner loops in CPU-intensive enough jobs for this to matter. Alas, sometimes, we are.

Stay tuned — I’m about to roll out LingPipe 3.8.1 which has this patch in place. For comparison, here’s the new code:

int[] allDimensions = new int[numDimensions];
...
for (int k = 0; k < numOutcomesMinus1; ++k) {
  ...
  int[] dimensions
    = hasSparseInputs
    ? xsJ.nonZeroDimensions()
    : allDimensions;
  adjustWeightsWithPrior(weightVectors[k], j, 
                         dimensions, ...

where nonZeroDimensions() returns the array of non-zero dimensions for a sparse vector. The update code is pretty much the same:

void adjustWeightsWithPrior(...
                            int[] dimensions,
                            ...) {
  for (int i = 0; i < dimensions.length; ++i) {
    int dim = dimensions[i];
    ...

(Warning: There’s another bug in the above code relative to 3.7.0, namely that the lastRegularizaton gets reset too soon, preventing the 2nd to K-1st coefficient vectors from getting regularized. On the plus side, we’ve also fixed a bug and a separate arithmetic overflow problem from 3.7.0.)

One Response to “Java Object Allocations Costly in Inner Loops”

  1. Ismael Juma Says:

    Hi,

    Yes, allocation should still be avoided in inner loops. Still, there is hope:

    http://blog.juma.me.uk/2008/12/17/objects-with-no-allocation-overhead/

    (also check the comments with the assembly generated)

    Best,
    Ismael

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s