Shame and Joy of Open Source: Perf Bug in FastCache

by

Hats off to Brian Wilson of Endeca for profiling and then debugging a huge problem in my (Bob’s) implementation of com.aliasi.util.FastCache. The joy is that Brian could find and diagnose the bug; the shame is mine from having made this colossal blunder in the first place and then not spotting it with size-sensitive tests.

The immediate upshot is that I’m going to roll out a 3.5.1 version with the bug patched this weekend.

The fast cache is designed to be a thread-safe frequency-of-use based cache with pruning triggered by a new entry exceeding the load factor. The pruning is supposed to scale counts by dividing by 2 and then discard records with zero counts.

You might be able to spot the problem yourself by looking at the implementation, which is online at: 3.5.0 version of FastCache.java.

I’ll give you a hint. It’s in the prune() method.

Specifically, in this line:

while (record != null 
         && (record.mCount >>> 2) == 0)

Here’s what it should be:

while (record != null 
         && (record.mCount = (record.mCount >>> 1)) == 0)

Here’s the 3.5.1 version of FastCache.java.

The problem was that I didn’t decrement the counts. So why didn’t we notice this before? Because the implementation in place now is not only sound (that is, it produces the right answers), but it’s also relatively well-behaved in the face of natural power-law distributed natural language data. Every time the size threshold gets exceeded, all records with counts of 3 or less get discarded. Since language is dominated by 1 and 2 counts in the tail, this isn’t such a problem in processes that don’t run forever on a really diverse set of data with lots of local duplicate terms.

The other problem is that I only meant to divide by 2, not by 4, which a right-shift (zero filled) by 2 achieves. I’ll blame the psycholinguistic lexical priming effect for the 2 bug (I was thinking divide by 2 so inserted a 2).

This is a very serious bug because not only does it allow the cache to grow without bound, every time a new entry is added after it exceeds capacity with 4 counts and above, the entire cache gets visited. In the original implementation, pruning recreated a whole bunch of new map entry records, and also a whole bunch of soft reference wrappers. I also tweaked the code so as not to recreate records, but just modify theones that exist. This is the clue that Brian found in profiling the code. And even then I didn’t spot the bug.

Ironically, I’m in Columbus for the 2008 Assoc for Comp Ling Conference, where I’m co-organizing (with Kevin Bretonnel Cohen, who did most of the work) a workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s