Price-is-Right Binary Search (for Suffix Arrays of Documents)

by

Suffix Arrays

Although this isn’t really the topic of the post, I’m adding suffix arrays to LingPipe at both the character and token level. Suffix arrays are one of those magic data structures that have too-good-to-be-true-but-it-really-is-true peformance. Here, we care about their ability to find arbitrarily long substring matches of substrings in text efficiently.

For instance, suffix arrays are useful if you’re looking for anything from plagiarized passages in a pile of writing assignments, cut-and-paste code blocks in a large project, or just commonly repeated phrases on Twitter.

I’m representing a collection of documents as a single long text with distinguished break tokens (or characters) to separate the documents in the text. In building the text, I collect have an array holding all the start positions of each document in the big text (in ascending order). Then I’m going to retrieve positions in the large text from the suffix array, leaving me with the problem of finding the document that spans a given position.

Price-is-Right Search: Largest without Going Over

I need what I think of as a Price-is-Right search — you want the index in the array of positions that is largest without going over.

For example, suppose I have an array of starting positions starts = { 0, 12, 19, 113, 492 }. And I want to know which document spans character 22. The answer is the document spanning characters 19 (inclusive) to 113 (exclusive), or index 2 in the start array.

The Algorithm

It’s surprisingly hard to get these trivial little algorithms right. It’s one of the things I learned to do much better when practicing on TopCoder, which is full of these kinds of algorithms. And these make great interview questions. The ability to code this kind of algorithm quickly is really the limiting factor in most coding projects. I’m still amazed thinking of how fast the TopCoder winners are at this kind of thing.

It took me about an hour to get it working and tested. The trick, as with all these algorithms, is to stay on top of your edge cases and handle the general case with a loop maintaining careful loop invariants. That actually makes it sound harder than it is.

/**
 * Given an increasing array of values and a specified value,
 * return the largest index into the array such that the
 * array's value at the index is smaller than or equal to the 
 * specified value.  Returns -1 if there are no entries in the 
 * array less than the specified value.
 *
 * Warning: No test is made that the values are in
 * increasing order.  If they are not, the behavior of this
 * method is not specified.
 * 
 * @param vals Array of values, sorted in ascending order.
 * @param val Specified value to search.
 */
public static int largestWithoutGoingOver(int[] vals, 
                                          int val) {
    int start = 0;
    int end = vals.length;
    if (vals.length == 0)
        return -1;
    if (val >= vals[end-1]) 
        return end - 1; 
    // invariant: vals[start] <= val <= vals[end-1] 
    while (start + 1 < end) {
        int mid = (start + end) / 2;
        // invariant: start < mid < end
        if (val < vals[mid]) 
            end = mid;
        else if (val > vals[mid])
            start = mid;
        else
            return mid;
    }
    return start;
}

There are even two comments in the code, explaining the invariants.

Where would we be without (J)unit tests?

@Test
public void testPriceIsRightEmpty() {
    int[] vs = { };
    assertEquals(-1,largestWithoutGoingOver(vs,3));
}

@Test
public void testPriceIsRight2() {
    int[] vs = { 0, 17, 23, 152, 153, 190 };
    assertEquals(-1, largestWithoutGoingOver(vs,-10));
    for (int k = 0; k + 1 < vs.length; ++k)
        for (int i = vs[k]; i < vs[k+1]; ++i)
            assertEquals(k, largestWithoutGoingOver(vs,i));
   assertEquals(vs.length-1, 
	         largestWithoutGoingOver(vs,190));
    assertEquals(vs.length-1, 
		 largestWithoutGoingOver(vs,195));
    assertEquals(vs.length-1, 
		 largestWithoutGoingOver(vs,Integer.MAX_VALUE));
    }

One Response to “Price-is-Right Binary Search (for Suffix Arrays of Documents)”

  1. Xinfan Meng Says:

    You are too generous to use two verbs in “I collect have an array …” :-)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 807 other followers