Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance.

### The Book

Check out Chapter 3, Finding Similar Items, from:

- Rajaraman, Anand and Jeff Ullman. 2010 (Draft).
*Mining of Massive Data Sets*.

It was developed for a Stanford undergrad class, and we know Ullman writes a mean text, so it’s not surprising it’s at the perfect level for serious practitioners. Other presentations I’ve seen have been very theory heavy (though feel free to point out other refs in the comments).

### The Drill

Here’s an overview of the scaling process, as currently understood, which may seem like a lot of work until you realize it reduces a quadratic all-versus-all doc comparison problem, each instance of which is hairy, to a linear problem, the constant factor for which is manageable.

Step 1. Build a tokenizer to create snippets of text, typically overlapping “shingles” consisting of sequences of tokens. LingPipe’s TokenNGramTokenizerFactory class is essentially a flexible shingler filter for a basic tokenizer. Of course, if all you need to do is the next steps, you don’t need to build string-based tokens — you only need their hash codes, and that’s done more efficiently using something like a rolling hash (the name “rolling hash” is new to me [at least in the intended sense], but the algorithm should be familiar from the Karp-Rabin string search algorithm, which is well described in Gusfield’s most awesome string algorithms book).

Step 2. From each document, extract multiple shingles. Typically these are just the overlapping n-grams or some stemmed or stoplisted forms thereof, which is where the name “shingle” comes from . Rajaram and Ullman suggest that a single stop word followed by the next two tokens works well, which isn’t exactly a shingle, though you could use it as a component and make sequences of these items.

Step 3. Calculate minhash values for each of the shingles in a doc. This provides a compressed representation of sets with the nifty property that the chance that minhashes are equal is the same as the Jaccard distance itself (explained in the book cited above). There’s no Wikipedia page (that I could find), but here’s a nice blog entry on MinHash, which comes with (C#?) code.

Step 4. Calculate locality-sensitive hashes of the minhash values. The point of locality-sensitivity hashing is to map similar items to similar buckets. There’s some really cool math here on expected recall and precision, but I wouldn’t trust the naive numbers for natural language text, because of the huge degree of correlation.

Step 5. Test for equality using the locality-sensitive hashes (LSH). This reduces the quadratic problem of comparing all docs to one with roughly the same performance that only takes constant time. You can get an idea of what the usual presentation looks like for LSH by considering the LSH Wikipedia page, the first line of which assumes you know what a metric space is!

Step 6. You can then check the actual docs if you want to prevent false positive matches.

The book draft has nice examples of all of these things. It also goes over the theoretical justifications of why these approaches work, but doesn’t get bogged down in lots of math — it just sticks to what you need to know to understand the problem and build an implementation yourself. In fact, I may do just that.

### Tunable Tradeoffs in Accuracy versus Speed

One very cool feature of this approach is that it’s probabilistic in the sense that you can trade efficiency for accuracy. By using more and more shingles and more and more LSH, you can get pretty much arbitrarily close to 1.0 in accuracy. Given that the problem’s not so well defined already, we can usually tolerate a bit of error on both the false positive and false negative side.

### What Does Nutch Do?

The Apache Nutch project, based on the Apache Lucene search engine, is intended to be a web-scale crawler and indexer. It has an implementation of a similar algorithm, though I can’t find any details other than a comment that they’re using MD5 hashes to approximate step 6 (that’s a pretty good approximation). Does anyone have a pointer to how it works?

January 13, 2011 at 5:42 am |

[…] Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashi… […]

January 21, 2011 at 9:56 am |

Re-Nutch : see the package org.apache.nutch.crawl (http://svn.apache.org/viewvc/nutch/trunk/src/java/org/apache/nutch/crawl/)

and more specifically MD5signature and TextProfileSignature, which are the two implementations currently available.

The former is extremely straightforward (read brutal) and uses an MD5 hash of the raw binary content as a signature for a page.

The TextProfileSignature implementation is a bit more refined and does (I quote its Javadoc):

The algorithm to calculate a page “profile” takes the plain text version of a page and performs the following steps:

* remove all characters except letters and digits, and bring all characters to lower case,

* split the text into tokens (all consecutive non-whitespace characters),

discard tokens equal or shorter than MIN_TOKEN_LEN (default 2 characters),

* sort the list of tokens by decreasing frequency,

* round down the counts of tokens to the nearest multiple of QUANT (QUANT = QUANT_RATE * maxFreq, where QUANT_RATE is 0.01f by default, and maxFreq is the maximum token frequency). If maxFreq is higher than 1, then QUANT is always higher than 2 (which means that tokens with frequency 1 are always discarded).

* tokens, which frequency after quantization falls below QUANT, are discarded.

* create a list of tokens and their quantized frequency, separated by spaces, in the order of decreasing frequency.

* This list is then submitted to an MD5 hash calculation.

this allows for some slight variations in the content.

Julien

January 24, 2011 at 1:45 pm |

Thanks for the pointer to Nutch’s actual approach(es). I’ve seen the use of normalized middle-frequency terms before, but can’t remember where — maybe Nutch itself!

Rajaraman and Ullman seem to be going in the opposite direction when they combine a stop word and two following tokens to get a more diverse set of tokens, but then they use a Jaccard-like comparison to account for noise.