Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance.
Check out Chapter 3, Finding Similar Items, from:
- Rajaraman, Anand and Jeff Ullman. 2010 (Draft). Mining of Massive Data Sets.
It was developed for a Stanford undergrad class, and we know Ullman writes a mean text, so it’s not surprising it’s at the perfect level for serious practitioners. Other presentations I’ve seen have been very theory heavy (though feel free to point out other refs in the comments).
Here’s an overview of the scaling process, as currently understood, which may seem like a lot of work until you realize it reduces a quadratic all-versus-all doc comparison problem, each instance of which is hairy, to a linear problem, the constant factor for which is manageable.
Step 1. Build a tokenizer to create snippets of text, typically overlapping “shingles” consisting of sequences of tokens. LingPipe’s TokenNGramTokenizerFactory class is essentially a flexible shingler filter for a basic tokenizer. Of course, if all you need to do is the next steps, you don’t need to build string-based tokens — you only need their hash codes, and that’s done more efficiently using something like a rolling hash (the name “rolling hash” is new to me [at least in the intended sense], but the algorithm should be familiar from the Karp-Rabin string search algorithm, which is well described in Gusfield’s most awesome string algorithms book).
Step 2. From each document, extract multiple shingles. Typically these are just the overlapping n-grams or some stemmed or stoplisted forms thereof, which is where the name “shingle” comes from . Rajaram and Ullman suggest that a single stop word followed by the next two tokens works well, which isn’t exactly a shingle, though you could use it as a component and make sequences of these items.
Step 3. Calculate minhash values for each of the shingles in a doc. This provides a compressed representation of sets with the nifty property that the chance that minhashes are equal is the same as the Jaccard distance itself (explained in the book cited above). There’s no Wikipedia page (that I could find), but here’s a nice blog entry on MinHash, which comes with (C#?) code.
Step 4. Calculate locality-sensitive hashes of the minhash values. The point of locality-sensitivity hashing is to map similar items to similar buckets. There’s some really cool math here on expected recall and precision, but I wouldn’t trust the naive numbers for natural language text, because of the huge degree of correlation.
Step 5. Test for equality using the locality-sensitive hashes (LSH). This reduces the quadratic problem of comparing all docs to one with roughly the same performance that only takes constant time. You can get an idea of what the usual presentation looks like for LSH by considering the LSH Wikipedia page, the first line of which assumes you know what a metric space is!
Step 6. You can then check the actual docs if you want to prevent false positive matches.
The book draft has nice examples of all of these things. It also goes over the theoretical justifications of why these approaches work, but doesn’t get bogged down in lots of math — it just sticks to what you need to know to understand the problem and build an implementation yourself. In fact, I may do just that.
Tunable Tradeoffs in Accuracy versus Speed
One very cool feature of this approach is that it’s probabilistic in the sense that you can trade efficiency for accuracy. By using more and more shingles and more and more LSH, you can get pretty much arbitrarily close to 1.0 in accuracy. Given that the problem’s not so well defined already, we can usually tolerate a bit of error on both the false positive and false negative side.
What Does Nutch Do?
The Apache Nutch project, based on the Apache Lucene search engine, is intended to be a web-scale crawler and indexer. It has an implementation of a similar algorithm, though I can’t find any details other than a comment that they’re using MD5 hashes to approximate step 6 (that’s a pretty good approximation). Does anyone have a pointer to how it works?