Archive for January, 2011

Which Automatic Differentiation Tool for C/C++?

January 19, 2011

I’ve been playing with all sorts of fun new toys at the new job at Columbia and learning lots of new algorithms. In particular, I’m coming to grips with Hamiltonian (or hybrid) Monte Carlo, which isn’t as complicated as the physics-based motivations may suggest (see the discussion in David MacKay’s book and then move to the more detailed explanation in Christopher Bishop’s book).

Why Hamiltonian Monte Carlo?

Hamiltonian MC methods use gradient (derivative) information to guide Metropolis-Hastings proposals. Matt is finding that it works well for the multilevel deep interaction models we’re working on with Andrew. As the theory suggests, it works much better than Gibbs sampling when the variables are highly correlated, as they tend to be with multilevel regression coefficients. The basic idea goes back to the paper:

  • Duane, Simon, A. D. Kennedy, Brian J. Pendleton, and Duncan Roweth. 1987. Hybrid Monte Carlo. Physics Letters B 195(2):216–222. doi:10.1016/0370-2693(87)91197-X

Why AD?

Because we want to model general directed graphical models a la BUGS, we need to compute gradients.

If you need general gradients, it sure seems like automatic differentiation (AD) deserves at least some consideration. AD is a technique that operates on arbitrary functions defined by source code, generating new source code that computes the derivative (thus it’s a kind of automatic code generation). At a high level, it does just what you might expect: computes the derivatives of built-in functions and constants then uses the chain rule to put them together. Because it generates code, you can even have a go at tuning the resulting derivative code further.

Here are some basic refs:

So Which One?

So what I want to know is, if I’m going to be coding in C, which implementation of AD should I be using? (There are implementations in everyting from Fortran to Python to Haskell.)

Update 21 January 2010: So, two problems rear their ugly heads. One: almost all of the tools other than Tapenade require you to seriously modify the code with different data types and flags for the code generator. Two: most of these tools are for non-commercial use only and can’t be redistributed. Not exactly how we’d like to roll together our own academic software. This is especially annoying because they’re almost all government-research funded (Tapenade in France, many of the others in the US, with copyright held tightly by institutions like the University of Chicago and INRIA). Many of the packages require extensive secondary packages to be installed. I miss Java and Apache.

Tapenade’s Online AD for C, C++ and Fortran

Update 21 January 2010: maybe people coming from the blog killed it, but the Tapenade Online app is no longer up and hasn’t been since an hour or two after this blog post.

You can try it on your own C, C++, or Fortran program using the

For instance, consider the simple (intententionally non-optimal) C code to compute f(x) = x^2 + 3x:

double f(double x) {
  double y;
  y = x;
  y *= x;
  y += 3*x;
  return y;

Running Tapenade in forward mode, we get

double f_d(double x, double xd, double *f) {
    double y;
    double yd;
    yd = xd;  
    y = x;       
    yd = yd*x + y*xd;
    y *= x;
    yd = yd + 3*xd;
    y += 3*x;
    *f = y;                   
    return yd;

If you plug in 1.0 for xd (in general, this can be used for chaining functions), you can see that the function returns the derivative f'(x) and also sets the value of argument f (given as a pointer) to f(x).

In backward mode, Tapenade generates the following code:

void f_b(double x, double *xb, double fb) {
    double y;
    double yb;
    double f;
    y = x;
    yb = fb;
    *xb = *xb + (y+3)*yb;
    yb = x*yb;
    *xb = *xb + yb;

If you squint hard enough, you’ll realize this method also computes the derivative f'(x) of f(x). It’s set up to run the chain rule in reverse.

Citation Stats?

Yes, I know this is totally crass, but nevertheless, I went to and checked out the many systems for C/C++ they cite, and then did a Google Scholar query [SystemName "automatic differentiation"], restricted to results after (including?) 2005. Here’s what I found:

ADMB      73
ADC       16
ADIC     274
ADOL-C   265
CppAD     24
FAD       46
fadbad    65
ffadlib    3
OpenAD    91
Rapsodia   9
Sacado     8
Tapenade 244
Treeverse  7
YAO       88

This doesn’t seem to be expressing a particularly clear community preference. So I was hoping some of you may have suggestions. I should add that many of the papers are themselves surveys, so don’t actually correspond to someone using the package, which is what I’d really like to know.

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing

January 12, 2011

Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance.

The Book

Check out Chapter 3, Finding Similar Items, from:

It was developed for a Stanford undergrad class, and we know Ullman writes a mean text, so it’s not surprising it’s at the perfect level for serious practitioners. Other presentations I’ve seen have been very theory heavy (though feel free to point out other refs in the comments).

The Drill

Here’s an overview of the scaling process, as currently understood, which may seem like a lot of work until you realize it reduces a quadratic all-versus-all doc comparison problem, each instance of which is hairy, to a linear problem, the constant factor for which is manageable.

Step 1. Build a tokenizer to create snippets of text, typically overlapping “shingles” consisting of sequences of tokens. LingPipe’s TokenNGramTokenizerFactory class is essentially a flexible shingler filter for a basic tokenizer. Of course, if all you need to do is the next steps, you don’t need to build string-based tokens — you only need their hash codes, and that’s done more efficiently using something like a rolling hash (the name “rolling hash” is new to me [at least in the intended sense], but the algorithm should be familiar from the Karp-Rabin string search algorithm, which is well described in Gusfield’s most awesome string algorithms book).

Step 2. From each document, extract multiple shingles. Typically these are just the overlapping n-grams or some stemmed or stoplisted forms thereof, which is where the name “shingle” comes from . Rajaram and Ullman suggest that a single stop word followed by the next two tokens works well, which isn’t exactly a shingle, though you could use it as a component and make sequences of these items.

Step 3. Calculate minhash values for each of the shingles in a doc. This provides a compressed representation of sets with the nifty property that the chance that minhashes are equal is the same as the Jaccard distance itself (explained in the book cited above). There’s no Wikipedia page (that I could find), but here’s a nice blog entry on MinHash, which comes with (C#?) code.

Step 4. Calculate locality-sensitive hashes of the minhash values. The point of locality-sensitivity hashing is to map similar items to similar buckets. There’s some really cool math here on expected recall and precision, but I wouldn’t trust the naive numbers for natural language text, because of the huge degree of correlation.

Step 5. Test for equality using the locality-sensitive hashes (LSH). This reduces the quadratic problem of comparing all docs to one with roughly the same performance that only takes constant time. You can get an idea of what the usual presentation looks like for LSH by considering the LSH Wikipedia page, the first line of which assumes you know what a metric space is!

Step 6. You can then check the actual docs if you want to prevent false positive matches.

The book draft has nice examples of all of these things. It also goes over the theoretical justifications of why these approaches work, but doesn’t get bogged down in lots of math — it just sticks to what you need to know to understand the problem and build an implementation yourself. In fact, I may do just that.

Tunable Tradeoffs in Accuracy versus Speed

One very cool feature of this approach is that it’s probabilistic in the sense that you can trade efficiency for accuracy. By using more and more shingles and more and more LSH, you can get pretty much arbitrarily close to 1.0 in accuracy. Given that the problem’s not so well defined already, we can usually tolerate a bit of error on both the false positive and false negative side.

What Does Nutch Do?

The Apache Nutch project, based on the Apache Lucene search engine, is intended to be a web-scale crawler and indexer. It has an implementation of a similar algorithm, though I can’t find any details other than a comment that they’re using MD5 hashes to approximate step 6 (that’s a pretty good approximation). Does anyone have a pointer to how it works?

Chris Harrison’s Awesome Word Association Visualizations

January 11, 2011

Wow! These visualizations, which I just saw linked from Slashdot, blew me away:

I particularly like the word associations visualization, which compares pairs of words, such as good/evil and then investigates the words that follow them in bigrams base on conditional probablity bands, then sorts the words in each band by unigram frequency. The word spectrum visualization is also nice. By the use of space and scale, Harrison was able to show much more information than I’ve ever seen in a graph like this. Usually they look like giant hairballs.

The natural language processing part of this exercise is pretty much trivial. It’d be easy to do with the LingPipe language modeling package, for instance.

I’d like to see some part-of-speech type things done this way, but that’d be of more interest to linguistic geeks than the general public. Translation would also be interesting if you knew two languages. The Netflix data or other collaborative filtering data would be fun to visualize, too. As would phrasal data with a binary feature, such as Ryan McDonald et al.’s phrase-sentiment graph.

pyhi: Python Package/Module Hello World with all the Trimmings

January 7, 2011

I’ve been teaching myself Python, and being the compulsive neat freak that I am, I first had to figure out their namespaces and how to package everything properly.

pyhi: a Demo Python Package with all the Trimmings

If you want a skeletal application that does everything the right way (as far as I can tell from their online style recommendations), including package/module namespaces, unit tests, installer, documentation, and packages scripts, check out:

Of course, I’m happy to get advice if there are better ways to do this.

Why Python?

I’m working with Andrey Rzhetsky and James Evans at University of Chicago on a version of the Bayesian (and EM) annotation models in Python. I’m also working with Matt Hoffman, Andrew Gelman and Michael Malecki on Python at Columbia for Bayesian inference. Watch this space (and Andrew’s blog) for news.

Does Python Rock?

The short story is that I learned enough in a week to already use it for scripting and munging instead of Java. Compared to Perl, it’s a positively genius work of minimalism and consistency. Everything works pretty much the way you’d expect. When you need C/Fortran back ends (all the optimization, distribution, and matrix libs), Python’s a relatively clean front end. Numpy and PyMC are nice; the design of PyMC is particularly well thought out.

I love the generators and named arguments/defaults. I hate the whitespace syntax (no moving blocks of code with emacs to auto-indent). I wish I had a bit more control over types and pre-allocation, but that’s easily solved with utility functions.

At least as of version 2.6, character strings are the usual mess (Java’s became a mess when Unicode surpassed 16-bit code points), with one type for bytes and a different one for unicode strings (sort of like Java, only there are no built-in Java types for byte-sequence literals).

The lack of backward compatibility among versions of Python itself reminds me how fantastic the Java releases have been in that regard. Particularly the heroic effort of retro-fitting generics.

I find the lack of proper for(;;) loops or ++ operators rather perplexing; I get that they want everything to be a first class object but loop ranges seem to be taking this a bit far. And the “friendly” syntax for ternary operators is an oddly verbose and syntactically contorted choice for Python (“a if cond else b”). At least they left in break/continue.

The idea to execute a file on import probably makes sense for an interrpeted language, but boy is it ever slow (seconds to import numpy and pymc). It does let you wrap the imports in try/catch blocks, which strikes me as odd, but then I’m used to Java’s more declarative, configurable, and just-in-time import mechanism.

Why doesn’t assignment return its value? I can’t write the usual C-style idiomatic I/O loops. There are so many opportunities for function chaining that aren’t used. It must be some kind of stylistic battle where the Pythonistas love long skinny programs more than short fat ones.

Having to install C and Fortran-based packages takes me straight back to 1980s Unix build hell and makes me appreciate the lovely distribution mechanism that are Java jar files. I found the Enthought distribution helpful (it’s free for academics but pay-per-year for industrialists), because it includes numpy and then the PyMC installer worked (on Windows 32-bit; couldn’t get 64-bit anything working due to GCC conflicts I didn’t have the patience to sort out).

Of course, Python’s a dog in terms of speed and memory usage compared to Java, much less to C, but at least it’s an order of magnitude faster than R.

Monitoring Convergence of EM for MAP Estimates with Priors

January 4, 2011

I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm. Elementary textbook presentations often just say “until convergence”, which left me scratching my head. More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.

Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors. Given data y and parameters \theta, the goal is to find the parameters \theta^* that maximize the likelihood function p(y|\theta).

Likelihood and Missing Data

Usually EM is used for latent parameter problems, where there are latent variables z which are treated like missing data, so that the full likelihood function is actually p(y,z|\theta). For instance, z might be mixture component indicators, as in soft (EM) clustering. Typically the full likelihood is factored as p(y,z|\theta) = p(z|\theta) \times p(y|z,\theta).

Even though the expectation (E) step of EM computes “expectations” for z given current estimates of \theta and the data y, these “expectations” aren’t used in the likelihood calculation for convergence. Instead, the form of likelihood we care about for convergence marginalizes z away. Specifically, the maximum likelihood estimate \theta^* is the one that maximizes the likelihood with z marginalized out,

p(y|\theta) = \int p(y,z|\theta) \times p(z|\theta) \ dz.

Monitoring Likelihood or Parameters

There’s more than one way to monitor convergence. You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale). Log likelihood is more task-oriented, and thus more common in the machine learning world. But if you care about your parameters, you may want to measure them for convergence, because …

Linearly Separable Data for Logistic Regression

In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite. Thus the parameters will never converge. But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0).

Convergence with MAP?

Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors. Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.

Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data. In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood. Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.

Maximization Step with Priors

In EM with priors, the maximization step needs to set \theta^{(n)}, the parameter estimate in the n-th epoch, to the value that maximizes the total probability, \log p(y|\theta) + \log p(\theta), given the current “expectation” for the latent parameters z based on the the data and previous epoch’s estimate of \theta. That is, you can’t just set \theta^{(n)} to maximize the likelihood, \log p(y|\theta). There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound. And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM).

Convergence with Priors

Well, you could just monitor the parameters. But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, \log p(y|\theta) + \log p(\theta), not just the log likelihood p(y|\theta). What EM guarantees is that every iteration increases this sum. If you just monitor the likelihood term p(y|\theta), you’ll see it bouncing around rather than monotonically increasing. That’s because the prior’s having its effect, and you need to take that into account.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011


In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command

svn co

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance


A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);

This code is in src/ It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " +;//print what matched

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772’ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/’ there is a main method:

public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun via @
	Vet, 77, Busted For Obama Death Threat #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/ The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
	    if (addTweet) {
	return filteredTweets;

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.


In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.