Archive for the ‘LingPipe in Use’ Category

LingPipe Book Draft 0.4

March 31, 2011

I just put up the latest draft of the LingPipe book. Since the last edition, I’ve added a chapter on character language models and added details of the various classifier interface and classification types. The location’s the same:

I’m about to roll out LingPipe 4.0.2, which patches a few minor inconsistencies that have turned up in writing the book (like process character LMs not being serializable and some remaining generic arrays that I’ve recoded with lists [and deprecated the generic array versions]).

What is a (Mathematical) Model?

March 11, 2011

I was shocked and dismayed when I heard from a reader that I’d used the term “model” over 200 times in the LingPipe book without ever saying what it meant. This kind of thing is why it’s so hard to write introductory material.

Perhaps I shouldn’t have been surprised at this comment, because other people had expressed some uncertainty about the term “model” to me in the past.

What is a (Mathematical) Model?

In short, when I say “model”, I mean it in the bog-standard scientific sense, as explained on:

Quite simply, it’s just a bunch of math used to describe a phenomenon. Nothing interesting here either philosophically or conceptually, just the usual scientific method.

For instance, Newton’s equation f = m \times a (force equals mass times acceleration) is a mathematical model that may be used to describe the motions of the planets, among other things. Newton derived his model from Kepler’s observation that the planets picked out equal area in equal time in their orbits. Newton realized that by introducing the notion of “gravity”, he could model the orbits of the planets. Of course, he had to invent calculus to do so, but that’s another story.

Prediction vs. Postdiction

Typically models are used for predicting future events, but sometimes they’re used to retroactively try to understand past events (“backcasting” [“aftcasting” if you’re nautical] is the time opposite of “forecasting”, and “postdiction” the opposite of “prediction”). For instance, climate scientists attempt to postdict/backcast earth temperatures from data such as tree rings; we’re working on fitting models of such data with Matt Schofield as part of our Bayesian inference project at Columbia.

All Models are Wrong, but …

As the statistician George E. P. Box said, “Essentially, all models are wrong, but some are useful.” For instance, Newton’s model is wrong in that it doesn’t correct for relativistic effects at very high velocities. But it proved useful at predicting everything from eclipses to the tides.

The models we’ve used in LingPipe, such as the HMM model of part-of-speech tagging, are also clearly wrong. Language just isn’t Markovian (meaning that the n-th word only depends on a fixed window of previous few words). But we can still do pretty well at predicting part of speech tags with the simplified model.

LingPipe Baseline for MITRE Name Matching Challenge

February 17, 2011

We’ve released a baseline system for MITRE’s name matching challenge. Unfortunately, there’s no training data, so I used a simple heuristic matcher that we’ve used before for this problem and related problems.

In First Place (for now)

As of right this second, we’re in first place on their leaderboard (team name “Agent Smith”). There aren’t many participants yet, so I don’t expect to enjoy this position for long.

Check out the LingPipe Baseline

You can anonymously check out our source for our entry from the LingPipe Sandbox with Apache Subversion version control system via the command:

svn co

There’s an Ant build file that runs the eval and a README.txt to help you get started. I’ll repeat the contents of the README (as it stands now) below.

Mean Average Precision Scoring

The contest is being scored by mean average precision. This measure does not penalize you for returning long lists of irrelevant results. Our baseline returns lots of results, and its results for the variously balanced F measures that are reported could probably be improved by tuning thresholds and the maximum number of results returned.

What I’d like to see, as usual, is a precision-recall curve or ROC curve derived from setting a threshold on the scores. But it’s hard to rank bakeoff entries by ROC curve, so IR bakeoffs typically resort to area under one of these curves or a measure like mean average precision (sort of like area under the curve — see the LingPipe book‘s section on classifier evaluation for more discussion).


The Contest

MITRE is hosting an unsupervised name-matching challenge. The
home page is:

From there, you can register for the challenge and download the data, which consists of two lists of names, one long “index” file (about 825K names), and one short “query” file (around 9K names). The task is to rank potential matches in the index file for each name in the query file. The names are broken into forenames and surnames, but there’s no guarantee this assignment is accurate.

There is no gold standard as to which names actually match, so MITRE’s using a liberal notion of matching corresponding to “requires further human review to reject”. Furthermore, this notion of matching may change as the task evolves.

Example Matches

The only way (I know of) to see some example matches is to download the distribution and look at the documentation (.docx format — I had to install the latest OpenOffice to read it). They list examples of what they take to be potential matches for further review.

The LingPipe Baseline

This directory contains baseline code for an entry based on character n-grams. The entry is set up so that the character n-gram scores are used as a filter, which should have high sensitivity (low false negative rate) for matches, though low specificity (high false positive rate). Parameters controlling its agressiveness may be tuned.

If you download the data and unpack it into directory $DATADIR, you can run the task as follows:

% ant -Ddata.dir=$DATADIR ngrams

This’ll write the output match rankings into a fresh file in the output subdirectory /runs. If you don’t change the output routines, the output will be in the appropriate format for submission to the contest.

You can turn verbose debugging on via the flag DEBUG in the code itself.

Three Pass Refinement

Our approach is based on three passes.

1. Indexing

Create a character n-gram index and select potential pairs based on having at least one n-gram in common. The n-gram length is parameterizable through the constant INDEX_NGRAM. Setting it lower increases run time but may increase sensitivity for obscure matches.

2. TF/IDF Scoring and Filtering

Rank the first-pass possible matches by TF/IDF distance over their character n-grams. The range of n-grams is parameterizable with MATCH_NGRAM_MIN and MATCH_NGRAM_MAX; setting these to 2 and 4 respectively produces matches based on 2-, 3-, and 4-grams. TF/IDF weighting will weight the less frequent n-grams more heavily. The maximum nubmer of candidates surviving this stage may be bounded by setting the MAX_RESULTS_INDEX variable in the code.

3. Rescoring

The method

     double rescore(String[],String[],double);

takes the original fields (first name/last name) as string arrays and a double consisting of the n-gram match score, and allows an arbitrary score to be returned. As distributed, this method just returns the score passed in. The maximum number of results surviving the final ranking is determiend by the variable MAX_RESULTS.

This will write a system response ready for submission to the challenge in the default output directory /runs.

Questions and Comments

Let us know if you have questions or comments about our distribution. Feel free to post your uses here or give the rest of us suggestions on how to improve the baseline.

Apache Lucene 3.0 Tutorial

February 11, 2011

[Update: 10 Feb 2014. Much has changed in Lucene since 3.0. An extensive tutorial for Lucene 4 is now available as a chapter in the book

Text Processing in Java

This chapter covers search, indexing, and how to use Lucene for simple text classification tasks. A bonus feature is a quick reference guide to Lucene’s search query syntax.]

Update (24 July 2012) The tutorial has been updated for Lucene 3.6. See:

With this release of the LingPipe Book, I created a standalone version of the tutorial for version 3 of the Apache Lucene search library.

It contains about 20 pages covering the basics of analysis, indexing and search. It’s distributed with sample code and an Ant build file with targets to run the demos.

Building the Source

The ant build file is in the file src/applucene/build.xml and should be run from that directory. The book’s distribution is organized this way so that each chapter’s demo code is roughly standalone, but they are able to share libs. There are some minor dependencies on LingPipe in the example (jar included), but those are just for I/O and could be easily removed or replicated.

More In-Depth Info on Lucene

The standard reference for Lucene is not its own site or javadoc, which are fairly limited tutorial-wise, but rather the recently released (as of February 2011) book by three Lucene committers:

Looking at the Manning Press page for the book (linked above), I just realized they blurbed one of my previous blog posts, a review of Lucene in Action!

But wait, there’s more

If you’re interested in natural language, or just need a tutorial on character encodings and Java strings and I/O, you can find the rest of the LingPipe book at its home page:

Enjoy. And as always, let me know if you have any comments, here, or directly to

LingPipe Book Draft, Version 0.3

February 10, 2011

I’ve finished the section on naive Bayes and made a bunch of fairly minor updates to earlier sections based on comments from Breck.

As before, you can find the latest version at:

Scaling Jaccard Distance for Document Deduplication: Shingling, MinHash and Locality-Sensitive Hashing

January 12, 2011

Following on from Breck’s straightforward LingPipe-based application of Jaccard distance over sets (defined as size of their intersection divided by size of their union) in his last post on deduplication, I’d like to point out a really nice textbook presentation of how to scale the process of finding similar document using Jaccard distance.

The Book

Check out Chapter 3, Finding Similar Items, from:

It was developed for a Stanford undergrad class, and we know Ullman writes a mean text, so it’s not surprising it’s at the perfect level for serious practitioners. Other presentations I’ve seen have been very theory heavy (though feel free to point out other refs in the comments).

The Drill

Here’s an overview of the scaling process, as currently understood, which may seem like a lot of work until you realize it reduces a quadratic all-versus-all doc comparison problem, each instance of which is hairy, to a linear problem, the constant factor for which is manageable.

Step 1. Build a tokenizer to create snippets of text, typically overlapping “shingles” consisting of sequences of tokens. LingPipe’s TokenNGramTokenizerFactory class is essentially a flexible shingler filter for a basic tokenizer. Of course, if all you need to do is the next steps, you don’t need to build string-based tokens — you only need their hash codes, and that’s done more efficiently using something like a rolling hash (the name “rolling hash” is new to me [at least in the intended sense], but the algorithm should be familiar from the Karp-Rabin string search algorithm, which is well described in Gusfield’s most awesome string algorithms book).

Step 2. From each document, extract multiple shingles. Typically these are just the overlapping n-grams or some stemmed or stoplisted forms thereof, which is where the name “shingle” comes from . Rajaram and Ullman suggest that a single stop word followed by the next two tokens works well, which isn’t exactly a shingle, though you could use it as a component and make sequences of these items.

Step 3. Calculate minhash values for each of the shingles in a doc. This provides a compressed representation of sets with the nifty property that the chance that minhashes are equal is the same as the Jaccard distance itself (explained in the book cited above). There’s no Wikipedia page (that I could find), but here’s a nice blog entry on MinHash, which comes with (C#?) code.

Step 4. Calculate locality-sensitive hashes of the minhash values. The point of locality-sensitivity hashing is to map similar items to similar buckets. There’s some really cool math here on expected recall and precision, but I wouldn’t trust the naive numbers for natural language text, because of the huge degree of correlation.

Step 5. Test for equality using the locality-sensitive hashes (LSH). This reduces the quadratic problem of comparing all docs to one with roughly the same performance that only takes constant time. You can get an idea of what the usual presentation looks like for LSH by considering the LSH Wikipedia page, the first line of which assumes you know what a metric space is!

Step 6. You can then check the actual docs if you want to prevent false positive matches.

The book draft has nice examples of all of these things. It also goes over the theoretical justifications of why these approaches work, but doesn’t get bogged down in lots of math — it just sticks to what you need to know to understand the problem and build an implementation yourself. In fact, I may do just that.

Tunable Tradeoffs in Accuracy versus Speed

One very cool feature of this approach is that it’s probabilistic in the sense that you can trade efficiency for accuracy. By using more and more shingles and more and more LSH, you can get pretty much arbitrarily close to 1.0 in accuracy. Given that the problem’s not so well defined already, we can usually tolerate a bit of error on both the false positive and false negative side.

What Does Nutch Do?

The Apache Nutch project, based on the Apache Lucene search engine, is intended to be a web-scale crawler and indexer. It has an implementation of a similar algorithm, though I can’t find any details other than a comment that they’re using MD5 hashes to approximate step 6 (that’s a pretty good approximation). Does anyone have a pointer to how it works?

Monitoring Convergence of EM for MAP Estimates with Priors

January 4, 2011

I found it remarkably hard to figure out how to monitor convergence for the expectation maximization (EM) estimtation algorithm. Elementary textbook presentations often just say “until convergence”, which left me scratching my head. More advanced presentations often leave you in a sea of generalized maximization routines and abstract functionals.

Typically, EM is phrased for maximum likelihood estimation (MLE) problems where there are no priors. Given data y and parameters \theta, the goal is to find the parameters \theta^* that maximize the likelihood function p(y|\theta).

Likelihood and Missing Data

Usually EM is used for latent parameter problems, where there are latent variables z which are treated like missing data, so that the full likelihood function is actually p(y,z|\theta). For instance, z might be mixture component indicators, as in soft (EM) clustering. Typically the full likelihood is factored as p(y,z|\theta) = p(z|\theta) \times p(y|z,\theta).

Even though the expectation (E) step of EM computes “expectations” for z given current estimates of \theta and the data y, these “expectations” aren’t used in the likelihood calculation for convergence. Instead, the form of likelihood we care about for convergence marginalizes z away. Specifically, the maximum likelihood estimate \theta^* is the one that maximizes the likelihood with z marginalized out,

p(y|\theta) = \int p(y,z|\theta) \times p(z|\theta) \ dz.

Monitoring Likelihood or Parameters

There’s more than one way to monitor convergence. You can monitor either the differences in log likelihoods (after marginalizing out the latent data) or the differences in parameters (e.g. by Euclidean distance, though you might want to rescale). Log likelihood is more task-oriented, and thus more common in the machine learning world. But if you care about your parameters, you may want to measure them for convergence, because …

Linearly Separable Data for Logistic Regression

In data that’s linearly separable on a single predictor, the maximum likelihood coefficient for that predictor is infinite. Thus the parameters will never converge. But as the parameter approaches infinity, the difference its (absolute) growth makes to log likelihood diminishes (we’re way out on the extremes of the logistic sigmoid at this point, where the slope’s nearly 0).

Convergence with MAP?

Textbooks often don’t mention, either for philosophical or pedagogical reasons, that it’s possible to use EM for general maximum a posterior (MAP) estimation when there are priors. Pure non-Bayesians talk about “regularization” or “shrinkage” (specifically the ridge or lasso for regression problems) rather than priors and MAP estimates, but the resulting estimate’s the same either way.

Adding priors for the coefficients, even relatively weak ones, can prevent estimates from diverging, even in the case of separable data. In practice, maximum a posteriori (MAP) estimates will balance the prior and the likelihood. Thus it is almost always a good idea to add priors (or “regularize” if that goes down better philosophically), if nothing else to add stability to the estimates in cases of separability.

Maximization Step with Priors

In EM with priors, the maximization step needs to set \theta^{(n)}, the parameter estimate in the n-th epoch, to the value that maximizes the total probability, \log p(y|\theta) + \log p(\theta), given the current “expectation” for the latent parameters z based on the the data and previous epoch’s estimate of \theta. That is, you can’t just set \theta^{(n)} to maximize the likelihood, \log p(y|\theta). There are analytic solutions for the maximizer in many conjugate settings like Dirichlet-Multinomial or Normal-Normal, so this isn’t as hard as it may sound. And often you can get away with increasing it rather than maximizing it (leading to the so-called generalized EM algorithm, GEM).

Convergence with Priors

Well, you could just monitor the parameters. But if you want to monitor the equivalent of likelihood, you need to monitor the log likelihood plus prior, \log p(y|\theta) + \log p(\theta), not just the log likelihood p(y|\theta). What EM guarantees is that every iteration increases this sum. If you just monitor the likelihood term p(y|\theta), you’ll see it bouncing around rather than monotonically increasing. That’s because the prior’s having its effect, and you need to take that into account.

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011


In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command

svn co

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance


A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);

This code is in src/ It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " +;//print what matched

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772’ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/’ there is a main method:

public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun via @
	Vet, 77, Busted For Obama Death Threat #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/ The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
	    if (addTweet) {
	return filteredTweets;

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.


In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.


A Call to Code: Microsoft Research/Bing Query Spell Check Challenge

December 21, 2010

And I quote: "Microsoft Research in partnership with Bing is happy to launch the Speller Challenge." First place is $10,000, starts Jan 17, 2011, contest is on May 27, 2011.

Speller Challenge

We have the baseline implementation of the required system in our tutorials, see Spelling Tutorial. The system is not wrapped in a web service and you will need to dig up your own training data—Wikipedia anyone?

There may be issues if you want to use LingPipe tools to build the system, here is our Royalty Free License. For the purposes of this contest we will create a free license for contest use that is non-restrictive. If you’d like to request a free license for the contest, send me an email at


Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command

svn co

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile); 
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter 
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer 
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 
	ArrayList row = new ArrayList();
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml



     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/ shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);	    
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -  
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘;. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party 
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... 

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.