Author Archive

Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp

January 10, 2014

Code Camp is canceled. We are late on delivering a LingPipe recipes book to our publisher and that will have to be our project for March. But we could have testers/reviewers come and hang out. Less fun I think.




Dates: To be absolutely clear: Dates are 3/2/2014 to 3/31/14 in Driggs Idaho. You can work remotely, we will be doing stuff before as well for setup.

New: We have setup a github repository. URL is


Every year we go out west for a month of coding and skiing. Last year it was Salt Lake City Utah, this year it is Driggs Idaho for access to Grand Targhee and Jackson Hole. The house is rented for the month of March and the project selected. This year we will create a Watson clone.

I have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with the latest billion dollar effort. So how about assembling a scrappy bunch of developers and see how close we can come to recreating the Jeopardy beast?

How this works:

  • We have a month of focused development. We tend to code mornings, ski afternoons. House has 3 bedrooms if you want to come stay. Prior arrangements must be made. House is paid for. Nothing else is.
  • The code base will be open source. We are moving LingPipe to the AGPL so maybe that license or we could just apache license it. We want folks to be comfortable contributing.
  • You don’t have to be present to contribute.
  • We have a very fun time last year. We worked very hard on an email app that didn’t quite get launched but the lesson learned was to start with the project defined.

If you are interested in participating let us know at Let us know your background, what you want to do, do you expect to stay with us etc…. No visa situations please and we don’t have any funds to support folks. Obviously we have limited space physically and mentally so we may say no but we will do our best to be inclusive. Step 1–transcribe some Jeopardy shows.

Ask questions in the comments so all can benefit. Check comments before asking questions pls. I’ll answer the first one that is on everyone’s mind:

Q: Are you frigging crazy???

A: Why, yes, yes we are. But we are also really good computational linguists….


LingPipe Incubator Welcomes Seer

September 9, 2013

We have started an incubator program for start-ups with natural language processing (NLP) needs. Seer is our first egg. The company creates productivity apps based on unstructured data sources–that is where LingPIpe fits in. The guys (Conall and Joe) fit a pattern that we see quite often–smart people with a good idea that presupposes NLP but they don’t have much experience with it.

The GetSeer guys were on the ball right away because they had a bunch of data annotated and had cobbled together a regular expression based classifier that served as an excellent starting point. We had machine learning based classifiers up and running within two hours of starting. That included an evaluation harness. Nice.


The great thing about our working setup is that I get to teach/advise which keeps my time commitment manageable and they get to do most of the heavy lifting which is how they learn to build and maintain NLP systems. I have had to learn some Scala since I co-code most of the time when we work together and that is their implementation language. Scala is a hip extension of Java with a less verbose syntax and stronger type inference.

I’ll keep the blog updated with developments. Current status:

  • There is a small amount of training data.
  • Current results are awful (no surprise).
  • We have been roughing in the solution with Naive Bayes. Next steps will be opening a can of logistic regression for more interesting feature extraction.

YapMap: Breck’s Fun New Project to Improve Search

January 27, 2012

I have signed on as chief scientist at YapMap. It is a part time position that grew out of me being on their advisory board for the past 3 years. Try the search interface for the forums below:

Automotive Forums

YapMap search for Low Carb Breakfast on Diabetes Daily

A screen shot of the interface:

UI for YapMap's search results

What I like about the user interface is that threads can be browsed easily–I have spent hours on remote controlled airplane forums reading every post because it is quite difficult to find relevant information within a thread. The color coding and summary views are quite helpful in eliminating irrelevant posts.

My first job is to get query spell checking rolling. Next is search optimized for the challenges of thread based postings. The fact that relevance of a post to a query is a function of a thread is very interesting. I will hopefully get to do some discourse analysis as well.

I will continue to run Alias-i/LingPipe. The YapMap involvement is just too fun a project to pass up given that I get to build a fancy search and discovery tool.


Peter Jackson has Passed Away

August 4, 2011

I am very sorry to convey that Peter Jackson has passed away. Below is an email sent to Thomson Reuters from their CTO.

Peter was a great friend, early customer and advisor to LingPipe. I will miss him dearly.



From James Powell, CTO Thomson Reuters

Dear Colleagues,

It is with great sadness that I must inform you that Peter Jackson, our Chief Scientist and head of R&D, died suddenly this morning.

Peter JacksonPeter was a colleague of immense talents, both professionally and personally. He started his career in academia, and after a post as a Lecturer in Artificial Intelligence at Edinburgh University in Scotland he moved to the U.S. where he went on to become Associate Professor in the Department of Electrical and Computer Engineering at Clarkson University, NY.

A true global citizen who was born in Barbados and educated in England, Peter also served as Visiting Professor at the Artificial Intelligence Laboratory for the School of Electrical & Electronic Engineering at Singapore Polytechnic.

Peter switched from the corridors of academia to the meeting rooms of Rochester when he joined Lawyers Cooperative Publishing (LCP) back in 1995, as Lead Software Engineer for the Natural Language Group. LCP was acquired by West Group the following year.

His penetrating intelligence, allied to his outgoing and immediately likeable personality, ensured that Peter moved quickly through the ranks at Thomson; next in the West Group and then at Thomson Legal & Regulatory, where he rose to become Chief Research Scientist & Vice-President, Technology in 2005.

A lasting legacy

Peter was the obvious choice for the position of Chief Scientist when we became Thomson Reuters in 2008, and it was a role he fulfilled with exceptional results, forming a 40-strong R&D group as a corporate-wide resource, and aligning our Research & Development capabilities with the major platform and strategy initiatives of recent years.

Peter’s legacy at Thomson Reuters is significant and lasting. He oversaw the advanced technologies , such as CaRE, Concord and Results Plus, that enabled us to launch radical and successful new product platforms such as WestlawNext and PeopleMap. He was also integral to the development of our innovative Reuters Insider channel.

On top of that he represented Thomson Reuters with great professionalism — and considerable personality and panache — in the wider world of conferences, academia and professional associations. He was a great ambassador for Thomson Reuters in his role overseeing university liaison with regards to joint research projects with institutions of the caliber of MIT, NYU and CMU.

These achievements give some sense of Peter’s abilities and drive. Peter, though, had many more strings to his bow. He published three books and about 40 papers on his fields of expertise (artificial intelligence and natural language processing), and invented four US patents.

Above all, when it came to his work, Peter was proudest of the group that he built and worked with side-by-side every day.

If this wasn’t enough for one man, he was also a talented musician and a serious music fan, serving for a number of years on the Board of Directors for the Saint Paul Chamber Orchestra. In recent times, he traveled to Chicago, New York and Ojai to support the SPCO on tour. Many people in Eagan will remember Peter’s “Jazzkickers” band playing at employee events around campus.

He even demonstrated his guitar playing prowess in this internal technology video filmed at Joe’s Pub in NYC earlier this summer.

The many of us who worked with Peter will miss greatly his enthusiasm, the clarity of his thinking, his wisdom and his wit.

He leaves behind his wife Sandy, also a former employee here. Peter was 62.

Sometimes I Wish NLP Systems Occasionally Blew Up

June 28, 2011

The consequences of a badly performing NLP (Natural Language Processing) system tend to be a pretty low key, low drama event. Bad things may happen but not in a way that will get noticed in the same way as: Rocket Go Boom

The failure of a rocket launch highly motivates the involved engineers to not have that happen again. Miss that named entity, blam! If only NLP programmers had such explicit negative feedback. I think the field would be better for it.

NLP Systems are Easier to Sell than Build

Customers get the potential value of advanced NLP/Text Analytics/etc in the same way that people get the potential value of space flight.

Space ships in artist's conception

The Dreaded Artist's Conception--picture credit NASA

It would be so cool to do sentiment analysis in low earth orbit! Sadly, the tremendous promise of the field is held back by a combination of overselling, under-delivering and lack of awareness of how to build good performing systems. What contributes to poor performance the most?

Be aware that you are selling to the best NLP systems out there: Humans

One of the greatest frustrations I face is severely underfunded projects. For the most part rockets get a much healthier dose of funding because people see the failures clearly and do not have a grasp of how rockets work. Not so much for NLP. Language processing is so easy for humans that it is like trying to sell cargo airplanes to eagles. They just don’t get what is hard, what is easy and the necessity of infrastructure. “Mr. Eagle, um, well we really need a runway to get the 20 tons of product into the air”. Mr. Eagle responds with “What are you talking about? I can take off, land and raise a family on a tree branch. Cargo planes are easy because flying is easy for me. So I will give you a fish to do the job.”

Don’t ask a banker from 1994 to understand your tweets

Another source of poor performance is the reliance of general purpose solutions that are not well suited or tuned to the domain. It is unrealistic to expect a named entity model to perform well on Twitter if its training data is the 1994 Wall Street Journal. Modules customized for the domain can make a huge difference in performance. But customization costs money, takes time and requires a different mind set.

Understand the problem clearly with hand-coded examples

The #1 bit of advice I give customers is to emulate what you expect the system to do by hand. Then show that to stake holders to make sure a problem is being solved that addresses something your business cares about. Also, Mr. Eagle will much better appreciate the need of another solution after ferrying 100 pounds of product 1 pound at a time. By doing this you will have reduced the risk of failure by half in my opinion.

NLP is hard because in addition to being technically difficult, it is made worse because it seems easy for humans to do. They then under-appreciate the challenges. If systems blew up spectacularly we might have a better appreciation of that.


Programmers Expect Dog Like Consistency in Software, but NLP is more like Cats

June 21, 2011
Typical Startup Staffing Choices

Photo Credit hoangnam_nguyen at

From the perspective of a traditionally trained programmer, one of the hardest things to appreciate when working with Natural Language Processing(NLP) systems is the approximate, unpredictable quality provided by NLP modules. I liken it to getting dog people to learn to work with cats.

For example, a standard NLP module could decide whether “Bill Jones” is mentioned in the sentence “An announcement was made by Sir Bill Jones on the eve of his retirement…” In database terms the task is to satisfy the query “SELECT person FROM text” or a string function “detectPerson(sentence)”.

A range of answers is possible. They include:

  • Getting it right. “Bill Jones” is definitely a person in the sentence.
  • Getting it right, but hedging. “Bill Jones” has a 75% likelihood of being a person in the sentence. A person could also be “Sir Bill” with 15% likelihood or “Jones on” with likelihood 1% with some other low confidence estimates.
  • Getting it dead wrong. No “Bill Jones” person found at all. It turns out that the system had never seen “Sir” as an honorific in training and had other issues. Sorry, maybe next time — really the system gets a 95% F-measure so it won’t happen often.

Obviously getting it right is what we want. The programmer throws the stick and the program dutifully fetches the information like a dog. Provide pat on head and a who’s-a-good-program positive reinforcement and all is well.

Getting it right but hedging is more of a problem. The existence of a probability complicates things a bit. What do you do with probabilities that are low or tied? What about close cases where the spans are overlapping? It’s like throwing a ball and getting a chew toy, a bit of a stick and some regurgitated food.

Getting it dead wrong is really a wrench in the works because it is just so friggin’ obvious that the right answer is sitting right there, pristine. How can the program ignore it? Just go get it, pick it up, and bring it back you stoooopid program. What is this 95% pedigree worth if it can’t handle that? Try playing catch with a cat and you will have exactly the right experience. But it is not the cat’s fault; you can’t expect a cat to act like a dog, but cats are a useful tool if you want to catch mice or amusingly pursue a laser pointer. Like NLP, cats have to be influenced to do what you want, not commanded.

Working with Cats

The challenges of NLP require that the developer work with the strengths of the technology and factor in the shortcomings. It is a different mind set that tends to require more complicated data structures. Also multiple possible answers slows things down with extra decision making loops. Systems often need extensive tuning to work with the capabilities of the cat^h^h^h NLP technology. Once that is mastered, then the additional flexibility and tolerance of uncertainty of signal in data will result in a very different capability. Systems can degrade more gracefully, apply to more situations and provide a more human flexibility to text processing. But you got to scratch the cat just the right way.

Breck (who loves dogs and cats)

P.S. If you want to get a start on NLP programming check out our tutorials.

IBM’s Watson and the State of NLP

June 14, 2011

Aditya Kalyanpur presented an overview of the Jeopardy! winning Watson computer system June 6 at in New York for the New York Semantic Web Meetup. I was asked to present a three minute overview of the state of Natural Language Processing (NLP). In this post I want to couch the Watson system in the context of the state-of-the-art since it didn’t make sense to do it at the meetup because I presented first.

The State of NLP According to Breck

Conveying the state-of-the-art in three minutes is quite a challenge so lets run with the analogy of aviation for ease of comprehension. So where is NLP?

It Flies!

We have achieved the analog of basic powered flight. No doubt.

Yikes! and away

But in no sense have we gotten to the this level of performance.

Amelia Earhart's Lockheed Vega

My best guess is that we are at the point of a reasonable commercial foundation as an industry with some changes to come that we don’t know about yet, not unlike aviation in the mid 1920’s. Perhaps the beginning of the Golden Age of NLP.

And in no sense are we in the reliable, high technology commercial space that modern air transport provides.

Boing 777

Where does Watson Fit in the Analogy

Watson fits perfectly in the example of the red 1928 Lockheed Vega above for the following reasons:

  • The Vega is actually Amelia Earhart’s plane that was used to break records (crossing the Atlantic solo), generate publicity and was a stunning success for a nascent industry.
  • While inspirational, the Vega’s success had little to do with advancing the underlying technology. What would I consider an advancement of technology? Frank Whittle patented the turbojet in 1930.
  • Watson shows how a 20 person team working 4 years can win a very challenging game with skill, effort and daring much in the same way that aviation records were broken with the same. Don’t think some careers were not on the line with the Watson effort–I think IBM ceased termination by firing squad in the 70’s so Earhart had more on the line. But what are the prospects of a mid-level ex-IBM exec in today’s economy? Perhaps the firing squad would be a kindness.

But Watson is Playing a Game

There is one issue that seriously concerns me; Watson won a question answering game with the trivial twist that the answers must be phrased as questions. So the clue “First President of the US” is answered with “Who is George Washington”. But Watson is not a general purpose question answering system. What is the difference?

Another analogy: The game of chess is based on medieval battles but even though Big Blue beats the best human players one would never consider using Big Blue to manage an actual battle. Real war is messy, approximate and without clear rules which makes chess algorithms totally inappropriate.

Real world question answering has similar qualities to real war: messy, approximate and no clear rules. The game of Jeopardy! is based on the existence of a unique, easily understood and verified answer given the clue. Taking one of the examples from the talk;

In 1698, this comet discoverer took a ship called the Paramour Pink on the first purely scientific sea voyage

The correct “question” is “Who is Edmond Halley” of Halley’s Comet fame. The example is used to work through an impressive system diagram that resembles a well developed model train set (thanks to Prof. Mark Steedman for the simile). Much is done to generate the correct answer while avoiding distractors like Peter Sellers from the Pink Panther movies. But run the same clue past Google with “-watson -jeopardy” appended to eliminate pages that mention discussion of this publicized example and the first result is Halley’s Comet Stamps with the first sentence mentioning the correct answer.

There is still an impressive amount of work in extracting the correct name but the hunt for the answer was ready to be found exactly because it is a game, unambiguous, well known and well selected given the clue.

What does Real World Question Answering Look Like?

What kinds of questions have I approached a search engine with?

What is the current 30 year FHA mortgage rate?

This question is a disaster from the uniqueness of answer perspective. My initial search results were pretty low quality and did not provide accurate rate information for what I knew the answer to be.

When is it best to ski in Chile?

This went better. There was a FAQ on the first page of results but the answer just went on and on. “The season runs from mid-June to mid-October. Although every year is different, and it comes down to Mother Nature, the best time for dry powder is mid-June, July, August, and up to the 2nd week in September. After that,….” Again we have a non-unique answer because my question was not that specific in the first place.

What is the Reputation of LingPipe?

This is a question that a group of Columbia MBA students took on for us in their small business program which I recommend btw.

This question was hopeless in search because there is not a page out there that needs to be found with our reputation nicely summarized. Answering the question requires distillation across many resources even if information was restricted to web only.

Welcome to the real world, question answering is hell.

Where Might Watson Flourish Outside of Jeopardy! Tournaments?

Jeopardy! is a game of finding the uniquely obvious given indirect clues. Otherwise it is not a game that can be judged and played. What else in the world has this quality? The Watson team is now approaching medical diagnosis which is a real world use case that might match the Jeopardy! game format with symptoms as clues and diagnosis as the answer. Uniqueness is not guaranteed in diagnosis but Watson can handle multiple answers. This is an area where computer systems from the 1970’s, e.g. Mycin, out performed experts but they didn’t have a NLP component. Medical diagnosis, once symptoms are recognized, is a game like problem.

In the end Watson is an engineering achievement, but in no way have the skills of a good reference librarian been replicated.

I came across an interesting article by Michael Lind on information technology and its role in productivity while writing this blog post. Interestingly he puts information technology in the same time bracket as I do.


Pro Bono Projects and Internships with LingPipe

April 4, 2011

We have initiated a pro bono program that teams a lingpipe employee with a programmer wanting to learn about LingPipe to serve a non-profit or researcher needing help with a text analytics/NLP/computational linguistics project.

So far is is going pretty well with one classification project underway. The way it went was that I worked with an intern, a solid NLP programmer, to help a Ga Tech researcher get a classification system up and running. Now the intern is running the project with some feedback from me but not much is required at this point.

We are taking applications for interns wanting to learn how to code with LingPipe. We also will take applications for projects. Non-profits or academic research only please. Just send an email outlining your skills/needs and we will see what we can do.


Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011


In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command

svn co

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance


A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);

This code is in src/ It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " +;//print what matched

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772’ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/’ there is a main method:

public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed?
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun via @
	Vet, 77, Busted For Obama Death Threat #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/ The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
	    if (addTweet) {
	return filteredTweets;

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.


In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.


A Call to Code: Microsoft Research/Bing Query Spell Check Challenge

December 21, 2010

And I quote: "Microsoft Research in partnership with Bing is happy to launch the Speller Challenge." First place is $10,000, starts Jan 17, 2011, contest is on May 27, 2011.

Speller Challenge

We have the baseline implementation of the required system in our tutorials, see Spelling Tutorial. The system is not wrapped in a web service and you will need to dig up your own training data—Wikipedia anyone?

There may be issues if you want to use LingPipe tools to build the system, here is our Royalty Free License. For the purposes of this contest we will create a free license for contest use that is non-restrictive. If you’d like to request a free license for the contest, send me an email at