Author Archive

YapMap: Breck’s Fun New Project to Improve Search

January 27, 2012

I have signed on as chief scientist at YapMap. It is a part time position that grew out of me being on their advisory board for the past 3 years. Try the search interface for the forums below:

Automotive Forums

YapMap search for Low Carb Breakfast on Diabetes Daily

A screen shot of the interface:

UI for YapMap's search results

What I like about the user interface is that threads can be browsed easily–I have spent hours on remote controlled airplane forums reading every post because it is quite difficult to find relevant information within a thread. The color coding and summary views are quite helpful in eliminating irrelevant posts.

My first job is to get query spell checking rolling. Next is search optimized for the challenges of thread based postings. The fact that relevance of a post to a query is a function of a thread is very interesting. I will hopefully get to do some discourse analysis as well.

I will continue to run Alias-i/LingPipe. The YapMap involvement is just too fun a project to pass up given that I get to build a fancy search and discovery tool.

Breck

Peter Jackson has Passed Away

August 4, 2011

I am very sorry to convey that Peter Jackson has passed away. Below is an email sent to Thomson Reuters from their CTO.

Peter was a great friend, early customer and advisor to LingPipe. I will miss him dearly.

Breck

—————-

From James Powell, CTO Thomson Reuters

Dear Colleagues,

It is with great sadness that I must inform you that Peter Jackson, our Chief Scientist and head of R&D, died suddenly this morning.

Peter JacksonPeter was a colleague of immense talents, both professionally and personally. He started his career in academia, and after a post as a Lecturer in Artificial Intelligence at Edinburgh University in Scotland he moved to the U.S. where he went on to become Associate Professor in the Department of Electrical and Computer Engineering at Clarkson University, NY.

A true global citizen who was born in Barbados and educated in England, Peter also served as Visiting Professor at the Artificial Intelligence Laboratory for the School of Electrical & Electronic Engineering at Singapore Polytechnic.

Peter switched from the corridors of academia to the meeting rooms of Rochester when he joined Lawyers Cooperative Publishing (LCP) back in 1995, as Lead Software Engineer for the Natural Language Group. LCP was acquired by West Group the following year.

His penetrating intelligence, allied to his outgoing and immediately likeable personality, ensured that Peter moved quickly through the ranks at Thomson; next in the West Group and then at Thomson Legal & Regulatory, where he rose to become Chief Research Scientist & Vice-President, Technology in 2005.

A lasting legacy

Peter was the obvious choice for the position of Chief Scientist when we became Thomson Reuters in 2008, and it was a role he fulfilled with exceptional results, forming a 40-strong R&D group as a corporate-wide resource, and aligning our Research & Development capabilities with the major platform and strategy initiatives of recent years.

Peter’s legacy at Thomson Reuters is significant and lasting. He oversaw the advanced technologies , such as CaRE, Concord and Results Plus, that enabled us to launch radical and successful new product platforms such as WestlawNext and PeopleMap. He was also integral to the development of our innovative Reuters Insider channel.

On top of that he represented Thomson Reuters with great professionalism — and considerable personality and panache — in the wider world of conferences, academia and professional associations. He was a great ambassador for Thomson Reuters in his role overseeing university liaison with regards to joint research projects with institutions of the caliber of MIT, NYU and CMU.

These achievements give some sense of Peter’s abilities and drive. Peter, though, had many more strings to his bow. He published three books and about 40 papers on his fields of expertise (artificial intelligence and natural language processing), and invented four US patents.

Above all, when it came to his work, Peter was proudest of the group that he built and worked with side-by-side every day.

If this wasn’t enough for one man, he was also a talented musician and a serious music fan, serving for a number of years on the Board of Directors for the Saint Paul Chamber Orchestra. In recent times, he traveled to Chicago, New York and Ojai to support the SPCO on tour. Many people in Eagan will remember Peter’s “Jazzkickers” band playing at employee events around campus.

He even demonstrated his guitar playing prowess in this internal technology video filmed at Joe’s Pub in NYC earlier this summer.

The many of us who worked with Peter will miss greatly his enthusiasm, the clarity of his thinking, his wisdom and his wit.

He leaves behind his wife Sandy, also a former employee here. Peter was 62.

Sometimes I Wish NLP Systems Occasionally Blew Up

June 28, 2011

The consequences of a badly performing NLP (Natural Language Processing) system tend to be a pretty low key, low drama event. Bad things may happen but not in a way that will get noticed in the same way as: Rocket Go Boom

The failure of a rocket launch highly motivates the involved engineers to not have that happen again. Miss that named entity, blam! If only NLP programmers had such explicit negative feedback. I think the field would be better for it.

NLP Systems are Easier to Sell than Build

Customers get the potential value of advanced NLP/Text Analytics/etc in the same way that people get the potential value of space flight.

Space ships in artist's conception

The Dreaded Artist's Conception--picture credit NASA

It would be so cool to do sentiment analysis in low earth orbit! Sadly, the tremendous promise of the field is held back by a combination of overselling, under-delivering and lack of awareness of how to build good performing systems. What contributes to poor performance the most?

Be aware that you are selling to the best NLP systems out there: Humans

One of the greatest frustrations I face is severely underfunded projects. For the most part rockets get a much healthier dose of funding because people see the failures clearly and do not have a grasp of how rockets work. Not so much for NLP. Language processing is so easy for humans that it is like trying to sell cargo airplanes to eagles. They just don’t get what is hard, what is easy and the necessity of infrastructure. “Mr. Eagle, um, well we really need a runway to get the 20 tons of product into the air”. Mr. Eagle responds with “What are you talking about? I can take off, land and raise a family on a tree branch. Cargo planes are easy because flying is easy for me. So I will give you a fish to do the job.”

Don’t ask a banker from 1994 to understand your tweets

Another source of poor performance is the reliance of general purpose solutions that are not well suited or tuned to the domain. It is unrealistic to expect a named entity model to perform well on Twitter if its training data is the 1994 Wall Street Journal. Modules customized for the domain can make a huge difference in performance. But customization costs money, takes time and requires a different mind set.

Understand the problem clearly with hand-coded examples

The #1 bit of advice I give customers is to emulate what you expect the system to do by hand. Then show that to stake holders to make sure a problem is being solved that addresses something your business cares about. Also, Mr. Eagle will much better appreciate the need of another solution after ferrying 100 pounds of product 1 pound at a time. By doing this you will have reduced the risk of failure by half in my opinion.

NLP is hard because in addition to being technically difficult, it is made worse because it seems easy for humans to do. They then under-appreciate the challenges. If systems blew up spectacularly we might have a better appreciation of that.

Breck

Programmers Expect Dog Like Consistency in Software, but NLP is more like Cats

June 21, 2011
Typical Startup Staffing Choices

Photo Credit hoangnam_nguyen at flickr.com

From the perspective of a traditionally trained programmer, one of the hardest things to appreciate when working with Natural Language Processing(NLP) systems is the approximate, unpredictable quality provided by NLP modules. I liken it to getting dog people to learn to work with cats.

For example, a standard NLP module could decide whether “Bill Jones” is mentioned in the sentence “An announcement was made by Sir Bill Jones on the eve of his retirement…” In database terms the task is to satisfy the query “SELECT person FROM text” or a string function “detectPerson(sentence)”.

A range of answers is possible. They include:

  • Getting it right. “Bill Jones” is definitely a person in the sentence.
  • Getting it right, but hedging. “Bill Jones” has a 75% likelihood of being a person in the sentence. A person could also be “Sir Bill” with 15% likelihood or “Jones on” with likelihood 1% with some other low confidence estimates.
  • Getting it dead wrong. No “Bill Jones” person found at all. It turns out that the system had never seen “Sir” as an honorific in training and had other issues. Sorry, maybe next time — really the system gets a 95% F-measure so it won’t happen often.

Obviously getting it right is what we want. The programmer throws the stick and the program dutifully fetches the information like a dog. Provide pat on head and a who’s-a-good-program positive reinforcement and all is well.

Getting it right but hedging is more of a problem. The existence of a probability complicates things a bit. What do you do with probabilities that are low or tied? What about close cases where the spans are overlapping? It’s like throwing a ball and getting a chew toy, a bit of a stick and some regurgitated food.

Getting it dead wrong is really a wrench in the works because it is just so friggin’ obvious that the right answer is sitting right there, pristine. How can the program ignore it? Just go get it, pick it up, and bring it back you stoooopid program. What is this 95% pedigree worth if it can’t handle that? Try playing catch with a cat and you will have exactly the right experience. But it is not the cat’s fault; you can’t expect a cat to act like a dog, but cats are a useful tool if you want to catch mice or amusingly pursue a laser pointer. Like NLP, cats have to be influenced to do what you want, not commanded.

Working with Cats

The challenges of NLP require that the developer work with the strengths of the technology and factor in the shortcomings. It is a different mind set that tends to require more complicated data structures. Also multiple possible answers slows things down with extra decision making loops. Systems often need extensive tuning to work with the capabilities of the cat^h^h^h NLP technology. Once that is mastered, then the additional flexibility and tolerance of uncertainty of signal in data will result in a very different capability. Systems can degrade more gracefully, apply to more situations and provide a more human flexibility to text processing. But you got to scratch the cat just the right way.

Breck (who loves dogs and cats)

P.S. If you want to get a start on NLP programming check out our tutorials.

IBM’s Watson and the State of NLP

June 14, 2011

Aditya Kalyanpur presented an overview of the Jeopardy! winning Watson computer system June 6 at TheLadders.com in New York for the New York Semantic Web Meetup. I was asked to present a three minute overview of the state of Natural Language Processing (NLP). In this post I want to couch the Watson system in the context of the state-of-the-art since it didn’t make sense to do it at the meetup because I presented first.

The State of NLP According to Breck

Conveying the state-of-the-art in three minutes is quite a challenge so lets run with the analogy of aviation for ease of comprehension. So where is NLP?


It Flies!

We have achieved the analog of basic powered flight. No doubt.


Yikes! and away

But in no sense have we gotten to the this level of performance.


Amelia Earhart's Lockheed Vega

My best guess is that we are at the point of a reasonable commercial foundation as an industry with some changes to come that we don’t know about yet, not unlike aviation in the mid 1920′s. Perhaps the beginning of the Golden Age of NLP.


And in no sense are we in the reliable, high technology commercial space that modern air transport provides.

Boing 777


Where does Watson Fit in the Analogy

Watson fits perfectly in the example of the red 1928 Lockheed Vega above for the following reasons:

  • The Vega is actually Amelia Earhart’s plane that was used to break records (crossing the Atlantic solo), generate publicity and was a stunning success for a nascent industry.
  • While inspirational, the Vega’s success had little to do with advancing the underlying technology. What would I consider an advancement of technology? Frank Whittle patented the turbojet in 1930.
  • Watson shows how a 20 person team working 4 years can win a very challenging game with skill, effort and daring much in the same way that aviation records were broken with the same. Don’t think some careers were not on the line with the Watson effort–I think IBM ceased termination by firing squad in the 70′s so Earhart had more on the line. But what are the prospects of a mid-level ex-IBM exec in today’s economy? Perhaps the firing squad would be a kindness.

But Watson is Playing a Game

There is one issue that seriously concerns me; Watson won a question answering game with the trivial twist that the answers must be phrased as questions. So the clue “First President of the US” is answered with “Who is George Washington”. But Watson is not a general purpose question answering system. What is the difference?

Another analogy: The game of chess is based on medieval battles but even though Big Blue beats the best human players one would never consider using Big Blue to manage an actual battle. Real war is messy, approximate and without clear rules which makes chess algorithms totally inappropriate.

Real world question answering has similar qualities to real war: messy, approximate and no clear rules. The game of Jeopardy! is based on the existence of a unique, easily understood and verified answer given the clue. Taking one of the examples from the talk;

In 1698, this comet discoverer took a ship called the Paramour Pink on the first purely scientific sea voyage

The correct “question” is “Who is Edmond Halley” of Halley’s Comet fame. The example is used to work through an impressive system diagram that resembles a well developed model train set (thanks to Prof. Mark Steedman for the simile). Much is done to generate the correct answer while avoiding distractors like Peter Sellers from the Pink Panther movies. But run the same clue past Google with “-watson -jeopardy” appended to eliminate pages that mention discussion of this publicized example and the first result is Halley’s Comet Stamps with the first sentence mentioning the correct answer.

There is still an impressive amount of work in extracting the correct name but the hunt for the answer was ready to be found exactly because it is a game, unambiguous, well known and well selected given the clue.

What does Real World Question Answering Look Like?

What kinds of questions have I approached a search engine with?

What is the current 30 year FHA mortgage rate?

This question is a disaster from the uniqueness of answer perspective. My initial search results were pretty low quality and did not provide accurate rate information for what I knew the answer to be.

When is it best to ski in Chile?

This went better. There was a FAQ on the first page of results but the answer just went on and on. “The season runs from mid-June to mid-October. Although every year is different, and it comes down to Mother Nature, the best time for dry powder is mid-June, July, August, and up to the 2nd week in September. After that,….” Again we have a non-unique answer because my question was not that specific in the first place.

What is the Reputation of LingPipe?

This is a question that a group of Columbia MBA students took on for us in their small business program which I recommend btw.

This question was hopeless in search because there is not a page out there that needs to be found with our reputation nicely summarized. Answering the question requires distillation across many resources even if information was restricted to web only.

Welcome to the real world, question answering is hell.

Where Might Watson Flourish Outside of Jeopardy! Tournaments?

Jeopardy! is a game of finding the uniquely obvious given indirect clues. Otherwise it is not a game that can be judged and played. What else in the world has this quality? The Watson team is now approaching medical diagnosis which is a real world use case that might match the Jeopardy! game format with symptoms as clues and diagnosis as the answer. Uniqueness is not guaranteed in diagnosis but Watson can handle multiple answers. This is an area where computer systems from the 1970′s, e.g. Mycin, out performed experts but they didn’t have a NLP component. Medical diagnosis, once symptoms are recognized, is a game like problem.

In the end Watson is an engineering achievement, but in no way have the skills of a good reference librarian been replicated.

I came across an interesting article by Michael Lind on information technology and its role in productivity while writing this blog post. Interestingly he puts information technology in the same time bracket as I do.

Breck

Pro Bono Projects and Internships with LingPipe

April 4, 2011

We have initiated a pro bono program that teams a lingpipe employee with a programmer wanting to learn about LingPipe to serve a non-profit or researcher needing help with a text analytics/NLP/computational linguistics project.

So far is is going pretty well with one classification project underway. The way it went was that I worked with an intern, a solid NLP programmer, to help a Ga Tech researcher get a classification system up and running. Now the intern is running the project with some feedback from me but not much is required at this point.

We are taking applications for interns wanting to learn how to code with LingPipe. We also will take applications for projects. Non-profits or academic research only please. Just send breck@lingpipe.com an email outlining your skills/needs and we will see what we can do.

Breck

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011

Summary:

In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance

Tokenization

A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);
	}
    }
}

This code is in src/Tokenize1.java. It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " + matcher.group());//print what matched
	}
    }
}

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772′ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }
}

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/NormalizedTokenizerFactory.java’ there is a main method:


public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

 proximity(cs1,cs2)
   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ExploreJaccard.java and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;
		    }
		}
	    }

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun http://t.co/MrTUwxv via @
	Vet, 77, Busted For Obama Death Threat http://tinyurl.com/25zyxgp #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE http://artistdata.com/a/32on	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue http://artistdata.com/a/32ow

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/DedupeJaccard.java. The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
		}
	    }
	    if (addTweet) {
		filteredTweets.add(targetTweet);
	    }
	}
	return filteredTweets;
    }

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is http://research.microsoft.com/pubs/130851/ANDD.pdf.

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
compile:
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

jar:
      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

dedupeJaccard:
     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… http://goo.gl/fb/j5OUm #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.

Wrapup

In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.

Breck

A Call to Code: Microsoft Research/Bing Query Spell Check Challenge

December 21, 2010

And I quote: "Microsoft Research in partnership with Bing is happy to launch the Speller Challenge." First place is $10,000, starts Jan 17, 2011, contest is on May 27, 2011.

Speller Challenge

We have the baseline implementation of the required system in our tutorials, see Spelling Tutorial. The system is not wrapped in a web service and you will need to dig up your own training data—Wikipedia anyone?

There may be issues if you want to use LingPipe tools to build the system, here is our Royalty Free License. For the purposes of this contest we will create a free license for contest use that is non-restrictive. If you’d like to request a free license for the contest, send me an email at breck@lingpipe.com.

Breck

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/DedupeSimpleHash.java:

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		++seenCounter;
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    }
	    else {
		seenBefore.add(tweet);
	    }
	}	

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile); 
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter 
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer 
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 
	ArrayList row = new ArrayList();
	row.add("");
	row.add("");
	row.add("");
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	    row.clear();
	    row.add("");
	    row.add("");
	    row.add("");
	    row.add(tweet);
	    writer.write(row);
	}
	writer.close();
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml

compile:

jar:

dedupeSimpleHash:
     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' - http://bit.ly/gXZfqN http://fb.me/NRKRnQWy
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
....[snip]....
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/dHWLod
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/geCufk
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/hp7In9
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/iggM61
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/ijwRNA

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for bit.ly which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/DedupeNormalizedHash.java shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);	    
	    if (seenBefore.contains(tweet)) {
		++seenCounter;
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    }
	    else {
		seenBefore.add(tweet);
	    }
	}	

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
[snip]
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' - http://bit.ly/gXZfqN http://fb.me/NRKRnQWy
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -  
[snip]
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘http://bit.ly/gXZfqN http://fb.me/NRKRnQWy&#8217;. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party 
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... 

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.

Breck

Call for PhD Ideas for Lucene/Solr Implementation

November 29, 2010

Otis over at the Sematech blog wants to get people identifying hard problems faced by people working in search that PhD students can select from and hopefully solve with a Lucene/Solr implementation. The students get a ‘real world problem’ and the world gets a concrete open source implementation of the solution. The call is: Lucene / Solr for Academia: PhD Thesis Ideas

My suggested PhD idea is tolerable precision at high recall dictionary matching of phrases. Mike Ross spent a good deal of time trying to get 100% matches of genes in MEDLINE abstracts given a dictionary of genes (Entrez Gene) and aliases. The core of the problem is that not all the mentions of genes are on the aliases set for the gene. Huge issues around efficiency in addition to getting it working at all.


Follow

Get every new post delivered to your Inbox.

Join 312 other followers