Archive for the ‘Java’ Category

Why is C++ So Fast?

July 1, 2011

I had to respond to John Cook’s blog post,

My response was so long, I thought I’d repost it here.

I didn’t understand the answer properly six months ago, despite spending a couple years as a professional C coder (when I did speech recogntion) and having spent the past ten or so years writing Java. C++ is enough different than either of these to be a completely different sort of beast.

C++ is Faster than C!

At least, it’s easier to write fast code in C++ than in C these days. In fact, these days, C++ is the language of choice for optimization, not plain old C. The reason it’s so efficient is twofold.

Reason 1: Tight Data Structures

First, C++ is intrinsically stingy with memory (unlike Java objects, a C++ struct has no memory overhead if there are no virtual functions [modulo word alignment issues]). Smaller things run faster due to caching, and are also more scalable. Of course, this is true of C, too.

Reason 2: Template Metaprogramming

What makes C++ both hard to debug and understand and even more efficient than C is the use of template metaprogramming.

In particular, there’s huge gains to be made with the kind of lazy evaluation you can implement with expression templates and the kind of memory and indirection savings you get by avoiding virtual functions with the curiously recurring template pattern.

[Update: 28 July 2011: I thought an example would help, and this was the original one motivating template expressions.]

Suppose you have a scalar x and two m by n matrices a and b. Now suppose you want to evaluate d = a + (x * b). The standard non-lazy approach would be to create an intermediate matrix c representing the scalar-matrix product (x * b) and then add that to a. Basically, what’d happen is the same as if you’d coded it like this:

matrix c(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        c[m,n] = x * b[m,n];
matrix d(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        d[m,n] = a[m,n] + c[m,n];
return d;

With clever template expression programming, you can get around allocating an intermediate value, so that the resulting computation would amount to the same thing as coding it as follows.

matrix d(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        d[m,n] = a[m,n] + x * b[m,n];
return d;

This saves the allocation and deallocation of matrix c. Memory contention is a huge problem with numerical code these days as CPU speed and parallelism outstrips memory speed, so this is a very large savings. You also save intermediate index arithmetic and a number of sets and gets from arrays, which are not free.

The same thing can happen for transposition, and other operations that don’t really need to generate whole intermediate matrices. The trick’s figuring out when it’s more efficient to allocate an intermediate. The Eigen matrix package we’re using seems to do a good job of this.

[end Update: 28 July 2011]

Voting with Their Code

I just visited Google NY lat week, where some of my old friends from my speech recognition days work. Upon moving to Google, they reimplemented their (publicly available) finite state transducer library in C++ using expression templates (the OpenFST project).

You’ll also see expression templates with the curious recurrence in the efficient matrix libraries from Boost (BLAS level only) and Eigen (also provides linear algebra). They use templates to avoid creating intermediate objects for transposes or scalar-matrix products, just like a good Fortran compiler.

But Wait, C++ is also More Expressive

Templates are also hugely expressive. They lead to particularly neat implementations of techniques like automatic differentiation. This is what roused me from my life of Java and got me into C++. That, and the need for good probability and matrix libraries — why don’t these things exist in Java (or where are they if I missed them?).

The Future: Stochastic Optimization

What’s really neat is that there are now stochastic optimizers for C++ that can analyze your actual run-time performance. I really really really like that feature of the Java server compilers. Haven’t tried it in C++ yet.

PS: So Why is LingPipe in Java?

Ability to write fast programs isn’t everything. Java is much, and I mean much, easier to deploy. The whole jar thing is much easier than platform-specific executables.

Java’s code organization is much cleaner. None of this crazy header guard and .hpp versus .cpp stuff. No putting namespaces one place, code another place, and having them completely unrelated to includes (you can be more organized, but this stuff’s just much easier to figure out in Java because it’s conventionalized and everyone does it the same way). The interfaces are simpler — they’re explicit, not undocumented pure abstractions (where you have to look at the code of what they call an “algorithm” to see what methods and values it requires of its objects).

Java has support for threading built in. And a great concurrency library. There are web server plugins like servlets that make much of what we do so much easier.

There’s also programming time. I’m getting faster at writing C++, but there’s just so much more that you need to think about and take care of, from both a system author and client perspective (mainly memory management issues).

Many people don’t realize this, but C is pretty much the most portable language out there. As far as I know, there’s no place Java runs that C won’t run. C++ is a different story. The compilers are getting more consistent, but the texts are still fraught with “your mileage may vary due to compiler” warnings, as are the mailing lists for the different packages we’re using (Eigen, Boost Spirit, Boost Math), all of which lean heavily on templates.

Drafts 0.5 of Books on LingPipe and Java Text

June 27, 2011

We’ve released draft 0.5 of the LingPipe books. Get them while they’re hot from their home page:

Two Books for the Price of One

For version 0.5, we’ve broken out the bits about text processing in Java that aren’t LingPipe specific into their own book. And we’ve slightly changed the title of the LingPipe book to make this clearer.

Authors Piling On

Breck’s going to help out with both books.

Mitzi’s agreed to help us out with the text processing book now that she’s back in the bio(medical) text processing business after several years of working on genomics.

Comments, Please

As always, your comments are appreciated.

Pro Bono Projects and Internships with LingPipe

April 4, 2011

We have initiated a pro bono program that teams a lingpipe employee with a programmer wanting to learn about LingPipe to serve a non-profit or researcher needing help with a text analytics/NLP/computational linguistics project.

So far is is going pretty well with one classification project underway. The way it went was that I worked with an intern, a solid NLP programmer, to help a Ga Tech researcher get a classification system up and running. Now the intern is running the project with some feedback from me but not much is required at this point.

We are taking applications for interns wanting to learn how to code with LingPipe. We also will take applications for projects. Non-profits or academic research only please. Just send breck@lingpipe.com an email outlining your skills/needs and we will see what we can do.

Breck

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011

Summary:

In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance

Tokenization

A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);
	}
    }
}

This code is in src/Tokenize1.java. It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " + matcher.group());//print what matched
	}
    }
}

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772’ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }
}

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/NormalizedTokenizerFactory.java’ there is a main method:


public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

 proximity(cs1,cs2)
   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ExploreJaccard.java and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;
		    }
		}
	    }

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun http://t.co/MrTUwxv via @
	Vet, 77, Busted For Obama Death Threat http://tinyurl.com/25zyxgp #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE http://artistdata.com/a/32on	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue http://artistdata.com/a/32ow

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/DedupeJaccard.java. The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
		}
	    }
	    if (addTweet) {
		filteredTweets.add(targetTweet);
	    }
	}
	return filteredTweets;
    }

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is http://research.microsoft.com/pubs/130851/ANDD.pdf.

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
compile:
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

jar:
      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

dedupeJaccard:
     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… http://goo.gl/fb/j5OUm #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.

Wrapup

In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.

Breck

A Call to Code: Microsoft Research/Bing Query Spell Check Challenge

December 21, 2010

And I quote: "Microsoft Research in partnership with Bing is happy to launch the Speller Challenge." First place is $10,000, starts Jan 17, 2011, contest is on May 27, 2011.

Speller Challenge

We have the baseline implementation of the required system in our tutorials, see Spelling Tutorial. The system is not wrapped in a web service and you will need to dig up your own training data—Wikipedia anyone?

There may be issues if you want to use LingPipe tools to build the system, here is our Royalty Free License. For the purposes of this contest we will create a free license for contest use that is non-restrictive. If you’d like to request a free license for the contest, send me an email at breck@lingpipe.com.

Breck

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/DedupeSimpleHash.java:

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		++seenCounter;
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    }
	    else {
		seenBefore.add(tweet);
	    }
	}	

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile); 
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter 
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer 
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 
	ArrayList row = new ArrayList();
	row.add("");
	row.add("");
	row.add("");
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	    row.clear();
	    row.add("");
	    row.add("");
	    row.add("");
	    row.add(tweet);
	    writer.write(row);
	}
	writer.close();
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml

compile:

jar:

dedupeSimpleHash:
     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' - http://bit.ly/gXZfqN http://fb.me/NRKRnQWy
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
....[snip]....
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/dHWLod
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/geCufk
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/hp7In9
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/iggM61
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... http://bit.ly/ijwRNA

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for bit.ly which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/DedupeNormalizedHash.java shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);	    
	    if (seenBefore.contains(tweet)) {
		++seenCounter;
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    }
	    else {
		seenBefore.add(tweet);
	    }
	}	

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
[snip]
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' - http://bit.ly/gXZfqN http://fb.me/NRKRnQWy
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -  
[snip]
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘http://bit.ly/gXZfqN http://fb.me/NRKRnQWy&#8217;. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party 
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... 

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.

Breck

R’s Scoping

September 9, 2010

[Update: 10 September 2010 I didn’t study Radford Neal’s example closely enough before making an even bigger mess of things. I’d like to blame it on HTML formatting, which garbled Radford’s formatting and destroyed everyone else’s examples, but I was actually just really confused about what was going on in R. So I’m scratching most of the blog entry and my comments, and replacing them with Radford’s example and a pointer to the manual.]

A Better Mousetrap

There’s been an ongoing discussion among computational statisticians about writing something better than R, in terms of both speed and comprehensibility:

Radford Neal’s Example

Radford’s example had us define two functions,

> f = function () { 
+     g = function () a+b
+     a = 10
+     g()
+ }

> h = function () { 
+     a = 100
+     b = 200
+     f()
+ }

> b=3

> h()
[1] 13

This illustrates what’s going on, assuming you can parse R. I see it, I believe it. The thing to figure out is why a=10 was picked up in the call to g() in f, but b=200 was not picked up in the call to f() in h. Instead, the global assignment b=3 was picked up.

RTFM

Even after I RTFM-ed, I was still confused.

It has a section 10.7 titled “Scope”, but I found their example

cube <- function(n) {
    sq <- function() n*n
    n*sq()
}

and the following explanation confusing,

The variable n in the function sq is not an argument to that function. Therefore it is a free variable and the scoping rules must be used to ascertain the value that is to be associated with it. Under static scope (S-Plus) the value is that associated with a global variable named n. Under lexical scope (R) it is the parameter to the function cube since that is the active binding for the variable n at the time the function sq was defined. The difference between evaluation in R and evaluation in S-Plus is that S-Plus looks for a global variable called n while R first looks for a variable called n in the environment created when cube was invoked.

I was particularly confused by the “environment created when cube was invoked” part, because I couldn’t reconcile it with Radford’s example.

Let’s consider a slightly simpler example without nested function calls.

> j =10
> f = function(x) j*x
> f(3)
[1] 30
> j =12
> f(3)
[1] 36

This shows it can’t be the value of j at the time f is defined, because it changes when I change j later. I think it’s actually determining how it’s going to find j when it’s defined. If there’s a value of j that’s lexically in scope (not just defined in the current environment), it’ll use that value. If not, it’ll use the environment of the caller. And things that go on in subsequent function definitions and calls, as Radford’s example illustrates, don’t count.

Am I the only one who finds this confusing? At least with all your help, I think I finally understand what R’s doing.

McCandless, Hatcher & Gospodnetić (2010) Lucene in Action: 2nd Ed. (Review)

August 16, 2010

I’m very happy to see that the 2nd Edition is out:

  • McCandless, Michael, Erik Hatcher, and Otis Gospodnetić. 2010. Lucene in Action. Second Edition. Manning Press.

Manning’s offering 40% off until September 30, 2010. Follow the link to the book and use code lingpipeluc40 when you check out.

You Need This Book

If you’re interested in version 3 or later of the Apache Lucene search library, you need this book. As far as I know, this book is the only way to come up to speed on the intricacies of Lucene.

What if I have the First Edition?

Keep it for your projects that require 2.x versions of Lucene. So much has changed in version 3 that is not backward compatible with previous versions that the first edition of this book will be of almost no use. Every example has changed. It’s pretty much a whole new book (other than the first page where it describes Lucene as simple — that used to be true, but is not any longer).

I’ve been coding in Lucene since version 1.2 (2002), and the javadoc’s been getting better over time. I still needed the code samples from this book to get a basic index up and running and to figure out how to enumerate the terms given an analyzer. (In the first edition of the book, I contributed a case study on n-gram-based analysis, so it’s not like I was unfamiliar with the whole process!)

Is it a Good Book?

I’m going to violate the “show, don’t tell” rule of writing and answer my own question with an unequivocal “yes”. Now let me show you why.

The authors clearly know their Lucene. Not surprising, given that they’re three of the core developers for Lucene. (Erik Hatcher was also involved in the Ant build system, and has another Manning … in Action book for that.) Following the Manning Press in Action methodology, they lay everything out with very clear examples that you can run. As such, the book’s better than good — it’s really great set of code examples presented in a coherent style and order, with easy-to-follow explanations.

The book covers a wide range of introductory and advanced topics in a way that’s about as easy to follow as possible. It starts from simple indexing and query construction and then moves into more advanced topics like spell checking and term vector more-like-this implementations. It does all of this with examples, cleverly sidestepping most of the (relatively simple) mathematics underlying search and scoring, while still conveying the basic principles.

They cover extras like the Tika document framework, where they provide a very clear intro to the SAX XML parsers and a nice overview the Jakarta Commons Digester (I dislike the config-and-reflection-based Digester so much that I built LingPipe’s delegating and delegate handlers as an in-the-code alternative).

If you follow the examples in this book and try them out you’ll be in a good position to use Lucene in a serious application. Even better, you’ll be prepared to join the project’s amazingly helpful and responsive mailing list (for years now, the first response to most newbie questions has been to read Lucene in Action!).

They cover many more advanced topics than the last book, like the range of file-level locking options and their use in a more transactional setting than basic Lucene provides itself.

Needs a Revision

Reading the book, I can’t shake the guilt arising from not sending them feedback on drafts. I had the drafts lying around in PDF form for ages and kept promising to send Otis feedback. As is, this book needs another editorial pass. (That’s not to say it’s not fantastic as is — all the glitches in the book I saw were relatively minor and didn’t get in the way of the code examples actually illustrating what’s going on with Lucene.)

The initial two lines of a paragraph are repeated on pages xxx and xxxi. There are also typos like inserted question marks, as in “book?two hundred years ago” on p. xxxii.

There are mistakes in code fragments, like a missing comma in the parse() example on p. 239.

There are also bugs in code. I only studied a few programs in the term vector section (which is great) and in the Tika section. Listing 5.18 has a bug in that it calls searcher.search() with a maximum number of hits of 10 (below pushpin 6), whereas the method is specified to take a configurable max (above pushpin 3).

There are dangling references like “uncomment the output lines toward the end of the docsLike method” (on p. 195), where there are no output lines toward the end of the said method (on p. 193).

I’d be in favor of getting rid of the content-free text like “It’s time to get our feet wet!” and so on, while you’re at it.

What’s Missing?

There’s a huge range of contributed packages surrounding Lucene these days, coupled with an ever-expanding range of applications to which Lucene is being put. As a result, the authors face a daunting curatorial challenge.

From my perspective, the biggest gap is around Unicode and how it behaves with respect to normalization and search. It’s more than just stripping accents from Western characters.

JUnit Examples

Don’t get me wrong. I love JUnit. We use it for unit testing in LingPipe. And I recommend reading our tests to help understand our classes. I just don’t think it’s a good way to present examples in an intro book. On top of that, they’re using the old style of JUnit rather than the newer style with annotations for tests and startup/teardown methods.

What about the Code?

Authors of textbooks trying to introduce an API are presented with the difficult problem of balancing simplicity with solid coding practice. I have to fight my own tendency to err on the latter side. The authors favor simplicity in most cases, as they explain up front. The authors realize that users are likely to start from cut-and-pasted versions of their code, and try to warn people about the shortcuts they took for the sake of clarity. (We do the same thing in our tutorials.)

Many (but not all) methods that throw exceptions are just declared to throw Exception. I think this does readers a disservice in not knowing what’s actually throwing what kind of exception. Others might argue it makes the code cleaner and less cluttered. The authors explain that’s their motivation and point out proper exception etiquette. We do the same thing in some of our tutorial code.

The use of generics is spotty, with allocations like new TreeMap() that should be new TreeMap<String,Integer> and subsequent use of non-generic Iterator instances. As a result, they have to cast their gets and can’t use auto-boxing. This just clutters up the code using the maps and makes their intended contents opaque.

I’d recommend autoboxing, or its explicit result, Integer.valueOf(int), over the older style new Integer(int) (the latter always creates a new instance, whereas valueOf() being a factory can share instances).

I’d like to see more foreach and fewer explicit iterators.

They don’t like ternary operators or min/max operators, resulting in awkward fragments like:

int size = max;
if (max > hits.scoreDocs.length) size = hits.scoreDocs.length;

rather than the simpler

int size = Math.min(max,hits.scoreDocs.length);

I don’t know if this is the different authors’ example coding style (they all know how to write real production code, like Lucene itself!).

There is a more serious issue for floating point precision they try to deal with in listing 5.22 in the last if/else block. They try to avoid having cosines take on values outside of (-1,1), which would be illegal inputs to acos(). Their approach is insufficient. The only way to guarantee that your natural cosine computations fall in this range is to do:

double boundedX = Math.max(-1.0, Math.min(1.0, x));

That’s another problem we ran into for our k-means clustering implementation in LingPipe. Once the vectors get big, it’s easy to get results greater than 1 for close matches.

Also, if you’re worried about efficiency, replace Math.sqrt(x) * Math.sqrt(y) with Math.sqrt(x * y).

Layout, Etc.

I’m not a big fan of Manning’s approach to code formatting and presentation. Whole programs are printed out, often across multiple pages, and a numbered push-pin like graphic in very heavy bold is used to pick them out and cross-reference them in the text. I prefer the interleaved style I’ve been using, and you see in something like Bloch’s Effective Java, but I can see that it’s a matter of taste.

The code formatting also bugged me. Rather than breaking before assignment equal signs, they break after. They use two characters of indent, which is hard to scan. And the code’s too light for the rest of the text (the “color” doesn’t balance in the typographical sense).

The real typographical sin this book commits is breaking class and method names and file paths and URLs across lines. A class name like IndexWriter is apt to show up with Index- at the end of one line and Writer at the start of the next. The reason this is bad is that hyphens are legal characters in paths and may be confused with underscores in class names (hyphens aren’t legal in Java identifiers).

How Much Math?

For the most part, the authors stick to concrete examples and shy away from abstract mathematics (and related algorithm analyses). If you want to understand the math behind comparison or the probability models behind compressed indexes, I’d still recommend Witten, Moffat and Bell’s Managing Gigabytes (Morgan Kaufmann text, so it’s prohibitively expensive). Mannnig, Raghavan and Schütze’s more recent Introduction to Information Retrieval (Cambridge Uni. Press let them leave an online version up for free!) has most of the math and many more recent topics related to language processing such as edit distance algorithms and their relation to spell checking, probabilistically motivated retrieval, and a range of classifiers like K-nearest neighbors and support vector machines.

In the cosine example in the “Leveraging Term Vectors” (section 5.10), they define cosine in a formula without ever defining the length or dot product functions. If someone needs cosine defined for them, they probably need the definitions for basic vector operations. In the example code, they assume that the word vector has only a single instance of each word, which is certain to be confusing for novice geometers. I would’ve just stuck with largest cosine for the sort, rather than smallest angle and skipped the call to Math.acos(), the arc (or inverse) cosine operation (which they later describe as prohibitive in terms of computation costs, though it’ll only make a difference percentage-wise if the term vectors are short).

For spell checking, it’d have been nice to get up to something like the noisy channel model. Peter Norvig provides an elegant introduction in his chapter in the Beautiful Data book from O’Reilly. Lucene’s spell checking is still pretty simple and word based (rather than frequency based), though the authors provide some suggestions on how to work with what’s there.

What about Solr?

Now that the Solr search client-server project is merging with Lucene, it’ll be interesting to see if the third edition includes information about Solr. This book only mentions it in passing in a page-long section 10.8.

Other Books on Lucene

I haven’t seen Smiley and Pugh’s 2009 Solr 1.4 Enterprise Search Server, but it has positive feedback on Amazon.

Manu Konchady’s book Building Search Applications with Lucene, LingPipe, and Gate is more introductory (more like the O’Reilly NLTK book), and needs an update for Lucene 3 and LingPipe 4.

About those Cover Illustrations

On page xxxii, they tell the story of an editor buying the book from which the cover art is drawn. They say the cover page is missing so they couldn’t track it down, but they mention the date and artists. I just did a search for the terms they mention, [ottoman empire clothing 1802 william miller] and found two copies of the book for sale at a Donald Heald Books (you may need to search again). Here’s the scoop

[DALVIMART, Octavien (illustrator) – William ALEXANDER (1767-1816)].

The Costume of Turkey, illustrated by a series of engravings; with descriptions in English and French

London: printed for William Miller by T. Bensley, 1802 [but text watermarked 1796 and 1811; plates 1811 and 1817). Folio (14 1/8 x 10 1/2 inches). Parallel text in French and English. Letterpress title with integral hand-coloured stipple-engraved vignette, 60 hand-coloured stipple-engraved plates by J. Dadley or William Poole after Octavien Dalvimart. Contemporary green straight-grained morocco, covers bordered in gilt and blind, skilfully rebacked to style with the spine in six compartments with double raised bands, the bands highlighted with gilt tooling, lettered in gilt in the second compartment, the others with repeat decoration in gilt, oatmeal endpapers, gilt edges. Provenance: William Rees Mogg (Cholwell House, Somerset, England, armorial bookplate).

A good copy of this classic work on the costume of the Ottoman Empire.

Starting with the “Kislar Aga or first black unuch [sic.]” in the “Grand Signior’s Seraglio” the subjects covered are quite wide-ranging but centered on the inhabitants of Constantinople and those who were visiting the capital city: a “Sultana”, “Officers of the Grand Signior”, “Turkish woman of Constantinople”, “Turkish woman in provincial dress”. Dalvimart does make occasional forays out into the provincial areas of the empire: included are images of an “Albanian”, an “Egyptian Arab”, a “Bedouin Arab”, a “Dervise [sic.] of Syria”, an “Armenian”, a “Bosniac” as well as a number of fine plates of the female costume of the Greek Islands (which are much admired in the text). “The Drawings, from which … [the] plates have been engraved, were made on the spot … [in about 1798] by Monsieur Dalvimart, and may be depended upon for their correctness. They have been accurately attended to in the progress of the engraving; and each impression has been carefully coloured according to the original drawing, that the fidelity of them might not be impaired” (Preface). Abbey points out that the informative text is attributed to Willliam Alexander in the British Library catalogue.

Abbey Travel II,370; Atabey 313; cf. Colas I, 782; cf. Lipperheide I, Lb37; cf. Vinet 2337.

Sierra & Bates. Head First Java 2nd Ed. (Review)

August 9, 2010

After the comments on the last post about relying more heavily on Java textbooks, I ordered the most popular ones from Amazon. I’ll start the reviews with the number one most popular book on Java (ranked by Amazon sales; it also has 269 reviews; I’m happy to see Effective Java is number 4):

  • Sierra, Kathy and Bert Bates. 2005. Head First Java. 2nd Edition. O’Reilly.

Mashup: Elementary School Workbook + Kitschy Humor

Image from Title Page.

I grew up with texts like The Little LISPer, but this book’s even stranger. It strikes me as a mashup of an elementary school math problem workbook and one of those cutesy humor books you see near the checkout counters of bookstores around the holidays.

Here’s an example of one of their quizzes, where you fill in missing pieces of a Java program; just like school writing exercises:

The Design and Layout

I think they were striving for a sense of the enthusiastic amateur. The “fun” fonts and askew alignments make it look non-linear. Code comments are added in a handwritten post-it note style.

You really need to head over to Amazon and “look inside”.

Who’s it For?

It’s aimed at Java novices. Other than that, I’m not sure. Maybe for potential Java developers who like crossword puzzles or true/false quizzes with titles like “Be the Compiler”. I’m more of a Java Puzzlers guy myself (thank my other childhood friend, Martin Gardner).

Unless you laughed at the photo included above, don’t buy it for the humor.

How Much Does it Cover?

It covers a startling amount of Java, including advanced topics like threads, sockets, RMI, Swing, audio, …

Of course, there can’t be much depth given this range. But you do get a wide range of hello-world-type programs.

Image from Page 101.

Is it Sound Java?

The authors clearly know what they’re doing. The advice and coverage are too basic if you want to program Java seriously, but it’ll get you started on the right track.

There’s even good high-level advice like telling you to write unit tests (though not using anything as complex as JUnit).

It’s an Industry

Like other popular series (i.e., …for Dummies, … in Action, … in a Nutshell), this one’s turned into an industry, with dozens of titles covering everything from HTML to statistics.

As such, they’ve created all the trappings of such an enterprise, such as “Head First learning principles”:

  • make it visual — put the words within or near graphics
  • use a conversational and personalized style
  • get the learner to think more deeply
  • get — and keep — the reader’s attention
  • touch their emotions

And they break out all sorts of “metacognitive” tips explaining why they wrote such a whacky book and how you should go about learning with it (like read it before bed and mix right-brain/left-brain activities for learning).

Many reviewers like the approach well enough to blurb about it.

LingPipe Book Draft Available

August 5, 2010

We’ve started work on a proper LingPipe book. You can download the current partial draft of the book and sample code from:

In case you were wondering why the blog’s been quieter these days, this is it!

Our Goals

Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.

We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in LingPipe’s javadoc or the current tutorials.

Progress so Far

So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant. The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration. The third chapter covers regular expressions, focusing on their interaction with Unicode. The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy. The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.

The first appendiex is an introduction to Java, including the primitives, objects and arrays. The second appendix contains suggested further reading in areas related to natural language processing.

I hope to keep churning out chapters at around one per week. As I complete chapters, I’ll release new versions.

Comments Most Appreciated

C’mon — you know you want to be listed in the front of the book for sending us feedback. Any comments, suggestions, etc., should go to carp@alias-i.com.

The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing.

I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.

Eventual Publication

We’ll be publishing it ourselves, probably through CreateSpace. That’ll let us sell through Amazon.

If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).

We plan to continue distributing PDF versions for free.

It’s about Time

I’m psyched, because it’s been over ten years since my last book.

I’m also working on a book on Bayesian categorical modeling — the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book. Don’t hold your breath; I’m trying to finish the LingPipe book first.