Author Archive

Processing Tweets with LingPipe #2: Finding Duplicates with Hashing and Normalization

December 2, 2010

This post is about taking the csv file format used in post #1 and eliminating the huge number of annoying duplicates or near duplicates that exist in Twitter search results. The next entry in the series will cover more open ended near duplicate detection strategies such as Jaccard Distance and will introduce the very important area of evaluation.

As before you can download the code with this tarball or get it from subversion with the command

svn co

We will start with a simple ingest and representation of the csv data and consider how to find near or exact duplicates without concern for computational efficiency. The research literature in this area is very concerned with scalable performance, I will instead focus on more ideal solutions that focus on simplicity and quality.

Hashing Driven Deduplication

A simple way to find exact duplicates is to use a HashSet data structure in Java for comparison of tweets. Once we get the tweets into a list we iterate over the tweets to see whether the the exact string is in the HashSet. Below is the recognition phase from src/

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

The code iterates over the CSV format search results, tests to see if a HashSet already contains the string and adds it if the set does not contain the string. I could have just added all the strings and the HashSet would have behaved the same but that is slightly automagical and might prove confusing. Below the code for writing out the CSV format for presumably unique tweets:

	File output = new File(outDir,outFile); 
	System.out.println("Writing to: " + output.toString());
	FileOutputStream stream =  new FileOutputStream(output);
	OutputStreamWriter streamWriter 
	    = new OutputStreamWriter(stream,Strings.UTF8);
	CsvListWriter writer 
	    = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 
	ArrayList row = new ArrayList();
	row.add("Unique Tweet");
	writer.write(row); //header for csv file
	for (String tweet : seenBefore) {
	System.out.println(seenCounter + " seen of " + texts.size());

Note that I have kept the tweet in the same row as the search results. This convention will allow interaction with other components.

Running the program on searches/Obama.csv yields the following:

>ant dedupeSimpleHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.simpleDedupeHash.csv
Buildfile: build.xml



     [java] 2
     [java] Seen Before at postition:48: RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] Seen Before at postition:58: RT @New_federalists: Obama lets Islamic terrorists pour across the Mexican border, but treats US citizens as criminals at airports!!! #RESIST #REBEL !
     [java] Seen Before at postition:1490: RT @Cubachi: RT @PalinTV: Obama can't even pardon a Turkey without notes
     [java] Seen Before at postition:1499: RT @BreakingNews: President Barack Obama pardons turkeys ?Apple? and ?Cider?; says feels good to prevent one ?shellacking? this year
     [java] Writing to: runs/Obama.simpleDedupeHash.csv
     [java] 297 seen of 1500

Duplicate Detection via Normalization

The simple deduplication approach gets rid of 297 exact duplicates, or 20% of the tweets. Pretty good but looking at the ‘unique’ tweets and sorting on the texts in a spread sheet shows some other issues:

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update...

Note that your results will vary since we are not running the searches on the same day. These posts only differ in the url for which is still pretty much the same tweet as far as I am concerned. Looking further I see that tweets get re-tweeted with a prefix of ‘RT@:’ pretty often as well.

RT @mattyglesias: Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																																												
Repealing the whole Affordable Care Act would be a mistake, but Obama should compromise and agree to scrap the death...

It appears that removing urls and the retweet prefixes will find more duplicates. That suggests normalization to help make the tweets more uniform by eliminating "trivial" differences. A look at src/ shows regular expressions that replace the url and retweets with the empty string.

	List texts = parseFile(dataDir);
	HashSet seenBefore = new HashSet();
	for (int i=0; i < texts.size(); ++i) {
	    String tweet = texts.get(i);
	    System.out.println("Before:" + tweet);
	    tweet = tweet.replaceAll("\\s*RT\\s*@\\w+:\\s*","");//removes "RT @foobar:"
	    tweet = tweet.replaceAll("https?:[^\\s]*",""); //removes "http://foo" "https://bar"
	    System.out.println("After :" + tweet);	    
	    if (seenBefore.contains(tweet)) {
		System.out.println("Seen Before at postition:" + i +": " + tweet);
	    else {

Running the normalized approach the modifications are evident:

> ant dedupeNormalizedHash -Ddata=searches/Obama.csv -Doutdir=runs -Doutfile=Obama.dedupeNormalizedHash.csv
     [java] Before:RT @rolandsmartin: President Obama: 'I Don't Think About Sarah Palin' -
     [java] After :President Obama: 'I Don't Think About Sarah Palin' -  
     [java] Writing to: runs/Obama.dedupeNormalizedqHash.csv
     [java] 454 seen of 1500

Both patterns apply to the tweet removing ‘RT @rolandsmartin: ‘ and ‘;. The normalization allows for identification of another 157 duplicates. Normalization is a standard technique used when working with text data. You see it in search engines where ‘walks’, ‘walking’ and ‘walked’ are normalized to the root word ‘walk’ with a stemmer. Both documents and queries are normalized the same way allowing a query ‘Books about walking’ to match documents that don’t mention ‘walking’ but do mention ‘walk’, ‘walks’ or ‘walked.’ Normalization can bring more trouble than the problems it solves and should be used carefully–perhaps a future blog post is called for on our preference to use character n-grams in situations where stemming is typically used.

Looking at the output of the normalized hash approach reveals more opportunity to eliminate near duplicates and there may be more patterns to exploit but eventually diminishing returns will bring the pattern driven normalization effort to a close. Lets move to more open ended approaches that better address the long tail side of the issue:

Has Obama written a stern letter to North Korea yet? They will respect that.
Has Obama written a stern letter to North Korea yet? They will respect that. <--- indeed!

Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party 
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #p2 #tlot
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner W...  #tcot #tlot #p2
Latest Gallup poll: Barack Obama gets another shellacking from the Tea Party: By Nile Gardiner World Last update... 

The first pair only differ in an additional ‘<—indeed!’ which is not a likely pattern in the data that can be exploited much beyond this case. The second example shows that the basic tweet has many small variations. In these situations it is useful to deduce what algorithm you use to say that the tweets are (mostly) the same.

In my next post in the series I will cover Jaccard Distance as the method of identifying near duplicates and bring in the importance of evaluation metrics to drive development.


Processing Tweets with LingPipe #1: Search and CSV Data Structures

November 16, 2010

I have been focusing on teaching developers how to approach the challenges of computational linguistics. An interesting source to work with is Twitter since it is culturally relevant these days, fairly easy to work with from a data feed perspective and it is just so darned cute. Like many cute things, it is not long before it is eating your shoes, resisting toilet training and overall testing one’s patience. We will be using LingPipe to help tame the beast.

I will start with how to work with the search client and will move towards language ID, deduplication and a bit of sentiment classification in later blog entries. The code for this post is available in this tarball or from our subversion repository which can be accessed via the command:

svn co

Go ahead and download the software and make sure you can compile. I am developing with Ant so I just type ‘ant’ in the teaching/twitter directory:

$ ant
Buildfile: build.xml

    [mkdir] Created dir: /Users/breck/devguard/teaching/lingpipe4twitter/build/classes
    [javac] Compiling 8 source files to /Users/breck/devguard/teaching/lingpipe4twitter/build/classes

      [jar] Building jar: /Users/breck/devguard/teaching/lingpipe4twitter/demo.jar

Total time: 3 seconds

Searching Twitter

We are using the Twitter4j library for accessing the Twitter API. The folks at Twitter apparently make contributions, while not an endorsement, it certainly increases credibility.

Getting some data on disk is the first item. Run a search on Obama as follows:

ant search -Dquery="Obama"
Buildfile: build.xml



     [java] [Mon Nov 15 10:20:45 EST 2010]Will use class twitter4j.internal.logging.StdOutLoggerFactory as logging factory.
     [java] [Mon Nov 15 10:20:45 EST 2010]Will use twitter4j.internal.http.HttpClientImpl as HttpClient implementation.
     [java] writing to disk in directory: searches

Total time: 6 seconds

The file name will be the query, searchers/Obama.csv. Note that subsequent searches will not overwrite previous search results. Rerunning the above query would create a file searches/Obama_1.csv.

Looking at the csv output in spread sheet like Open Office or Excel (import as comma seperated UTF-8 text)

Search results in CSV format viewed in a spread sheet

The tweets are in column D. To get them more readable you should select the entire column D, resize width to around 5″ and modify formatting to include wrapping the text.

Pretty cute eh? Lots of entrepreneurs have gotten to this stage of perusing Twitter and see piles of customers clamoring for access to this feed that is effectively the id of the internet. Clean it up with some super ego and profit is sure to follow. Well, the cute gets ugly fast. Run a few queries in areas of interest and the messiness becomes apparent quickly. Below we go into some details and a bit of data analysis.


We will be using the CSV data structure to work with the Twitter search results. Spreadsheets provide an easy way to view, sort and manipulate the tweets and there is a simple parser that allows us to interface with LingPipe functionality.

The relevant bits of the search code in src/ are:

    static final int TWEETS_PER_PAGE = 100;
    static final int NUM_PAGES = 15;

    public static void main (String[] args) 
        throws IOException, TwitterException {
        String queryString = args[0];
        File outDir = new File(args[1]);
        Twitter twitter = new TwitterFactory().getInstance();
        Query query = new Query(queryString);
        List tweets = new ArrayList();
        for (int i = 1; i <= NUM_PAGES; ++i) {
            QueryResult result =;
            List resultTweets = result.getTweets();
            if (resultTweets.size() == 0) break;
        System.out.println("writing to disk in directory= " + outDir);
        TweetWriter.writeCsv(tweets,outDir,queryString,"csv"); //our class for writing tweets

The only tricky bit to this search interface is the two levels of indexing for search results, pages and
tweets per page. Otherwise we accumulate the tweets and send them off to our TweetWriter class for
writing to disk.

The code for writing is in src/

public class TweetWriter {    
    public static void writeCsv(List tweets, File dir, 
                                String filename, String suffix)
        throws IOException {
        File output = nextFile(dir,filename,suffix);
        OutputStream fileOut = null;
        Writer streamWriter = null;
        CsvListWriter writer = null;
        try {
            fileOut =  new FileOutputStream(output);
            streamWriter = new OutputStreamWriter(fileOut,Strings.UTF8);
            writer = new CsvListWriter(streamWriter,CsvPreference.EXCEL_PREFERENCE); 

            List headerRow 
                = Arrays.asList("Estimate", "Guessed Class", 
                                "True Class", "Text");
            List row = new ArrayList();
            for (Tweet tweet : tweets) {
                String whitespaceNormedTweet
                    = tweet.getText().replaceAll("[\\s]+"," ").trim();
        } finally {
            try {
                if (writer != null)
            } finally {
                try { 
                    Streams.closeQuietly(streamWriter); //lingpipe convenience utility to close without exceptions 
                } finally {

This code uses the Super Csv package to handle reading and writing csv (comma seperated value) formatted files. For this tutorial I am only interested in the text of the Tweet object but much more information is available–see the Twitter4j documentation on the returned Tweet object.

The basic pattern is to populate a row with string values and write the row out. I have adopted the convention that the first line of a csv file contains headers which is handled before iterating over the tweets. Writing out the tweets involves 3 empty fields followed by the text of the tweet. Later those empty fields will be populated by human and machine annotations so this is our foundational data structure. The remaining odd bit is the whitespaceNormedTweet = tweet.getText().replaceAll("\\s+"," ").trim(); which replaces new lines and carriage returns with a single white space. This is in clear violation of my “do not adulterate the data” stance but for the purposes of this tutorial the improved readability of the csv format makes it worth it.

Also note that the IO has been wrapped for maximum paranoia regarding open file handles that might take out a JVM on repeated failures of writer, streamWriter or fileOut. I just know that someone is going to copy and paste the above code into a production system so might as well make it well behaved. BTW a real implementation would be streaming tweets to disk to keep down memory use. Thanks to Bob for noting all this and re-writing the closes.

Exploring the Data

Blog entries are supposed to be short I will cut the presentation so that I may blog another day in good conscience. But a little data analysis to whet the appetite seems appropriate.

Duplicates and Near Duplicates:

We now have an on disk data structure for storing searched for tweets that we can browse with a spread sheet and modify at our will. Lets look at what our harvest has reaped. Select the entire sheet and sort by column D. Scrolling down a bit I find for my Obama search some near duplicates that share prefixes:

#2: Crimes Against Liberty: An Indictment of President Barack Obama
#2: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama
#5: Crimes Against Liberty: An Indictment of President Barack Obama: Crimes Against Liberty: An Indictment of Pre...

Looking more closely we see that the repeats are minor variants with both the referred to URL, the #2 vs #5 and some more variation in the last example. For the purposes of this tutorial I assume that duplicates and near duplicates need to be eliminated–there are other use cases where retweets, which are near duplicates passed on by different users, are desirable. We will be handling duplicates and near duplicates in the next blog post.


I browsed the tweets and started to markup up what the language was. I didn’t have to look past around 100 tweets to get a pretty rich smattering of languages.

Some of the languages found in Twitter searches

Diversity of Languages Displayed.

In another blog post I will approach the issue of language identification–fortunately it is pretty easy for languages that are not near neighbors like British English and American English. If you want to get a heads up on language id look at our language ID tutorial.

That is all for now. My next posts will attend to deduplication, then language ID before moving on to classification for topic and sentiment.


Bing Translate Has a Great UI, and Some Nice NLP, too

August 10, 2010

I’m really digging the user interfaces Bing has put up. Google’s now copying their image display and some of their results refinement. I’ve been working on tokenization in Arabic for the LingPipe book and was using the text from an Arabic Wikipedia page as an example.

Here are links to Bing’s and Google’s offerings:

Yahoo!’s still using Babel Fish in last year’s UI; it doesn’t do Arabic.

Language Detection

First, it uses language detection to figure out what language you’re translating from. Obvious, but oh so much nicer than fiddling with a drop-down menu.

Side-by-Side Results

Second, it pops up the results side-by-side.

Sentence-Level Alignments

Even cooler, if you mouse over a region, it does sentence detection and shows you the corresponding region in the translation. Awesome.

Back at Google

I went back and looked at Google translate and see that they’ve added an auto-detect language feature since my last visit. Google only displays the translated page, but as you mouse over it, there’s a pop-up showing the original text and asking for a correction.

I don’t know who did what first, but these are both way better interfaces than I remember from the last time I tried Google translation. And the results are pretty good NLP-wise, too.

LingPipe Book Draft Available

August 5, 2010

We’ve started work on a proper LingPipe book. You can download the current partial draft of the book and sample code from:

In case you were wondering why the blog’s been quieter these days, this is it!

Our Goals

Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.

We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in LingPipe’s javadoc or the current tutorials.

Progress so Far

So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant. The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration. The third chapter covers regular expressions, focusing on their interaction with Unicode. The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy. The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.

The first appendiex is an introduction to Java, including the primitives, objects and arrays. The second appendix contains suggested further reading in areas related to natural language processing.

I hope to keep churning out chapters at around one per week. As I complete chapters, I’ll release new versions.

Comments Most Appreciated

C’mon — you know you want to be listed in the front of the book for sending us feedback. Any comments, suggestions, etc., should go to

The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing.

I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.

Eventual Publication

We’ll be publishing it ourselves, probably through CreateSpace. That’ll let us sell through Amazon.

If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).

We plan to continue distributing PDF versions for free.

It’s about Time

I’m psyched, because it’s been over ten years since my last book.

I’m also working on a book on Bayesian categorical modeling — the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book. Don’t hold your breath; I’m trying to finish the LingPipe book first.

Big Bit-Packed Array Abstraction (for Java, C, etc.)

July 28, 2010

One limitation of Java is that arrays can only be up to Integer.MAX_VALUE entries long. That means you can only get 2 billion entries into an array. That’s only 2GB if the arrays contain bytes, and we have way more memory than that these days. That’s not enough.

A second limitation, shared by C, is that each entry is a round number of bits, usually a power of two bytes (8, 16, 32, or 64 bits). Often 32 is too little and 64 too much.

I’ve been spending some time lately working on bioinformatics applications. The size and shape of the data’s a bit unusual for me, and very awkwardly sized for Java or even C-based arrays.

Packing the Human Genome

For instance, if you want to represent the bases in both strands of the human genome, it’s a sequence 6 billion bases long with each base being 1 of 4 values (A, C, G, or T). OK, no big, just pack four values into each byte and you can squeeze under the limit with a 1.5 billion long array of bytes. It’s even pretty easy to write the bit fiddling because everything aligns on byte boundaries.

Indexing the Human Genome

Now let’s suppose we want to index the genome so we can find occurrences of sequences of arbitrary length in time proportional to their length. Easy. Just implement a suffix array over the positions. (A suffix array is a packed form of a suffix tree, which is a trie structure that represents every suffix in a string. Technically, it’s an array consisting of all the positions, sorted based on the suffix defined from the position to the end of the string.)

Because there are 6G positions, the positions themselves won’t fit into an int, even using an unsigned representation. There are only 32 bits in an int, and thus a max of about 4 billion values. But a long is overkill. We need 33 bits, not 64. Using 8 bytes per entry, the suffix array would fill 48GB. We really only need a little over half that amount of space if we use 33 bits per pointer. But now things aren’t so easy. First, the bit fiddling’s more of a pain because 33-bit values can’t be both packed and byte aligned. Second, you can’t create an array that’ll hold 24GB of values, even if it’s a maximally sized array of longs (that’ll get you to 16GB).

The Abstraction

What I’ve defined in the package I’m working on with Mitzi is a BigArray interface. The interface is super simple.

interface BigArray {
    long length();
    long get(long n);
    void set(long n, long val);

That’s the basics. We might want to make the set operations optional so we could have immutable big arrays. As is, there’s no way to define immutable arrays in Java or C (though Java provides immutable wrappers for lists in the collections framework and C probably has something similar in its standard template lib).

I followed the idiom of Java’s Number interface and define extra getters for bytes, shorts and integers for convenience, but don’t show them here.

On top of a big array, we could implement a variety of the new I/O buffer classes, using something like the random access file repositioning pattern, but it’d be clunky. It’s really a shame the buffers are also sized to integers.


What’s groovy about interfaces is that we can implement them any way we want. The simplest implementation would just be backed by an array of long values. An implementation specifically for two-bit values might be used for the genome, with an efficient byte-aligned implementation.

What I did was create an implementation that allows you to specify the array length and bits per value. I then pack the bits into an array of longs and fiddle them out for getting and setting. This is fine for the genome sequence, because it’ll fit, although it’s not as efficient as the byte-aligned implementation, so I’ll probably add extra special-case implementations.

But it’s still not enough for the suffix array. I can’t fit all the bits into an array of longs. Although I haven’t implemented it yet, the obvious thing to do here is to build up the big array out of an array of arrays. I’d only need two arrays of longs for the human genome suffix array. This is pretty easy to implement hierarchically with an array of smaller big array implementations. First figure out which subarray, then do the same thing as before.

Yes, I Know about Burrows-Wheeler Indexing

If you’re just interested in the suffix array packing problem, the Burrows-Wheeler compression algorithm may be combined with indexing to get the entire genome suffix array into 4GB of memory in a way that’s still usable at run time. This was first explored by Ferragini and Manzini in their 2000 paper, Opportunistic data structures with applications. Very very cool stuff. And it works even better in the genomics setting than for text documents because of the small alphabet size (just four bases for DNA/RNA or 20 amino acids for proteins). Several projects have taken this approach (e.g. BWA and Bowtie).

LaTeX Problem’s Not Our Fault

July 14, 2010

Thanks to those who’ve sent me e-mail saying that the LaTeX is broken on the site. It’s not our fault. Really. It’s a problem with WordPress’s server.

Are lots of people having problems seeing LaTeX on our site?

What’s Going On

WordPress uses a web service to decode the LaTeX expressions to PNGs. When I write something like

$latex f(x)$

in a blog post, it gets rendered as f(x).

What’s really going on is that WordPress is preprocessing my doc and replacing the LaTeX escapes with links to a web service that returns PNG images. For instance, the example above translates to the following link:

If you click on the link, you should see just the formula.

(Note: Given that they run the preprocessor, WordPress might change the path and the link might go stale. Or their server might be down. If that happens, right click on the image in your browser and select “View Image Info” or whatever the variant for doing that on your browser is.)

What Happens When it Fails

When WordPress’s web server goes down, it fills in something like “latex path not specified”. Not sure whether that’s postprocessing or just the result from the server not being configured properly.

Brendan O’Connor was kind enough to send me a screen shot:

screen shot of effect of WordPress LaTeX server outage

The Latin1 Transcoding Trick (for Java Servlets, etc.)

July 14, 2010

This post uses Java servlets as an example, but applies more broadly to a situation where a program automatically converts streams of bytes to streams of Unicode characters.

The Trick

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte. These char values may be translated back into bytes with a simple cast:

char[] cs = Converter.getChars();
byte[] bs = new byte[cs.length];
for (int i = 0; i < cs.length; ++i)
    bs[i] = (byte) cs[i];

Now you have the original bytes back again in the array bs. In Java, char values act as unsigned 16-bit values, whereas byte values are signed 8-bit values. The casting preserves values through the magic of integer arithmetic overflow and twos-complement notation. (I attach a program that’ll verify this works at the end.)

You can now use your own character encoding detection or pull out a general solution like the International Components for Unicode (which I highly recommend — it tracks the Unicode standard very closely, performing character encoding detection, fully general and configurable Unicode normalization, and even transliteration).

Use in Servlets for Forms

I learned this trick from Jason Hunter’s excellent book, Java Servlet Programming, (2nd Edition, O’Reilly). Hunter uses the trick for decoding form data. The problem is that there’s no way in HTML to declare what character encoding is used on a form. What Hunter does is add a hidden field for the value of the char encoding followed by the Latin1 transcoding trick to recover the actual string.

Here’s an illustrative code snippet, copied more or less directly from Hunter’s book:

public void doGet(HttpServletRequest req, ...) {
    String encoding = req.getParameter("charset");
    String text = req.getParameter("text");
    text = new String(text.getBytes("ISO-8859-1"), encoding);

Of course, this assumes that the getParameter() will use Latin1 to do the decoding so that the getBytes("ISO-8859-1") returns the original bytes. According to Hunter, this is typically what happens because browsers insist on submitting forms using ISO-8859-1, no matter what the user has chosen as an encoding.

We borrowed this solution for our demos (all the servlet code is in the distribution under $LINGPIPE/demos/generic), though we let the user choose the encoding (which is itself problematic because of the way browsers work, but that’s another story).

Testing Transcoding

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] bs = new byte[256];
        for (int i = -128; i < 128; ++i)
            bs[i+128] = (byte) i;
        String s = new String(bs,"ISO-8859-1");
        char[] cs = s.toCharArray();
        byte[] bs2 = s.getBytes("ISO-8859-1");
        for (int i = 0; i < 256; ++i)
            System.out.printf("%d %d %d\n",

which prints out


c:\carp\temp>java Test
-128 128 -128
-127 129 -127
-2 254 -2
-1 255 -1
0 0 0
1 1 1
126 126 126
127 127 127

Unicode Shell for Java: Windows PowerShell ISE

July 6, 2010

I finally figured out how to have my Java programs write Unicode to a Windows shell and show up looking the right way. Here’s the final result (click to enlarge):

DOS vs Powershell for Unicode

And here’s how I did it.

1. Fire Up PowerShell ISE

My new Windows 7 Professional notebook comes with an application called PowerShell ISE (make sure to run the ISE version; the unmarked one is more like DOS and has the same problems noted below). The “ISE” is for “integrated scripting environment”.

It defaults to Consolas font, which looks a lot like Lucida console.

2. Set Character Encoding to UTF-8

This’ll set the DOS character encoding to be UTF-8.

> chcp 65001
Active code page: 65001

3. Set Java System.out to UTF-8

Before writing to System.out, Java’s constant for the standard output, you’ll need to set it up to use UTF-8. It’s a one-liner.

System.setOut(new PrintStream(System.out,true,"UTF-8"));

The true value enables auto-flushing, which is a good idea for standard output.

Demo Program

Here’s a simple test program (the backslash escapes are Java literals for Unicode code points).


public class Test {

    public static void main(String[] args) 
        throws UnsupportedEncodingException {

        PrintStream utf8Out
            = new PrintStream(System.out,false,"UTF-8");

        System.out.println("English: Hello");
        System.out.println("French: D\u00E9j\u00E0 vu");
        System.out.println("Tamil: \u0B92");
        System.out.println("Han: \u4E52");


You can see the output in action in the screen dumps at the top of this post.

Problems with DOS

The DOS shell will let you set the character encoding and change the font to Lucida console. But it oddly duplicates characters at the end of lines if the lines contain non-ASCII code points. You can see this on the example. And it can’t handle the Tamil or Han characters.

Training Examples: A Stack Overflow for NLP and ML and …

June 29, 2010

Joseph Turian’s just created a site

which describes itself as a place “where data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!”

It’s modeled pretty directly on the hugely popular Stack Overflow, which bills itself as a “collaboratively edited question and answer site for programmers – regardless of platform or language.”

In case you’re too lazy to look yourself, questions at the top of Training Examples right now are

  • how to discovery emerging topics in customer contacts?
  • How do I test a method (murmurhash) for generating random numbers?
  • Finding related users based on user data
  • Algorithm to discover features from raw data?
  • When is low-dimensional dense representations better than high-dimensional and sparse? And vice-versa?
  • What are the state of the art optimization algorithms for problems with numerous local minima?

Par for the course, it’s a mix of wildly general (non-convex optimization) and reasonably specific (testing a random number generator) questions.

Joseph sites the following motivators for using Training Examples:

  • Communicate with experts outside of your lab.
  • Answer a question once publicly, instead of potentially many times over email.
  • Share knowledge you have that might not be reflected in your publication record.
  • Find new collaborators.

Those are also good reasons for starting a blog or commenting on one!

The site requires registration for posting or tracking questions, but not for reading questions or answers. I have gazillions of passwords and a regular process for managing them, but it’s still enough of a pain that I don’t just register at every site I visit. Of course, with something like this, you need registration, or it’ll be nothing but spam in ten minutes flat. It’s a constant battle on the blog, but I don’t want to moderate or require registration.

The Unbearable Heaviness of Java Strings

June 22, 2010

Instantaneous Update: No sooner had I posted this than I received a really helpful clarifying comment from Ismael Juma with a link to exactly the references I was looking for. It turns out the 64-bit JVMs do compress references in some cases. And people ask me why I waste my time blogging!

Update: 23 June 2010. Super cool. As of Java SE 6 build 14, all you need to do is provide the java command the argument -XX:+UseCompressedOops and all references will consume 32 bits (“OOP” is for “ordinary object pointer”). That does limit you to 4 billion objects on the heap (232), which if they’re just plain objects will consume 32GB. You’d need at least 64GB of RAM to make it worth turning the option off, because 4 billion objects consume 64GB without OOP compression.

Does anyone know how fast the JVM is with compression? I’ll have to run some tests.

Let’s start with the punchline. In Sun/Oracle’s 64-bit JVM, a string consumes 60 bytes plus the size of its UTF-16 encoding, which is at least 2 times the number of Unicode characters. In a 32-bit JVM, if anyone’s still using one, there’s a 36-byte fixed cost plus the cost of the UTF-16 encoding.

My number one peeve with Java is that there is no equivalent of C’s struct. I can do without pointer arithmetic, but objects in Java are just too heavy for serious lifting (somehow that metaphor worked out backward). It’s the reason there are parallel arrays all over LingPipe — arrays of structured objects just take up too much space and cause too much garbage collection.


Yes, Java made the questionable choice of using UTF-16 to encode strings internally. Each char is a UTF-16 byte pair. Any unicode character with a code point at or below U+FFFF uses a single UTF-16 byte pair to encode, and anything higher requires two UTF-16 byte pairs. (In the designer’s defense, there weren’t any Unicode characters with code points above U+FFFF when Java was released and 64-bit architectures in laptops seemed a long way off.)

The bummer is that for all of us with piles of ASCII or near-ASCII (e.g. Latin1 or Windows-1252) text, UTF-16 takes nearly twice as much space as UTF-8.

Java Object Memory Layout

The layout of objects in memory is not part of the Java language specification, so implementations may do this differently. We’ll only consider Sun/Oracle’s 32-bit and 64-bit JVMs.

After a lot of guesswork and empirical profiling, I finally found some references that look solid on the topic of how Java objects are laid out in memory:

What’s worrisome is that they look awfully similar and there are no external references. How did they figure this out? Looking at the byte code? By looking at the implementation of Java itself? By reverse engineering?

Principles of Object Layout

Assuming they’re correct, and the argument looks convincing, the basic idea of sizing a Java object (not including the size of any object it references), is based on a few principles motivated by byte alignment. The idea is that it’s more convenient for the underlying architecutre if items start on even number multiple of their number of bytes. So a double or long would want to start at position 0, 8, 16, …, an int or float on positions 0, 4, 8, 12, …, a short or char on positions 0, 2, 4, 6, … and bytes on any position at all.

References (aka pointers), will be 8 bytes on 64-bit architectures and 4 bytes on 32-bit architectures.

Basic Object Layout

An instance of Object requires two references, which consume 8 bytes (32 bit) or 16 bytes (64 bit).

One reference is to the class of the object, and one is the ID of the object. Objects get IDs because they may be relocated in memory yet need a unique identifier to resolve reference equality (==). The ID is also used for synchronization and for the hash code method implemented in Object.

In order to support byte alignment, the member variables are laid out from largest to smallest, that is doubles and longs, then ints and floats, then shorts and chars, then bytes.

Objects are reference aligned, because they start with references, so object sizes are rounded up to the nearest even multiple of 4 bytes (32 bit) or 8 bytes (64 bit).

Subclass Layout

Subclasses add their own member length to their superclass’s.

Subclasses point to their own class definition with one of their references and lay out their member values below the superclass’s. There are some subtleties here surrounding ordering member variables in the subclass layout to reuse space that would’ve been padding in the superclass.

Array Layout

Arrays take up 12 bytes (32 bit) or 20 bytes (64 bit) plus their member size, plus an additional 4 bytes if the array holds longs, doubles or object references.

Arrays behave like objects in java, and contain the same first two references (to the array class and to the ID). They additionally contain an integer length (a choice that, sadly, limits the size of Java arrays). Then they contain the equivalent of their length in member variables. These start after the length, which means 4 more bytes of padding for double or long or object reference arrays.


The Sun JDK implements strings as holding an array of char values, an integer start and an integer length definining a slice of the array, and an integer hash code.

REF -> String
REF -> [ REF -> Array of char
            REF -> ID
            int -> length
            char -> charAt(0)
            char -> charAt(length-1) ]
int -> start
int -> length
int -> hashCode
(REF-int) padding

That’s 5 REFs, 4 ints, and a number of char equal to the length (in UTF-16 byte pairs) of the string, plus padding. So with 8-byte refs in the 64-bit JVM, that’s 60 bytes and with 4-byte refs in the 32-bit JVM, it’s 36 bytes. Plus the length of the string times 2.


That’s why efficient coding in Java looks like C. It’s just too expensive to allocate objects in tight loops, no matter what you may have heard to the contrary about the compiler being smarter than you and generational garbage collection curing all ills.