Archive for the ‘Java’ Category

The Latin1 Transcoding Trick for Ant

April 21, 2013

A while back Bob blogged about The Latin1 Transcoding Trick for Java Servlets, etc.

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte.

Another situation where this trick comes in real handy is dealing with the way that Ant compiles its logfiles.

If, like me, you’re fond of debug-by-printf and you use Ant to compile and run your programs, then you might have run into the problem that has given rise to many StackOverflow queries, that is, when you use an Ant task to run the program and instrument your code with print statements to standard out, Ant replaces non-ASCII characters with a question mark. When the problem you’re trying to debug is making sure that non-ASCII characters are being processed correctly, this is both misleading and maddening. The standard advice on StackOverflow is to set the shell environment variable ANT_OPTS using the following incantation (for bash shell):

export ANT_OPTS="-Dfile.encoding=UTF-8"

This works as long as you’re working with UTF-8 encoded character data and your terminal’s encoding is set to UTF-8 as well. Here a solution that works no matter what character encoding is in play:

export ANT_OPTS="-Dfile.encoding=Latin1"

It’s the ol’ Latin1 transcoding trick!

Of course you already know about character encodings . Do you know about Ant’s IO System? Here’s what Ant contributor Conor MacNeill says:

The Ant IO system is designed to associate any output sent to System.out and System.err with the task that generated the output and to log the output accordingly.

Ant’s Main class installs its own output streams into System.out and System.err. These streams are instances of DemuxOutputStreams

Using the source code for Ant 1.9.0, in class org.apache.tools.ant.Main we see that System.In, System.Out, and System.Err are all reassigned to Ant’s DemuxInputStream and DemuxOutputStream, which extend InputStream and OutputStream, respectively:

System.setIn(new DemuxInputStream(project));
System.setOut(new PrintStream(new DemuxOutputStream(project, false)));
System.setErr(new PrintStream(new DemuxOutputStream(project, true)));

The call to the PrintStream constructor is the one-arg constructor PrintStream(OutputStream out). Because no file encoding is specified, the encoding used is the default charset for the JVM that’s running Ant. This is specified by the system property file.encoding. This property varies depending on your platform and locale. To check this, try this on your machine:

public class GetDefaultEncoding {
    public static void main(String[] args) {
        System.out.println(System.getProperty("file.encoding"));
    }
}

On my Mac running OS-X the default is US-ASCII (because the default locale on this machine is en_US). On my Windows XP machine the default is Cp1252 (Windows Latin1 which differs from ISO_8859-1 just enough to be noticeable).

At the point where Ant’s DemuxInputStream reads in the bytes sent to System.out by a Java task, any character outside of the default character set is replaced by a question mark. When Latin1 is the default encoding, all bytes are valid Latin1 characters and their Unicode code point value is the same as the byte value so the bytes from the Java task pass through the Ant IO system unchanged.

As long as the next process in the chain (e.g. the terminal app) is configured to handle whatever encoding your text data is in, you’re good to go.

Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a "pass".] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.

Standard Output Ruins Everything!

May 9, 2012

The title is a a paraphrase from Dirk Eddelbuettel on the Rcpp mailing list (an interface tool for R and C++), but the lesson also applies to Java.

Don’t Write to Standard Output!

One of the first lessons of writing an API (as opposed to something that only runs from the command line) is that you never ever ever write to standard output in an API.

The reasons are that (1) you never know how someone might configure standard output around you (it’s resettable in Java), and (2) you never know what context your API will run in — it may be running in a servlet or in a Swing GUI where standard output where it’s invisible to your user (but does clog up the logs and the shell from which the Swing GUI was invoked).

So What do You do Instead?

1. Throw an exception if there’s some kind of error. See my previous post, “When to catch, pass, or throw exceptions?

Of course you have to be careful here about the context things are running in, too, especially if you try throwing a runtime exception instead of a checked exception. This is why the Google style guide for C++ forbids exceptions!

2. If there’s no error, the common advice is to write to a logger. We didn’t do that in LingPipe because we didn’t want any dependencies to other code built into LingPipe. We also didn’t want every user of LingPipe to have to configure a logger like log4j or Java’s built-in logger. The other issue with loggers is that they have one top-level config, so it gets confusing with multiple packages running if you use high-level config in the properties files (I know you can configure per-package, but people often don’t and get surprised).

Alternatively, you can write messages into something like a string builder. Then they can be sent to whatever output source you want.

The com.aliasi.io.Reporter class may look like a standard logger, but it’s only configurable programatically and is set by default to just accumulate results. Note how it’s passed into logistic regression fitting, not just there by default in the background.

A second alternative is to pass an OutputStream into the function that might want to write and write to that. In a command-line setting it can be set to the standard output. In an embedded context, it might be set to a byte array output stream wrapped in a PrintStream, which will just accumulate the results until they can be dealt with. For instance, they might be written into a servlet output stream for use in a web app.

Buffering Exceptions

August 16, 2011

OK, now that it’s just us API fanatics, imagine you need to call a stream read in an interface method implementation, but the interface doesn’t declare an I/O exception. For concreteness, suppose we have an interface:

interface Foo {
    public void bar();
}

and we want to implement a Foo that calls streaming I/O, as in:

public class IOFoo implements Foo {
    public void bar() {
        InputStream in = ...;
        ...
        int numBytes = in.read(buf); // throws IOException
    }
}

A common example is the Java interface Runnable, which has a run() method which does not throw exceptions. Given that this is the basis for threads, the issue comes up all the time.

This won’t compile, because the call to the read method throws an I/O exception, which is a checked exception that is not declared on the run method. In fact, it can’t be declared on the run method, because the interface doesn’t declare an I/O exception. What I too often see is this broken workaround:

        int numBytes = -1;
        try {
            numBytes = in.read();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }

What’s wrong here? There’s no way for anyone calling the bar() method to recover the exception and handle it. You could document that it throws a RuntimeException, but all kinds of problems can lead to a runtime exception being thrown, and there’s no way to tell which threw it. (Yes, you could encode it in the message, but that’s an even worse idea, as it’s very hard to maintain and code against.)

An alternative approach is to follow the pattern used by Java’s PrintWriter class and buffer the exception then make it available to the caller. Here’s a simplified example that buffers a single exception:

public class IOFoo implements Foo {
    private IOException mException;
    public void bar() {
        InputStream in = ...;
        ...
        try {
            in.read(); // throws IOException
        } catch (IOException e) {
            mException = e;
            return;
        }
    }
    public IOException getIOException() {
        return mException;
    }
}

That way, a caller can do this:

IOFoo f = new IOFoo(...);
f.run();
if (f.getIoException() != null)
    throw (f.getIOException());

Now the context around the call to IOFoo’s run can catch the IOException. For instance, if this is in a method call, you can just declare the method to throw the checked IOException. Note that we need to declare f as an IOFoo, not just a Foo, because we call its getIOException method.

Java’s PrintWriter keeps a list of exceptions that have been raised in calls to its methods. That’s because it keeps going after exceptions, assuming the client would rather they be silently ignored. I don’t think that’s such a great idea, for the obvious reason that there’s no telling what’s going to happen when I/O exceptions are just ignored.

It would be straightforward to set a flag that stops any further action once an exception has been raised.

The case I was actually using this in was for was writing a static method to serialize a tokenized language model. Visiting the n-grams requires an IntArrayHandler whose handle method is not declared to throw any checked exceptions. I wanted any embedded exception to be buffered and thrown by the top-level static method.

“Sentiment Takes a LONG Time”, Said the Logistic Regression Classifier

August 12, 2011

Hello again, fellow LingPipers!  It is I, yet again, Sam The Intern.  In my last (and first) post I promised the masses that I would return with a study similar to the last one that I conducted, but this time doing the non-trivial case of sentiment.  It turns out that this aspiration ended up being nothing more than a (Ling)Pipe-dream.  As my titular character, the logistic regression classifier so concisely said before, sentiment takes a long time!  And, with this being my last week at LingPipe, I will not have enough time to give our logistic regression classifiers the attention and whispered sweet nothings that they require to function properly.

As you may remember from my first post, I absolutely love to annotate.  However, annotating for sentiment takes much more time than language identification, because the annotator needs to internalize and judge the text, not just glance at it.  Add to this the fact that the sentiment classifier requires much more training data than the language identification classifier, and you have yourself an intern that has been product-conditioned to Coca-Cola, but no time to do anything about it (reading 1000 tweets about how much people love coke will do that to you).

I was able to start the study, with small amounts of success.  The following confusion matrix is the result of  a 10-fold cross-validation run across all of the data that I had time to annotate (about 1000 tweets each for “Clinton”, “Coke”, and “Gaga”).  The top of the chart is the reference data which I annotated, and the side is the response data that the classifier provided.  The categories are: w = neutral, e = positive, q = negative.

   w    e   q
w 1244,239,50
e 313 ,312,23
q 199 ,45 ,23

I wish that I had more time to spend on the sentiment classification problem.  Perhaps I will be back next summer, to continue the effort to teach computers good from bad.  Until then, I will be at Rochester, college-ing!  And drinking massive quantities of Coke…

“Language ID”, Said the NGram Process Dynamic Language Model Classifier…

July 25, 2011

Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.

I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.

And then he told me this is the trivial case.

Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.

Data:

Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).

After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.

To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data.  She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.

Process:

Part 1: Evaluation of Customized Language Classifier

We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.

As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.

After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.

This procedure was repeated for each of the five Twitter queries.

Part 2: Evaluation of Generic Language Classifier

We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.

After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.

Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier

A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.

Results:

Part 1

For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.

Figure 1 - Clinton Generic

Figure 2 - Coke Generic

Figure 2 New

Figure 3 - Gaga Generic

Figure 4 NEW

Figure 4 - Mitsubishi Generic

Figure 5 - Wiener Generic

Part 2

For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%

Part 2.5

We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.

Figure 6 - Clinton Generic vs. Custom

Figure 7 - Coke Generic vs. Custom

Figure 8 - Gaga Generic vs. Custom

Figure 9 - Mitsubishi Generic vs. Custom

Figure 10 - Wiener Generic vs. Custom

The experiment was run again (at nGram size 4),  50 times.  The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).

Figure 11 - Non-English Classifier Comparrison

Figure 12 - English Classifier Comparrison

Part 3

When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).

Figure 13 - Clinton Generic Performance Graph (on a logarithmic scale)

We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).

Figure 14 - Coke Category Performance

Figure 15 - Mitsubishi Category Performance

Discussion:

Parts 1+2

We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.

Parts 3

We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.

Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):

Custom Data Distribution

Generic Data Distribution

And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!

Why is C++ So Fast?

July 1, 2011

I had to respond to John Cook’s blog post,

My response was so long, I thought I’d repost it here.

I didn’t understand the answer properly six months ago, despite spending a couple years as a professional C coder (when I did speech recogntion) and having spent the past ten or so years writing Java. C++ is enough different than either of these to be a completely different sort of beast.

C++ is Faster than C!

At least, it’s easier to write fast code in C++ than in C these days. In fact, these days, C++ is the language of choice for optimization, not plain old C. The reason it’s so efficient is twofold.

Reason 1: Tight Data Structures

First, C++ is intrinsically stingy with memory (unlike Java objects, a C++ struct has no memory overhead if there are no virtual functions [modulo word alignment issues]). Smaller things run faster due to caching, and are also more scalable. Of course, this is true of C, too.

Reason 2: Template Metaprogramming

What makes C++ both hard to debug and understand and even more efficient than C is the use of template metaprogramming.

In particular, there’s huge gains to be made with the kind of lazy evaluation you can implement with expression templates and the kind of memory and indirection savings you get by avoiding virtual functions with the curiously recurring template pattern.

[Update: 28 July 2011: I thought an example would help, and this was the original one motivating template expressions.]

Suppose you have a scalar x and two m by n matrices a and b. Now suppose you want to evaluate d = a + (x * b). The standard non-lazy approach would be to create an intermediate matrix c representing the scalar-matrix product (x * b) and then add that to a. Basically, what'd happen is the same as if you'd coded it like this:

matrix c(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        c[m,n] = x * b[m,n];
matrix d(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        d[m,n] = a[m,n] + c[m,n];
return d;

With clever template expression programming, you can get around allocating an intermediate value, so that the resulting computation would amount to the same thing as coding it as follows.

matrix d(m,n);
for (int m = 0; m < M; ++m)
    for (int n = 0; n < N; ++n)
        d[m,n] = a[m,n] + x * b[m,n];
return d;

This saves the allocation and deallocation of matrix c. Memory contention is a huge problem with numerical code these days as CPU speed and parallelism outstrips memory speed, so this is a very large savings. You also save intermediate index arithmetic and a number of sets and gets from arrays, which are not free.

The same thing can happen for transposition, and other operations that don't really need to generate whole intermediate matrices. The trick's figuring out when it's more efficient to allocate an intermediate. The Eigen matrix package we're using seems to do a good job of this.

[end Update: 28 July 2011]

Voting with Their Code

I just visited Google NY lat week, where some of my old friends from my speech recognition days work. Upon moving to Google, they reimplemented their (publicly available) finite state transducer library in C++ using expression templates (the OpenFST project).

You'll also see expression templates with the curious recurrence in the efficient matrix libraries from Boost (BLAS level only) and Eigen (also provides linear algebra). They use templates to avoid creating intermediate objects for transposes or scalar-matrix products, just like a good Fortran compiler.

But Wait, C++ is also More Expressive

Templates are also hugely expressive. They lead to particularly neat implementations of techniques like automatic differentiation. This is what roused me from my life of Java and got me into C++. That, and the need for good probability and matrix libraries -- why don't these things exist in Java (or where are they if I missed them?).

The Future: Stochastic Optimization

What's really neat is that there are now stochastic optimizers for C++ that can analyze your actual run-time performance. I really really really like that feature of the Java server compilers. Haven't tried it in C++ yet.

PS: So Why is LingPipe in Java?

Ability to write fast programs isn't everything. Java is much, and I mean much, easier to deploy. The whole jar thing is much easier than platform-specific executables.

Java's code organization is much cleaner. None of this crazy header guard and .hpp versus .cpp stuff. No putting namespaces one place, code another place, and having them completely unrelated to includes (you can be more organized, but this stuff's just much easier to figure out in Java because it's conventionalized and everyone does it the same way). The interfaces are simpler -- they're explicit, not undocumented pure abstractions (where you have to look at the code of what they call an "algorithm" to see what methods and values it requires of its objects).

Java has support for threading built in. And a great concurrency library. There are web server plugins like servlets that make much of what we do so much easier.

There's also programming time. I'm getting faster at writing C++, but there's just so much more that you need to think about and take care of, from both a system author and client perspective (mainly memory management issues).

Many people don't realize this, but C is pretty much the most portable language out there. As far as I know, there's no place Java runs that C won't run. C++ is a different story. The compilers are getting more consistent, but the texts are still fraught with "your mileage may vary due to compiler" warnings, as are the mailing lists for the different packages we're using (Eigen, Boost Spirit, Boost Math), all of which lean heavily on templates.

Drafts 0.5 of Books on LingPipe and Java Text

June 27, 2011

We’ve released draft 0.5 of the LingPipe books. Get them while they’re hot from their home page:

Two Books for the Price of One

For version 0.5, we’ve broken out the bits about text processing in Java that aren’t LingPipe specific into their own book. And we’ve slightly changed the title of the LingPipe book to make this clearer.

Authors Piling On

Breck’s going to help out with both books.

Mitzi’s agreed to help us out with the text processing book now that she’s back in the bio(medical) text processing business after several years of working on genomics.

Comments, Please

As always, your comments are appreciated.

Pro Bono Projects and Internships with LingPipe

April 4, 2011

We have initiated a pro bono program that teams a lingpipe employee with a programmer wanting to learn about LingPipe to serve a non-profit or researcher needing help with a text analytics/NLP/computational linguistics project.

So far is is going pretty well with one classification project underway. The way it went was that I worked with an intern, a solid NLP programmer, to help a Ga Tech researcher get a classification system up and running. Now the intern is running the project with some feedback from me but not much is required at this point.

We are taking applications for interns wanting to learn how to code with LingPipe. We also will take applications for projects. Non-profits or academic research only please. Just send breck@lingpipe.com an email outlining your skills/needs and we will see what we can do.

Breck

Processing Tweets with LingPipe #3: Near duplicate detection and evaluation

January 2, 2011

Summary:

In Post #1 we covered how to search Twitter and get a useful on disk data representation of the search results. In Post #2 we covered the first level of deduplication using HashSets as the test of sameness. We extended sameness by doing some normalization of the tweets which included removing urls and retweets.

You can download the code with this tarball or get it from subversion with the command


svn co https://aliasi.devguard.com/svn/teaching/lingpipe4twitter

Despite the solutions above we still have annoyingly similar tweets that are not useful for my goals for this set of posts. In particular the near duplicates really foul up any attempts at cross validation where one part of the data is used as training and the other as test data. If there are duplicates then we are training on testing for some examples and results end up looking too good.

Tokenization and Jaccard Distance

The duplicate detection problem in Twitter is really about word overlap with a slight game of telephone quality to it. The elaborations of previous tweets tend to be leading or following comments with the core of the source tweet preserved. Not much rephrasing is going on so word overlap between tweets is the obvious place to go. That entails a few elaborations to our approach:

  1. Find words in the tweets: Tokenization
  2. Measure similarity of tweets without sensitivity to tweet length: Jaccard Distance

Tokenization

A very simple tokenizer can just break on what characters separate words as follows:

public class Tokenize1 {
    //Approach #1, find token seperators
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	System.out.println("Tweet: " + tweet);
	String[] tokens = tweet.split(" ");
	for (String token : tokens) {
	    System.out.println("Token: " + token);
	}
    }
}

This code is in src/Tokenize1.java. It produces output:

<ant tokenize1
     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: @mparent77772:
     [java] Token: Should
     [java] Token: Obama's
     [java] Token: 'internet

Another approach would be to try an define what characters belong in a token and match that, but that is a little more complex to program because String does not have a simple method call to capture an array of regex matches–see the regex chapter in the LingPipe book for more discussion:

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize2 {
    //Approach 2, define what is a token
    public static void main (String[] args) {
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	Pattern tokenPattern = Pattern.compile("\\w+"); //match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	Matcher matcher 
	    = tokenPattern.matcher(tweet);
	while (matcher.find()) { //keep matching until no more matches
	    System.out.println("Token: " + matcher.group());//print what matched
	}
    }
}

This approach produces the following output:

<ant tokenize2  
   [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] Token: RT
     [java] Token: mparent77772
     [java] Token: Should
     [java] Token: Obama
     [java] Token: s
     [java] Token: internet

Note that the text that separate the tokens are very different. Between ‘RT’ and ‘mparent77772′ is the separator ‘ @’. Depending on the needs of the application it can be easier to define tokens by what they look like rather than what the spaces around them look like. Often it is both.

While these approaches might work for the case at hand we will introduce the LingPipe tokenizers instead. They offer much richer tokenization options ready to go–see chapter 8 of the LingPipe book draft for more details or look at the Java doc.

An equivalent tokenization in the LingPipe API is created as follows:

import com.aliasi.tokenizer.RegExTokenizerFactory;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Tokenize3 {
    public static void main (String[] args) {
	//approach #3, A LingPipe tokenizer
	String tweet = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz";
	TokenizerFactory tokFactory
	    = new RegExTokenizerFactory("\\w+");//match one or more letter or digit chars
	System.out.println("Tweet: " + tweet);
	char[] chars = tweet.toCharArray();
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }
}

Its output reports on both the white spaces as well as the tokens.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: RT
     [java] White Space :' @'
     [java] Token: mparent77772
     [java] White Space :': '
     [java] Token: Should

We add the normalization features of post #2 by extending the ModifyTokenTokenizerFactory to filter retweets and urls. A good way to develop a tokenizer is to put in into its own class and create a stand-alone task that runs the tokenizer over example data. In ‘src/NormalizedTokenizerFactory.java’ there is a main method:


public static void main(String[] args) {
	TokenizerFactory tokFactory = new NormalizedTokenizerFactory();
	String text = "RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz"
	System.out.println("Tweet: " + text);
	char[] chars = text.toCharArray(); //convert to charArray
	Tokenizer tokenizer 
	    = tokFactory.tokenizer(chars,0,chars.length);
	String token;
	System.out.println("White Space :'" +  tokenizer.nextWhitespace() + "'");
	while ((token = tokenizer.nextToken()) != null) {
	    System.out.println("Token: " + token);
	    System.out.println("White Space :'" + tokenizer.nextWhitespace()+"'");
	}
    }

Running the above with ant normalizedTokenizerFactorty produces nicely normalized output.

     [java] Tweet: RT @mparent77772: Should Obama's 'internet kill switch' power be curbed? http://bbc.in/hcVGoz
     [java] White Space :''
     [java] Token: Should
     [java] White Space :' '
     [java] Token: Obama
     [java] White Space :'''
     [java] Token: s
     [java] White Space :' ''
     [java] Token: internet

How that tokenizer functions is left as an exercise for the reader. Time to take on Mr. Jaccard.

Jaccard Distance as a Measure of Similarity

Jaccard Distance is a good way to impress folks with a fancy sounding term that is really just percent word overlap in our usage. If you think it helps you can also call it by the term ‘coefficient de communauté’ but then you might have to reveal that it was popularized by a French botanist–kind of undermines its jargon impact score. From the Jaccard Javadoc the proximity of two character sequences:

 proximity(cs1,cs2)
   = size(termSet(cs1) INTERSECT termSet(cs2))
     / size(termSet(cs1) UNION termSet(cs2))

We get the term sets from the tokenizer, and proximity is the percentage of tokens shared by both character sequences. The code to compute this for all pairs of tweets in our search results is in src/ExploreJaccard.java and the relevant bit of code is:

	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	int filteredCount = 0;
	List candidateTweets 
	    = filterNormalizedDuplicates(texts, 
					 tokFactory); //throw out easy cases
	System.out.println("Normalized duplicate filter leaves " + candidateTweets.size() + " tweets");
	row = new ArrayList();
	for (int i = 0; i < candidateTweets.size(); ++i) {
	    String closestTweet = "default value";
	    double closestProximity = -1d;
	    String targetTweet = candidateTweets.get(i);
	    for (int j = 0; j < candidateTweets.size(); ++j ) {//cross product, ouchy, ow ow. 
		String comparisionTweet = candidateTweets.get(j);
		double thisProximity 
		    = jaccardD.proximity(targetTweet,comparisionTweet);
		if (i != j) { // can't match self
		    if (closestProximity < thisProximity) {
			closestTweet = comparisionTweet;
			closestProximity = thisProximity;
		    }
		}
	    }

The goal of the above wildly inefficient program is to explore what the closest tweet is as determined by token overlap. The filterNormalizedDeduplicates(tweets,tokFactory) filters out duplicates as discussed in #2. We can then decide on a threshold to throw out tweets with too much overlap. Running the program on the Obama.csv example:

ant exploreJaccard -Ddata=searches/Obama.csv -Doutfile=runs/Obama.jaccard.csv

and then viewing the output .csv file with a sort on column B we get (click to see larger image):

Jaccard Distance sorted for similarity, click on for larger image

Note that we get some values of 1 even though there are differences in Tweet1 and Tweet2. In row 2 the normalization filters out the url leaving the only difference being the phrase “replacing trains by high speed buses” in Tweet1 has an additional “high speed” in “replacing high speed trains by high speed busses” in Tweet 2. Since Jaccard does not count the number of words in computing distance the additional phrase adds no words to the set of words since it already exists in the tweet.

Scrolling down reveals just how bad redundancy can be with tweets. At the 50% overlap point the tweets are still looking quite similar:

	Vet, 77, Busted For Obama Death Threat | The Smoking Gun http://t.co/MrTUwxv via @
	Vet, 77, Busted For Obama Death Threat http://tinyurl.com/25zyxgp #tcot #tlot #sgp

There are 14 unique tokens with 7 overlapping yielding 50% overlap. 36% of the Obama query tweets have an overlap of 50% or more with another tweet. Trying a different query, "the" finds less overlap between tweets. Only 3% of the tweets have another tweet with 50% overlap. The query "today" yields 14% of tweets overlap. The lesson here is that some queries yield more uniform tweets which impacts subsequent language processing. A more technical way of expressing this observation is that the entropy of the resulting search results varies depending on the query. The "obama" search result entropy is lower (less random) than results for "the" which is more random.

The ‘today’ query had usefully different tweets at the 50% overlap level:

Playing a show in Chicago, IL at 9:00 PM today at LE PASSAGE http://artistdata.com/a/32on	
Playing a show in Cape Girardeau, MO at 9:00 PM today at The Venue http://artistdata.com/a/32ow

Despite sharing half the words they are clearly not retweets and contain different information. It might be quite difficult to automatically reject ‘obama’ near duplicates at 50% token overlap but retain the ‘today’ near duplicates with 50% overlap–I encourage people to suggest approaches in the comments.

Finally we add a class that will take a set of queries and filter them for near duplicates at a given Jaccard proximity in src/DedupeJaccard.java. The interesting bit is:

    public static List filterTweetsJaccard(List texts,
				       TokenizerFactory tokFactory,
				       double cutoff) {
	JaccardDistance jaccardD = new JaccardDistance(tokFactory);
	List filteredTweets = new ArrayList();
	for (int i = 0; i < texts.size(); ++i) {
	    String targetTweet = texts.get(i);
	    boolean addTweet = true;
	    //big research literature on making the below loop more efficient
	    for (int j = 0; j = cutoff) {
		    addTweet = false;
		    break; //one nod to efficency
		}
	    }
	    if (addTweet) {
		filteredTweets.add(targetTweet);
	    }
	}
	return filteredTweets;
    }

Deduplication along these lines is a big deal for web search as well as cleaning up Twitter feeds. The painful bit is the comparison of each new tweet to all the tweets that have passed the filter before which does not scale well. Much work has gone into hashing schemes that test for a hash match based on a lexical fingerprint of the existing corpus of tweets as well as more sophisticated similarity metrics. A recent paper with a decent overview of past work is http://research.microsoft.com/pubs/130851/ANDD.pdf.

Running the code on the Obama search results removes half the tweets with a .5 cutoff.

<ant dedupeJaccard -Ddata=searches/Obama.csv -Doutfile=runs/ObamaJaccardFiltered.csv
Buildfile: build.xml
compile:
    [javac] Compiling 1 source file to /home/breck/devguard/teaching/lingpipe4twitter/build/classes

jar:
      [jar] Building jar: /home/breck/devguard/teaching/lingpipe4twitter/demo.jar

dedupeJaccard:
     [java] Writing to: runs/ObamaJaccardFiltered.csv
     [java] Filtered size: 752 originally 1500

Looking at the tweets another issue immediately presents itself:

Coupe du monde - Obama: "Mauvaise décision": Le président Barack Obama a… http://goo.gl/fb/j5OUm #Maroc #sport #foot
RT @Luminary212: LOL “@BreakingNews: Obama knocks FIFA's decision to award Qatar, not the U.S., the 2022 World Cup: 'Wrong decision'”
I want to meet President Obama.
RT @laaficion Barack Obama, presidente de EU, consideró que la FIFA se equivocó al darle a Qatar el Mundial de 2022 y no a su país// celos

All sorts of languages are in the data. It looks like there is French and Spanish mixed in with the English. Next post will be creation of a language identification classifiers that will return to the notion of entropy discussed above.

Wrapup

In the quest to clean up the search results I have introduced some key concepts in text processing that will resurface in various forms throughout the posts. Summarizing:

  • Tokenization: Breaking strings into words/word separators is key to getting at the unit of meaning for word based processing. Oddly enough in the next post we won’t be using words for language id, but we will be still tokenizing.
  • Normalization: By eliminating trivial differences we can make text look more uniform–but beware of this tendancy.
  • Entropy: Some collections of data are more uniform than others.
  • Similarity: There are measures like Jaccard Distance to estimate the similarity of text. We will see many more examples that implement topic assignment, sentiment and more.
  • Evaluation: We have picked an operating point for near duplicate detection by examining data. While the evaluation metric remains quite loose, in the next post we will both create gold-standard (or truth) data and evaluation the performance of our language id system against it.

Breck


Follow

Get every new post delivered to your Inbox.

Join 467 other followers