Archive for the ‘LingPipe in Use’ Category

“Academic” Licenses, GPL, and “Free” Software

November 3, 2011

[This post repeats a long comment I posted about licensing in response to Brendan O’Connor’s blog entry, End-to-End NLP Packages. Brendan’s post goes over some packages for NLP and singles out LingPipe as being only “quasi free.”]

Restrictive “Academic-Only” Licenses

Some of those other packages, like C&C Tools and Senna, are in the same “quasi free” category as LingPipe in the sense that they’re released under what their authors call “non-commercial” licenses. For instance, none of the Senna, C&C, or LingPipe licenses are compatible with GPL-ed code. Senna goes so far as to prohibit derived works altogether.

The LingPipe License

The intent for the

was a little different from the “academic use only” licenses in that we didn’t single out academia as a special class of users. We do allow free use for research purposes for industrialists and academics alike. We also provide a “developers” license that explicitly gives you this right, which makes some users’ organizations feel better.

Truly Free NLP Software

The other tools, like NLTK, Mallet, OpenNLP, and GATE are released under more flexible licenses (LGPL, Apache or BSD), which I really do think of as being truly “free”. Mahout’s also in this category, though not mentioned by Brendan, whereas packages like TreeTagger are more like Senna or C&C in their restrictive “academic only” licensing.

Stanford and the GPL

Stanford NLP’s license sounds like it was written by someone who didn’t quite understand the GPL. Their page says (the link is also theirs):

The Stanford CoreNLP code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software.

Technically, what they say is true. It would’ve been clearer if they’d replaced “research” with “research and non-research” and “free” with “free and for-profit”. Instead, their choice of examples suggests “free” or “research” have some special status under the GPL, which they don’t. With my linguist hat on, I’d say their text leads the reader to a false implicature. The terms “research” and “academia” don’t even show up in the GPL, and although “free” does, GNU and others clarify this usage elswewhere as “free as in free speech”, not “free as in free beer”.

Understanding the GPL

The key to understanding the GPL lies behind Stanford’s embedded link to

Here, proprietary doesn’t have to do with ownership, but rather with closed source. Basically, if you redistribute source code or an application based on GPL-ed code, you have to also release your code under the GPL, which is why it’s called a “copyleft” or “viral” license. In some cases, you can get away with using a less restrictive license like LGPL or BSD for your mods or interacting libraries, though you can’t change the underlying GPL-ed source’s license.

GPL Applies to Academics, Too

There’s no free ride for academics here — you can’t take GPL-ed code, use it to build a research project for your thesis, then give an executable away for free without also distributing your code with a compatible license. And you can’t restrict the license to something research only. Similarly, you couldn’t roll a GPL-ed library into Senna or C&C or LingPipe and redistribute them under their own licenses. Academics are often violating these terms because they somehow think “research use only” is special.

Services Based on GPL-ed Software and the AGPL

You can also set up a software service, for example on Amazon’s Elastic Compute Cloud (EC2) or on your own servers, that’s entirely driven by GPL-ed software, like say Stanford NLP or Weka, and then charge users for accessing it. Because you’re not redistributing the software itself, you can modify it any way you like and write code around it without releasing your own software. GNU introduced the Affero GPL (AGPL), a license even more restrictive than the GPL that tries to close this server loophole for the basic GPL.

Charging for GPL-ed Code

You can charge for GPL-ed code if you can find someone to pay you. That’s what RedHat’s doing with Linux, what Revolution R’s doing with R, and what Enthought’s doing with Python.

LingPipe’s Business Model is Like MySQL’s

Note that this is not what MySQL did with MySQL (before they sold it to Oracle) nor is it what we do with LingPipe. In both those cases, the company owns all the intellectual property and copyrights and thus is able to release the code under multiple licenses. This strategy’s explained on the

We license LingPipe under custom licenses as well as our royalty-free license. These licenses include all sorts of additional restrictions (like only using some of the modules on so many servers) and additional guarantees (like indemnification and maintenance); don’t ask me about the details — that’s Breck’s bailiwick. Suffice it to say most companies don’t like to get involved with copyleft, be it from GPL or LingPipe’s royalty-free license. So we let them pay us extra and get an unencumbered license so they can do what they want with LingPipe and not have to share their code. We’ve had more than one customer buy commercial license for LingPipe who wouldn’t even tell us what they were going to do with our software.

Free “Academic” Software

Also, keep in mind that as an academic, your university (or lab) probably has a claim to your intellectual property developed using their resources. Here’s some advice from GNU on that front:


Discontinuities in ROC Calculations

October 26, 2011

Amaç Herdagdelen brought up some interesting examples on our mailing list that caused LingPipe’s ROC curve (and PR curve) implementation to provide some unexpected (and arguably wrong) results. (We then took the detailed discussion off line, and are now reporting back.)

Computing ROC Curves by Sorting

The basic problem is that the computation of an ROC curve involves first sorting the responses by the system’s estimated probability a response should be positive, then incrementally computing sensitivity versus 1 – specificty operating points by working down the ranked list. You can see a worked example in the class documentation for

What if there are Ties?

Mike Ross realized there was a problem when you had two examples with the same score, but different reference values. For instance, a system might be looking at two Tweets and estimate a 0.98 probability that both are positive sentiment. If one was classified as positive in the reference (i.e., the gold standard) and the other negative, then there’s a problem. With one order, you get one curve, and with the other order, you get a different curve. Mike figured out we might as well just leave off the intermediate point, because we get to the same operating point after handling both of them.

What if there are Near Ties?

Then Amaç wrote back after noticing he had some cases that were very close, but not quite, the same. This comes up with arithmetic precision all the time. In his case, it was different one-versus-all results for a multi-category classifier. He suggested treating two results as the same if they were within some small absolute value of each other. This is the typical thing to do when checking results are the same in unit tests (or code) — it’s built into junit for Java and googletest for C++.

Discretized ROC Curve Estimates are Discontinuous

That led me (Bob) to realize that the real problem is that these measures are discontinuous in the underlying scores. Suppose I have three items, A, B and C, with true values POS, NEG, and POS. Now if I have assigned scores to them of 0.97, 0.96, and 0.98, I get a curve based on the sequence TP, TP, FP, or a perfect score. I get the same reslt as the score for B moves from 0.96 to any value less than 0.97. But at the point it crosses 0.97, I wind up with a sequence TP, FP, TP, which is a whole different story.

The upshot is that area under the ROC curve (or PR curve) is not continuous in the underlying scores.

Help? Should we Probabilistically Smooth?

Does anyone have any idea what we should do here?

I’m thinking moving to some kind of probabilistic version might be able to smooth things out in a well-defined way. For instance, if I add normally-distributed noise around each point and then integrate it out, things are smooth again.

Domain Adaptation with Hierarchical Logistic Regression

September 29, 2011

Last post, I explained how to build hierarchical naive Bayes models for domain adaptation. That post covered the basic problem setup and motivation for hierarchical models.

Hierarchical Logistic Regression

Today, we’ll look at the so-called (in NLP) “discriminative” version of the domain adaptation problem. Specifically, using logistic regression. For simplicity, we’ll stick to the binary case, though this could all be generalized to K-way classifiers.

Logistic regression is more flexible than naive Bayes in allowing other features (aka predictors) to be brought in along with the words themselves. We’ll start with just the words, so the basic setup look more like naive Bayes.

The Data

We’ll use the same data representation as in the last post, with D being the nubmer of domains, with I_d docs in domain d. Document i in domain d will have N[d,i] tokens. We’ll assume V is the size of the vocabulary.

Raw document data provides a token x[d,i,n] \in 1{:}V for word n of document i of domain d. Assuming a bag-of-words representation like we used in naive Bayes, it’ll be more convenient to convert each document into a word frequency vector u[d,i] of dimensionality V. To match naive Bayes, define a word frequency vector u[d,i] with value at word (or intercept) w given by

u[d,i,w] = \sum_{n = 1}^{N[d,i]} \mathbb{I}(x[d,i,n] = w),

where \mathbb{I}(\phi) is the indicator function, taking on value 1 if \phi is true and 0 otherwise. Now that we have continuous data, we could transform it using something like TF/IDF weighting. For convenience, we’ll prefix an intercept predictor 1 to each vector of predictors u[d,i]. Because it’s logistic regression, other information like the source of the document may also be included in the predictor vectors.

The labeled training data consists of a classification z[d,i] \in \{ 0, 1 \} for each document i in domain d.

The Model, Lower Level

The main parameter to estimate is a coefficient vector \beta[d] of size V + 1 for each domain d. Here, the first dimension will correspond to the intercept and the other dimensions each correspond to a word in the vocabulary.

The probabilty that a given document is of category 1 (rather than category 0) is given by

\mbox{Pr}[z[d,i] = 1] = \mbox{logit}^{-1}(\beta[d]^T u[d,i]).

The inner (dot) product is just the sum of the dimensionwise products,

\beta[d]^T u[d,i] = \sum_{v = 0}^V \beta[d,v] \times u[d,i,v].

This value, called the linear prediction, can take values in (-\infty,\infty). The logistic sigmoid function

\mbox{logit}^{-1}(x) = 1/(1 + \exp(-x))

maps the unbounded linear prediction to the range (0,1), where it is taken to represent the probabilty that the category z[d,i] of document i in domain d is 1.

In sampling notation, we define the category as being drawn from a Bernoulli distribution with parameter given by the transformed predictor; in symbols,

z[d,i] \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\beta[d]^T u[d,i]).

Unlike naive Bayes, logistic regression model does not model the word data u, instead treating it as constant.

The Model, Upper Levels

We are treating the coefficient vectors as estimated parameters, so we need a prior for them. We can choose just about any kind of prior we want here. We’ll only consider normal (Gaussian, L2) priors here, but the Laplace (L1) prior is also very popular. Recently, statisticians have argued for using a Cauchy distribution as a prior (very fat tails; see Gelman et al.’s paper) or combining L1 and L2 into a so-called elastic net prior. LingPipe implements all of these priors; see the documentation for regression priors.

Specifically, we are going to pool the prior across domains by sharing the priors for words v across domains. Ignoring any covariance for the time being (a bad idea in general, but it’s hard to scale coveriance to NLP-sized coefficient vectors), we draw the component for word v of the coefficients for domain d from a normal distribution,

\beta[d,v] \sim \mbox{\sf Normal}(\mu[v],\sigma[v]^2).

Here \mu[v] represents the mean value of the coefficient for word (or intercept) v across domains and \sigma[v]^2 is the variance of the coefficient values across domains. A strong prior will have low variance.

All that’s left is to provide a prior for these location and scale parameters. It’s typical to use a simple centered normal distribution for the locations, specifically

\mu[v] \sim \mbox{\sf Normal}(0,\tau^2),

where \tau^2 is a constant variance parameter. Typically, the variance \tau^2 is set to a constant to make the entire prior on the means weakly informative. Alternatively, we could put a prior on \tau itself.

Next, we need priors for the \sigma[v] terms. Commonly, these are chosen to be inverse \chi^2 (a specific kind of inverse gamma distribution) distributions for convenience because they’re (conditionally) conjugate. Thus the typical model would take the variance \sigma[v]^2 to be the parameter, giving it an inverse gamma prior,

\sigma[v]^2 \sim \mbox{\sf InvGamma}(\alpha,\beta).

Gelman argues against inverse gamma priors on variance because they have pathological behavior when the hierarchical variance terms are near zero.
Gelman prefers the half-Cauchy prior, because it tends not to degenerate in hierarchical models like the inverse gamma can. The half Cauchy is just the Cauchy distribution restricted to positive values (thus doubling the usual density, but restricting support to non-negative values). In symbols, we generate the deviation \sigma[v] (not the variance) using a (non-negative) Half-Cauchy distribution:

\sigma[v] \sim \mbox{HalfCauchy}().

And that’s about it. Of course, you’ll need some fancy-pants software to fit the whole thing, but it’s not that difficult to write.

Properties of the Hierarchical Logisitic Regression Model

Like in the naive Bayes case, going hierarchical means we get data pooling (or “adaptation” or “transfer”) across domains.

One very nice feature of the regression model follows from the fact that we are treating the word vectors as unmodeled predictors and thus don’t have to model their probabilities across domains. Instead, the coefficient vector \beta[d] for domain d only needs to model words that discriminate positive (category 1) from negative (category 0) examples. Thus the correct value for \beta[d,v] for a word v is 0 if the word does not discriminate between positive and negative documents. Thus our hierarchical hyperprior has location 0 in order to pull the prior location \mu[v] for each word v to 0.

The intercept is just acting as a bias term toward one category or the other (depending on the value of its coefficient).

Comparison to SAGE

If one were to approximate the full Bayes solution with maximum a posteriori point estimates and use Laplace (L1) priors on the coefficients,

\beta[d,v] \sim \mbox{\sf Laplace}(\mu[d],\sigma[d]^2)

then we recreate the regression counterpart of Eisenstein et al.’s hybrid regression and naive-Bayes style sparse additive generative model of text (aka SAGE).

Eisenstein et al.’s motivation was to represent each domain as a difference from the average of all domains. If you recast the coefficient prior formula slightly and set

\beta[d,v] = \alpha[d,v] + \mu[d]

and sample

\alpha[d,v] \sim \mbox{Laplace}(0,\sigma[d]^2)

you’ll see that if the variance term \sigma[d] is low enough, many of the \alpha[d,v] values will be zero. Of course, in a full Bayesian approach, you’ll integrate over the uncertainty in \alpha, not set it to a point estimate of zero. So sparseness only helps when you’re willing to approximate and treat estimates as more certain than we have evidence for. Of course, resarchers do that all the time. It’s what LingPipe’s logistic regression and CRF estimators do.

Adding Covariance

Presumably, the values of coefficients have will have non-negligible covariance. That is, I expect words like “thrilling” and “scary” to covary. For thriller movies, they’re positive sentiment terms and for appliances, rather negative. The problem with modeling covariance among lexical items is that we usually have tens of thousands of them. Not much software can handle a 10K by 10K matrix.

Another way to add covariance instead of using a covariance model for “fixed effects” is instead convert to random effects. For instance, a model like latent Dirichlet allocation (LDA) models covariance of words by grouping them into topics. For instance, two words with high probabilities in the same topic will have positive covariance.

Further Reading

I’m a big fan of Gelman and Hill’s multilevel regression book. You definitely want to master that material before moving on to Gelman et al.’s Bayesian Data Analysis. Together, they’ll give you a much deeper understanding of the issues in hierarchical (or more generally, multilevel) modeling.

Buffering Exceptions

August 16, 2011

OK, now that it’s just us API fanatics, imagine you need to call a stream read in an interface method implementation, but the interface doesn’t declare an I/O exception. For concreteness, suppose we have an interface:

interface Foo {
    public void bar();

and we want to implement a Foo that calls streaming I/O, as in:

public class IOFoo implements Foo {
    public void bar() {
        InputStream in = ...;
        int numBytes =; // throws IOException

A common example is the Java interface Runnable, which has a run() method which does not throw exceptions. Given that this is the basis for threads, the issue comes up all the time.

This won’t compile, because the call to the read method throws an I/O exception, which is a checked exception that is not declared on the run method. In fact, it can’t be declared on the run method, because the interface doesn’t declare an I/O exception. What I too often see is this broken workaround:

        int numBytes = -1;
        try {
            numBytes =;
        } catch (IOException e) {
            throw new RuntimeException(e);

What’s wrong here? There’s no way for anyone calling the bar() method to recover the exception and handle it. You could document that it throws a RuntimeException, but all kinds of problems can lead to a runtime exception being thrown, and there’s no way to tell which threw it. (Yes, you could encode it in the message, but that’s an even worse idea, as it’s very hard to maintain and code against.)

An alternative approach is to follow the pattern used by Java’s PrintWriter class and buffer the exception then make it available to the caller. Here’s a simplified example that buffers a single exception:

public class IOFoo implements Foo {
    private IOException mException;
    public void bar() {
        InputStream in = ...;
        try {
  ; // throws IOException
        } catch (IOException e) {
            mException = e;
    public IOException getIOException() {
        return mException;

That way, a caller can do this:

IOFoo f = new IOFoo(...);;
if (f.getIoException() != null)
    throw (f.getIOException());

Now the context around the call to IOFoo’s run can catch the IOException. For instance, if this is in a method call, you can just declare the method to throw the checked IOException. Note that we need to declare f as an IOFoo, not just a Foo, because we call its getIOException method.

Java’s PrintWriter keeps a list of exceptions that have been raised in calls to its methods. That’s because it keeps going after exceptions, assuming the client would rather they be silently ignored. I don’t think that’s such a great idea, for the obvious reason that there’s no telling what’s going to happen when I/O exceptions are just ignored.

It would be straightforward to set a flag that stops any further action once an exception has been raised.

The case I was actually using this in was for was writing a static method to serialize a tokenized language model. Visiting the n-grams requires an IntArrayHandler whose handle method is not declared to throw any checked exceptions. I wanted any embedded exception to be buffered and thrown by the top-level static method.

“Sentiment Takes a LONG Time”, Said the Logistic Regression Classifier

August 12, 2011

Hello again, fellow LingPipers!  It is I, yet again, Sam The Intern.  In my last (and first) post I promised the masses that I would return with a study similar to the last one that I conducted, but this time doing the non-trivial case of sentiment.  It turns out that this aspiration ended up being nothing more than a (Ling)Pipe-dream.  As my titular character, the logistic regression classifier so concisely said before, sentiment takes a long time!  And, with this being my last week at LingPipe, I will not have enough time to give our logistic regression classifiers the attention and whispered sweet nothings that they require to function properly.

As you may remember from my first post, I absolutely love to annotate.  However, annotating for sentiment takes much more time than language identification, because the annotator needs to internalize and judge the text, not just glance at it.  Add to this the fact that the sentiment classifier requires much more training data than the language identification classifier, and you have yourself an intern that has been product-conditioned to Coca-Cola, but no time to do anything about it (reading 1000 tweets about how much people love coke will do that to you).

I was able to start the study, with small amounts of success.  The following confusion matrix is the result of  a 10-fold cross-validation run across all of the data that I had time to annotate (about 1000 tweets each for “Clinton”, “Coke”, and “Gaga”).  The top of the chart is the reference data which I annotated, and the side is the response data that the classifier provided.  The categories are: w = neutral, e = positive, q = negative.

   w    e   q
w 1244,239,50
e 313 ,312,23
q 199 ,45 ,23

I wish that I had more time to spend on the sentiment classification problem.  Perhaps I will be back next summer, to continue the effort to teach computers good from bad.  Until then, I will be at Rochester, college-ing!  And drinking massive quantities of Coke…

LingPipe and JDK 7

July 28, 2011

[Update: 29 July 2011: Thanks to Dean Jones for pointing out that our friends at Lucid Imagination have found a deal-breaking bug in the initial JDK 7’s optimization of loops. Here’s the scoop from the source:

In any case, we won’t be using any of the JDK 7 features in LingPipe until JDK 6 is retired.]

Oracle (seems odd to say that) just released version 7 of the Java developers kit, aka JDK 7.

What’s in JDK 7?

I haven’t been following what was going into JDK 7. I found this comprehensive list of changes:

This includes Project Coin, which involved changes to the language itslef. These are summarized nicely with examples at:

There were also changes to add module structure, some concurrency changes, some new I/O (nio) changes, support for Unicode 6 in internationalization, support for the latest XML, a new version of JDBC, a bunch of changes to Swing, and some changes to things like cryptography that I didn’t even know were in Java in the first place.

LingPipe Compatibility

The good news is that LingPipe seems to work just fine with Java 7. At least all of our unit tests pass.

There are a couple new compiler warnings. Variable list arguments (i.e., varargs) and generics don’t play nicely together. This should be obvious from the fact that varargs use an array and you can’t have generic arrays. So now I get very menacing error messages like

warning: [unchecked] Possible heap pollution from parameterized vararg type E

I’m a bit surprised they didn’t add this warning earlier, as it seems like a well understood and commonly made mistake. There’s a nice discussion of the underlying problem due to type erasure and arrays here:

And also a very nice discussion with great examples here:

From Kalanger’s discussion, I realized everywhere I got these errors was actually safe as the arrays in question didn’t persist beyond the scope of the function body where the user couldn’t get at them.

“Language ID”, Said the NGram Process Dynamic Language Model Classifier…

July 25, 2011

Hi, there! My name is Sam Brown, and I’m a summer intern here at LingPipe. I’m from Eastchester, NY, and I’m a rising sophomore in the University of Rochester Biomedical Engineering Department.

I’ve been interning at LingPipe’s luxuriously airconditioned headquarters for about a month and a half now. After barely learning Java and some of the LingPipe API in the span of about a week, my slavemaster…rather…boss Breck thought it would be a great idea if I could publish a three-part Java-based computational linguistics experiment exploring the relative performances of generic and customized classifiers for language identification.

And then he told me this is the trivial case.

Since I hardly understood the objective myself, even after Breck kindly spelled it out for me about 50 times, let me try and explain the experiment. Language model classifiers can be used to sort test data (like a bunch of tweets retrieved from Twitter using a query like “Gaga”) into two categories (in my case English, and Non-English). These classifiers, of course, need to be trained on annotated data (tweets collected from Twitter that we sort for the computer) before they can be useful. This is where my experiment’s problem comes in; Is it better to make a custom classifier that is trained on data similar to the test data (a set of tweets different from the test data, but also retrieved using the query “Gaga”), or to train a generic classifier on data that comes from a variety of sources (a collection of tweets from the queries “Coke, “Clinton”, etc.)? What we found is that, based on the input and test data, either classifier may do better at language identification depending on the circumstances.


Before we could start the experiment, we first needed to collect our data from Twitter. For each of our five queries, 1500 tweets were retrieved from Twitter, and duplicates were removed from this data by requiring a Jaccard distance of .5 or less between any two tweets. The effect of this is that any tweet with more than 50% token (word) overlap with any other tweet in the accepted data collection was rejected from the corpus (our collection of tweets to be used in the experiment).

After that came my absolute favorite part; Annotation. I annotated 500 tweets for being written in English (‘e’) or non-English (‘x’), for 5 queries (for a total of 2,500 tweets). These annotated tweets comprised our first epoch of data. As if that wasn’t enough fun, we reran the 5 queries again two weeks later, and annotated these as well to form the second epoch of data (for a cumulative total of 5000 tweets). We checked if there were duplicates between the two Twitter retrievals by using the Jaccard distance between any two given tweets to decide if they were identical.

To form a baseline to compare our final results to, I needed to find an interannotator agreement rate. My lovely girlfriend very nicely (and possibly naively) agreed to annotate some of the data.  She annotated 400 tweets of each query in the second epoch of data, all of which overlapped with data that I annotated as well. A program was then used to find the precision and recall between my annotations and hers. The agreement rate was 95% precision and 97% recall with confidence .01% with one annotator serving as truth and the other the response (thereby proving that she knows English 5% ± .01% better than me). For evaluation purposes the annotations were adjudicated resulting in complete agreement.


Part 1: Evaluation of Customized Language Classifier

We created a custom language model classifier by training it on data retrieved from Twitter using one of our queries, like “Gaga”. To do this, we first loaded all of the annotated training tweets for that query into the training section of a corpus. We then loaded the annotated testing tweets into the test section of the corpus. This created a corpus of 1000 tweets, 500 each for training and testing. The testing and training tweets were both retrieved from Twitter using the same query at different times.

As the title character in today’s adventure, our friendly neighborhood nGram boundary language model classifier was used to train and test on the corpus. Because nGram boundary classifiers are sensitive to nGram size (the size of the chunk of characters in a tweet that the classifier being trained actually sees), it trained and tested on the data with nGram sizes 1 through 15.

After each test, the classifier was given to a joint classifier evaluator. This evaluator was used to record the one-versus-all data for each category (precision, recall, and f-measure), as well as calculate the micro-averaged f-measure. The micro-averaged f-measure was used for quantitative comparison between classifiers.

This procedure was repeated for each of the five Twitter queries.

Part 2: Evaluation of Generic Language Classifier

We created a generic language model classifier by training it on data retrieved from Twitter using the same queries as the customized experiment. The difference between the custom and generic classifiers is that, in the generic classifier, the testing and training tweets were retrieved using different queries. Because each set of data included 500 tweets, the total amount of data that we had was 5,000 tweets. For any given query, the training data consisted of all the other data minus the 500 tweets of data to be tested on and 500 tweets of data that was retrieved at an earlier epoch with the same query. All of this data, which contained a total of 4000 tweets for any given test data, was put into a list. Five-hundred tweets were then randomly selected from this list and entered into the training section of the corpus. We then loaded the annotated testing tweets into the testing section of the corpus, and trained and tested the corpus using the same nGram looping structure as we did in the customized training process.

After each test, we evaluated the classifier in the same way that we did for the custom classifier. This procedure was repeated for each of the five Twitter queries.

Part 3: Evaluation of the Effect of Corpus Size On a Generic Classifier

A generic classifier was trained on data from a variety of different sources, as described above. However, the size of the corpus was increased incrementally after each training and testing experiment. This was done by nesting the nGram looping structure inside of the corpus size looping structure. The evaluator returned the best micro-averaged f-measure after each corpus size increase, and this was graphed against corpus size.


Part 1

For each of the five queries that we tested, the custom classifier had an f-measure that ranged from 80% to 95%. For four out of the five queries, the English classifications performed better than the non-english. However, on the query “Mitsubishi”, the non-english classifications performed better. Out of the four queries in which English performed better, two queries (“Clinton”(Fig. 1) and “Coke” (Fig. 2)) had significantly higher f-measures for the English than the non-English.

Figure 1 - Clinton Generic

Figure 2 - Coke Generic

Figure 2 New

Figure 3 - Gaga Generic

Figure 4 NEW

Figure 4 - Mitsubishi Generic

Figure 5 - Wiener Generic

Part 2

For each of the five queries, the generic classifier had an f-measure that ranged from 70% to 95%

Part 2.5

We put the micro-averaged f-measures for the generic and custom classifiers for each query on the same graph of F-Measure versus NGram size. For two of the queries (“Clinton” (Fig. 6) and “Coke” (Fig. 7)), the custom and generic classifiers performed about equally for an nGram size of 4. For two of the remaining queries (“Gaga” (Fig. 8) and “Wiener” (Fig. 10)), the custom classifier outperformed the generic. For the final query (“Mitsubishi” (Fig. 9)), the generic classifier outperformed the custom.

Figure 6 - Clinton Generic vs. Custom

Figure 7 - Coke Generic vs. Custom

Figure 8 - Gaga Generic vs. Custom

Figure 9 - Mitsubishi Generic vs. Custom

Figure 10 - Wiener Generic vs. Custom

The experiment was run again (at nGram size 4),  50 times.  The sample mean of these experiments’ f-measures were graphed, with a 95% confidence error bar (Figures 11 and 12).

Figure 11 - Non-English Classifier Comparrison

Figure 12 - English Classifier Comparrison

Part 3

When we graphed the generic classifiers’ performance versus the size of the of the training data, we found that the English classifications’ F-Measure increased slightly over increased training data. Non-English classifications’ F-Measure increased dramatically over increased training data. (Fig. 13).

Figure 13 - Clinton Generic Performance Graph (on a logarithmic scale)

We graphed the size of the English portion of the corpus versus the English F-Measure (and did the same with the Non-English), and found that English classifications performed better at low category-corpus sizes than Non-English did (Figs. 14 & 15).

Figure 14 - Coke Category Performance

Figure 15 - Mitsubishi Category Performance


Parts 1+2

We came to the conclusion that neither generic nor customized training regimens are necessarily better than the other because each one performed better in different circumstances. We believe that the Non-English classifications for Mitsubishi scored higher than the English classifications because the Non-English category was larger than the English category, and also it was very coherent (with mostly asian writing, so that each category has either Roman characters or Asian characters). We believe that “Clinton” and “Coke” had much higher English F-Measures than the others because these two queries produce the most English tweets.

Parts 3

We believe that the English classifications performed better than the non-English classifications, especially at low corpus sizes, because the makeup of the English training data is more coherent. Because there are many different, seemingly unrelated attributes that define the broad Non-English category, it’s difficult for the classifier to identify what makes a tweet Non-English at low corpus-sizes. This also explains why the classifier is better at identifying English tweets, even at equal category sizes.

Our data balance for each test are as follows (for the generic classifiers, the balance of the 4000 tweet list is shown):

Custom Data Distribution

Generic Data Distribution

And that’s my first experiment in computational linguistics! Anyone who would like to throw money at me for this monumental achievement may do so liberally. Otherwise, look forward to my next experiment, the not-so-trivial case, Sentiment!! (perhaps THAT will convince you to throw money at me!) Until then, I will be annotating. A lot. My favorite!

Sorting Primitive Types in Java

July 14, 2011

One of the big pains of Java is the lack of built-in collection-like utilities for primitive types like int. As a result, you need to box all the integers, which leads to huge bloat in terms of size, and because of memory locality and dereferencing issues, a huge slowdown to boot.

Oddly, more than one person at ICML told me they were impressed with IBM’s tuning on Watson (their champion-level Jeopardy! playing system Breck blogged about recently) because they reimplemented search and sorting!

C’mon people, this isn’t that hard. In fact, as I was telling our intern today, it’s basic CS 101. At least it was back in the 1980s when I was an undergrad. I have no idea what the kids are up to these days.

Primitive int Sorts for LingPipe Suffix Arrays

As I was implementing suffix arrays for LingPipe 4.1, I realized I’d need to implement sorting over primitive int types for both efficiency and scalability. So I designed everything to let me plug in a better sort implementation when I got some time. Well, I got some time today, so here goes.

It’s also not that rare a problem. There are open-source primitive collection libs available from GNU Trove (GPL, of course). And while not geared toward sorting, there’s the Apache Primitives lib which gives you collection-like behavior with primitives (also something LingPipe could benefit from in several places).

Gift Horses and Strings

Ah, the GPL. It giveth and it taketh away.

In general, one shouldn’t look a gift horse in the mouth. But what if there are strings attached? (Sorry, but I just love mixed metaphors, even if they are rather fixed idioms; I laugh out loud at the New Yorker “block that metaphor” quotes from news sources.)

We can hardly drop GPL-ed code into LingPipe (in fact, as of many years ago, at the behest of our customers, we removed all code dependencies from LingPipe, except for Java itself). So that means it’s time to saddle up (hard to turn off that metaphor machine) and write some really low-level code.

I actually find this kind of thing more fun than a slap in the face with a wet fish. (Last one — I just had to drop in a bit of the old litotes).

Programming Pearls is Your Friend

The best book I’ve ever read on designing algorithms for invariants and then profiling and tuning is:

  • Bentley, Jon. 2000. Programming Pearls, 2nd Edition. Addison-Wesley.

It’s so good that Josh Bloch used as the basis for the Java collections, as he outlines in this awesome blog post on a related topic, also related to Bentley’s search algorithms and his own experience as a CS student at CMU where Bentley was teaching:

I just went for Bentley’s qsort4, which is dead easy. It took only half the afternoon to write and unit test the whole thing with random tests, leaving plenty of time for this blog entry (enough rope to hang myself, one might say).

Show, Don’t Tell

How easy is it? Easy peasey isn’t the half of it. Here’s the code for com.aliasi.util.Sort other than the package declaration and imports and doc.

    public static void isort(int[] x, CompareInt compare) {
        for (int i = 0; i < x.length; ++i) {
            int t = x[i];
            int j = i;
            for ( ; j > 0 && compare.lessThan(t,x[j-1]); --j)
                x[j] = x[j-1];
            x[j] = t;
public static void qsort(int[] x, CompareInt compare) {
    Random random = new Random();
   static void qsortPartial(int[] x, int lower, int upper,
                             CompareInt compare,
                             Random random) {
        if (upper - lower < MIN_QSORT_SIZE)
        swap(x, lower, lower + random.nextInt(upper-lower+1));
        int t = x[lower];
        int i = lower;
        int j = upper + 1;
        while (true) {
            do {
            } while (i <= upper && compare.lessThan(x[i],t));
            do {
            } while (compare.lessThan(t,x[j]));
            if (i > j)
public static void swap(int[] xs, int i, int j) {
    int temp = xs[i];
    xs[i] = xs[j];
    xs[j] = temp;
public interface CompareInt {
    public boolean lessThan(int a, int b);
static int MIN_QSORT_SIZE = 7;

Thanks, Jon!

I don’t think this needs much explanation. There’s a comparator-like interface CompareInt to implement comparisons. Bentley’s quicksort uses Robert Sedgewick‘s speedups for worst case via randomization of the pivot. It also uses insertion sort, also shown, for small cases, as it’s faster with memory locality. This is more or less a symbol-by-symbol translation of Bentley’s code, with a general plug-in comparator.

Look for it wherever high quality software is sold (or given away).

Randomized Unit Tests

I’ve also started moving to more large-scale randomized unit tests. Here’s what I did for testing sorts.

public void testSort() {
    Random random = new Random();
    for (int i = 0; i < 129; ++i) {
        int[] xs = new int[i];
        int[] ys = new int[i];
        int[] zs = new int[i];
        for (int numTests = 1; numTests < 10; ++numTests) {
            Sort.qsort(ys, Sort.NATURAL_INT_COMPARE);
            Sort.isort(zs, Sort.NATURAL_INT_COMPARE);

static void randomFill(int[] xs, Random random) {
    for (int i = 0; i < xs.length; ++i)
        xs[i] = random.nextInt();

static void copy(int[] xs, int[] ys) {
    for (int i = 0; i < xs.length; ++i)
        xs[i] = ys[i];

The built-in Sort.NATURAL_INT_COMPARE just implements Sort.CompareInt with the natural sort.

 public static final CompareInt NATURAL_INT_COMPARE
        = new CompareInt() {
                public boolean lessThan(int a, int b) {
                    return a < b;

That’s my afternoon in a nutshell (sorry, just one more (idiom) for the road).

TREC 2011 Crowdsourcing Track Data Released

June 16, 2011

Here’s the home page:

Clean and Small

The data’s super clean. I love it when task organizers take the time to release clean and easy to use data.

I was a bit surprised that there was so little of it, but I suppose this is just the dev data. There’s around 10K judgments from around 200 Mechanical Turkers for a single TREC topic (think query) with relevance labels. There appears to be the usual Turker quantity of noise annotations and a hugely skewed distribution of number of judgments per Turker (with the active Turkers looking more spammy, again as usual).

Mixed in Gold-Standard Data

One nice twist to the data is that there is also NIST gold-standard judgments for a subset of the data. In fact, the HITs (tasks assigned) for the Turkers had 5 relevance judgments, 1 of which was annotated previously by NIST.

As many of you may know, this is (among) the strategy(s) that CrowdFlower advocates for their customers.

What We Plan to Do

I don’t know how many submissions each team gets, but I’ll at least run the full Bayesian hierarchical model for a single topic (that is, the model Massimo and I described in our LREC tutorial on data annotation models). That’s a fully automatic system. I don’t know if we can remove the spam annotators by hand, but I might run that one, too. And probably something simpler with point estimates and simple voting, but that’d be four submissions. Hopefully they’ll evaluate voting themselves.

Probabilistic Evaluation

I have posted to their mailing list, but no luck getting anyone interested in log loss-based evaluations. Recall that in log loss, your score is the negation of the sum of the log probabilities your system assigned to the correct answers (it’s thus negated total probability, which is a loss function — I really wish optimization had settled on maximizing gain rather than minimizing loss).

Log loss the metric that’s being minimized by my (and any other statistical) model over the training data (that’s the M in maximum a posterior estimation, only we’re dealing with loss, not gain). This evaluation gets at how well calibrated a system’s probabilistic predictions are.

I’ll at least do that for our submissions once we get the “official gold standard” evaluation labels.

Hopefully, they’ll do the usual TREC bangup job of presenting things like precision-recall curves.

Pro Bono Projects and Internships with LingPipe

April 4, 2011

We have initiated a pro bono program that teams a lingpipe employee with a programmer wanting to learn about LingPipe to serve a non-profit or researcher needing help with a text analytics/NLP/computational linguistics project.

So far is is going pretty well with one classification project underway. The way it went was that I worked with an intern, a solid NLP programmer, to help a Ga Tech researcher get a classification system up and running. Now the intern is running the project with some feedback from me but not much is required at this point.

We are taking applications for interns wanting to learn how to code with LingPipe. We also will take applications for projects. Non-profits or academic research only please. Just send an email outlining your skills/needs and we will see what we can do.