Archive for the ‘Carp’s Blog’ Category

Natural Language Generation for Spam

March 31, 2012

In a recent comment on an earlier post on licensing, we got this spam comment. I know it’s spam because of the links and the URL.

It makes faculty adage what humans can do with it. We’ve approved to beacon bright of that with LingPipe’s authorization — we artlessly can’t allow the attorneys to adapt our own arbitrary royalty-free license! It was advised to accept some AGPL-like restrictions (though we’d never heard of AGPL). At atomic with the (A)GPL, there are FAQs that I can about understand.

ELIZA, all over Again

What’s cool is how they used ELIZA-like technologies to read a bit of the post and insert it into some boilerplate-type generation. There are so many crazy and disfluent legitimate comments that with a little more work, this would be hard to filter out automatically. Certainly the WordPress spam filter, Akismet, didn’t catch it, despite the embedded links.

Black Hat NLP is Going to Get Worse

It would be really easy to improve on this technology with a little topic modeling and better word spotting (though they seem to do an OK job of that) and better language modeling for generation. Plus better filtering a la modern machine translation systems.

The real nasty applications of such light processing and random regeneration will be in auto-generating reviews and even full social media, etc. It’ll sure complicate sentiment analysis at scale. You can just create blogs full of this stuff, link them all up like a good SEO practitioner, and off you go.

Cross Validation vs. Inter-Annotator Agreement

March 12, 2012

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct. This approach was pioneered in MITRE’s Alembic Workbench, which was used to create the original MUC-6 named-entity corpus.

The chunker underlying LingPipe’s annotation toolkit is based on LingPipe’s character language-model rescoring chunker, which can be trained online (that is, as the data streams in) and has quite reasonable out-of-the-box performance. It’s LingPipe’s best out-of-the-box chunker. In contrast, CRFs can be engineered to outperform the rescoring chunker with good feature engineering.

A very nice project would be to build a semi-supervised version of the rescoring chunker. The underlying difficulty is that our LM-based and HMM-based models take count-based sufficient statistics.

It Works!

Mitzi’s getting reasonable system accuracy under cross validation, with over 80% precision and recall (and hence over 80% balanced F-measure).

That’s not Cricket!

According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement.

So What?

If we can train a system to performa at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation?

We have something better — we can train a system to match Mitzi’s annotation!

In fact, training such a system is really all that we often care about. It’s much better to be able to train a system than another human to do the annotation.

The other thing we might want a corpus for is to evaluate a range of systems. There, if the systems are highly comparable, the fringes of the corpus matter. But perhaps the small, but still p < 0.05, differences in such systems don't matter so much. What the MT people have found is that even a measure that's roughly correlated with performance can be used to guide system development.

Error Analysis and Fixing Inconsistencies

Mitzi’s been doing the sensible thing of actually looking at the errors the system’s making under cross validation. In some of these cases, she’d clearly made a braino and annotated the data wrong. So she fixes it. And system performance goes up.

What Mitzi’s reporting is what I’ve always found in these tasks. For instance, she inconsistently annotated time plus date sequences, sometimes including the times and sometimes not. So she’s going back to correct to do it all consistently to include all of the time information in a phrase (makes sense to me).

After a couple of days of annotation, you get a much stronger feeling for how the annotations should have gone all along. The annotations drifted so much over time in this fashion in the clinical notes annotated for the i2b2 Obesity Challenge that the winning team exploited time of labeling as an informative feature to predict co-morbidities of obesity!

That’s also not Cricket!

The danger with re-annotating is that the system’s response will bias the human annotations. System-label bias is also a danger with single annotation under the tag-a-little, learn-a-little setup. If you gradually change the annotation to match the system’s responses, you’ll eventually get to very good, if not perfect, performance under cross validation.

So some judgment is required in massaging the annotations into a coherent system, but one that you care about, not one driven by the learned system’s behavior.

On the other hand, you do want to choose features and chunkings the system can learn. So if you find you’re trying to make distinctions that are impossible for the system to learn, then change the coding standard to make it more learnable, that seems OK to me.

Go Forth and Multiply

Mitzi has only spent a few days annotating the data and the system’s already working well end to end. This is just the kind of use case that Breck and I had in mind when we built LingPipe in the first place. It’s so much fun seeing other people use your tools

When Breck and Linnea and I were annotating named entities with the citationEntities tool, we could crank along at 5K tokens/hour without cracking a sweat. Two eight-hour days will net you 80K tokens of annotated data and a much deeper insight into the problem. In less than a person-week of effort, you’ll have a corpus the size of the MUC 6 entity corpus.

Of course, it’d be nice to roll in some active learning here. But that’s another story. As is measuring whether it’s better to have a bigger or a better corpus. This is the label-another-instance vs. label-a-fresh-instance decision problem that (Sheng et al. 2008) addressed directly.

Settles (2011): Closing the Loop: Fast, Interactive Semi-Supervised Annotation with Queries on Features and Instances

February 23, 2012

Whew, that was a long title. Luckily, the paper’s worth it:

Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.

It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.

The Big Picture

The easiest way to see what’s going on is with a screenshot of DUALIST, the system on which the paper is based:

It’s basically a tag-a-little, learn-a-little annotation tool for classifiers. I wrote something along these lines for chunk tagging (named entities, etc.) — you can find it in the LingPipe sandbox project citationEntities (called that because I originally used it to zone bibliogrphies in docs, citations in bibliographies and fields in citations). Mitzi just brought it up to date with the current LingPipe and generalized it for some large multi-part document settings.

In DUALIST, users provide two kinds of input:

  1. category classifications for documents
  2. words associated with categories

The left-hand-side of the interface presents a scrolling list of documents, with buttons for categories. There are then columns for categories with words listed under them. Users can highlight words in the lists that they believe are associated with the category. They may also enter new words that don’t appear on the lists.

Settles points out a difficult choice in the design. If you update the underlying model after every user choice, the GUI items are going to rearrange themselves. Microsoft tried this with Word, etc., arranging menus by frequency of use, and I don’t think anyone liked it. Constancy of where something’s located is very important. So what he did was let the user mark up a bunch of choices of categories and words, then hit the big submit button at the top, which would update the model. I did roughly the same thing with our chunk annotation interface.

There’s always a question in this kind of design whether to pre-populate the answers based on the model’s guesses (as far as I can tell, DUALIST does not pre-populate answers). Pre-populating answers makes the user’s life easier in that if the system is halfway decent, there’s less clicking. But it raises the possibility of bias, with users just going with what the system suggests without thinking too hard.

Naive-Bayes Classifier

The underlying classification model is naive Bayes with a Dirichlet prior. Approximate inference is carried out using a maximum a posteriori (MAP) estimate of parameters. It’s pretty straightforward to implement naive Bayes this in a way that’s fast enough to use in this setting. The Dirichlet is conjugate to the multinomial so the posteriors are analytically tractable and the sufficient statistics are just counts of documents in each category and the count of words in documents of each category.

The Innovations

The innovation here is twofold.

The first innovation is that Settles uses EM to create a semi-supervised MAP estimate. As we’ve said before, it’s easy to use EM or some kind of posterior sampling like Gibbs sampling over a directed graphical model with any subset of its parameters or labels being unknown. So technically, this is straightforward. But it’s still a really good idea. Although semi-supervised classifiers are very popular, I’ve never seen it used in this kind of active-learning tagging interface. I should probably add this to our chunking tagger.

The second (and in my opinion more important) innovation is in letting users single out some words as being important words in categories. The way this gets pushed through to the model is by setting the component of the Dirichlet prior corresponding to the word/category pair to a larger value. Settles fits this value using held-out data rather than rolling it into the model itself with a prior. The results seem oddly insensitive to it, which surprised me (but see below).

Comments on the Classifier

Gadzooks! Settles seems to be missing the single biggest tuning parameter typically applied to naive Bayes — the document length normalizer. Perhaps he did this because when you document-length normalize, you no longer have a properly generative model that corresponds to the naive Bayes paradigm. But it makes a huge difference.

The LingPipe EM tutorial uses naive Bayes and the same 20 Newsgroups corpus as (Nigam, McCallum and Mitchell 2000) used for evaluation, and I showed the effect of document length normalization is huge (the Nigam et al. article has been cited nearly 2000 times!). You can do way better than Nigam et al.’s reported results by setting the document length norm to a small value like 5. (What document length norm’s doing is trying to correct for the lack of covariance and overdispersion modeling in naive multinomial document model — it’s the same kind of shenanigans you see in speech recognition in weighting the acoustic and language models and the same trick I just saw Kevin Knight pull out during a talk last week about decoding encrypted documents and doing machine translation.)

I think one of the reasons that the setting of the prior for important words has so little effect (see the performance figures) is that all of the priors are too high. If Settles really is starting with the Laplace prior (aka add 1), then that’s already too big for naive Bayes in this setting. Even the uniform prior (aka add 0) is too big. We’ve found that we need very small (less than 1) prior parameters for word-in-topic models unless there’s a whole lot of data (and the quantity Settles is using hasn’t gotten there by a long shot — you need to get enough so that the low counts dominate the prior before the effect of the prior washes out, so we’re talking gigabytes of text, at least).

Also, this whole approach is not Bayesian. It uses point estimates. For a discussion of what a properly Bayesian version of naive Bayes would look like, check out my previous blog post, Bayesian Naive Bayes, aka Dirichlet Multinomial Classifiers. For a description of what it means to be Bayesian, see my post What is Bayesian Statistical Inference?.

Confusion with Dirichlet-Multinomial Parameterization and Inference

There’s a confusion in the presentation of the Dirichlet prior and consequent estimation. The problem is that the prior parameter for a Dirichlet is conventionally the prior count (amount you add to the usual frequency counts) plus one. That’s why a prior of 1 is uniform (you add nothing to the frequency counts) and why a prior parameter of 2 corresponds to Laplace’s approach (add one to all frequency counts). The parameter is constrained to be positive, so what does a prior of 0.5 mean? It’s sort of like subtracting 1/2 from all the counts (sound familiar from Kneser-Ney LM smoothing?).

Now the maximum a posteriori estimate is just the estimate you get from adding the prior counts (parameter minus one) to all the empirical counts. It doesn’t even exist if the counts are less than 1, which can happen with Dirichlet parameter components that are less than 1. But Settles says he’s looking at the posterior expectation (think conditional expectation of parameters given data — the mean of the posterior distribution). The posterior average always exist (it has to given the bounded support here), but it requires you to add another one to all the counts.

To summarize, the mean of the Dirichlet distribution \mbox{Dir}(\theta|\alpha) is

\bar{\theta} = \alpha/(\sum_k \alpha_k),

whereas the maximum (or mode) is

\theta^{*} = (\alpha - 1) / (\sum_k (\alpha_k - 1)).

where the -1 is read componentwise, so \alpha - 1 = (\alpha_1-1,\ldots,\alpha_K-1). This only exists if all \alpha_k \geq 0.

That’s why a parameter of 1 corresponds to the uniform distribution and why a parameter of 2 (aka Laplace’s “add-one” prior) is not uniform.

Settles says he’s using Laplace and taking the posterior mean (which, by the way, is the “Bayesian point estimate” (oxymoron warning) minimizing expected square loss). But that’s not right. If he truly adds 1 to the empirical frequency counts, then he’s taking the posterior average with a Laplace prior (which is not uniform). This is equivalent to the posterior mode with a prior parameter of 3. But it’s not equivalent to either the posterior mode or mean with a uniform prior (i.e., prior parameter of 1).

Active Learning

Both documents and words are sorted by entropy-based active learning measures which the paper presents very clearly.

Documents are sorted by the conditional entropy of the category given the words in the model.

Word features are sorted by information gain, which is the reduction in entropy from the prevalence category distribution to the expected category distribution entropy conditoned on knowing the feature’s value.

Rather than sorting docs by highest classification uncertainty, as conditional entropy does, we’ve found it useful to sort docs by the lowest classification uncertainty! That is, we ask humans to label the docs about which the classifier is least uncertain. The motivation for this is that we’re often building high-precision (and relatively low recall) classifiers for customers and thus have a relatively high probability threshold to return a guess. So the higher ranked items are closer to the boundary we’re trying to learn. Also, we find in real world corpora that if we go purely by uncertainty, we get a long stream of outliers.

Settles does bring up the issue of whether using what’s effectively a kind of active learning mechanism trained with one classifier will be useful for other classifiers. We need to get someone like John Langford or Tong Zhang in here to prove some useful bounds. Their other work on active learning with weights is very cool.

GUI comments

I love the big submit button. What with Fitts’s law, and all.

I see a big problem with this interface for situations with more than a handful of categories. What would the full 20 Newsgroups look like? There aren’t enough room for more columns or a big stack of buttons.

Also, buttons seems like the wrong choice for selecting categories. These should probably be radio buttons to express the exclusivity and the fact that they don’t take action themselves. Typically, buttons cause some action.

Discriminative Classifiers

Given the concluding comments, Settles doesn’t seem to know that you can do pretty much exactly the same thing in a “discriminative” classifier setting. For instance, logistic regression can be cast as just another directed graphical model with parameters for each word/category pair. So we could do full Bayes with no problem.

There are also plenty of online estimation procedures for weighted examples; you’ll need the weighting to deal with EM (see, e.g., (Karampatziakis and Langford 2010) for an online weighted training method that adjusts neatly for curvature in the objective; it’s coincidentally the paper I’m covering for the next Columbia Machine Learning Reading Group meeting).

The priors can be carried over to this setting, too, only now they’re priors on regression coefficients. See (Genkin, Lewis and Madigan 2007) for guidance. One difference is that you get a mean and a variance to play with.

Building it in LingPipe

LingPipe’s traditional naive Bayes implementation contains all that you need to build a system like DUALIST. Semi-supervised learning with EM is covered in our EM tutorial with naive Bayes as an example.

To solve some of the speed issues Settles brings up in the discussion section, you can always thread the retraining in the background. That’s what I did in the chunk tagger. With a discriminative “online” method, you can just keep cycling through epochs in the background, which gives you the effect of a hot warmup for subsequent examples. Also, you don’t need to run to convergence — the background models are just being used to select instances for labeling.

How to Prevent Overflow and Underflow in Logistic Regression

February 16, 2012

Logistic regression is a perilous undertaking from the floating-point arithmetic perspective.

Logistic Regression Model

The basic model of an binary outcome y_n \in \{ 0, 1\} with predictor or feature (row) vector x_n \in \mathbb{R}^K and coefficient (column) vector \beta \in \mathbb{R}^K is

y_n \sim \mbox{\sf Bernoulli}(\mbox{logit}^{-1}(x_n \beta))

where the logistic sigmoid (i.e., the inverse logit function) is defined by

\mbox{logit}^{-1}(\alpha) = 1 / (1 + \exp(-\alpha))

and where the Bernoulli distribution is defined over support y \in \{0, 1\} so that

\mbox{\sf Bernoulli}(y_n|\theta) = \theta \mbox{ if } y_n = 1, and

\mbox{\sf Bernoulli}(y_n|\theta) = (1 - \theta) \mbox{ if } y_n = 0.

(Lack of) Floating-Point Precision

Double-precision floating-point numbers (i.e., 64-bit IEEE) only support a domain for \exp(\alpha) of roughly \alpha \in (-750,750) before underflowing to 0 or overflowing to positive infinity.

Potential Underflow and Overflow

The linear predictor at the heart of the regression,

x_n \beta = \sum_{k = 0}^K x_{n,k} \beta_k

can be anywhere on the real number line. This isn’t usually a problem for LingPipe’s logistic regression, which always initializes the coefficient vector \beta to zero. It could be a problem if we have even a moderately sized coefficient and then see a very large (or small) predictor. Our probability estimate will overflow to 1 (or underflow to 0), and if the outcome is the opposite, we assign zero probability to the data, which is not good predictively.

Log Sum of Exponents to the Rescue

Luckily, there’s a solution. First, we’re almost always working with log probabilities to prevent underflow in the likelihood function for the whole data set y,x,

\log p(y|\beta;x) = \log \prod_{n = 1}^N p(y_n|\beta;x_n) = \sum_{n=1}^N \log p(y_n|\beta;x_n)

Working on the inner log probability term, we have

\log p(y_n|\beta;x_n)

{ } = \log \mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n \beta))

{ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1
{ } = \log (1 - \mbox{logit}^{-1}(x_n \beta)) \mbox{ if } y_n = 0

Recalling that

1 - \mbox{logit}^{-1}(\alpha) = \mbox{logit}^{-1}(-\alpha),

we further simplify to

{ } = \log \ \mbox{logit}^{-1}(x_n \beta) \mbox{ if } y_n = 1
{ } = \log \ \mbox{logit}^{-1}(-x_n \beta) \mbox{ if } y_n = 0

Now we’re in good shape if we can prevent the log of the inverse logit from overflowing or underflowing. This is manageable. If we let \alpha stand in for the linear predictor (or its negation), we have

{ } = \log \ \mbox{logit}^{-1}(\alpha)

{ } = \log (1 / (1 + \exp(-\alpha)))

{ } = - \log (1 + \exp(-\alpha))

{ } = - \mbox{logSumExp}(0,-\alpha)

Log Sum of Exponentials

Recall that the log sum of exponentials function is

\mbox{logSumExp}(a,b) = \log (\exp(a) + \exp(b))

If you’re not familiar with how it prevents underflow and overflow, check out my previous post:

In the logistic regression case, we have an even greater chance for optimization because the argument a is a constant zero.

Logit-transformed Bernoulli

Putting it all together, we have the logit-transformed Bernoulli distribution,

\mbox{\sf Bernoulli}(y_n|\mbox{logit}^{-1}(x_n\beta))

{ } = - \mbox{logSumExp}(0,-x_n\beta) \mbox{ if } y_n = 1
{ } = - \mbox{logSumExp}(0,x_n\beta) \mbox{ if } y_n = 0

We can just think of this as an alternatively parameterized Bernoulli distribution,

\mbox{\sf BernoulliLogit}(y|\alpha) = \mbox{\sf Bernoulli}(y|\mbox{logit}^{-1}(\alpha))

with which our model can be expressed as

y_n \sim \mbox{\sf BernoulliLogit}(x_n\beta).

Recoding Outcomes {0,1} as {-1,1}

The notation’s even more convenient if we recode the failure outcome as -1 and thus take the outcome y \in \{ -1, 1 \}, where we have

\mbox{\sf BernoulliLogit}(y|\alpha) = - \mbox{logSumExp}(0,-y \alpha)

Bob’s ML Meetup Talk — Stan: A Bayesian Directed Graphical Model Compiler

January 6, 2012

I (Bob) am going to give a talk at the next NYC Machine Learning Meetup, on 19 January 2012 at 7 PM:

There’s an abstract on the meetup site. The short story is that Stan’s a directed graphical model compiler (like BUGS) that uses adaptive Hamiltonian Monte Carlo sampling to estimate posterior distributions for Bayesian models.

The official version 1 release is coming up soon, but until then, you can check out our work in progress at:

  • Google Code: Stan.

How to Close a LinkedIn Account with a “Large Network of Connections”

November 28, 2011

Last week I shut my LinkedIn account down.

My first attempt resulted in a warning web page saying I couldn’t shut it down because I had over 250 contacts. With that hint, I just deleted contacts until I had fewer than 250. Then I could close my account through their web form.

Please Don’t Go!

The first close attempt resulted in an e-mail from LinkedIn to their “customer support” group, cc-ed to me, asking them to close my account because I was unable to, citing the reason:

The member has a large network of connections to close. Please close during non-peak hours.

LinkedIn seems to be suggesting their site is so fragile it can’t be trusted to delete during peak hours.

A week later (as in today), I got the predicted response from “customer support”, namely a plea to stay. Today’s e-mail started with:

I’m sorry it’s taken so long to get back to you.

The apology seems rather disengenuous given the rest of the e-mail, which continued with:

I noticed that you have put a lot of effort into growing your LinkedIn network. Because of this, I wanted to confirm that you want to close an account with such a large number of connections.

Only after this did the e-mail start outlining further steps I’d have to take to close the account I’d already closed last week.

Why did I close my LinkedIn account? On the “con” side, it was a hassle to go through invitations to connect from people I didn’t know or had met once. I felt bad if I said “no” or “yes”. On the “pro” side, I couldn’t come up with anything. It’s not like I’m going to use LinkedIn to find a job.

Roberto takes over ICSI

November 15, 2011

Roberto Pieraccini, who was my boss at SpeechWorks (among other notable accomplishments), is going to be the new director of U. C. Berkeley’s International Computer Science Institute (ICSI).

Now this is a succession event I can (and did) wholeheartedly endorse. They asked me to write a recommendation letter as a former employee (his new colleagues all report to him; he’ll report to the trustees). I can paraphrase what I said: I’d be more than happy to work for Roberto again.

Twitter POS Tagging with LingPipe and ARK Tweet Data

November 4, 2011

The Data

We will train and test on anything that’s easy to parse. Up today is a basic English part-of-speech tagging for Twitter developed by Kevin Gimpel et al. (and when I say “et al.”, there are ten co-authors!) in Noah Smith’s group at Carnegie Mellon.

The relevant resources are:

Their paper describes their tagging scheme as well as their CRF-based tagger. It uses Stanford’s CRF tagger with baseline features as a performance comparison. The code for their tagger’s also in the distribution. I’m not sure what the license is — it’s listed as “other open source” (I didn’t even know Google Code let you do that — I thought it was “free beer” or nothing with them).

Training and Evaluating a LingPipe POS Tagger

Their corpus was very easy to parse (thanks, I really appreciate it). It only took me about an hour or so to download the data, parse it, and evaluate LingPipe’s baseline POS tagger on it. (It helps to be the author of code. The patterns feel awfully comfortable.)

Our performance was 85.4% accuracy on their train/test split using the default parameters for tagging in LingPipe. In contrast, the Stanford CRF tagger with default features was 85.9% accurate, whereas Gimpel et al.’s tagger achieved 89.4% accuracy. As usual, LingPipe’s HMM tagger is competitive with out-of-the-box CRFs and a few percentage points behind tuned, feature-rich CRFs.

Their paper (on page 5) says the annotator agreement is 92.2%. They also break accuracy out per tag, which LingPipe’s output also does; you can see this yourself if you run it.

LingPipe’s Baseline POS Tagger

The baseline POS tagger in LingPipe is a bigram HMM with emissions defined by a bounded character language model. Estimation is with simple additive smoothing (i.e., MAP estimates given symmetric Dirichlet priors) for the initial state and transition probabilities and Witten-Bell smoothing for the character LMs. Our main motivation for doing things this way is that (a) it’s online, letting us train an example at a time, and (b) it’s reasonably fast when it runs. We should be able to decode this tag set at well over 500K tokens/second by turning on caching of character LM results and pruning.

We could also implement their approach using LingPipe’s CRFs. It’s just that it’d take a bit longer than an hour all in.

Run it Yourself

You can get their code from their project home page, linked above.

All of my code’s checked into the LingPipe Sandbox in a project named “twitter-pos”. You can check it out anonymously using Subversion:

svn co

The code’s in a single file, stored under the src subdirectory of the package:

package com.lingpipe.twpos;

import com.aliasi.classify.*;
import com.aliasi.corpus.*;
import com.aliasi.hmm.*;
import com.aliasi.tag.*;
import java.util.*;

public class Eval {

    public static void main(String[] args) throws IOException {
        System.out.println("Reading Corpus");
        TwitterPosCorpus corpus 
            = new TwitterPosCorpus(new File(args[0]));
        System.out.println("Training Tagger");
        HmmCharLmEstimator hmm = new HmmCharLmEstimator();
        HmmDecoder tagger = new HmmDecoder(hmm);

        boolean storeTokens = true;
        TaggerEvaluator evaluator
            = new TaggerEvaluator(tagger,storeTokens);

    static List<Tagging> parse(File f) throws IOException {
        List<Tagging> taggings 
            = new ArrayList<Tagging>();
        FileLineReader reader = new FileLineReader(f,"UTF-8");
        List tokens = new ArrayList();
        List tags = new ArrayList();
        for (String line : reader) {
            String[] tokTag = line.split("\\s+");
            if (tokTag.length != 2) {
                taggings.add(new Tagging(tokens,tags));
                // System.out.println("tokens=" + tokens);
                // System.out.println("tags=" + tags);
                tokens = new ArrayList();
                tags = new ArrayList();
            } else {
        return taggings;

    static class TwitterPosCorpus extends ListCorpus<Tagging> {
        public TwitterPosCorpus(File path) throws IOException {
            for (Tagging t : parse(new File(path,"train")))
            for (Tagging t : parse(new File(path,"dev")))
            for (Tagging t : parse(new File(path,"test")))

LingPipe’s pretty fast for this sort of thing, with the entire program above, including I/O, corpus parsing, training, and testing taking a total of 5 seconds on my now ancient workstation.

Although it wouldn’t be a fair comparison, there’s usually a percent or so to be eked out of a little tuning in this setting (it would’ve been fair had I done tuning on the dev set and evaluated exactly once). This was just a straight out of the box, default settings eval. In general, one shouldn’t trust results that report post-hoc best settings values as they’re almost always going to overestimate real performance for all the usual reasons.

Finally, here’s the confusion matrix for tags in the first-best output:


I should really figure out how to format that a bit more neatly.

“Academic” Licenses, GPL, and “Free” Software

November 3, 2011

[This post repeats a long comment I posted about licensing in response to Brendan O’Connor’s blog entry, End-to-End NLP Packages. Brendan’s post goes over some packages for NLP and singles out LingPipe as being only “quasi free.”]

Restrictive “Academic-Only” Licenses

Some of those other packages, like C&C Tools and Senna, are in the same “quasi free” category as LingPipe in the sense that they’re released under what their authors call “non-commercial” licenses. For instance, none of the Senna, C&C, or LingPipe licenses are compatible with GPL-ed code. Senna goes so far as to prohibit derived works altogether.

The LingPipe License

The intent for the

was a little different from the “academic use only” licenses in that we didn’t single out academia as a special class of users. We do allow free use for research purposes for industrialists and academics alike. We also provide a “developers” license that explicitly gives you this right, which makes some users’ organizations feel better.

Truly Free NLP Software

The other tools, like NLTK, Mallet, OpenNLP, and GATE are released under more flexible licenses (LGPL, Apache or BSD), which I really do think of as being truly “free”. Mahout’s also in this category, though not mentioned by Brendan, whereas packages like TreeTagger are more like Senna or C&C in their restrictive “academic only” licensing.

Stanford and the GPL

Stanford NLP’s license sounds like it was written by someone who didn’t quite understand the GPL. Their page says (the link is also theirs):

The Stanford CoreNLP code is licensed under the full GPL, which allows its use for research purposes, free software projects, software services, etc., but not in distributed proprietary software.

Technically, what they say is true. It would’ve been clearer if they’d replaced “research” with “research and non-research” and “free” with “free and for-profit”. Instead, their choice of examples suggests “free” or “research” have some special status under the GPL, which they don’t. With my linguist hat on, I’d say their text leads the reader to a false implicature. The terms “research” and “academia” don’t even show up in the GPL, and although “free” does, GNU and others clarify this usage elswewhere as “free as in free speech”, not “free as in free beer”.

Understanding the GPL

The key to understanding the GPL lies behind Stanford’s embedded link to

Here, proprietary doesn’t have to do with ownership, but rather with closed source. Basically, if you redistribute source code or an application based on GPL-ed code, you have to also release your code under the GPL, which is why it’s called a “copyleft” or “viral” license. In some cases, you can get away with using a less restrictive license like LGPL or BSD for your mods or interacting libraries, though you can’t change the underlying GPL-ed source’s license.

GPL Applies to Academics, Too

There’s no free ride for academics here — you can’t take GPL-ed code, use it to build a research project for your thesis, then give an executable away for free without also distributing your code with a compatible license. And you can’t restrict the license to something research only. Similarly, you couldn’t roll a GPL-ed library into Senna or C&C or LingPipe and redistribute them under their own licenses. Academics are often violating these terms because they somehow think “research use only” is special.

Services Based on GPL-ed Software and the AGPL

You can also set up a software service, for example on Amazon’s Elastic Compute Cloud (EC2) or on your own servers, that’s entirely driven by GPL-ed software, like say Stanford NLP or Weka, and then charge users for accessing it. Because you’re not redistributing the software itself, you can modify it any way you like and write code around it without releasing your own software. GNU introduced the Affero GPL (AGPL), a license even more restrictive than the GPL that tries to close this server loophole for the basic GPL.

Charging for GPL-ed Code

You can charge for GPL-ed code if you can find someone to pay you. That’s what RedHat’s doing with Linux, what Revolution R’s doing with R, and what Enthought’s doing with Python.

LingPipe’s Business Model is Like MySQL’s

Note that this is not what MySQL did with MySQL (before they sold it to Oracle) nor is it what we do with LingPipe. In both those cases, the company owns all the intellectual property and copyrights and thus is able to release the code under multiple licenses. This strategy’s explained on the

We license LingPipe under custom licenses as well as our royalty-free license. These licenses include all sorts of additional restrictions (like only using some of the modules on so many servers) and additional guarantees (like indemnification and maintenance); don’t ask me about the details — that’s Breck’s bailiwick. Suffice it to say most companies don’t like to get involved with copyleft, be it from GPL or LingPipe’s royalty-free license. So we let them pay us extra and get an unencumbered license so they can do what they want with LingPipe and not have to share their code. We’ve had more than one customer buy commercial license for LingPipe who wouldn’t even tell us what they were going to do with our software.

Free “Academic” Software

Also, keep in mind that as an academic, your university (or lab) probably has a claim to your intellectual property developed using their resources. Here’s some advice from GNU on that front:


Oracle buys Endeca, HP buys Autonomy, Microsoft buys FAST

October 28, 2011

The news that Oracle’s buying Endeca sounds awfully familiar. But this time it cuts a little closer to home, because we’re an Endeca technology partner. Endeca has been a great customer to work with — we’ve been really impressed with their engineers at every turn.

Clean Sweep

I believe this makes it almost a clean sweep of the medium-to-medium-large-sized independent search companies. Maybe Vivisimo will be next. Of course, there are still small companies delivering search via Apache Lucene and SOLR, such as Sematext and Lucid Imagination. I imagine they will be delighted that yet another small competitor was snapped up by a tech giant.

2008: Microsoft buys FAST

By combining the innovation and agility of FAST with the discipline and resources of Microsoft, our customers get the best of both worlds: market-leading products from a trusted technology partner. … Enterprise Search from Microsoft offers best-in-class technologies…

from: Microsoft fact sheet

2011: HP buys Autonomy

Autonomy brings to HP higher value business solutions that will help customers manage the explosion of information. Together with Autonomy, we plan to reinvent how both unstructured and structured data is processed, analyzed, optimized, automated and protected. … this bold action will squarely position HP in software and information to create the next-generation Information Platform, and thereby, create significant value for our shareholders.

from: HP press release

2011: Oracle buys Endeca

Combination [of Oracle and Endeca] provides best-in-class technology and applications for unstructured data management, business intelligence, and web commerce. … The convergence of structured and unstructured information is driving the need for a common data management and analytics platform.

from: Oracle Press Release

Maybe the tough economic climate has made it hard for small-to-medium-sized tech companies to survive without the deep pockets of a successful large tech company. Maybe Oracle made them an offer that was too good to refuse. Far better to sell when you can get a good price than to suffer the fate of Yahoo! (or SpeechWorks, for that matter).

Make us an Offer?

On that note, feel free to make us an offer for LingPipe.