Author Archive

Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp

January 10, 2014

Code Camp is canceled. We are late on delivering a LingPipe recipes book to our publisher and that will have to be our project for March. But we could have testers/reviewers come and hang out. Less fun I think.




Dates: To be absolutely clear: Dates are 3/2/2014 to 3/31/14 in Driggs Idaho. You can work remotely, we will be doing stuff before as well for setup.

New: We have setup a github repository. URL is


Every year we go out west for a month of coding and skiing. Last year it was Salt Lake City Utah, this year it is Driggs Idaho for access to Grand Targhee and Jackson Hole. The house is rented for the month of March and the project selected. This year we will create a Watson clone.

I have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with the latest billion dollar effort. So how about assembling a scrappy bunch of developers and see how close we can come to recreating the Jeopardy beast?

How this works:

  • We have a month of focused development. We tend to code mornings, ski afternoons. House has 3 bedrooms if you want to come stay. Prior arrangements must be made. House is paid for. Nothing else is.
  • The code base will be open source. We are moving LingPipe to the AGPL so maybe that license or we could just apache license it. We want folks to be comfortable contributing.
  • You don’t have to be present to contribute.
  • We have a very fun time last year. We worked very hard on an email app that didn’t quite get launched but the lesson learned was to start with the project defined.

If you are interested in participating let us know at Let us know your background, what you want to do, do you expect to stay with us etc…. No visa situations please and we don’t have any funds to support folks. Obviously we have limited space physically and mentally so we may say no but we will do our best to be inclusive. Step 1–transcribe some Jeopardy shows.

Ask questions in the comments so all can benefit. Check comments before asking questions pls. I’ll answer the first one that is on everyone’s mind:

Q: Are you frigging crazy???

A: Why, yes, yes we are. But we are also really good computational linguists….


Limits of Floating Point and the Asymmetry of 0 and 1

December 19, 2013

If we are representing probabilities, we are interested in numbers between 0 and 1. It turns out these have very different properties in floating-point arithmetic. And it’s not as easy to solve as just working on the log scale.

Smallest Gaps between Floating Point Numbers of Different Magnitudes

The difference between 0 and the next biggest number representable in double-precision floating point (Java or C++ double) is on the order of 1e-300. In constrast, the difference between 1 and the next smallest number representable is around 1e-15. The reason is that to exploit the maximum number of significant digits, the mantissa for the number near 1 has to be scaled roughly like 0.999999999999999 or 1.0000000000001 and the exponent has to be scaled like 0.

CDFs and Complementary CDFs

This is why there are two differently translated error functions in math libs, erf() and erfc(), which are rescaled cumulative and complementary cumulative distribution functions. That is, you can use erf() to calculate the cumulative unit normal distribution function (cdf), written Phi(x); erfc() can be used to calculate the complementary cumulative unit normal distribution function (ccdf), written (1 – Phi(x)).

Log Scale no Panacea

Switching to the log scale doesn’t sove the problem. If you have a number very close to 1, you need to be careful in taking its log. If you write log(1-x), you run into the problem that x can only be so close to 1. That’s why standard math libraries provide a log1p() function defined by log1p(x) = log(1 + x). This gives you back the precision you lose by subtraction two numbers close to each other (what’s called “catastrophic cancellation” in the arithmetic processing literature).

A Little Quiz

I’ll end with a little quiz to get you thinking about floating point a little more closely. Suppose you set the bits in a floating point number randomly, say by flipping a coin for each bit.

Junior Varsity:  What’s the approximate probability that the random floating point number has an absolute value less than 1?

Varsity:  What is the event probability of drawing a number in the range (L,U). Or, equivalently, what’s the density function to which the draws are proportional?

LingPipe Incubator Welcomes Seer

September 9, 2013

We have started an incubator program for start-ups with natural language processing (NLP) needs. Seer is our first egg. The company creates productivity apps based on unstructured data sources–that is where LingPIpe fits in. The guys (Conall and Joe) fit a pattern that we see quite often–smart people with a good idea that presupposes NLP but they don’t have much experience with it.

The GetSeer guys were on the ball right away because they had a bunch of data annotated and had cobbled together a regular expression based classifier that served as an excellent starting point. We had machine learning based classifiers up and running within two hours of starting. That included an evaluation harness. Nice.


The great thing about our working setup is that I get to teach/advise which keeps my time commitment manageable and they get to do most of the heavy lifting which is how they learn to build and maintain NLP systems. I have had to learn some Scala since I co-code most of the time when we work together and that is their implementation language. Scala is a hip extension of Java with a less verbose syntax and stronger type inference.

I’ll keep the blog updated with developments. Current status:

  • There is a small amount of training data.
  • Current results are awful (no surprise).
  • We have been roughing in the solution with Naive Bayes. Next steps will be opening a can of logistic regression for more interesting feature extraction.

Token Lumpers and Splitters: Spider-Man vs. Superman

August 21, 2013

One of the perennial problems in tokenization for search, classification, or clustering of natural language data is which words and phrases are spelled as compounds and which are separate words. For instance consider “dodgeball” (right) vs. “dodge ball” (wrong) and “golfball” (wrong) vs. “golf ball” (right)?

It’s a classic lumpers vs. splitters problem. Alas, the “right” answer can change over time and some words are listed as both lumped and split in some dictionaries.

This point was first brought home to me and Breck when we were browsing Comcast’s query logs for TV search queries, where users seemed to arbitrarily insert or delete spaces in queries. Who can blame them with the inconsistencies of English in this regard?

This token lumping vs. splitting problem is one of the reasons LingPipe’s spelling correction is not token-based like Lucene’s (though Lucene has other ways to handle combining and splitting compounds — more on that from Mitzi in the near future). And it’s one of the reasons (along with morphological affixes) that character n-grams provide a better basis for classification than hard tokenization.

My favorite examples are comic book superhero names. It turns out that the editors at Marvel Comics are splitters, whereas those at DC Comics are lumpers. Hence we have on the Marvel roster,

Ant-Man, Iron Man, and Spider-Man.

They’re not consistent on hyphenation, which is another issue if you don’t throw away punctuation.

DC, on the other hand, employs men, women, and children, but doesn’t have the budget for spaces:

Aquaman, Batman, Catwoman, Hawkman, Martian Manhunter, Superboy, Supergirl, and Superman.

The exception that proves the rule (I’m open-minded enough to use a phrase that’s a pet peeve of mine), DC gives us:

Wonder Woman

VerbCorner: Another Game with Words

July 4, 2013

Games with Words continues to roll out new games, releasing VerbCorner.

VerbCorner contains a series of sub-games with titles like “Philosophical Zombie Hunter” and questions that sound like they’re straight of an SAT reading comprehension test, starting with a long story I won’t repeat and then questions like:

Michelle smashed the gauble into quaps. Which of the following is *definitely* a real human?

  1. Michelle
  2. None of the above
  3. Can’t tell because in this context ‘smash’ has more than one meaning.
  4. Can’t tell because the sentence is ungrammatical.
  5. Can’t tell because I don’t know that verb.

Josh Hartshorne, one of the developers, writes that Verbcorner is “a new project I’m doing in collaboration with Martha Palmer at CU-Boulder, which is aimed at getting human judgments to inform the semantics assigned to verbs in VerbNet. You can find a press release about the project … You’ll see we’ve incorporated a lot of gamification elements in order to make the task more engaging (and also to improve annotation quality).”

This note from the press release harkens back to my philosophical semantics days, “If we do not want to define words in terms of other words, what should they be defined in terms of? This remains an open research question…”. I’ll say! The project has a generative semantics feel; in the press release, Josh is quoted as saying “There are a few dozen core components of meaning and there are tens of thousands of verbs in English.” Now I’m left hanging as to what a “core component of meaning” is!

All Bayesian Models are Generative (in Theory)

May 23, 2013

[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix X and outcome vector y. In regression modeling, we assume a coefficient vector \beta and explicitly model the outcome likelihood p(y|\beta,X) and coefficient prior p(\beta). But what about X? If we were modeling X, we’d have a full likelihood p(X,y|\beta,\psi), where \psi are the additional parameters involved in modeling X and the prior is now joint, p(\beta,\psi).

So how do we justify conditional modeling as Bayesians? We simply assume that \psi and \beta have independent priors, so that

p(\beta,\psi) = p(\beta) \times p(\psi).

The posterior then neatly factors as

p(\beta,\psi|y,X) = p(\psi|X) \times p(\beta|X,y).

Looking at just the inference for the regression coefficients \beta, we have the familiar expression

p(\beta|y,X) \propto p(\beta) \times p(y|\beta,X).

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for \psi. To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given k others than a joint distribution on all k+1 variables.

We knew that all along.

Generative vs. Discriminative; Bayesian vs. Frequentist

April 12, 2013

[There’s now a followup post, All Bayesian models are generative (in theory).]

I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.

To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.

Frequentist Bayesian
Discriminative p(z ; w, β) p(z, β ; w) = p(z | β ; w) * p(β)
Generative p(z, w ; β) p(z, w, β) = p(z, w | β) * p(β)

If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.

Frequentists model the probability of observations given the parameters. This involves a likelihood function.

Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.

Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.

Mean-Field Variational Inference Made Easy

March 25, 2013

I had the hardest time trying to understand variational inference. All of the presentations I’ve seen (MacKay, Bishop, Wikipedia, Gelman’s draft for the third edition of Bayesian Data Analysis) are deeply tied up with the details of a particular model being fit. I wanted to see the algorithm and get the big picture before being overwhelmed with multivariate exponential family gymnastics.

Bayesian Posterior Inference

In the Bayesian setting (see my earlier post, What is Bayesian Inference?), we have a joint probability model p(y,\theta) for data y and parameters \theta, usually factored as the product of a likelihood and prior term, p(y,\theta) = p(y|\theta) p(\theta). Given some observed data y, Bayesian predictive inference is based on the posterior density p(\theta|y) \propto p(\theta, y) of the the unknown parameter vector \theta given observed data vector y. Thus we need to be able to estimate the posterior density p(\theta|y) to carry out Bayesian inference. Note that the posterior is a whole density function—we’re not just after a point estimate as in maximum likelihood estimation.

Mean-Field Approximation

Variational inference approximates the Bayesian posterior density p(\theta|y) with a (simpler) density g(\theta|\phi) parameterized by some new parameters \phi. The mean-field form of variational inference factors the approximating density g by component of \theta = \theta_1,\ldots,\theta_J, as

g(\theta|\phi) = \prod_{j=1}^J g_j(\theta_j|\phi_j).

I’m going to put off actually defining the terms g_j until we see how they’re used in the variational inference algorithm.

What Variational Inference Does

The variational inference algorithm finds the value \phi^* for the parameters \phi of the approximation which minimizes the Kullback-Leibler divergence of g(\theta|\phi) from p(\theta|y),

\phi^* = \mbox{arg min}_{\phi} \ \mbox{KL}[ g(\theta|\phi) \ || \ p(\theta|y) ].

The key idea here is that variational inference reduces posterior estimation to an optimization problem. Optimization is typically much faster than approaches to posterior estimation such as Markov chain Monte Carlo (MCMC).

The main disadvantage of variational inference is that the posterior is only approximated (though as MacKay points out, just about any approximation is better than a delta function at a point estimate!). In particular, variational methods systematically underestimate posterior variance because of the direction of the KL divergence that is minimized. Expectation propagation (EP) also converts posterior fitting to optimization of KL divergence, but EP uses the opposite direction of KL divergence, which leads to overestimation of posterior variance.

Variational Inference Algorithm

Given the Bayesian model p(y,\theta), observed data y, and functional terms g_j making up the approximation of the posterior p(\theta|y), the variational inference algorithm is:

  • \phi \leftarrow \mbox{random legal initialization}
  • \mbox{repeat}
    • \phi_{\mbox{\footnotesize old}} \leftarrow \phi
    • \mbox{for } j \mbox{ in } 1:J
      • \mbox{{} \ \ set } \phi_j \mbox{ such that } g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)].
  • \mbox{until } ||\phi - \phi_{\mbox{\footnotesize old}}|| < \epsilon

The inner expectation is a function of \theta_j returning a single non-negative value, defined by

\mathbb{E}_{g_{-j}}[\log p(\theta|y)]

\begin{array}{l}  \mbox{ } = \int_{\theta_1} \ldots \int_{\theta_{j-1}} \int_{\theta_{j+1}} \ldots \int_{\theta_J}  \\[8pt] \mbox{ } \hspace*{0.4in}  g(\theta_1|\phi_1) \times \cdots \times g(\theta_{j-1}|\phi_{j-1}) \times  g(\theta_{j+1}|\phi_{j+1}) \times \cdots \times g(\theta_J|\phi_J)  \times  \log p(\theta|y)  \\[8pt] \mbox{ } \hspace*{0.2in}  d\theta_J \cdots d\theta_{j+1} \ d\theta_{j-1} \cdots d\theta_1  \end{array}

Despite the suggestive factorization of g and the coordinate-wise nature of the algorithm, variational inference does not simply approximate the posterior marginals p(\theta_j|y) independently.

Defining the Approximating Densities

The trick is to choose the approximating factors so that the we can compute parameter values \phi_j such that g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)]. Finding such approximating terms g_j(\theta_j|\phi_j) given a posterior p(\theta|y) is an art form unto itself. It’s much easier for models with conjugate priors. Bishop or MacKay’s books and the Wikipedia present calculations for a wide range of exponential-family models.

What if My Model is not Conjugate?

Unfortunately, I almost never work with conjugate priors (and even if I did, I’m not much of a hand at exponential-family algebra). Therefore, the following paper just got bumped to the top of my must understand queue:

It’s great having Dave down the hall on sabbatical this year — one couldn’t ask for a better stand in for Matt Hoffman. They are both insanely good at on-the-fly explanations at the blackboard (I love that we still have real chalk and high quality boards).

Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a “pass”.] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.

Another Linguistic Corpus Collection Game

November 12, 2012

Johan Bos and his crew at University of Groningen have a new suite of games aimed at linguistic data data collection. You can find them at:

Wordrobe is currently hosting four games. Twins is aimed at part-of-speech tagging, Senses is for word sense annotation, Pointers for coref data, and Names for proper name classification.

One of the neat things about Wordrobe is that they try to elicit some notion of confidence by allowing users to “bet” on their answers.

They also discuss prizes, but I didn’t see any mention of what the prizes were.

The project is aimed at imrpoving the Groningen Meaning Bank. I hope they release the raw user data as well as their best guess at a gold standard. I had some background discussion with Johan about annotation models, but they’re going to go with something relatively simple, which means there’s an opportunity to compare a richer statistical models like the other ones I’ve cited on the Data Annotation category of this blog.

Other Linguistic Games

The first linguistic game of which I was aware was Ahn’s reCAPTCHA. Although aimed at capturing OCR annotations as a side effect, it is more of a security wall aimed at filtering out bots than a game. Arguably, I’ve been played by it more than the other way around.

A more linguistically relevant game is Poesio et al.’s Phrase Detectives, which is aimed at elucidating coreference annotations. I played through several rounds of every aspect of it. The game interface itself is very nice for a web app. Phrase Detectives occassionally has cash prizes, but it looks like they ran out of prize money because the last reference to prizes was July 2011.

Are they Really Games?

Phrase Detectives is more like an Amazon Mechanical Turk task with a backstory and leaderboard. I didn’t create a login for Wordrobe to try it, but I’m going out on a limb to guess it’s going to be similar given the descriptions of the games.