Author Archive

Canceled: Help Build a Watson Clone–Participants Sought for LingPipe Code Camp

January 10, 2014

Code Camp is canceled. We are late on delivering a LingPipe recipes book to our publisher and that will have to be our project for March. But we could have testers/reviewers come and hang out. Less fun I think.

Apologies.
Breck

—————————

 

Dates: To be absolutely clear: Dates are 3/2/2014 to 3/31/14 in Driggs Idaho. You can work remotely, we will be doing stuff before as well for setup.

New: We have setup a github repository. URL is https://github.com/watsonclone/jeopardy-.git

————————

Every year we go out west for a month of coding and skiing. Last year it was Salt Lake City Utah, this year it is Driggs Idaho for access to Grand Targhee and Jackson Hole. The house is rented for the month of March and the project selected. This year we will create a Watson clone.

I have blogged about Watson before and I totally respect what they have done. But the coverage is getting a bit breathless with the latest billion dollar effort. So how about assembling a scrappy bunch of developers and see how close we can come to recreating the Jeopardy beast?

How this works:

  • We have a month of focused development. We tend to code mornings, ski afternoons. House has 3 bedrooms if you want to come stay. Prior arrangements must be made. House is paid for. Nothing else is.
  • The code base will be open source. We are moving LingPipe to the AGPL so maybe that license or we could just apache license it. We want folks to be comfortable contributing.
  • You don’t have to be present to contribute.
  • We have a very fun time last year. We worked very hard on an email app that didn’t quite get launched but the lesson learned was to start with the project defined.

If you are interested in participating let us know at watsonclone@lingpipe.com. Let us know your background, what you want to do, do you expect to stay with us etc…. No visa situations please and we don’t have any funds to support folks. Obviously we have limited space physically and mentally so we may say no but we will do our best to be inclusive. Step 1–transcribe some Jeopardy shows.

Ask questions in the comments so all can benefit. Check comments before asking questions pls. I’ll answer the first one that is on everyone’s mind:

Q: Are you frigging crazy???

A: Why, yes, yes we are. But we are also really good computational linguists….

Breck

Limits of Floating Point and the Asymmetry of 0 and 1

December 19, 2013

If we are representing probabilities, we are interested in numbers between 0 and 1. It turns out these have very different properties in floating-point arithmetic. And it’s not as easy to solve as just working on the log scale.

Smallest Gaps between Floating Point Numbers of Different Magnitudes

The difference between 0 and the next biggest number representable in double-precision floating point (Java or C++ double) is on the order of 1e-300. In constrast, the difference between 1 and the next smallest number representable is around 1e-15. The reason is that to exploit the maximum number of significant digits, the mantissa for the number near 1 has to be scaled roughly like 0.999999999999999 or 1.0000000000001 and the exponent has to be scaled like 0.

CDFs and Complementary CDFs

This is why there are two differently translated error functions in math libs, erf() and erfc(), which are rescaled cumulative and complementary cumulative distribution functions. That is, you can use erf() to calculate the cumulative unit normal distribution function (cdf), written Phi(x); erfc() can be used to calculate the complementary cumulative unit normal distribution function (ccdf), written (1 – Phi(x)).

Log Scale no Panacea

Switching to the log scale doesn’t sove the problem. If you have a number very close to 1, you need to be careful in taking its log. If you write log(1-x), you run into the problem that x can only be so close to 1. That’s why standard math libraries provide a log1p() function defined by log1p(x) = log(1 + x). This gives you back the precision you lose by subtraction two numbers close to each other (what’s called “catastrophic cancellation” in the arithmetic processing literature).

A Little Quiz

I’ll end with a little quiz to get you thinking about floating point a little more closely. Suppose you set the bits in a floating point number randomly, say by flipping a coin for each bit.

Junior Varsity:  What’s the approximate probability that the random floating point number has an absolute value less than 1?

Varsity:  What is the event probability of drawing a number in the range (L,U). Or, equivalently, what’s the density function to which the draws are proportional?

Fokkens et al. on Replication Failure

September 25, 2013

After ACL, Becky Passonneau said I should read this:

You should, too. Or if you’d rather watch, there’s a video of their ACL presentation.

The Gist

The gist of the paper is that the authors tried to reproduce some results found in the literature for word sense identification and named-entity detection, and had a rather rough go of it. In particular, they found that every little decision they made impacted the evaluation error they got, including how to evaluate the error. As the narrative of their paper unfolded, all I could do was nod along. I loved that their paper was in narrative form — it made it very easy to follow.

The authors stress that we need more information for reproducibility and that reproducing known results should get more respect. No quibble there. A secondary motivation (after the pedagogical one) of the LingPipe tutorials was to show that we could reproduce existing results in the literature.

The Minor Fall

Although they’re riding one of my favorite hobby horses, they don’t quite lead it in the direction I would have. In fact, they stray into serious discussion of the point estimates of performance that their experiments yielded, despite the variation they see under cross-validation folds. So while they note that rankings of approaches changes based on what seem like minor details, they stop short of calling into question the whole idea of trying to assign a simple ranking. Instead, they stress getting to the bottom of why the rankings differ.

A Slightly Different Take on the Issue

What I would’ve liked to have seen in this paper is more emphasis on two key statistical concepts:

  1. the underlying sample variation problem, and
  2. the overfitting problem with evaluations.

Sample Variation

The authors talk about sample variation a bit when they consider different folds, etc., in cross-validation. For reasons I don’t understand they call it “experimental setup.” But they don’t put it in terms of estimation variance. That is, independently of getting the features to all line up, the performance and hence rankings of various approaches vary to a large extent because of sample variation. The easiest way to see this is to run cross-validation.

I would have stressed that the bootstrap method should be used to get at sample variation. The bootstrap differs from cross-validation in that it samples training and test subsets with replacement. The bootstrap effectively evaluates within-sample sampling variation, even in the face of non-i.i.d. samples (usually there is correlation among the items in a corpus due to choosing multiple sentences from a single document or even multiple words from the same sentence). The picture is even worse in that the data set at hand is rarely truly representative of the intended out-of-sample application, which is typically to new text. The authors touch on these issues with discussions of “robustness.”

Overfitting

The authors don’t really address the overfitting problem, though it is tangential to what they call the “versioning” problem (when mismatched versions of corpora such as WordNet produce different results).

Overfitting is a huge problem in the current approach to evaluation in NLP and machine learning research. I don’t know why anyone cares about the umpteenth evaluation on the same fold of the Penn Treebank that produces a fraction of a percentage gain. When the variation across folds is higher than your estimated gain on a single fold, it’s time to pack the papers in the file drawer, not in the proceedings of ACL. At least until we also get a journal of negative results to go with our journal of reproducibility.

The Major Lift

Perhaps hill-climbing error on existing data sets is not the best way forward methdologically. Maybe, just maybe, we can make the data better. It’s certainly what Breck and I tell customers who want to build an application.

If making the data better sounds appealing, check out

I find the two ideas so compelling that it’s the only area of NLP I’m actively researching. More on that later (or sooner if you want to read ahead).

 

(My apologies to Leonard Cohen for stealing his lyrics.)

LingPipe Incubator Welcomes Seer

September 9, 2013

We have started an incubator program for start-ups with natural language processing (NLP) needs. Seer is our first egg. The company creates productivity apps based on unstructured data sources–that is where LingPIpe fits in. The guys (Conall and Joe) fit a pattern that we see quite often–smart people with a good idea that presupposes NLP but they don’t have much experience with it.

The GetSeer guys were on the ball right away because they had a bunch of data annotated and had cobbled together a regular expression based classifier that served as an excellent starting point. We had machine learning based classifiers up and running within two hours of starting. That included an evaluation harness. Nice.

Image

The great thing about our working setup is that I get to teach/advise which keeps my time commitment manageable and they get to do most of the heavy lifting which is how they learn to build and maintain NLP systems. I have had to learn some Scala since I co-code most of the time when we work together and that is their implementation language. Scala is a hip extension of Java with a less verbose syntax and stronger type inference.

I’ll keep the blog updated with developments. Current status:

  • There is a small amount of training data.
  • Current results are awful (no surprise).
  • We have been roughing in the solution with Naive Bayes. Next steps will be opening a can of logistic regression for more interesting feature extraction.

Token Lumpers and Splitters: Spider-Man vs. Superman

August 21, 2013

One of the perennial problems in tokenization for search, classification, or clustering of natural language data is which words and phrases are spelled as compounds and which are separate words. For instance consider “dodgeball” (right) vs. “dodge ball” (wrong) and “golfball” (wrong) vs. “golf ball” (right)?

It’s a classic lumpers vs. splitters problem. Alas, the “right” answer can change over time and some words are listed as both lumped and split in some dictionaries.

This point was first brought home to me and Breck when we were browsing Comcast’s query logs for TV search queries, where users seemed to arbitrarily insert or delete spaces in queries. Who can blame them with the inconsistencies of English in this regard?

This token lumping vs. splitting problem is one of the reasons LingPipe’s spelling correction is not token-based like Lucene’s (though Lucene has other ways to handle combining and splitting compounds — more on that from Mitzi in the near future). And it’s one of the reasons (along with morphological affixes) that character n-grams provide a better basis for classification than hard tokenization.

My favorite examples are comic book superhero names. It turns out that the editors at Marvel Comics are splitters, whereas those at DC Comics are lumpers. Hence we have on the Marvel roster,

Ant-Man, Iron Man, and Spider-Man.

They’re not consistent on hyphenation, which is another issue if you don’t throw away punctuation.

DC, on the other hand, employs men, women, and children, but doesn’t have the budget for spaces:

Aquaman, Batman, Catwoman, Hawkman, Martian Manhunter, Superboy, Supergirl, and Superman.

The exception that proves the rule (I’m open-minded enough to use a phrase that’s a pet peeve of mine), DC gives us:

Wonder Woman

VerbCorner: Another Game with Words

July 4, 2013

Games with Words continues to roll out new games, releasing VerbCorner.

VerbCorner contains a series of sub-games with titles like “Philosophical Zombie Hunter” and questions that sound like they’re straight of an SAT reading comprehension test, starting with a long story I won’t repeat and then questions like:

Michelle smashed the gauble into quaps. Which of the following is *definitely* a real human?

  1. Michelle
  2. None of the above
  3. Can’t tell because in this context ‘smash’ has more than one meaning.
  4. Can’t tell because the sentence is ungrammatical.
  5. Can’t tell because I don’t know that verb.

Josh Hartshorne, one of the developers, writes that Verbcorner is “a new project I’m doing in collaboration with Martha Palmer at CU-Boulder, which is aimed at getting human judgments to inform the semantics assigned to verbs in VerbNet. You can find a press release about the project … You’ll see we’ve incorporated a lot of gamification elements in order to make the task more engaging (and also to improve annotation quality).”

This note from the press release harkens back to my philosophical semantics days, “If we do not want to define words in terms of other words, what should they be defined in terms of? This remains an open research question…”. I’ll say! The project has a generative semantics feel; in the press release, Josh is quoted as saying “There are a few dozen core components of meaning and there are tens of thousands of verbs in English.” Now I’m left hanging as to what a “core component of meaning” is!

All Bayesian Models are Generative (in Theory)

May 23, 2013

[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix X and outcome vector y. In regression modeling, we assume a coefficient vector \beta and explicitly model the outcome likelihood p(y|\beta,X) and coefficient prior p(\beta). But what about X? If we were modeling X, we’d have a full likelihood p(X,y|\beta,\psi), where \psi are the additional parameters involved in modeling X and the prior is now joint, p(\beta,\psi).

So how do we justify conditional modeling as Bayesians? We simply assume that \psi and \beta have independent priors, so that

p(\beta,\psi) = p(\beta) \times p(\psi).

The posterior then neatly factors as

p(\beta,\psi|y,X) = p(\psi|X) \times p(\beta|X,y).

Looking at just the inference for the regression coefficients \beta, we have the familiar expression

p(\beta|y,X) \propto p(\beta) \times p(y|\beta,X).

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for \psi. To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given k others than a joint distribution on all k+1 variables.

We knew that all along.

Generative vs. Discriminative; Bayesian vs. Frequentist

April 12, 2013

[There's now a followup post, All Bayesian models are generative (in theory).]

I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.

To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.

Frequentist Bayesian
Discriminative p(z ; w, β) p(z, β ; w) = p(z | β ; w) * p(β)
Generative p(z, w ; β) p(z, w, β) = p(z, w | β) * p(β)

If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.

Frequentists model the probability of observations given the parameters. This involves a likelihood function.

Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.

Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.

Mean-Field Variational Inference Made Easy

March 25, 2013

I had the hardest time trying to understand variational inference. All of the presentations I’ve seen (MacKay, Bishop, Wikipedia, Gelman’s draft for the third edition of Bayesian Data Analysis) are deeply tied up with the details of a particular model being fit. I wanted to see the algorithm and get the big picture before being overwhelmed with multivariate exponential family gymnastics.

Bayesian Posterior Inference

In the Bayesian setting (see my earlier post, What is Bayesian Inference?), we have a joint probability model p(y,\theta) for data y and parameters \theta, usually factored as the product of a likelihood and prior term, p(y,\theta) = p(y|\theta) p(\theta). Given some observed data y, Bayesian predictive inference is based on the posterior density p(\theta|y) \propto p(\theta, y) of the the unknown parameter vector \theta given observed data vector y. Thus we need to be able to estimate the posterior density p(\theta|y) to carry out Bayesian inference. Note that the posterior is a whole density function—we’re not just after a point estimate as in maximum likelihood estimation.

Mean-Field Approximation

Variational inference approximates the Bayesian posterior density p(\theta|y) with a (simpler) density g(\theta|\phi) parameterized by some new parameters \phi. The mean-field form of variational inference factors the approximating density g by component of \theta = \theta_1,\ldots,\theta_J, as

g(\theta|\phi) = \prod_{j=1}^J g_j(\theta_j|\phi_j).

I’m going to put off actually defining the terms g_j until we see how they’re used in the variational inference algorithm.

What Variational Inference Does

The variational inference algorithm finds the value \phi^* for the parameters \phi of the approximation which minimizes the Kullback-Leibler divergence of g(\theta|\phi) from p(\theta|y),

\phi^* = \mbox{arg min}_{\phi} \ \mbox{KL}[ g(\theta|\phi) \ || \ p(\theta|y) ].

The key idea here is that variational inference reduces posterior estimation to an optimization problem. Optimization is typically much faster than approaches to posterior estimation such as Markov chain Monte Carlo (MCMC).

The main disadvantage of variational inference is that the posterior is only approximated (though as MacKay points out, just about any approximation is better than a delta function at a point estimate!). In particular, variational methods systematically underestimate posterior variance because of the direction of the KL divergence that is minimized. Expectation propagation (EP) also converts posterior fitting to optimization of KL divergence, but EP uses the opposite direction of KL divergence, which leads to overestimation of posterior variance.

Variational Inference Algorithm

Given the Bayesian model p(y,\theta), observed data y, and functional terms g_j making up the approximation of the posterior p(\theta|y), the variational inference algorithm is:

  • \phi \leftarrow \mbox{random legal initialization}
  • \mbox{repeat}
    • \phi_{\mbox{\footnotesize old}} \leftarrow \phi
    • \mbox{for } j \mbox{ in } 1:J
      • \mbox{{} \ \ set } \phi_j \mbox{ such that } g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)].
  • \mbox{until } ||\phi - \phi_{\mbox{\footnotesize old}}|| < \epsilon

The inner expectation is a function of \theta_j returning a single non-negative value, defined by

\mathbb{E}_{g_{-j}}[\log p(\theta|y)]

\begin{array}{l}  \mbox{ } = \int_{\theta_1} \ldots \int_{\theta_{j-1}} \int_{\theta_{j+1}} \ldots \int_{\theta_J}  \\[8pt] \mbox{ } \hspace*{0.4in}  g(\theta_1|\phi_1) \times \cdots \times g(\theta_{j-1}|\phi_{j-1}) \times  g(\theta_{j+1}|\phi_{j+1}) \times \cdots \times g(\theta_J|\phi_J)  \times  \log p(\theta|y)  \\[8pt] \mbox{ } \hspace*{0.2in}  d\theta_J \cdots d\theta_{j+1} \ d\theta_{j-1} \cdots d\theta_1  \end{array}

Despite the suggestive factorization of g and the coordinate-wise nature of the algorithm, variational inference does not simply approximate the posterior marginals p(\theta_j|y) independently.

Defining the Approximating Densities

The trick is to choose the approximating factors so that the we can compute parameter values \phi_j such that g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)]. Finding such approximating terms g_j(\theta_j|\phi_j) given a posterior p(\theta|y) is an art form unto itself. It’s much easier for models with conjugate priors. Bishop or MacKay’s books and the Wikipedia present calculations for a wide range of exponential-family models.

What if My Model is not Conjugate?

Unfortunately, I almost never work with conjugate priors (and even if I did, I’m not much of a hand at exponential-family algebra). Therefore, the following paper just got bumped to the top of my must understand queue:

It’s great having Dave down the hall on sabbatical this year — one couldn’t ask for a better stand in for Matt Hoffman. They are both insanely good at on-the-fly explanations at the blackboard (I love that we still have real chalk and high quality boards).

Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a "pass".] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.


Follow

Get every new post delivered to your Inbox.

Join 808 other followers