Archive for the ‘Carp’s Blog’ Category

Becky’s and My Annotation Paper in TACL

October 29, 2014

Finally something published after all these years working on the problem:

Rebecca J. Passonneau and Bob Carpenter. 2014. The Benefits of a Model of Annotation. Transactions of the Association for Comptuational Linguistics (TACL) 2(Oct):311−326. [pdf(Presented at EMNLP 2014.)

Becky just presented it at EMNLP this week. I love the TACL concept and lobbied Michael Collins for short reviewer deadlines based on the fact that (a) I saw the bioinformatics journals could do it and (b) I either do reviews right away or wait until I’m nagged, then do them right away — they never take even three weeks of real work. The TACL editors/reviewers really were quick, getting responses to us on the initial paper and revisions in under a month.

What you Get

There’s nothing new in the paper technically that I haven’t discussed before on this blog. In fact, the model’s a step back from what I consider best practices, mainly to avoid having to justify Bayesian hierarchical models to reviewers and hence avoid reviews like the last one I got. I’d highly recommend using a proper hierarchical model (that’s “proper” in the “proper English breakfast” sense) and using full Bayesian inference.

One of the main points of the paper is that high kappa values are not necessary for high quality corpora and even more surprisingly, high kappa values are not even sufficient.

Another focus is on why the Dawid and Skene type models that estimate a full categorical response matrix per annotator are preferable to anything that relies on weighted voting (hint: voting can’t adjust for bias).

I also made a pitch for using expected information gain (aka mutual information) to evaluate quality of annotators.

The Data

The paper’s very focused on the Mechanical Turk data we collected for word sense. That data’s freely available as part of the manually annotated subcorpus (MASC) of the American national corpus (ANC). That was 1000 instances of roughly 50 words (nouns, adjectives and verbs) and roughly 25 annotations per instance, for a total of around 1M labels. You can get it in easily digestible form from the repo linked below.


I also wrote a simple EM-based optimizer in R to perform maximum penalized likelihood (equivalently max a posteriori [MAP]) estimates. It’s not super fast, but it can deal with 25K labels over 200 annotators and 1000 items in a minute or so. The code is available from a

GitHub Repo: bob-carpenter/anno.

It includes a CSV-formatted version of all the data we collected in case you want to do your own analysis.

We collected so many labels per item so that we could try to get a handle on item difficulty. We just haven’t had time to do the analysis. So if you’re interested or want to collaborate, let me know (

Limits of Floating Point and the Asymmetry of 0 and 1

December 19, 2013

If we are representing probabilities, we are interested in numbers between 0 and 1. It turns out these have very different properties in floating-point arithmetic. And it’s not as easy to solve as just working on the log scale.

Smallest Gaps between Floating Point Numbers of Different Magnitudes

The difference between 0 and the next biggest number representable in double-precision floating point (Java or C++ double) is on the order of 1e-300. In constrast, the difference between 1 and the next smallest number representable is around 1e-15. The reason is that to exploit the maximum number of significant digits, the mantissa for the number near 1 has to be scaled roughly like 0.999999999999999 or 1.0000000000001 and the exponent has to be scaled like 0.

CDFs and Complementary CDFs

This is why there are two differently translated error functions in math libs, erf() and erfc(), which are rescaled cumulative and complementary cumulative distribution functions. That is, you can use erf() to calculate the cumulative unit normal distribution function (cdf), written Phi(x); erfc() can be used to calculate the complementary cumulative unit normal distribution function (ccdf), written (1 – Phi(x)).

Log Scale no Panacea

Switching to the log scale doesn’t sove the problem. If you have a number very close to 1, you need to be careful in taking its log. If you write log(1-x), you run into the problem that x can only be so close to 1. That’s why standard math libraries provide a log1p() function defined by log1p(x) = log(1 + x). This gives you back the precision you lose by subtraction two numbers close to each other (what’s called “catastrophic cancellation” in the arithmetic processing literature).

A Little Quiz

I’ll end with a little quiz to get you thinking about floating point a little more closely. Suppose you set the bits in a floating point number randomly, say by flipping a coin for each bit.

Junior Varsity:  What’s the approximate probability that the random floating point number has an absolute value less than 1?

Varsity:  What is the event probability of drawing a number in the range (L,U). Or, equivalently, what’s the density function to which the draws are proportional?

Fokkens et al. on Replication Failure

September 25, 2013

After ACL, Becky Passonneau said I should read this:

You should, too. Or if you’d rather watch, there’s a video of their ACL presentation.

The Gist

The gist of the paper is that the authors tried to reproduce some results found in the literature for word sense identification and named-entity detection, and had a rather rough go of it. In particular, they found that every little decision they made impacted the evaluation error they got, including how to evaluate the error. As the narrative of their paper unfolded, all I could do was nod along. I loved that their paper was in narrative form — it made it very easy to follow.

The authors stress that we need more information for reproducibility and that reproducing known results should get more respect. No quibble there. A secondary motivation (after the pedagogical one) of the LingPipe tutorials was to show that we could reproduce existing results in the literature.

The Minor Fall

Although they’re riding one of my favorite hobby horses, they don’t quite lead it in the direction I would have. In fact, they stray into serious discussion of the point estimates of performance that their experiments yielded, despite the variation they see under cross-validation folds. So while they note that rankings of approaches changes based on what seem like minor details, they stop short of calling into question the whole idea of trying to assign a simple ranking. Instead, they stress getting to the bottom of why the rankings differ.

A Slightly Different Take on the Issue

What I would’ve liked to have seen in this paper is more emphasis on two key statistical concepts:

  1. the underlying sample variation problem, and
  2. the overfitting problem with evaluations.

Sample Variation

The authors talk about sample variation a bit when they consider different folds, etc., in cross-validation. For reasons I don’t understand they call it “experimental setup.” But they don’t put it in terms of estimation variance. That is, independently of getting the features to all line up, the performance and hence rankings of various approaches vary to a large extent because of sample variation. The easiest way to see this is to run cross-validation.

I would have stressed that the bootstrap method should be used to get at sample variation. The bootstrap differs from cross-validation in that it samples training and test subsets with replacement. The bootstrap effectively evaluates within-sample sampling variation, even in the face of non-i.i.d. samples (usually there is correlation among the items in a corpus due to choosing multiple sentences from a single document or even multiple words from the same sentence). The picture is even worse in that the data set at hand is rarely truly representative of the intended out-of-sample application, which is typically to new text. The authors touch on these issues with discussions of “robustness.”


The authors don’t really address the overfitting problem, though it is tangential to what they call the “versioning” problem (when mismatched versions of corpora such as WordNet produce different results).

Overfitting is a huge problem in the current approach to evaluation in NLP and machine learning research. I don’t know why anyone cares about the umpteenth evaluation on the same fold of the Penn Treebank that produces a fraction of a percentage gain. When the variation across folds is higher than your estimated gain on a single fold, it’s time to pack the papers in the file drawer, not in the proceedings of ACL. At least until we also get a journal of negative results to go with our journal of reproducibility.

The Major Lift

Perhaps hill-climbing error on existing data sets is not the best way forward methdologically. Maybe, just maybe, we can make the data better. It’s certainly what Breck and I tell customers who want to build an application.

If making the data better sounds appealing, check out

I find the two ideas so compelling that it’s the only area of NLP I’m actively researching. More on that later (or sooner if you want to read ahead).


(My apologies to Leonard Cohen for stealing his lyrics.)

Token Lumpers and Splitters: Spider-Man vs. Superman

August 21, 2013

One of the perennial problems in tokenization for search, classification, or clustering of natural language data is which words and phrases are spelled as compounds and which are separate words. For instance consider “dodgeball” (right) vs. “dodge ball” (wrong) and “golfball” (wrong) vs. “golf ball” (right)?

It’s a classic lumpers vs. splitters problem. Alas, the “right” answer can change over time and some words are listed as both lumped and split in some dictionaries.

This point was first brought home to me and Breck when we were browsing Comcast’s query logs for TV search queries, where users seemed to arbitrarily insert or delete spaces in queries. Who can blame them with the inconsistencies of English in this regard?

This token lumping vs. splitting problem is one of the reasons LingPipe’s spelling correction is not token-based like Lucene’s (though Lucene has other ways to handle combining and splitting compounds — more on that from Mitzi in the near future). And it’s one of the reasons (along with morphological affixes) that character n-grams provide a better basis for classification than hard tokenization.

My favorite examples are comic book superhero names. It turns out that the editors at Marvel Comics are splitters, whereas those at DC Comics are lumpers. Hence we have on the Marvel roster,

Ant-Man, Iron Man, and Spider-Man.

They’re not consistent on hyphenation, which is another issue if you don’t throw away punctuation.

DC, on the other hand, employs men, women, and children, but doesn’t have the budget for spaces:

Aquaman, Batman, Catwoman, Hawkman, Martian Manhunter, Superboy, Supergirl, and Superman.

The exception that proves the rule (I’m open-minded enough to use a phrase that’s a pet peeve of mine), DC gives us:

Wonder Woman

VerbCorner: Another Game with Words

July 4, 2013

Games with Words continues to roll out new games, releasing VerbCorner.

VerbCorner contains a series of sub-games with titles like “Philosophical Zombie Hunter” and questions that sound like they’re straight of an SAT reading comprehension test, starting with a long story I won’t repeat and then questions like:

Michelle smashed the gauble into quaps. Which of the following is *definitely* a real human?

  1. Michelle
  2. None of the above
  3. Can’t tell because in this context ‘smash’ has more than one meaning.
  4. Can’t tell because the sentence is ungrammatical.
  5. Can’t tell because I don’t know that verb.

Josh Hartshorne, one of the developers, writes that Verbcorner is “a new project I’m doing in collaboration with Martha Palmer at CU-Boulder, which is aimed at getting human judgments to inform the semantics assigned to verbs in VerbNet. You can find a press release about the project … You’ll see we’ve incorporated a lot of gamification elements in order to make the task more engaging (and also to improve annotation quality).”

This note from the press release harkens back to my philosophical semantics days, “If we do not want to define words in terms of other words, what should they be defined in terms of? This remains an open research question…”. I’ll say! The project has a generative semantics feel; in the press release, Josh is quoted as saying “There are a few dozen core components of meaning and there are tens of thousands of verbs in English.” Now I’m left hanging as to what a “core component of meaning” is!

All Bayesian Models are Generative (in Theory)

May 23, 2013

[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix X and outcome vector y. In regression modeling, we assume a coefficient vector \beta and explicitly model the outcome likelihood p(y|\beta,X) and coefficient prior p(\beta). But what about X? If we were modeling X, we’d have a full likelihood p(X,y|\beta,\psi), where \psi are the additional parameters involved in modeling X and the prior is now joint, p(\beta,\psi).

So how do we justify conditional modeling as Bayesians? We simply assume that \psi and \beta have independent priors, so that

p(\beta,\psi) = p(\beta) \times p(\psi).

The posterior then neatly factors as

p(\beta,\psi|y,X) = p(\psi|X) \times p(\beta|X,y).

Looking at just the inference for the regression coefficients \beta, we have the familiar expression

p(\beta|y,X) \propto p(\beta) \times p(y|\beta,X).

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for \psi. To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given k others than a joint distribution on all k+1 variables.

We knew that all along.

Generative vs. Discriminative; Bayesian vs. Frequentist

April 12, 2013

[There's now a followup post, All Bayesian models are generative (in theory).]

I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.

To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.

Frequentist Bayesian
Discriminative p(z ; w, β) p(z, β ; w) = p(z | β ; w) * p(β)
Generative p(z, w ; β) p(z, w, β) = p(z, w | β) * p(β)

If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.

Frequentists model the probability of observations given the parameters. This involves a likelihood function.

Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.

Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.

Mean-Field Variational Inference Made Easy

March 25, 2013

I had the hardest time trying to understand variational inference. All of the presentations I’ve seen (MacKay, Bishop, Wikipedia, Gelman’s draft for the third edition of Bayesian Data Analysis) are deeply tied up with the details of a particular model being fit. I wanted to see the algorithm and get the big picture before being overwhelmed with multivariate exponential family gymnastics.

Bayesian Posterior Inference

In the Bayesian setting (see my earlier post, What is Bayesian Inference?), we have a joint probability model p(y,\theta) for data y and parameters \theta, usually factored as the product of a likelihood and prior term, p(y,\theta) = p(y|\theta) p(\theta). Given some observed data y, Bayesian predictive inference is based on the posterior density p(\theta|y) \propto p(\theta, y) of the the unknown parameter vector \theta given observed data vector y. Thus we need to be able to estimate the posterior density p(\theta|y) to carry out Bayesian inference. Note that the posterior is a whole density function—we’re not just after a point estimate as in maximum likelihood estimation.

Mean-Field Approximation

Variational inference approximates the Bayesian posterior density p(\theta|y) with a (simpler) density g(\theta|\phi) parameterized by some new parameters \phi. The mean-field form of variational inference factors the approximating density g by component of \theta = \theta_1,\ldots,\theta_J, as

g(\theta|\phi) = \prod_{j=1}^J g_j(\theta_j|\phi_j).

I’m going to put off actually defining the terms g_j until we see how they’re used in the variational inference algorithm.

What Variational Inference Does

The variational inference algorithm finds the value \phi^* for the parameters \phi of the approximation which minimizes the Kullback-Leibler divergence of g(\theta|\phi) from p(\theta|y),

\phi^* = \mbox{arg min}_{\phi} \ \mbox{KL}[ g(\theta|\phi) \ || \ p(\theta|y) ].

The key idea here is that variational inference reduces posterior estimation to an optimization problem. Optimization is typically much faster than approaches to posterior estimation such as Markov chain Monte Carlo (MCMC).

The main disadvantage of variational inference is that the posterior is only approximated (though as MacKay points out, just about any approximation is better than a delta function at a point estimate!). In particular, variational methods systematically underestimate posterior variance because of the direction of the KL divergence that is minimized. Expectation propagation (EP) also converts posterior fitting to optimization of KL divergence, but EP uses the opposite direction of KL divergence, which leads to overestimation of posterior variance.

Variational Inference Algorithm

Given the Bayesian model p(y,\theta), observed data y, and functional terms g_j making up the approximation of the posterior p(\theta|y), the variational inference algorithm is:

  • \phi \leftarrow \mbox{random legal initialization}
  • \mbox{repeat}
    • \phi_{\mbox{\footnotesize old}} \leftarrow \phi
    • \mbox{for } j \mbox{ in } 1:J
      • \mbox{{} \ \ set } \phi_j \mbox{ such that } g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)].
  • \mbox{until } ||\phi - \phi_{\mbox{\footnotesize old}}|| < \epsilon

The inner expectation is a function of \theta_j returning a single non-negative value, defined by

\mathbb{E}_{g_{-j}}[\log p(\theta|y)]

\begin{array}{l}  \mbox{ } = \int_{\theta_1} \ldots \int_{\theta_{j-1}} \int_{\theta_{j+1}} \ldots \int_{\theta_J}  \\[8pt] \mbox{ } \hspace*{0.4in}  g(\theta_1|\phi_1) \times \cdots \times g(\theta_{j-1}|\phi_{j-1}) \times  g(\theta_{j+1}|\phi_{j+1}) \times \cdots \times g(\theta_J|\phi_J)  \times  \log p(\theta|y)  \\[8pt] \mbox{ } \hspace*{0.2in}  d\theta_J \cdots d\theta_{j+1} \ d\theta_{j-1} \cdots d\theta_1  \end{array}

Despite the suggestive factorization of g and the coordinate-wise nature of the algorithm, variational inference does not simply approximate the posterior marginals p(\theta_j|y) independently.

Defining the Approximating Densities

The trick is to choose the approximating factors so that the we can compute parameter values \phi_j such that g_j(\theta_j|\phi_j) = \mathbb{E}_{g_{-j}}[\log p(\theta|y)]. Finding such approximating terms g_j(\theta_j|\phi_j) given a posterior p(\theta|y) is an art form unto itself. It’s much easier for models with conjugate priors. Bishop or MacKay’s books and the Wikipedia present calculations for a wide range of exponential-family models.

What if My Model is not Conjugate?

Unfortunately, I almost never work with conjugate priors (and even if I did, I’m not much of a hand at exponential-family algebra). Therefore, the following paper just got bumped to the top of my must understand queue:

It’s great having Dave down the hall on sabbatical this year — one couldn’t ask for a better stand in for Matt Hoffman. They are both insanely good at on-the-fly explanations at the blackboard (I love that we still have real chalk and high quality boards).

Anyone Want to Write an O’Reilly Book on NLP with Java?

February 21, 2013

Mitzi and I pitched O’Reilly books a revision of the Text Processing in Java book that she’s been finishing off.

The response from their editor was that they’d love to have an NLP book based on Java, but what we provided looked like everything-but-the-NLP you’d need for such a book. Insightful, these editors. That’s exactly how the book came about, when the non-proprietary content was stripped out of the LingPipe Book.

I happen to still think that part of the book is incredibly useful. It covers all of unicode, UCI for normalization and detection, all of the streaming I/O interfaces, codings in HTML, XML and JSON, as well as in-depth coverage of reg-exes, Lucene, and Solr. All of the stuff that is continually misunderstood and misconfigured so that I have to spend way too much of my time sorting it out. (Mitzi finished the HTML, XML and JSON chapter, and is working on Solr; she tuned Solr extensively on her last consulting gig, by the way, if anyone’s looking for a Lucene/Solr developer).

O’Reilly isn’t interested in LingPipe because of LingPipe’s non-OSF approved license. I don’t have time to finish the LingPipe book now that I’m at Columbia full time; I figured it’d be 1500 pages when I was done if I kept up at the same level of detail, and even that didn’t seem like enough!

A good model for such a book is Manning Press’s Taming Text, co-authored by Breck’s grad-school classmate Tom Morton. It’s based on Lucene/Mahout and Tom’s baby, OpenNLP. (Manning also passed on Text Processing in Java, which is where Breck sent it first.)

Another model, aimed more at linguists than programmers, is O’Reilly’s own NLTK Book, co-authored by my grad school classmate Steve Bird, and my Ph.D. supervisor, Ewan Klein. Very small world, NLP.

Manning also passed on TPiJ, so Colloquial Media will branch out from genre fiction into tech books. More news when we’re closer to finishing.

If you know why LDA will think this document is about American football and why frame-based parsing will only make topic classification worse, then you’re probably a good candidate to write this book. [Domain knowledge: Manning is the surname of two generations of very well known American football players all of whom play the only position that fill the agent roll in a very popular play known as a "pass".] If you know why Yarowsky‘s one discourse, one sense rule is violated by the background knowledge and also understand the principled Bayesian strategies for what Yarowsky called “bootstrapping,” even better.

Bayesian Inference for LDA is Intractable

February 18, 2013

Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.

Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,

and on computational intractability, which is a bit more involved than a program being slow,

The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.

Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.

Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.

And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.

So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).

Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.

So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.

I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.

Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.

The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.


Get every new post delivered to your Inbox.

Join 822 other followers