AKA, I’m going to use discrete choice analysis (DCA) to provide a probabilistic model for the learning as search optimization (LaSO) paradigm and I’ll fit it with stochastic gradient descent (SGD). Too bad LaSO isn’t one, or I’d have hit the TLA trifecta.

It occurred to me to do this after considering DCA for coref and then reading Daumé and Marcu’s paper:

(Hint to authors: anything you call “large scale” today will look tiny tomorrow.)

The paper does a really nice job of describing their LaSO technique and demonstrating its utility for within-document coreference resolution. There are other more general papers on LaSO, but this one’s nice in that it works through a clean applied example involving within-document coreference that we care about here at Alias-i.

What’s really cool is that they jointly solve the entity extraction and coref problem, and get to use arbitrary history-based features.


The LaSO paradigm is simply one of scoring various search states. You extract feature vectors from a search state and input. At a given search state, you can generate a finite number of next possible search states consistent with the input. These are then scored using the feature vectors.

The trick is in the training, where they consider a single deviation from the correct search path at every step. They train these decisions discriminatively.

It’s more general than the kind of history-based parsing:

that I used in my very first paper on stat NLP:

In the usual history-based parsing setup, there was a fixed set of choices corresponding to parser operations. For instance, Manning and I used left-corner derivations and Black et al. used top-down derivations. In the LaSO setup, there can be an arbitrary number of choices. For instance, when considering to which antecedent a pronoun refers, the number of antecedents isn’t bounded in advance.


Daumé and Marcu consider two non-probabilistic scoring approaches, a perceptron-like approach and an approximate wide-margin approach.

Given the setup, we can model the finite set of choices from any given search state using DCA. This in turn provides a proper probabilistic measure over the entire set of possible outputs.

I’m not sure it’s going to work well in practice because it’ll locally normalize all the decisions. If you look at Andrew McCallum’s work on CRF-like approaches to structured prediction problems like coref, or if you look at Daumé, Marcu and Langford’s extension of LaSO to SEARN, one might think about normalizing on a more global level. The problem, as usual, is the normalization constant.

On the other hand, considering Ratinov and Roth’s experiments with entity detection, it seems like it’s worth pursuing the simple greedy (or more generally beam-based) approach.

Fitting with SGD

It’s easy to generate maximum a posteriori (MAP) coefficient estimates for the DCA model using stochastic gradient descent (SGD). In fact, it’s just like fitting other DCA models. Not coincidentally, I finished unit-testing DCA fitting via SGD yesterday.

Look for it in the next LingPipe release. I’ll either figure out how to abstract the entire search, like LaSO, or just provide the DCA fitter and a coref-based example.

3 Responses to “DCA LaSO via SGD”

  1. hal Says:

    In the “authors” defense, large-scale wasn’t meant to refer to the data size, but to the analysis :). I.e., most preceding work in EDT hadn’t really tried to figure out what was going on, but just tried to boost numbers. In retrospect, it’s a bad title because large-scale (the other version) was really taking off at that point, but we weren’t trying to claim that we were using big data sets.

    • lingpipe Says:

      I think you probably underestimate how many other groups were trying to figure out what was going on.

      I feel the same way about techniques described as “advanced” or “sophisticated”. One generation’s sophisticated is the next generation’s old hat.

      I’m still not sure what you mean by “the analysis”. You certainly report lots of feature ablation results (though not all 2**N). After the title, “large” doesn’t show up until page 2, where you mention “a large collection of features” and “large corpora”.

      The size that impressed me the most in the paper is the number of features you generated. I was actually going to ask how long it took to put all those features together. I feel like I’ve seen almost all of those features before (or maybe it was after). Just not all put together and used for joint coref and entity tagging.

      The scope of the feature set and shortness of the paper bring up the problem, which I believe you or someone else addressed in an earlier blog post, of reproducibility. I don’t feel I could reproduce the results of this paper because of the open-ended feature descriptions such as “eight hand-written patterns for identifying pleonastic ‘it’ and ‘that'” or “syntactic features based on running an in-house state of the art part of speech tagger” or “we used similar data mined from a 138GB web corpus”.

    • lingpipe Says:

      Hal — what I’d really like to know is whether a per-decision probabilistic version of LaSO has been tried or if you think it’d be too restrictive to normalize that way.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s