AKA, I’m going to use discrete choice analysis (DCA) to provide a probabilistic model for the learning as search optimization (LaSO) paradigm and I’ll fit it with stochastic gradient descent (SGD). Too bad LaSO isn’t one, or I’d have hit the TLA trifecta.
It occurred to me to do this after considering DCA for coref and then reading Daumé and Marcu’s paper:
- Daumé, Hal III, and Daniel Marcu. 2005. A Large-Scale Exploration of Effective Global Features for a Joint Entity Detection and Tracking Model. In HLT/EMNLP.
(Hint to authors: anything you call “large scale” today will look tiny tomorrow.)
The paper does a really nice job of describing their LaSO technique and demonstrating its utility for within-document coreference resolution. There are other more general papers on LaSO, but this one’s nice in that it works through a clean applied example involving within-document coreference that we care about here at Alias-i.
What’s really cool is that they jointly solve the entity extraction and coref problem, and get to use arbitrary history-based features.
The LaSO paradigm is simply one of scoring various search states. You extract feature vectors from a search state and input. At a given search state, you can generate a finite number of next possible search states consistent with the input. These are then scored using the feature vectors.
The trick is in the training, where they consider a single deviation from the correct search path at every step. They train these decisions discriminatively.
It’s more general than the kind of history-based parsing:
- Black, Ezra, Fred Jelinek, John Lafferty, David M. Magerman, Robert Mercer and Salim Roukos. 1992. Towards History-based Grammars: Using Richer Models for Probabilistic Parsing. In HLT.
that I used in my very first paper on stat NLP:
- Manning, Christopher D. and Bob Carpenter. 1997. Left-corner Language Models and Parsing. In IWPT.
In the usual history-based parsing setup, there was a fixed set of choices corresponding to parser operations. For instance, Manning and I used left-corner derivations and Black et al. used top-down derivations. In the LaSO setup, there can be an arbitrary number of choices. For instance, when considering to which antecedent a pronoun refers, the number of antecedents isn’t bounded in advance.
DCA + LaSO
Daumé and Marcu consider two non-probabilistic scoring approaches, a perceptron-like approach and an approximate wide-margin approach.
Given the setup, we can model the finite set of choices from any given search state using DCA. This in turn provides a proper probabilistic measure over the entire set of possible outputs.
I’m not sure it’s going to work well in practice because it’ll locally normalize all the decisions. If you look at Andrew McCallum’s work on CRF-like approaches to structured prediction problems like coref, or if you look at Daumé, Marcu and Langford’s extension of LaSO to SEARN, one might think about normalizing on a more global level. The problem, as usual, is the normalization constant.
On the other hand, considering Ratinov and Roth’s experiments with entity detection, it seems like it’s worth pursuing the simple greedy (or more generally beam-based) approach.
Fitting with SGD
It’s easy to generate maximum a posteriori (MAP) coefficient estimates for the DCA model using stochastic gradient descent (SGD). In fact, it’s just like fitting other DCA models. Not coincidentally, I finished unit-testing DCA fitting via SGD yesterday.
Look for it in the next LingPipe release. I’ll either figure out how to abstract the entire search, like LaSO, or just provide the DCA fitter and a coref-based example.