DCA isn’t Just Tied Logistic Regression
After posting about discrete choice analysis (DCA) a few days ago, then adding a post about analyzing transportation choices, it occurred to me that DCA isn’t really just a form of logistic regression with parameter tying in its most general form. In its most general form, any number of choices can be presented at run time, whereas in both the machine-learning max-ent style approaches and the stats textbook type approaches to logistic regression, there is always a fixed, finite set of outcome categories.
DCA for Within-Document Coreference
One of the projects on our plate right now involves within-document coreference resolution, which involves resolving which noun phrases in a document refer to the same entities. Thus it’s a bit more general than the classical anaphora resolution problem, which finds antecedents for pronouns. In our case, two mentions of “John Smith” in the same document need to be linked (if they refer to the same person; they won’t if you’re reading Amit Bagga and Breck’ss John Smith resolution paper, or for that matter, the recreation of their results in LingPipe’s clustering tutorial).
LingPipe’s heuristic coreference package is based on Breck’s thesis work (called CogNIAC). If you troll through the code or read Breck’s paper, you’ll see that it’s essentially a greedy online algorithm that visits the entity mentions in a document in order, and for each mention either links it to a previous linked chain of mentions, or it starts a new chain consisting only of the current mention. Essentially, it’s a greedy online clustering algorithm.
The resolution of a mention is guided by matchers and anti-matchers that score candidate antecedent mention chains based on properties such as the closest matching alias (using a complex comparison allowing for missing tokens), known alias associations, discourse proximity (how far away the last mention in a chain is and how many are intervening), entity type (person, location, and ogranization, and for persons, gender).
So far, there’s nothing for linking common nouns (as there is in ACE), and nothing for linking compound entities (e.g. “John and Mary … They …”).
A DCA Model
We could just throw all those features into a DCA type model! The discrete choices consist of each candidate antecedent mention chain along with the choice of creating a new mention chain. If we can extract a feature vector given a mention chain
and mention
, we’re in business. We could potentially even use the other mention chains in generating the features for mention
, as in
where
is the set of previous mention chains,
is the candidate mention chain, and
is the mention being resolved. With all existing mention chains in play, we can also generate a reasonable feature vector
for the choice of creating a new mention chain out of
given the antecedent candidates
.
The model consists of a coefficient vector . Given a mention
and set of potential antecedent chains
, the probability of a given antecedent is defined in the usual way for log-linear logistic-type models. For linking to an antecedent chain
, we have
.
For a new chain, which we’ll write , we have
.
Given annotated data and a feature extractor, the training cases easy to create. Each instance consists of the features for all previous mention chains and for creating a fresh along with the decision.
Does Any NLP Package Implement DCA?
Alas, LingPipe can’t handle these models the way I wrote logistic regression. In fact, I’m not sure anyone’s NLP package can handle the full DCA type models.
I’m afraid I can never quite figure out from the Mallet javadoc what exactly it can and can’t do because everything’s so abstract and there are so many implementations of the interfaces and so little high-level doc. But classes like cc.mallet.classify.MaxEnt
seem to assume a fixed set of output categories.
The Open NLP javadoc is much more straightforward; their opennlp.maxent.GISModel
also requires a fixed set of categories.
It also doesn’t look like Weka supports DCA-style models from the Weka Javadoc, but again, the interfaces are so general, and the implementations so plentiful, I’m not sure.
I’d love to hear about which packages do support this kind of thing. Presumably something the econometricians are using would do it.
Implementing DCA with SGD
DCA won’t be at all hard to write. In fact, having just finished CRFs (stay tuned for the release, probably next week).
Stochastic gradient descent (or a quasi-Newton method) is easy to adapt. The correct antecedent candidate or choice to create a new chain is treated as a positive (1) outcome, with all other candidate antecedent mention chains treated as negative (0) outcomes.
Coefficient Priors
Of course, we can put priors on the coefficients.
It sure would be nice to be able to create Bayesian estimators. Having just blogged about MLE and MAP’s bias, I’m worried about the MAP estimates for logistic regression all these systems (including LingPipe) use.
OK, you could use SVMs, too
Of course, SVM (hinge), perceptron (0/1), or other simple loss functions could be used instead of the logistic loss function. SGD will still work as is. And you could use non-linear kernels, which often squeeze a bit more accuracy out for very little work (contrasted with hand tuning non-linear features, such as interaction features, in a standard linear model).
Coreference Training Data?
Well, there was MUC 7 which kicked things off, and then ACE more recently. MUC-6 and MUC-7 and several years of ACE datasets are available from the LDC for a fee.
A couple months ago, I blogged about Massimo Poesio et al.’s effort to collect a coreference corpus using the Phrase Detectives annotation as game. This should be out soon, and be much larger than the MUC and ACE datasets.
January 15, 2010 at 5:12 pm |
I actually have forthcoming work that does precisely this but the model is trained in an unsupervised way (I’m locally normalizing each decision so the partition function isn’t too hairy). The same decision is made for each mention (either an existing cluster or a new one). The features \phi(m,c,C) look like is the cluster c have the closest mention (in tree or raw distance) of any cluster in c? Is m an indefinite nominal (typically a referent nominal is a new entity, predicate-nominal constructions withstanding). I didn’t at all know about the DCA connection, and I’ll be sure to mention it in a final version of the paper.
January 18, 2010 at 7:49 am |
Good point! I neglected to mention this in my talk at Columbia, since we were dealing with a fixed set of classes there. In DCA you can present any subset of classes for a test example, both ones that occur in the training data and ones that don’t, as long as you provide the corresponding attribute values for them, and those attribute values have tied parameters.
You can sort of fake this with current MaxEnt software (for instance Hal Duame III’s Mega M) by training a separate model for each test example, but that’s really ugly. I’ve been fooling around with various ways to handle this in BOXER (forthcoming successor to BXR). One tricky issue is how to handle the intercept parameter, where I think probably one needs a method like those used to estimate multinomials over large vocabularies (e.g. Friedman, N and Singer, Y. Efficient Bayesian Parameter Estimation in Large Discrete Domains, NIPS, 1999.)