DCA isn’t Just Tied Logistic Regression
After posting about discrete choice analysis (DCA) a few days ago, then adding a post about analyzing transportation choices, it occurred to me that DCA isn’t really just a form of logistic regression with parameter tying in its most general form. In its most general form, any number of choices can be presented at run time, whereas in both the machine-learning max-ent style approaches and the stats textbook type approaches to logistic regression, there is always a fixed, finite set of outcome categories.
DCA for Within-Document Coreference
One of the projects on our plate right now involves within-document coreference resolution, which involves resolving which noun phrases in a document refer to the same entities. Thus it’s a bit more general than the classical anaphora resolution problem, which finds antecedents for pronouns. In our case, two mentions of “John Smith” in the same document need to be linked (if they refer to the same person; they won’t if you’re reading Amit Bagga and Breck’ss John Smith resolution paper, or for that matter, the recreation of their results in LingPipe’s clustering tutorial).
LingPipe’s heuristic coreference package is based on Breck’s thesis work (called CogNIAC). If you troll through the code or read Breck’s paper, you’ll see that it’s essentially a greedy online algorithm that visits the entity mentions in a document in order, and for each mention either links it to a previous linked chain of mentions, or it starts a new chain consisting only of the current mention. Essentially, it’s a greedy online clustering algorithm.
The resolution of a mention is guided by matchers and anti-matchers that score candidate antecedent mention chains based on properties such as the closest matching alias (using a complex comparison allowing for missing tokens), known alias associations, discourse proximity (how far away the last mention in a chain is and how many are intervening), entity type (person, location, and ogranization, and for persons, gender).
So far, there’s nothing for linking common nouns (as there is in ACE), and nothing for linking compound entities (e.g. “John and Mary … They …”).
A DCA Model
We could just throw all those features into a DCA type model! The discrete choices consist of each candidate antecedent mention chain along with the choice of creating a new mention chain. If we can extract a feature vector given a mention chain and mention , we’re in business. We could potentially even use the other mention chains in generating the features for mention , as in where is the set of previous mention chains, is the candidate mention chain, and is the mention being resolved. With all existing mention chains in play, we can also generate a reasonable feature vector for the choice of creating a new mention chain out of given the antecedent candidates .
The model consists of a coefficient vector . Given a mention and set of potential antecedent chains , the probability of a given antecedent is defined in the usual way for log-linear logistic-type models. For linking to an antecedent chain , we have
For a new chain, which we’ll write , we have
Given annotated data and a feature extractor, the training cases easy to create. Each instance consists of the features for all previous mention chains and for creating a fresh along with the decision.
Does Any NLP Package Implement DCA?
Alas, LingPipe can’t handle these models the way I wrote logistic regression. In fact, I’m not sure anyone’s NLP package can handle the full DCA type models.
I’m afraid I can never quite figure out from the Mallet javadoc what exactly it can and can’t do because everything’s so abstract and there are so many implementations of the interfaces and so little high-level doc. But classes like
cc.mallet.classify.MaxEnt seem to assume a fixed set of output categories.
It also doesn’t look like Weka supports DCA-style models from the Weka Javadoc, but again, the interfaces are so general, and the implementations so plentiful, I’m not sure.
I’d love to hear about which packages do support this kind of thing. Presumably something the econometricians are using would do it.
Implementing DCA with SGD
DCA won’t be at all hard to write. In fact, having just finished CRFs (stay tuned for the release, probably next week).
Stochastic gradient descent (or a quasi-Newton method) is easy to adapt. The correct antecedent candidate or choice to create a new chain is treated as a positive (1) outcome, with all other candidate antecedent mention chains treated as negative (0) outcomes.
Of course, we can put priors on the coefficients.
It sure would be nice to be able to create Bayesian estimators. Having just blogged about MLE and MAP’s bias, I’m worried about the MAP estimates for logistic regression all these systems (including LingPipe) use.
OK, you could use SVMs, too
Of course, SVM (hinge), perceptron (0/1), or other simple loss functions could be used instead of the logistic loss function. SGD will still work as is. And you could use non-linear kernels, which often squeeze a bit more accuracy out for very little work (contrasted with hand tuning non-linear features, such as interaction features, in a standard linear model).
Coreference Training Data?
A couple months ago, I blogged about Massimo Poesio et al.’s effort to collect a coreference corpus using the Phrase Detectives annotation as game. This should be out soon, and be much larger than the MUC and ACE datasets.