It’s more fun with classifier algebra today, as we look in our friends at Thomson-Reuters R&D:
- Veeramachaneni, Sriharsha and Ravi Kumar Kondadadi (2009) Surrogate Learning – From Feature Independence to Semi-Supervised Classification. In NAACL Workshop on Semi-Supervised Learning.
The Algebra
The problem is binary classification into a class y in {0,1}, and the modeling assumption is that inputs consist of pairs of vectors x = (x1,x2) where x1 in X1 and x2 in X2, and where the x1 and x2 are conditionally independent given the class y:
p(x1,x2|y) = p(x1|y) * p(x2|y)
Note that we don’t have to make the co-training assumption that both x1 and x2 are sufficient to build a classifier in and of themselves. To prevent divides-by-zero, they also assume p(x1|x2) > 0 and p(x1|y) > 0, and that p(x1|y=0) != p(x1|y=1). With this, they derive a formula for p(y|x) involving only p(x1|y) and p(x1|x2).
Pr(y=0|x1,x2)
= p(x1|y=0) / p(x1|x2)
* [ p(x1|y=1) – p(x1|x2) ] / [ p(x1|y=1) – p(x1|y=0) ]
What’s neat about their analysis is that you can train p(x1|x2) without supervision about p(y|x1,x2). If you can get a good handle on p(x1|y) in some way, you can bring in all that information about x2 through p(x1|x2). As the authors note, their approach is in many ways similar to Ando and Zhang’s 2007 two view model.
Special Case Examples
They run two examples, both of which are very real problems for Thomson-Reuters: one on physician record linkage and one on discovering paraphrases for merger and acquisition sentences. In both experiments they make a further assumption that x1 is a binary variable with 100% recall for y=1, so that: Pr(y=1, x1=0) = 0, which lets them loosen the restriction on independence to cases where y=0, namely
p(x1,x2|y=0) = p(x1|y=0) * p(x2|y=0)
Physician Record Linkage
For the record-linkage case, they are classifying two physician records as to whether they are the same physician (y=1) or not. The 100% recall feature x1 is defined to be “have the same year of graduation”. That is, they’re assuming that if two records are identical, they have the same year of graduation. x2 is then the remaining information, and they estimate p(x1|x2) using logistic regression. They estimate p(x1|y=0) with a simple frequency estimate by taking random samples of pairs of physicians (some of these might actually match, but it’s sparse).
They show that their approach does a pretty good job compared to training a logistic regression model on a few hundred supervised instances (though not overwhelmingly better). The improvements over the supervised case all come in recall, which is not surprising. It’d be nice to see the whole precision-recall curves for the problem, which is typical for record linkage apps (see, e.g., one of the first examples in Gelman et al.’s Bayesian Data Analysis, which covers calibrating record linkage, a closely related problem).
They did a nice job of dealing with missing data, too, but that’s really a separate issue to some extent.
Paraphrasing Merger and Acquisition Sentences
This is a neat bootstrap (in the colloquial usage, not the statistical one) approach to finding sentences about mergers and acquisitions between pairs of companies. It’s similar to Agitchein and Gravano’s Snowball system. They used a named entity detector to tag the data, and then restricted attention to sentences containing pairs of companies. They then created high-precision patterns like “ORG bought ORG” and “offer for ORG” to create seeds (it’s not my fault that bootstrapping employs such a horribly mixed metaphor). They use the seeds to find relation sentences. They then go and find other sentences with the same pair of organizations within a time frame around the seed sentence of a month or two and consider those positive. They consider all other instances of one of the organizations and another organization negative (another noisy rule like graduation year).
The 100% recall feature x1 is that the two pair of organizations occur in a seed sentence (this is also noisy). They use a bag of unigram, bigram and trigram word features for x2. For some reason they use an SVM for p(x1=1|x2). They don’t actually mention how they estimate p(x1|y=0).
They give a few operating points, one of which has 99% recall at 79% precision, which is nice.
Conclusion
I’d have given this paper the thumbs-up, but the explanation of the paraphrase task could’ve been more clearly related back to the technique of the paper. Overall, I’d like to see more of this kind of innovative work on both technique and application in the main ACL.
June 24, 2009 at 3:18 am |
This seems somewhat similar in spirit to the work by Foster et al on “Multi-View Dimensionality Reduction via Canonical Correlation Analysis” where they use CCA on two views X1 and X2 of the data to learn a subspace and then use the projection of one of the views (X1 or X2) to learn a predictor in that subspace.. The idea is based on conditional independence assumption which says that conditioned on labels Y, the views X1 and X2 are independent of each other, and that each view alone is “sufficient” to make predictions. The key thing is that the CCA step requires only unlabeled data. They also have results showing that it results in better sample complexity.
June 24, 2009 at 11:19 am |
Indeed. I’m getting into reading this stuff because we’re working with Tong Zhang, the third author on the Multi-View Dimensionality Reduction via CCA paper. I’m still trying to get my head around all of the different approaches to the same basic idea. Veeramachaneni and Kondadadi’s approach is different in the sense that they’re not assuming each view is “sufficient” for predictions.
June 24, 2009 at 2:56 pm |
Thanks for the nice review. I think that in NLP, it is often possible to design features that are class-conditionally independent (or at least approximately so), which together with the difficulty of obtaining labeled training data makes multi-view learning very attractive.
Regarding your comment about p(x1|y) for the paraphrase problem, since for the surrogate learning in the special case p(y|x1,x2) is a monotonic function of p(x1|x2), we don’t need to estimate p(x1|y). All we need is a threshold on p(x1|x2) for the desired p-r trade-off. (Labeled data is needed for that, of course. We ducked the issue by presenting p-r values at different thresholds on a labeled test data.)
I haven’t studied the CCA paper, but at least for co-training view-redundancy is an unnecessary requirement because of two facts. First, the final classifier p(y|x1,x2) \propto p(y|x1) p(y|x2) under independence, which means that the classifiers on the two views can be combined easily. Second, with sufficient unlabeled data and an initial classifier on one view that is better than random guessing it can be shown that the classifier on the other view can still be improved.