Regarding your comment about p(x1|y) for the paraphrase problem, since for the surrogate learning in the special case p(y|x1,x2) is a monotonic function of p(x1|x2), we don’t need to estimate p(x1|y). All we need is a threshold on p(x1|x2) for the desired p-r trade-off. (Labeled data is needed for that, of course. We ducked the issue by presenting p-r values at different thresholds on a labeled test data.)

I haven’t studied the CCA paper, but at least for co-training view-redundancy is an unnecessary requirement because of two facts. First, the final classifier p(y|x1,x2) \propto p(y|x1) p(y|x2) under independence, which means that the classifiers on the two views can be combined easily. Second, with sufficient unlabeled data and an initial classifier on one view that is better than random guessing it can be shown that the classifier on the other view can still be improved.

]]>