## Elkan and Noto (2008): Learning Classifiers from Only Positive and Unlabeled Data

Charles Elkan sent along links to these two papers in response to my previous post, Denis et al.’s Text Classification from Positive and Unlabeled Examples. The title overlap indicates just how much we need auto-alerting using tech like Hal’s WhatToSee:

Because they’ve supplied the data, I can use it in my upcoming tutorial on semi-supervised learning (I’m also doing standard EM).

### The Setup

The basic setup is the same as for Denis et al.: a two-category (positive/negative) classification problem with a relatively tiny data set. In this case, a total of 7500 or so SwissProt text recrods, 2500 or so of which are positive instances that mention transport proteins. Then a labeled training set consisting of a randomly selected subset of the positive examples. The goal is to estimate a classifier p(positive|text) given only the collection of 7500 texts and 300 labeled positive examples.

### Lemma 1

Elkan and Noto show that if you can estimate the probability of a positive item being labeled, `p(lab|pos)`, and you can estimate the probability of an item being labeled, `p(lab|text)`, then you can estimate:

```p(pos|text) = p(lab,pos|text) / p(lab|pos)
= p(lab|text) / p(lab|pos)
```

Given that `p(lab|pos)` is a constant, we also see that `p(pos|text)` is proportional to `p(lab|text)`. They lean on this ratio to train a binary classifier to estimate `p(lab|text)`, which provides the same ranking by probability as the estimand of interest, `p(pos|text)`.

### The Evaluation

They evaluate by precision at 300 and the results look very good. The KDD paper cuts off the ROC curve at .95 recall saying the only range of interest is precision-at-300; the AI’08 paper shows the whole thing, where they hit 100% recall at a respectable 70% sensitivity, which almost seems too good to be true.

Their application of ranking articles is more like traditional search in that it does not require a properly calibrated `p(pos|text)`, only a ranking. They’re interested in supplying a list of 300 papers to a human curator to review them for inclusion in a database of transport proteins.

That’s a bit different than our concern. We’re working on the much larger scale problem of linking Entrez-Gene (hundreds of thousands of genes with between a handful and thousands positive examples plus metadata) and MEDLINE (tens of millions of citations with titles, authors, abstracts, dates, and other meta-data like MeSH terms). For each of the genes in Entrez, we have positive-only data consisting of MEDLINE citations to papers. We also have lots of metadata and attached descriptive text. We’d like to be able to calibrate `p(gene|text)` across genes so that for each paper, we can rank genes in order of likelihood they’re mentioned in the text. This is very different than ranking texts in order of likelihood of mentioning a gene, which we also want to do, but is easier with traditional “more like this” methodology.

### Estimation

In order to derive estimates of `p(pos|text)`, the authors use a 1D logistic regression to estimate probabilities `p(lab|text)` from SVM classifier margins. Of course, the sigmoid is monotonic, so this preserves the same ranking as the SVMs.

The authors estimate `p(lab|pos)` by computing the expectation of `p(lab|pos)` by averaging `p(lab|x)` over a cross-validated sample of labeled positives.

### Does Negative Data Help?

In the AI’08 paper, they consider adding negative data gleaned from early review rounds where they gave 300 papers to curators. Surprisingly, their results showed the negative data didn’t help much at all. They were surprised about variance in their data samples, which showed from 1.18% to 4.25% relevant articles in 300 samples. I’m not surprised – the 95% posterior interval for `Binomial(n|300,0.025)/300` is (0.010,0.043).

### Underestimation Bias (I think)

Their figure shows the result of a (mis-specified, in the statistical sense) two-dimensional logistic regression on two categories generated by 2D normal distributions (from the figure, it looks like zero covariance), using an expanded basis with intercept and interaction terms: `1, x, y, x*x, y*y` (click on figure to enlarge):

The blue pluses are positive cases, about 20% of which are labeled (not separated from the unlabeled positive cases); the red circles are negative cases. The smaller blue ellipse is the 50% contour for the labeled-vs-unlabeled classifier, and the larger blue ellipse is the 50% contour for a classifier trained on positive-vs-negative.

This neatly shows why the classifier trained on the labeled versus unlabeled distinction has high precision. It also neatly shows a case where Lemma 1 fails in practice, because the probabilities are not in the ratio dicated by Lemma 1.

My main concern with this paper is that training a classifier on labeled versus unlabeled data is going to systematically underestimate `p(lab|text)` because the unlabeled data contains many points drawn from the same set of underlying positive examples (red pluses) as the labeled examples. Their diagram shows this very neatly, though we can’t prove anything with a mis-specified model anecdote.