## Elkan and Noto (2008): Learning Classifiers from Only Positive and Unlabeled Data

Charles Elkan sent along links to these two papers in response to my previous post, Denis et al.’s Text Classification from Positive and Unlabeled Examples. The title overlap indicates just how much we need auto-alerting using tech like Hal’s WhatToSee:

Because they’ve supplied the data, I can use it in my upcoming tutorial on semi-supervised learning (I’m also doing standard EM).

### The Setup

The basic setup is the same as for Denis et al.: a two-category (positive/negative) classification problem with a relatively tiny data set. In this case, a total of 7500 or so SwissProt text recrods, 2500 or so of which are positive instances that mention transport proteins. Then a labeled training set consisting of a randomly selected subset of the positive examples. The goal is to estimate a classifier p(positive|text) given only the collection of 7500 texts and 300 labeled positive examples.

### Lemma 1

Elkan and Noto show that if you can estimate the probability of a positive item being labeled, `p(lab|pos)`, and you can estimate the probability of an item being labeled, `p(lab|text)`, then you can estimate:

```p(pos|text) = p(lab,pos|text) / p(lab|pos)
= p(lab|text) / p(lab|pos)
```

Given that `p(lab|pos)` is a constant, we also see that `p(pos|text)` is proportional to `p(lab|text)`. They lean on this ratio to train a binary classifier to estimate `p(lab|text)`, which provides the same ranking by probability as the estimand of interest, `p(pos|text)`.

### The Evaluation

They evaluate by precision at 300 and the results look very good. The KDD paper cuts off the ROC curve at .95 recall saying the only range of interest is precision-at-300; the AI’08 paper shows the whole thing, where they hit 100% recall at a respectable 70% sensitivity, which almost seems too good to be true.

Their application of ranking articles is more like traditional search in that it does not require a properly calibrated `p(pos|text)`, only a ranking. They’re interested in supplying a list of 300 papers to a human curator to review them for inclusion in a database of transport proteins.

That’s a bit different than our concern. We’re working on the much larger scale problem of linking Entrez-Gene (hundreds of thousands of genes with between a handful and thousands positive examples plus metadata) and MEDLINE (tens of millions of citations with titles, authors, abstracts, dates, and other meta-data like MeSH terms). For each of the genes in Entrez, we have positive-only data consisting of MEDLINE citations to papers. We also have lots of metadata and attached descriptive text. We’d like to be able to calibrate `p(gene|text)` across genes so that for each paper, we can rank genes in order of likelihood they’re mentioned in the text. This is very different than ranking texts in order of likelihood of mentioning a gene, which we also want to do, but is easier with traditional “more like this” methodology.

### Estimation

In order to derive estimates of `p(pos|text)`, the authors use a 1D logistic regression to estimate probabilities `p(lab|text)` from SVM classifier margins. Of course, the sigmoid is monotonic, so this preserves the same ranking as the SVMs.

The authors estimate `p(lab|pos)` by computing the expectation of `p(lab|pos)` by averaging `p(lab|x)` over a cross-validated sample of labeled positives.

### Does Negative Data Help?

In the AI’08 paper, they consider adding negative data gleaned from early review rounds where they gave 300 papers to curators. Surprisingly, their results showed the negative data didn’t help much at all. They were surprised about variance in their data samples, which showed from 1.18% to 4.25% relevant articles in 300 samples. I’m not surprised – the 95% posterior interval for `Binomial(n|300,0.025)/300` is (0.010,0.043).

### Underestimation Bias (I think)

Their figure shows the result of a (mis-specified, in the statistical sense) two-dimensional logistic regression on two categories generated by 2D normal distributions (from the figure, it looks like zero covariance), using an expanded basis with intercept and interaction terms: `1, x, y, x*x, y*y` (click on figure to enlarge):

The blue pluses are positive cases, about 20% of which are labeled (not separated from the unlabeled positive cases); the red circles are negative cases. The smaller blue ellipse is the 50% contour for the labeled-vs-unlabeled classifier, and the larger blue ellipse is the 50% contour for a classifier trained on positive-vs-negative.

This neatly shows why the classifier trained on the labeled versus unlabeled distinction has high precision. It also neatly shows a case where Lemma 1 fails in practice, because the probabilities are not in the ratio dicated by Lemma 1.

My main concern with this paper is that training a classifier on labeled versus unlabeled data is going to systematically underestimate `p(lab|text)` because the unlabeled data contains many points drawn from the same set of underlying positive examples (red pluses) as the labeled examples. Their diagram shows this very neatly, though we can’t prove anything with a mis-specified model anecdote.

### What about Recall?

It’s hard to evaluate recall in this setup, and the AI’08 paper provides some heuristics that require evaluating some negative data in the end from a sample. For the gene linkage problem of generating classifiers for every gene in Entrez-Gene, it’s hard to even get a sample where there’s enough positive density to allow estimation.

### Questions and Quibbles

A question I had is why they didn’t iterate. Given what they’re doing, it’d seem natural to combine this with EM-like iterations. Or Gibbs samples.

Quibble 1: After all the estimation, they don’t measure how well calibrated the probabilities are — we only see the ROC curves. Although every reviewer wants more graphs, I really want to see how well calibrated the probabilities are.

Quibble 2: The authors really shouldn’t use four decimal places (0.xxxx) for reporting results on 6000 data points in their tables, but rounding makes them look like less of an improvement on “prior art”.

Quibble 3: I didn’t see any justification for their claim in the AI’08 paper that SVMs have a “small and useful boost in accuracy compared to methods like maximum entropy [aka logistic regression]”. What’s the gain from their use of a quadratic kernel over the linear kernel? Did they regularize? I suppose SVMs are like IBM and you can’t go wrong these days buying into them. I’d have thought for estimating the probabilities (Quibble 1), logistic regression might have an advantage because training maximizes the sum of the log estimates. And even if SVMs do have a small and useful boost in accuracy, we’re talking about ranking on the one hand and probability estimate on the other, not first-best classifier accuracy.