I’m finally getting back to the problem of how to learn with only positively labeled data (and lots of unlabeled data). I discussed this last in the entry How Can I Build a Classifier with No Negative Data?.

This paper takes an interesting (and, in retrospect, obvious) approach:

- Denis, François, Rémi Gilleron, and Marc Tommasi. 2002. Text Classification from Positive and Unlabeled Examples. In
*Proceedings of the 9th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems*(IPMU 2002).

The foundation is naive Bayes with a bag of words representation for a binary classification task with two categories `{0,1}`

. They remind us in their equation (4) that naive Bayes yields a mixture of unigrams language mode for words `w`

:

p(w) = p(cat=0) * p(w|cat=0) + p(cat=1) * p(w|cat=1)

Denis et al. then (well, actually before) observe that we can estimate `p(w|cat=1)`

from positive examples. And we can estimate `p(w)`

over the entire corpus without labels. We can probably make a pretty good guess at `p(cat=0)`

, which gives us `p(cat=1) = 1.0 - p(cat=0)`

. If the documents are at all long and the category probabilities are not horrendously skewed, the cat probs usually don’t contribute much to the category estimates in naive Bayes.

Aha! Now we can solve for `p(w|cat=0)`

, which they do for us in their equation (6):

p(w|cat=0) = [ p(w) - p(cat=1) * p(w|cat=1) ] / p(cat=0)

Denis et al. note that for this to make sense, the document length distributions have to be the same. This not being the case may not matter much in practice. Recall that the independence assumptions of Naive Bayes need to be met, which they almost never are in bags of words under naive Bayes — actual counts are almost always overdispersed relative to the expectation under a multinomial as in naive Bayes.)

The paper includes some eval over WebKB Course Dataset, a set of 1000 or so academic department web pages from four universities classified into types. Denis et al. took the course pages (22% of total) to be category 1 and all other pages to be category 0. They got quite decent results with both 20 and 50 positively-labeled documents and up to 200 unlabeled docs.

They showed that classification performance was robust to errors in the category distribution specification `p(cat=0)`

.

One good idea’s all a paper (or a thesis for that matter) needs.

They did more eval and added cotraining in:

- François Denis, Anne Laurent, Rémi Gilleron, Marc Tommasi. 2003.

Text classification and co-training from positive and unlabeled examples. In*Proceedings of ICML*.

January 31, 2009 at 8:04 pm |

Thank you for posting about this paper. Lots of people don’t bother with papers prior to 2006. I maintain that there’s a lot to learn even from the 40’s!

One good idea for a thesis? No way. I wish someone had told me before I designed my whole system.

cj