How Can I Build a Classifier with no Negative Data?

by

As part of our NIH grant, we’re working on the database linkage problem from gene/protein mentions in MEDLINE to database entries in EntrezGene. Basically, it’s what the biologists call "gene normalization", and was the basis of Biocreative Task 2.

I can summarize the problem we’re having with a simple example. We’d like to classify all 17M or so entries in MEDLINE as to whether they’re about genomics or not. EntrezGene provides links to 200K citations that are about particular genes, so we have a pile of articles about genomics (making up about 300 million characters). What we don’t have is any negative training data.

So my question is: how do I build a classifier for articles about genomics versus those that are not about genomics?

The job running in the background giving me time to write this post is generating cross-validation on cross-entropy rates for all of these 200K citations. That is, I train a character-level language model on 180K citations and use it to evaluate the other 20K, for all possible choices. This gives me a received set of expected scores for positive examples (assuming there’s no bias in that 200K set, which there is in terms of recency and particular gene focus, not to mention focus on human genomics). I’m going to plot this and see what the curve looks like. Empirically, we can then set a threshold that would accept 99% of the articles we have. Unfortunately, I have no idea how well this’ll work in practice at rejecting the articles that aren’t about genomics.

For the genomics/non-genomics problem, we can just annotate a few thousand examples; it’ll only take a day or two.

The real problem is that we want to build models to classify contexts for the 30K or so human gene entries in Entrez. Some of them have a handful of example docs, some have hundreds. We’re going to pull out articles with potential mentions, then filter with the classifier. It’s related to what we did in Phase I of our grant and reported in our gene linkage white paper. In that setting, we can generate candidate docs using approximate matching of aliases, then use the scores to rank the possible docs according to their language model scores against the known articles for the gene in question. This is great in a search context, but doesn’t give us a go/no-go decision point, which we need for some of our downstream applications.

If anyone knows how to tackle this problem, I’d love to hear about it. I might even implement it as part of LingPipe if the idea’s simple and general enough.

4 Responses to “How Can I Build a Classifier with no Negative Data?”

  1. mdreid Says:

    You could try using a one-class SVM.

  2. todd. Says:

    A friend of mine did pretty well in a UCSD data mining contest with a similar problem. There you had to classify ~11k instances based on ~70k examples, only a handful of which were labeled, and all of the labels were positive. I believe he used some form of hierarchical clustering, where clusters with some number of labeled positive instances were declared positive.

    The problem solved is described here: http://mill.ucsd.edu/index.php?page=Datasets&subpage=Task2, and I’d be happy to get more information on the method if you’re interested.

  3. lingpipe Says:

    Awesome. Thanks, Todd, and thanks Mark (I’ve added your blog to our blog roll).

    The SVM approach is interesting, though why the paper is called that is confusing because they conclude the neural network approach has the same performance and is more robust. What I was planning was similar to Schölkopf’s one-class SVM. I’m surprised their method of using outliers in the positive data (what I’m evaluating right now) is more effective. It reminds me of using entries on the n-best list as negative examples for sequence tagging. Unfortunately, overall results are rather disappointing, being in the 50% range. I should emphasize we’d really like an approach that could balance precision and recall for different applications (for search we want high recall in the tail; for some other apps, we need high precision).

    The problem posed for the 2008 UC San Diego Data Mining Contest: Positive-Only Semi-Supervised Task is exactly what we’re trying to do, though stated in the form of 20 real-valued features instead of raw texts. What’s amazing to me is that there are about 50 distinct entries on the Leaderboard. There’s even prize money for win, place and show associated with this contest.

  4. todd. Says:

    Well, there was prize money. The contest ended in June. The final results are here. In 5th and 6th respectively for the semi-supervised task were groups called “Cocorico” and “AllYourBayes.” However, those two teams merged before the end of the contest. I was on AllYourBayes, but didn’t work much on the semi-supervised task. I’ll try to see if I can get the rest of the group to show up and talk about what worked and what didn’t.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s