As part of our NIH grant, we’re working on the database linkage problem from gene/protein mentions in MEDLINE to database entries in EntrezGene. Basically, it’s what the biologists call "gene normalization", and was the basis of Biocreative Task 2.
I can summarize the problem we’re having with a simple example. We’d like to classify all 17M or so entries in MEDLINE as to whether they’re about genomics or not. EntrezGene provides links to 200K citations that are about particular genes, so we have a pile of articles about genomics (making up about 300 million characters). What we don’t have is any negative training data.
So my question is: how do I build a classifier for articles about genomics versus those that are not about genomics?
The job running in the background giving me time to write this post is generating cross-validation on cross-entropy rates for all of these 200K citations. That is, I train a character-level language model on 180K citations and use it to evaluate the other 20K, for all possible choices. This gives me a received set of expected scores for positive examples (assuming there’s no bias in that 200K set, which there is in terms of recency and particular gene focus, not to mention focus on human genomics). I’m going to plot this and see what the curve looks like. Empirically, we can then set a threshold that would accept 99% of the articles we have. Unfortunately, I have no idea how well this’ll work in practice at rejecting the articles that aren’t about genomics.
For the genomics/non-genomics problem, we can just annotate a few thousand examples; it’ll only take a day or two.
The real problem is that we want to build models to classify contexts for the 30K or so human gene entries in Entrez. Some of them have a handful of example docs, some have hundreds. We’re going to pull out articles with potential mentions, then filter with the classifier. It’s related to what we did in Phase I of our grant and reported in our gene linkage white paper. In that setting, we can generate candidate docs using approximate matching of aliases, then use the scores to rank the possible docs according to their language model scores against the known articles for the gene in question. This is great in a search context, but doesn’t give us a go/no-go decision point, which we need for some of our downstream applications.
If anyone knows how to tackle this problem, I’d love to hear about it. I might even implement it as part of LingPipe if the idea’s simple and general enough.