We have been struggling with how to evaluate whether we are finding ALL the genes in MEDLINE/PubMed abstracts. And we want to do it right. There has been a fair amount of work on how to evaluate natural language problems–search via TREC, BIOCREATIVE, MEDTAG but nothing out there really covered what we consider to be a key problem in text based bioinformatics–coverage or recall in application to existing databases of Entrez Gene.
What is the Rumpelstiltskin tie in?
From the Wikipedia:
“In order to make himself appear more important, a miller lied to the king that his daughter could spin straw into gold. The king called for the girl, shut her in a tower room with straw and a spinning wheel, and demanded that she spin the straw into gold by morning, for three nights, or be executed. ” Much drama ensues but in the end a fellow named Rumpelstiltskin saves the day.
The cast breaks down as follows:
- The king: The National Institutes of Health (NIH) who really prefer that you deliver on what you promise on grants.
- The miller: Our NIH proposal in which we say “We are committed to making all the facts, or total recall, available to scientists…” Even worse is that this is from the one paragraph summary of how we were going to spend $750,000 of the NIH’s money. They will be asking about this.
- The daughter: I (Breck), who sees lots of straw and no easy path to developing adequate gold standard data to evaluate ourselves against.
- The straw: 15 million MEDLINE/PubMed abstracts and 500,000 genes that need to be connected in order to produce the gold. Really just a subset of it.
- The gold: A scientifically valid sample of mappings between genes and abstracts that we can test our claims of total recall. This is commonly called “Gold Standard Data.”
- Rumpelstiltskin: Bob, and lucky for me I do know his name.
Creating Gold from Straw
Creating linguistic gold standard data is difficult, detail oriented, frustrating and ultimately some of the most important work that one can do to take on natural language problems seriously. I was around when version 1 of the Penn Treebank was created and would chat with Beatrice Santorini about the difficulties they encountered for things as simple seeming as part-of-speech tagging. I annotated MUC-6 data for named entities and coreference, did the John Smith corpus of cross-document coref with Amit Bagga and have done countless customer projects. All of those efforts gave me insights that I would not have had otherwise about how language is actually used rather than the idealized version you get in standard linguistics classes.
The steps for creating a gold standard are:
- Define what you are trying to annotate: We started with a very open ended “lets see what looks annotatable” attitude for linking Entrez Gene to MEDLINE/PubMed. By the time we felt we had a sufficiently robust linguistic phenomenon we had a standard that mapped abstracts as a whole to gene entries in Entrez Gene. The relevant question was: “Does this abstract mention anywhere a literal instance of the gene?” Gene families were not taken to mention the member genes, so “the EXT familly of genes” would not count, but “EXT1 and EXT2 are not implicated in autism” would.
- Validate that you can get multiple people to do the same annotation: Bob and I sat down and annotated 20 of the same abstracts independently and compared our results. We found that we had 36 shared mappings from gene to abstract, with Bob finding 3 mappings that Bob did not and I found 4 that Bob did not. In terms of recall I found 92% (36/39) of what Bob did. Bob found 90% (36/40) of what I found. Pretty good eh? Not really, see below.
- Annotate enough data to be statistically meaningful: Once we are convinced we have a reliable phenomenon, then we need to be sure we have enough examples to minimize chance occurrences.
The Tricky Bit
I (the daughter) need to stand in front of the king (the NIH) and say how good our recall is. Better if the number is close to 100% recall. But what is 100% recall?
Even a corpus annotation with an outrageously high 90% interannoatator
agreement leads to problems:
- A marketing problem: Even if we hit 99.99% recall on the corpus, we don’t know what’s up with the 5% error.
We can report 99.99% recall against the corpus, but not against the truth.
after being total rock stars and modeling Bob at 99.99%, a slide that says we can only claim recall of 85-95% on the data. So I can throw out the 99.99% number and introduce a salad of footnotes and diagrams. I see congressional investigations in my future.
- A scientific problem: It bugs me that I don’t have a handle on what truth looks like. We really do think recall is the key to text bioinformatics and that text bioinformatics is the key to curing lots of diseases.
Rumpelstiltskin Saves the Day
So, here we are in our hip Brooklyn office space, sun setting beautifully over the Williamsburg bridge, Bob and I are sitting around with a lot of straw. It is getting tense as I imagine the king’s reaction to the “standard approach” of working from an interannotator agreement validated data set. Phrases like “cannot be done in a scientifically robust way”, “we should just do what everyone else does” and “maybe we should focus on precision” were bandied about with increasing panic. But the next morning Rumpelstiltskin walked in with the gold. And it goes like this:
The problem is in estimating what truth is given somewhat unreliable annotators. Assuming that Bob and I make independent errors and after adjudication (we both looked at where we differed and decided what the real errors were) we figured that each of us would miss 5% (1/20) of the abstract to gene mappings. If we took the union of our annotations, we end up with .025% missed mentions (1/400) by multiplying our recall errors (1/20*1/20)–this assumes independence of errors, a big assumption.
Now we have a much better upper limit that is in the 99% range, and more importantly, a perspective on how to accumulate a recall gold standard. Basically we should take annotations from all remotely qualified annotators and not worry about it. We know that is going to push down our precision (accuracy) but we are not in that business anyway.
At ISMB in Detroit, I stood up and criticized the BioCreative/GENETAG folks for adopting a crazy-seeming annotation standard that went something like this: “Annotate all the gene mentions in this data. Don’t get too worried about the phrase boundaries, but make sure the mention is specific.” I now see that approach as a sane way to increase recall. I see the error of my ways and feel much better since we have demonstrated 99.99% recall against gene mentions for that task–note this is a different, but related task to linking Entrez Gene ids to text abstracts. And thanks to the BioCreative folks for all the hard work pulling together those annotations and running the bakeoffs.