Biocreative II Gene Mention Evaluation Paper


My first paper that’ll be indexed in MEDLINE! Too bad I’m the 19th of 30-some authors:

The paper’s an overview of the Biocreative II gene mention (GM) eval, which was almost two years ago (October 2006). It’s a nice thing to read if you’re looking for ideas on how to build an NE recognizer, as 19 different systems are described (most of them based on CRFs).

It also describes experiments using CRFs to combine the annotations of all systems, which improved the best single-system F-measure of .87 to a committee-based system score of .90, and showed that even low-ranking systems contributed to combined accuracy. Shades of Netflix. Perhaps not surprisingly, many of the individual systems were themselves committee-based, often in a forward-backward combo of the same learner with B-I-O tagging of named entity tokens.

Since we’ve been doing annotation recently, I was struck by that section of the paper. It turns out over 10% of the annotations in Biocreative I were changed for Biocreative II (which used a superset of the original data). That’s a strong indication that their original coding standards were not well enough defined. Here’s what they (we?) say:

It can be argued that the difficulty experienced by human annotators in reaching mutual agreement directly limits the performance of automated systems, and this can be influenced by the clarity of the annotation guidelines.

So far so good.

It has been pointed out [ed: by us at Alias-i among others] that the guidelines for annotating genes are surprisingly short and simple given the complex guidelines for annotating named entities in news wires [1].

Basically, the guidelines were nonexistent in my understanding — they told biologists to mark “gene mentions”, including proteins.

However, a gene is a scientific concept, and it is only reasonable to rely on domain experts to recognize and annotate gene mentions.

Of course we’re not going to have good luck with people who don’t know what genes are annotating the data. The question is, what more do we need to tell them?

Thus, the gene annotation guidelines can be conveyed by reference to a body of knowledge shared by individuals with experience and training in molecular biology, …

I’m not sure how to read this. The “can be conveyed” indicates a sense of sufficiency for the instructions of “just annotate the genes”.

… and it is not feasible to give a complete specification for gene annotation that does not rely on this extensive background knowledge.

It’s not as if the previous clause follows from this one. The NE guidelines for newswire don’t start with a definition of what it is to be a person or a company. While it’s necessary to employ annotators who are domain experts, domain expertise doesn’t define linguistic boundary conditions, and that’s what this kind of named entity detection game is all about. What we call a gene is a matter of convention, and such conventions need to be laid out in annotation standards to remove ambiguity, scientific concept or no. Scientific concepts are no more rigid than other ones in language. Granted, we need to lean on background knowledge, but that’s also true in annotating corporations in newswire.

We annotated a few hundred abstracts from the autism literature and were overwhelmed by the slipperiness of the term “gene”. There are all kinds of tricky boundary cases where we know what the article is referring to in a scientific sense, but don’t know if a phrase should count as a gene mention or not. Does a mention of a gene family constitute a mention of a gene? What about substructures? What if a protein’s mentioned that we know is produced by a single gene? These are all linguistic decisions relating to corpora construction. The term “gene” is just too vague on its own. We’ll never get a complete definition, but hopefully we’ll get one that’s clearer as measured by better inter-annotator reliability.

Nevertheless, we believe that some improvement could be achieved by documenting current annotation decisions for difficult and ambiguous gene mentions.

P.S. The results sorted by best F-measure put us 11th out of 19 teams, at 0.80 F measure vs. top scoring 0.87. Our first-best system had the second highest recall in the eval (though the system with the highest recall was also 10% higher precision than ours). We did no feature tuning and used no external resources, creating our submission in a matter of hours. We also had by far the highest recall submission at 99.9%, as described in the paper. We still can’t get anyone to compete on precision at 99.99% recall, but we’re pretty sure it could be done better than our 7% precision submission. And more importantly, we think it’s what should be done.

2 Responses to “Biocreative II Gene Mention Evaluation Paper”

  1. Mark Johnson Says:

    Interesting post! I think you’re right that we need more reliable training data if we want to raise the f-scores of our automatically produced labels. But I don’t know whether higher f-scores produced by standardizing the training and testing data really means that our systems are more accurately finding genes.

    I wonder if there isn’t a kind of bias-variance tradeoff in annotation schemes. As you point out, we can increase interannotator agreement by providing additional annotation criteria, perhaps based on linguistic form, as you suggested. Sure, this reduces the variance in the labelling, but doesn’t it introduce a kind of bias? After all, we don’t really care about linguistic form per se here. One can imagine annotation criteria that make it crystal clear as to whether an NP should be labelled “gene” or not (e.g., provide annotators with a list of genes, an NP is labelled “gene” iff it is in the list), but which don’t capture the’ notion of gene well at all.

    I wonder if it might help to move beyond a binary annotation in which every NP is either a gene or not. Back in the 1950s Chomsky suggested we could let out theory itself decide the difficult cases; here, we could train our models just on the clear cases, and let them decide how best to generalize to the unclear ones.

  2. lingpipe Says:


    I tend to view this more in a software sense, perhaps, where if you want to implement a spec, the spec better be clear. The real problems then become whether it’s even possible to define a spec with good inter-annotator agreement, and perhaps more importantly, whether the output’s going to be at all useful.

    I think your comments are in the same spirit of the folks who created the BioCreative data (John Wilbur and crew), but they wanted to let the biologists decide the boundary cases
    For BioCreative, I believe they were expecting the biologists’ decisions to be useful by definition.

    We were wondering about this specifically in annotating genes. If someone mentions a gene family (which may contain a handful or dozens of genes), does it constitute a mention of the gene or not? This is a purely linguistic decision in that it’s pretty easy to tell from most phrases whether they refer to a family or its most prominent member. Or whether a variant of a gene should refer to the same gene family in the database linkage sense. The question is what’d be useful downstream. Ideally, you’d analyze this distictions and let the user decide, but that quickly complicates the annotation standard toward something like GENIA or even more fine-grained, which then becomes very hard to learn, but is clearly a better reflection of the underlying ontology in play.

    A lot of standards implicitly do what you suggest by censoring items on which there is no consensus.

    A better way to sort out the uncertainty may be to push the annotation uncertainty through the training, as suggested in some Padhraic Smyth articles in the mid-90s. That’ll act as a kind of normalization in linear classifiers like SVMs or logistic regression, which should ensure within the limits of the model estimation that the boundary cases get scores near the boundary.

Leave a Reply to Mark Johnson Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: