Is 90% Entity Detection Good Enough?

by

Ryan McDonald made the following comment on the NLPers blog: Tracking the State of the Art:

Discriminative models, rich feature sets and other developments have led to named-entity taggers with accuracies above 90% This is not only for simple categories like people names and places, but also for complex entity types like genes and chemical compounds. Entity taggers are so robust today, that they can often be (and are) used out-of-the-box for many real world applications.

Could someone point me to 90% accuracy gene taggers? Ideally, boxed ones that could be used for real-world applications. This is better than the best result from the last Biocreative on a single category, GENE, (not that anyone would know, because they haven’t made results public), and that evaluation used soft boundary decisions by allowing multiple “correct” boundaries. Results for the GENIA corpus, with the kind of structured categories Ryan’s talking about have been much lower.

Most of the 90% accuracy of high-accuracy taggers is derived from tagging the same thing right again and again (e.g. “the” and “be” for English POS taggers, “George Bush” and “IBM” for English entity taggers trained on WSJ). Entity mentions have a power law (Zipfian) distribution, so most of them aren’t mentioned very frequently. Thus 90% per-token accuracy translates into terrible per-type accuracy. Especially when one moves away temporally and topically from the training data (e.g. from well-edited, conventiaonlly structured Wall St. Journal news articles to movie or music reviews).

The situation gets worse if we’re looking for simple relations through collocations. Errors in sentences tend to be correlated, not independent, as many folks have noticed over the years. Thus we’re likely to get higher than .9 * .9 recall of common binary relations, but much lower than .9 * .9 recall of rarer binary relations. For applications, we simply don’t need the correlation between “George Bush” and “White House”, or between “p53” and “apoptosis”; everyone knows those relations.

Anyway, this problem is what’s led us to focus on high-recall approaches. Check out our named entity recognition tutorial for more information and even examples that run out of the box.

Ryan McDonald was at the talk at Columbia Uni I gave on entity extraction with high recall, and his point was that often the errors are one of boundaries or of the type assignment, and that it’s really no big deal for an app if the phrase is off by a token or two on the boundaries. I suppose that depends on the application. Boundary words are often discriminative in gene names or person names, and we’ve had a world of trouble historically in coreference by clustering when the names get off by one.