Efficiency through Ignorance

by

We’ve been working on systems that achieve very high recall (e.g. 99.95% at 10% precision for MEDLINE gene mentions using the untuned models from our named entity tutorial).

In order to achieve this level of recall, pruning thresholds must be set fairly high, which entails a difficult balancing act between efficiency and accuracy. While diving into the literature to see how other systems achieved their efficiency, I began to notice how many systems made strong “closed-world” assumption in their pruning strategies. For instance, Thorsten Brants’s widely used TnT Tagger is fairly typical in only tagging “known” words with tags that were found in the training data (see the TnT paper, formula 5 and section 2.3). This is a mode in Adwait Ratnaparkhi’s tagger (see Tag Dictionary on p. 136 in <a href=”the paper).

Collins’s parser uses a similar heuristic, in that it only considers lexical categories that were assigned by an underlying tagger, in this case, Ratnaparkhi’s (see section 3.1 and the description in table 1 of the paper) .

Similar strategies are often made (erroneously) of word senses, citing Gale, Church, and Yarowsky’s (in)famous One sense per discourse paper.

In all of these cases, efficiency is derived through a kind of enforced ignorance. To some extent, these decisions might also help first-best accuracy, by forcing the lexical statistics to dominate when they might otherwise be overwhelmed by contextual evidence.

All of this reminded by a talk by William Woods from KR ’94. I never saw the talk, but found the abstract so compelling that I’ve never forgotten it:

Beyond Ignorance-Based Systems

W. A. Woods — Sun Microsystems Laboratories, Inc., USA

The field of artificial intelligence has a long tradition of exploiting the potential of limited domains. While this is beneficial as a way to get started and has utility for applications of limited scope, these approaches will not scale to systems with more open-ended domains of knowledge. Many “knowledge-based” systems actually derive their success as much from ignorance as from the knowledge that they contain. That is, they succeed because they don’t know any better. Too great a reliance on a closed-world assumption and default reasoning in a limited domain can result in a system that is fundamentally limited and cannot be extended beyond its initial domain.

If the field of knowledge-based systems is to move beyond this stage, we need to develop knowledge representation and reasoning technology that is more robust in the face of domain extensions. Nonmonotonic reasoning becomes a liability if the fundamental abilities of a system can be destroyed by the addition of knowledge from a new domain. This talk will discuss some of the challenges that we must meet to develop systems that can handle diverse ranges of knowledge.

The problem is that there’s a very long tail in natural language data. Heuristic pruning of this kind leaves a definite blind spot in systems. The real problem is that even without these a priori pruning heuristics (e.g. one sense per discourse, or only tag known words with known tags), training corpus statistics may often lead systems into roughly similar decisions purely statistically.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s