Building High Precision Classifiers, Taggers, Chunkers, Spelling Correctors, …


Not only is F measure hard to optimize, it’s not what customers want. Neither is area under a precision-recall curve or high average precision stats. Sure, these might be reasonable proxies, but customers typically choose an operating point for their systems and go with it. What they really want is a service level agreement (SLA) for precision and/or recall.

High Recall

We’ve been focused on high recall in our research applications for DARPA (person, location, organization and interaction extraction) and NIH (gene, protein, disease, and relationship extractors). In these cases, intelligence analysts or biologists are willing to pore through a lot of false positives in order to find a true positive.

High Precision

When we deal with corporate customers building customer-facing information extraction or classification applications, they almost always want high precision. Concretely, the goal is almost always:

Build a system with 90% (or 95% or 99%) precision with as much recall as possible.

I’d like to see some bakeoffs with this metric.

Precision is closely related to the frequentist statistical notion of a false discovery rate (FDR). [Storey and Tibshirani 2003] provides a good discussion of the issues along with some interesting inference techniques in a multiple hypothesis testing framework. Being a Bayesian, I like hierarchical models plus posterior inference for this job, but that’s not the point of this post.

How We Build High Precision Search System

Suppose we have a search-type information extraction problem. For instance, we may be looking for protein-protein interactions in text, or spelling corrections without too many false positive suggestions.

Often what we’re looking for has low prevalence (e.g. indications of terrorists interacting with each other in a situation involving money), so we can’t just grab a random sample of data and start annotating.

As usual, there’s a premium on skilled annotators (no, we can’t just use Mechanical Turk, because we’re usually dealing with proprietary corporate databases in these situations), a less than well-defined task (e.g. find complaints about our product), and a huge pile of unlabeled data (usually millions and sometimes billions of cases [or trillions if you’re at MS, Yahoo! or Google]).

In the simple case of a two-category classifier, we start with no training data and then iteratively:

  1. Find some positive cases of what we’re looking for.
  2. Build a search-like application to find some more.
  3. Label the highest-probability results from our search as positive or negative.

This is very different than most active learning strategies, which typically select instances with high uncertaity to label. Instead, what we’re doing is labeling the data which the last system was most confident was positive.

This can be dangerous in that the boundary can be hard to detect. And the coding standard defining the boundary may change. This isn’t as much of a problem as it may seem, because we always go back and check the data we already have.

Like active learning, the result is a biased data set favoring items that score toward the high end of positiveness in one of the iterations. Active learning results in a data set focused on the positive/negative boundary.

We’ve never had time to evaluate this strategy on large bodies of data, like say the Reuters corpus. (Mainly, because we haven’t been working on that particular problem.)

Setting Operating Points

Now we come to practice and need to choose an operating point. If we have a probabilistic classifier, we return items that the classifier estimates to be above a certain likelihood of being true. Clearly if we set that point at 0.90 and our system is very accurate in our estimates, we’ll wind up with higher than 90% precision if the items are distributed on the 0.9 to 1.0 range. So we often set the threshold lower even when our probabilistic classifier is fairly accurate. (Logistic regression, by the way, is much more accurate than naive Bayes because of the way it accounts for dependencies.)

Choosing Operating Point by Utility

An even better approach is to assume a utility (equivalent loss) for each true positive, false positive, true negative, and false negative. Then you can assign a utility to each operating point and maximize the utility rather than just taking a 90% precision operating point. This is true eve if we assign 1 as the utility of a true positive and -9 as the utility of a false positive, and 0 the utility of false and true negatives. The 90% precision operating point provides a utility of 0 here and we may be able to do better if we set the operating point higher (thus lowering expected recall but improving precision and hence utility).

Adding utilities to a probability model puts us in the land of statistical decision theory.

Plain Old Variance, Over-Dispersion, and Non-Stationarity

Garden variety binomial variance is an issue here. If we don’t have a lot of data, we’ll wind up with a pretty fat posterior on the 90% precision threshold (or utility). If we set a very conservative estimate here to meet our high precision service level agreement, we’ll potentially lose lots of recall.

Overdispersion is a huge problem. One of my favorite examples was from one of the MUC entity evaluations. A single document had several mentions of the planet Mars, which was labeled as a location. Because the whole document was included, there were way more instances of the phrase “Mars” in the test data than you’d expect given the frequency of the term itself. A classic case of overdispersion (or “burstiness” as it’s often called in language data; see, e.g., (Jansche 2003) for an overview).

Another major problem is that data in the wild is rarely stationary. Instead, from hour to hour, day to day, or week to week, the topics slightly or dramatically change. Systems trained on MUC data from the 1990s don’t reflect today’s distribution of persons, locations or organizations in the news. Similarly, prior data just won’t predict the frequency of the location “Eyjafjallajökull” in this week’s news.

3 Responses to “Building High Precision Classifiers, Taggers, Chunkers, Spelling Correctors, …”

  1. Breck Baldwin Says:

    Customers will almost always ask for 90% or more for whatever metric we hand them, I think it is because in the American school system 90% or above is an A. When they actually see data they tend to be happy with lower performance.

  2. Jacob Says:

    Here’s a suggestion for naming that non Active Learning process: Manualated Learning. I’m doing something like that right now:
    1) Identify uniquely specific keywords for classification
    2) Classify documents based on the frequency of those keywords
    3) Use those classified documents as training data for supervised classification

    We’re currently in the “manualation loop” of 1 & 2, but we’ll soon be feeding 2 into 3 to see how the results are.

  3. Text Classification for Sentiment Analysis – Precision and Recall « Says:

    […] F-measure to be about as useful as accuracy. Or in other words, compared to precision & recall, F-measure is mostly useless, as you'll see below.Measuring Precision and Recall of a Naive Bayes ClassifierThe NLTK metrics […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: