Not only is F measure hard to optimize, it’s not what customers want. Neither is area under a precision-recall curve or high average precision stats. Sure, these might be reasonable proxies, but customers typically choose an operating point for their systems and go with it. What they really want is a service level agreement (SLA) for precision and/or recall.
We’ve been focused on high recall in our research applications for DARPA (person, location, organization and interaction extraction) and NIH (gene, protein, disease, and relationship extractors). In these cases, intelligence analysts or biologists are willing to pore through a lot of false positives in order to find a true positive.
When we deal with corporate customers building customer-facing information extraction or classification applications, they almost always want high precision. Concretely, the goal is almost always:
Build a system with 90% (or 95% or 99%) precision with as much recall as possible.
I’d like to see some bakeoffs with this metric.
Precision is closely related to the frequentist statistical notion of a false discovery rate (FDR). [Storey and Tibshirani 2003] provides a good discussion of the issues along with some interesting inference techniques in a multiple hypothesis testing framework. Being a Bayesian, I like hierarchical models plus posterior inference for this job, but that’s not the point of this post.
How We Build High Precision Search System
Suppose we have a search-type information extraction problem. For instance, we may be looking for protein-protein interactions in text, or spelling corrections without too many false positive suggestions.
Often what we’re looking for has low prevalence (e.g. indications of terrorists interacting with each other in a situation involving money), so we can’t just grab a random sample of data and start annotating.
As usual, there’s a premium on skilled annotators (no, we can’t just use Mechanical Turk, because we’re usually dealing with proprietary corporate databases in these situations), a less than well-defined task (e.g. find complaints about our product), and a huge pile of unlabeled data (usually millions and sometimes billions of cases [or trillions if you're at MS, Yahoo! or Google]).
In the simple case of a two-category classifier, we start with no training data and then iteratively:
- Find some positive cases of what we’re looking for.
- Build a search-like application to find some more.
- Label the highest-probability results from our search as positive or negative.
This is very different than most active learning strategies, which typically select instances with high uncertaity to label. Instead, what we’re doing is labeling the data which the last system was most confident was positive.
This can be dangerous in that the boundary can be hard to detect. And the coding standard defining the boundary may change. This isn’t as much of a problem as it may seem, because we always go back and check the data we already have.
Like active learning, the result is a biased data set favoring items that score toward the high end of positiveness in one of the iterations. Active learning results in a data set focused on the positive/negative boundary.
We’ve never had time to evaluate this strategy on large bodies of data, like say the Reuters corpus. (Mainly, because we haven’t been working on that particular problem.)
Setting Operating Points
Now we come to practice and need to choose an operating point. If we have a probabilistic classifier, we return items that the classifier estimates to be above a certain likelihood of being true. Clearly if we set that point at 0.90 and our system is very accurate in our estimates, we’ll wind up with higher than 90% precision if the items are distributed on the 0.9 to 1.0 range. So we often set the threshold lower even when our probabilistic classifier is fairly accurate. (Logistic regression, by the way, is much more accurate than naive Bayes because of the way it accounts for dependencies.)
Choosing Operating Point by Utility
An even better approach is to assume a utility (equivalent loss) for each true positive, false positive, true negative, and false negative. Then you can assign a utility to each operating point and maximize the utility rather than just taking a 90% precision operating point. This is true eve if we assign 1 as the utility of a true positive and -9 as the utility of a false positive, and 0 the utility of false and true negatives. The 90% precision operating point provides a utility of 0 here and we may be able to do better if we set the operating point higher (thus lowering expected recall but improving precision and hence utility).
Adding utilities to a probability model puts us in the land of statistical decision theory.
Plain Old Variance, Over-Dispersion, and Non-Stationarity
Garden variety binomial variance is an issue here. If we don’t have a lot of data, we’ll wind up with a pretty fat posterior on the 90% precision threshold (or utility). If we set a very conservative estimate here to meet our high precision service level agreement, we’ll potentially lose lots of recall.
Overdispersion is a huge problem. One of my favorite examples was from one of the MUC entity evaluations. A single document had several mentions of the planet Mars, which was labeled as a location. Because the whole document was included, there were way more instances of the phrase “Mars” in the test data than you’d expect given the frequency of the term itself. A classic case of overdispersion (or “burstiness” as it’s often called in language data; see, e.g., (Jansche 2003) for an overview).
Another major problem is that data in the wild is rarely stationary. Instead, from hour to hour, day to day, or week to week, the topics slightly or dramatically change. Systems trained on MUC data from the 1990s don’t reflect today’s distribution of persons, locations or organizations in the news. Similarly, prior data just won’t predict the frequency of the location “Eyjafjallajökull” in this week’s news.