Active Learning High Confidence Items for High Precision?

by

A popular active learning strategy is to label new items about which the current classifier is uncertain. The usual citation is the 1994 ICML paper by Dave Lewis and Jason Catlett, Heterogeneous Uncertainty Sampling for Supervised Learning. The intuition is that this focuses the learner on the marginal cases.

I just realized in my response to a comment from Ramesh Nallapati, that what we’ve been doing here at Alias-i is exactly the opposite of uncertainty sampling.

We sample from the highest confidence positive examples of a category for new labels. This is most likely to uncover highly misclassified negative examples, the bane of high precision systems. (I say “we”, but this was mainly Breck’s idea, as he does most of our on-the-ground application development with customers.) Our hypothesis is that this is the quickest route to high precision classifiers.

I know I mainly talk about high-recall systems in the blog, as that’s what we’re focusing on for biology researchers as part of our NIH grant. But our commercial customers are often most concerned about not looking bad and will put up with some missed cost savings or opportunities.

Our general focus is on high precision or high recall operating points, as we rarely interpret customers’ requests as wanting maximum F measure. For high recall, we often focus our tuning efforts on the least confident positive examples and the more highly ranked false positives to see if we can move up the relative rank of the positive examples.

In some classification problems, the marginal examples are truly marginal and trying to tease them into absolute categories is counterproductive because the hair-splitting involved is so frustrating. It’s much much faster, not to mention more fun, to make decisions on high-confidence items. Remember, a lot of UI is also about perception of time, not the amount of time itself. The time benefits from labeling easier examples may compound with even greater perceived speed.

Has anyone else tried this kind of “heterogeneous certainty sampling”? We don’t have any contrastive data.

6 Responses to “Active Learning High Confidence Items for High Precision?”

  1. Peter Turney Says:

    In my opinion, active learning should be done by assigning a cost to obtaining labels and a cost to misclassification errors. Questions about which cases to sample then become a matter of balancing costs. See:

    Types of Cost in Inductive Concept Learning
    http://arxiv.org/abs/cs.LG/0212034

  2. lingpipe Says:

    Thanks, Peter — that’s a really nice survey. Section 2.2.3 about dependency on other classifications generalizes the particular applications I was thinking about — it captures recall vs. precision preferences.

    Assigning costs is as problematic for classification, etc., as in general decision theory and economics. But I think we do it implicitly, if not explicitly, by definining things like 95% precision lower bounds.

    Peter also discusses assigning less value to correlated answers (on a per-answer basis), which reminds me of maximal marginal relevance in search results ordering (that is, try to get a diverse set of relevant answers to a query). The idea of multiple answers being progressively less useful is common in customer database population requests (e.g. can you find the actor for this movie?).

    The conditional test costs (false positive vs. false negative) come up in the epidemiology literature all the time, but almost never in what I’ve seen of machine learning in general or natural language processing in particular. The exception seems to be ACE’s entity evaluation metrics, which allow different error types to be weighted differently.

    P.S. This is exactly the kind of survey/taxonomy paper that’s useful to read but undervalued in conference reviewing. So double thanks for writing it!

    • Peter Turney Says:

      Section 2.2.3 covers precision versus recall and Section 4 covers the cost of labels. Section 4.2 discusses variable costs for easy labels versus hard labels. Active learning is mentioned in Section 4 and Section 11.

  3. mlstat Says:

    In my opinion the sample to request a label on cannot be picked based on how well we think the current classifier is classifying it, but how much information we expect the label to provide about the classifier parameters.

    In an old paper we showed that in some Bayesian sense this amounts to picking a sample which leads to the greatest expected change in the parameters. In that sense your sampling of high confidence regions may be well justified. (http://portal.acm.org/citation.cfm?id=1143844.1143965 )

    • lingpipe Says:

      Cool. Is there a non-pay-per-view version of the paper somewhere?

      That’s what I concluded when I wrote the blog entry Bayesian Active Learning. I don’t know how to do the proper Bayesian calculation here, though.

      There’s also a problem with the model’s over-estimate of its own confidence. What we find is that there are errors in the high-confidence regions in practice at a much higher rate than our models (logistic regression, naive Bayes, etc.) predict. So while changing the class to negative for a high-positive-confidence item may change the parameters, a model won’t expect it to change polarity.

      • mlstat Says:

        Here is a link to the pdf. http://sra.itc.it/people/olivetti/cv/80.pdf

        The paper deals with actively sampling feature values for feature relevance estimation, but the theory applies to sampling any missing values in a data matrix — in particular to the case where the data matrix has all the feature values but the class labels are missing.

        Regarding your post on Bayesian Active Learning, I agree that the problem with active sampling based on uncertainty is its preference towards outliers. Another problem is that if there is a cluster in the feature space where the positive and negative class are equally likely, the uncertainty sampler will keep picking in that cluster because no matter how well the classifier has been learned in that region, the posterior class probabilities are equal meaning the uncertainty is high.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s