A popular active learning strategy is to label new items about which the current classifier is uncertain. The usual citation is the 1994 ICML paper by Dave Lewis and Jason Catlett, Heterogeneous Uncertainty Sampling for Supervised Learning. The intuition is that this focuses the learner on the marginal cases.
I just realized in my response to a comment from Ramesh Nallapati, that what we’ve been doing here at Alias-i is exactly the opposite of uncertainty sampling.
We sample from the highest confidence positive examples of a category for new labels. This is most likely to uncover highly misclassified negative examples, the bane of high precision systems. (I say “we”, but this was mainly Breck’s idea, as he does most of our on-the-ground application development with customers.) Our hypothesis is that this is the quickest route to high precision classifiers.
I know I mainly talk about high-recall systems in the blog, as that’s what we’re focusing on for biology researchers as part of our NIH grant. But our commercial customers are often most concerned about not looking bad and will put up with some missed cost savings or opportunities.
Our general focus is on high precision or high recall operating points, as we rarely interpret customers’ requests as wanting maximum F measure. For high recall, we often focus our tuning efforts on the least confident positive examples and the more highly ranked false positives to see if we can move up the relative rank of the positive examples.
In some classification problems, the marginal examples are truly marginal and trying to tease them into absolute categories is counterproductive because the hair-splitting involved is so frustrating. It’s much much faster, not to mention more fun, to make decisions on high-confidence items. Remember, a lot of UI is also about perception of time, not the amount of time itself. The time benefits from labeling easier examples may compound with even greater perceived speed.
Has anyone else tried this kind of “heterogeneous certainty sampling”? We don’t have any contrastive data.
August 31, 2009 at 3:56 pm |
In my opinion, active learning should be done by assigning a cost to obtaining labels and a cost to misclassification errors. Questions about which cases to sample then become a matter of balancing costs. See:
Types of Cost in Inductive Concept Learning
http://arxiv.org/abs/cs.LG/0212034
August 31, 2009 at 4:15 pm |
Thanks, Peter — that’s a really nice survey. Section 2.2.3 about dependency on other classifications generalizes the particular applications I was thinking about — it captures recall vs. precision preferences.
Assigning costs is as problematic for classification, etc., as in general decision theory and economics. But I think we do it implicitly, if not explicitly, by definining things like 95% precision lower bounds.
Peter also discusses assigning less value to correlated answers (on a per-answer basis), which reminds me of maximal marginal relevance in search results ordering (that is, try to get a diverse set of relevant answers to a query). The idea of multiple answers being progressively less useful is common in customer database population requests (e.g. can you find the actor for this movie?).
The conditional test costs (false positive vs. false negative) come up in the epidemiology literature all the time, but almost never in what I’ve seen of machine learning in general or natural language processing in particular. The exception seems to be ACE’s entity evaluation metrics, which allow different error types to be weighted differently.
P.S. This is exactly the kind of survey/taxonomy paper that’s useful to read but undervalued in conference reviewing. So double thanks for writing it!
August 31, 2009 at 5:20 pm |
Section 2.2.3 covers precision versus recall and Section 4 covers the cost of labels. Section 4.2 discusses variable costs for easy labels versus hard labels. Active learning is mentioned in Section 4 and Section 11.
August 31, 2009 at 4:56 pm |
In my opinion the sample to request a label on cannot be picked based on how well we think the current classifier is classifying it, but how much information we expect the label to provide about the classifier parameters.
In an old paper we showed that in some Bayesian sense this amounts to picking a sample which leads to the greatest expected change in the parameters. In that sense your sampling of high confidence regions may be well justified. (http://portal.acm.org/citation.cfm?id=1143844.1143965 )
September 1, 2009 at 4:44 pm |
Cool. Is there a non-pay-per-view version of the paper somewhere?
That’s what I concluded when I wrote the blog entry Bayesian Active Learning. I don’t know how to do the proper Bayesian calculation here, though.
There’s also a problem with the model’s over-estimate of its own confidence. What we find is that there are errors in the high-confidence regions in practice at a much higher rate than our models (logistic regression, naive Bayes, etc.) predict. So while changing the class to negative for a high-positive-confidence item may change the parameters, a model won’t expect it to change polarity.
September 2, 2009 at 8:11 am
Here is a link to the pdf. http://sra.itc.it/people/olivetti/cv/80.pdf
The paper deals with actively sampling feature values for feature relevance estimation, but the theory applies to sampling any missing values in a data matrix — in particular to the case where the data matrix has all the feature values but the class labels are missing.
Regarding your post on Bayesian Active Learning, I agree that the problem with active sampling based on uncertainty is its preference towards outliers. Another problem is that if there is a cluster in the feature space where the positive and negative class are equally likely, the uncertainty sampler will keep picking in that cluster because no matter how well the classifier has been learned in that region, the posterior class probabilities are equal meaning the uncertainty is high.