Here is a link to the pdf. http://sra.itc.it/people/olivetti/cv/80.pdf

The paper deals with actively sampling feature values for feature relevance estimation, but the theory applies to sampling any missing values in a data matrix — in particular to the case where the data matrix has all the feature values but the class labels are missing.

Regarding your post on Bayesian Active Learning, I agree that the problem with active sampling based on uncertainty is its preference towards outliers. Another problem is that if there is a cluster in the feature space where the positive and negative class are equally likely, the uncertainty sampler will keep picking in that cluster because no matter how well the classifier has been learned in that region, the posterior class probabilities are equal meaning the uncertainty is high.

]]>Cool. Is there a non-pay-per-view version of the paper somewhere?

That’s what I concluded when I wrote the blog entry Bayesian Active Learning. I don’t know how to do the proper Bayesian calculation here, though.

There’s also a problem with the model’s over-estimate of its own confidence. What we find is that there are errors in the high-confidence regions in practice at a much higher rate than our models (logistic regression, naive Bayes, etc.) predict. So while changing the class to negative for a high-positive-confidence item may change the parameters, a model won’t expect it to change polarity.

]]>Section 2.2.3 covers precision versus recall and Section 4 covers the cost of labels. Section 4.2 discusses variable costs for easy labels versus hard labels. Active learning is mentioned in Section 4 and Section 11.

]]>In an old paper we showed that in some Bayesian sense this amounts to picking a sample which leads to the greatest expected change in the parameters. In that sense your sampling of high confidence regions may be well justified. (http://portal.acm.org/citation.cfm?id=1143844.1143965 )

]]>Assigning costs is as problematic for classification, etc., as in general decision theory and economics. But I think we do it implicitly, if not explicitly, by definining things like 95% precision lower bounds.

Peter also discusses assigning less value to correlated answers (on a per-answer basis), which reminds me of maximal marginal relevance in search results ordering (that is, try to get a diverse set of relevant answers to a query). The idea of multiple answers being progressively less useful is common in customer database population requests (e.g. can you find the actor for this movie?).

The conditional test costs (false positive vs. false negative) come up in the epidemiology literature all the time, but almost never in what I’ve seen of machine learning in general or natural language processing in particular. The exception seems to be ACE’s entity evaluation metrics, which allow different error types to be weighted differently.

P.S. This is exactly the kind of survey/taxonomy paper that’s useful to read but undervalued in conference reviewing. So double thanks for writing it!

]]>Types of Cost in Inductive Concept Learning

http://arxiv.org/abs/cs.LG/0212034