## Dasgupta and Hsu (2008) Hierarchical Sampling for Active Learning

Once I got onto the ICML discussion site, I followed some of the papers that had discussions. I’d been meaning to circle back around to this paper, as it addressed several of the concerns I had about active learning:

The general topic’s still how to efficiently (in human effort) collect useful training data (e.g. for classifiers).

Update at 3:02 PM: As soon as I wrote this, I read a post on active learning in John Langford’s blog, which points to the following tutorial page, which among other things, has a great bibliography:

What’s worried me since the first time I heard about active learning is that the data’s likely to wind up biased, because it’s not being chosen randomly. For example, by choosing examples at the margin (e.g. always choosing the least confident annotation), you can wind up with a highly non-representative data set. Dasgupta and Hsu do some analysis along with providing a nice literature survey.

The basic idea of this paper is that you can cluster the underlying data, then sample from the clusters. If the underlying data clusters well w.r.t. the classification problem, this can be very helpful. Needless to say, there’s a whole lot of math explaining exactly what it means for data to cluster well w.r.t. the classification problem.

What struck me skimming through this paper is that K-means++ initialization for K-means clustering had the same motivation (it’s now implemented in LingPipe, and we’ve found it to work well in practice). K-means++ works by sampling a new centroid with a probablity proportional to its distance to the closest centroid. Outliers are by definition a long way away from a centroid, and thus have high individual selection probabilities. The idea is that by sampling, more representative items, by sheer numbers, can be chosen, even if they’re individually unlikely.

So what about choosing items for active learning proportional to their expected probability of being erroneously classified. I’m thinking this’d have the same kind of effect as K-means++.

PS: I also learned from Foster’s active learning paper that my academic hero and former colleague, Herb Simon, had the first citation in this field back in the early 1970s. As usual, he was decades ahead of his time.

Update: John Langford just posted a blog entry on active search, mentioning this:

### 3 Responses to “Dasgupta and Hsu (2008) Hierarchical Sampling for Active Learning”

1. Ken Dwyer Says:

If I understand your idea correctly, the Bootstrap-LV algorithm in the Provost paper you cite does something similar to what you suggest. It samples proportional to the variance in the class probability estimate for a given example.

However, this strategy seems to degrade when the (prior) class distribution is highly skewed, since the total weight of the minority class examples, which have high variance/interestingness ‘scores’, may be dwarfed by the weight of the majority class examples due to their sheer numbers. A potential outcome is that you select very few minority class examples. An illustration of this phenomenon is given in my M.Sc. thesis, pp. 102-104.

http://www.cs.ualberta.ca/~dwyer/files/msc_thesis.pdf

The recent paper “Importance-weighted active learning” by Bygelzimer, Dasgupta, and Langford is on my to-read list, and may propose a more robust solution.

2. lingpipe Says:

I really didn’t have the Bootstrap-LV algorithm of Saar-Tsechansky and Provost in mind — that samples based on the variance among a bunch of classifiers. I did mean what you’re calling BootM (M is for “margin” based), which you attribute to Melville and Mooney et al.

I was concerned with the issue you bring up on p 102 (real page, not PDF page), citing Xiao et al., about weight sampling. That’s exactly the balance k-means++ is suposed to get right. K-means++ samples the next cluster centroid with a probability proportional to squared Euclidean distance to the closest centroid. In a generative model, that could be sampling an example proportional to something like minimum joint probability of category and example. With a spherical Gaussian classifier with uniform distribution over categories, k-means++ is doing something similar.

I’m not sure what you’d do in a discriminative model to get the right distribution. I was hoping someone would have a nice evaluation of a bunch of these ideas somewhere.

To emphasize outliers, can’t you just raise whatever the metric is to some power — just the opposite of the usual trick to deemphasize them in annealing? In the limit, you just get the approach where you choose the most uncertain example each time, which overemphasizes outliers for most applications.

3. lingpipe Says:

I just found this paper, which is again hinting at the same kind of approach, but not quite getting there:

Jingbo Zhu, Huizhen Wang, Tianshun Yao, and Benjamin K Tsou. 2008. Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification. In COLING.

They use KNN-density estimation to try to find outliers, but rather than sampling like k-means++ take the K-most extreme.