Jingbo Zhu, Huizhen Wang, Tianshun Yao, and Benjamin K Tsou. 2008. Active Learning with Sampling by Uncertainty and Density for Word Sense Disambiguation and Text Classification. In *COLING*.

They use KNN-density estimation to try to find outliers, but rather than sampling like k-means++ take the K-most extreme.

]]>I was concerned with the issue you bring up on p 102 (real page, not PDF page), citing Xiao et al., about weight sampling. That’s exactly the balance k-means++ is suposed to get right. K-means++ samples the next cluster centroid with a probability proportional to squared Euclidean distance to the closest centroid. In a generative model, that could be sampling an example proportional to something like minimum joint probability of category and example. With a spherical Gaussian classifier with uniform distribution over categories, k-means++ is doing something similar.

I’m not sure what you’d do in a discriminative model to get the right distribution. I was hoping someone would have a nice evaluation of a bunch of these ideas somewhere.

To emphasize outliers, can’t you just raise whatever the metric is to some power — just the opposite of the usual trick to deemphasize them in annealing? In the limit, you just get the approach where you choose the most uncertain example each time, which overemphasizes outliers for most applications.

]]>However, this strategy seems to degrade when the (prior) class distribution is highly skewed, since the total weight of the minority class examples, which have high variance/interestingness ‘scores’, may be dwarfed by the weight of the majority class examples due to their sheer numbers. A potential outcome is that you select very few minority class examples. An illustration of this phenomenon is given in my M.Sc. thesis, pp. 102-104.

The recent paper “Importance-weighted active learning” by Bygelzimer, Dasgupta, and Langford is on my to-read list, and may propose a more robust solution.

]]>