The method written up in this paper comes independently evaluated and recommended by Bruce Carneal at UCSD:
As we all know, k-means, like EM, is a mode-finding algorithm, meaning that it’s guaranteed to find local optima, but unlikely (even in practice) to find global optima.
k-means++ is a simple probabilistic means of initialization for k-means clustering that not only has the best known theoretical guarantees on expected outcome quality, it reportedly works very well in practice. The key is a cluster-initialization technique that preservers diversity of seeds while being robust to outliers.
Hard Clustering Error
We’ll only consider n-dimensional Euclidean space, though the approach can be generalized. We have a set X of points and are looking for a set C of points, of size K, called centers that minimizes the error:
where the point C(x) in C is the closest center to the point x:
Sorry to overload C (C as a set of clusters; C(x) as a function from an element to its cluster) here, but it’s standard for equivalence relations like clusterings.
The k-means algorithm is simple. It works from a given set of centers, which have traditionally been chosen randomly from among the points in the set X.
The K-means loop iteratively performs two deterministic steps until (guaranteed) convergence: (1) pass over all data points and reassign them to their closest centers, then (2) recalculate the centers as the mean of the points assigned to them.
Arthur and Vassilvitskii simply created a very simple probabilistic method for generating initial centers for K-means clustering a set of points X:
- Sample the first center c from a uniform distribution over X.
- For k = 2 to K
Sample the k-th center c[k] from a multinomial over X where a point x has has probability θx, defined by:
where D(x) is defined as the distance to the closest existing center.
The authors (and others since then) found both order-of-magnitude speedups and order of magnitude error reduction over random-initialized k-means on “real” data sets. Like many of these speedups, the results improve relatively as the number of clusters increases. For instance, they report a factor of 4 speedup and factor of 20 reduction in error for a tiny (4600 item) spam dataset with a few dozen features over 50 clusters. Unfortunately, they dont’ evaluate any large-dimensional problems, though they do some large item problems (n=500,000).
If the inputs are nicely separated, it had already been proved that this initialization had good behavior (the same initialization with a different analysis had been independently proposed by Ostrovsky et al. in a FOCS 2006 paper; more of that scientific zeitgeist).
If the points aren’t known to be well behaved, its expected to produce a result that’s O(log k) competitive with the best possible clustering. Being O(f) competitive means that the expected error is bounded to be on the order of f relative to the optimal order. For k-means++, the theorem provied by Arthur and Vassilvitskii bounds k-means++ error for any set of data points X by:
Where the expectation is over the probability p(C) of a given clustering (determined by the initial assignment) in the k-means++ algorithm.
Diversity with Robustness to Outliers
In the past, folks (like me) often chose points successively to maximize their distance from existing clusters. The problem with this approach is that it can be dominated by outliers. Sanjoy Dasgupta (in a COLT 2003 paper) introduced a subsampled approach that used a subset of the initial points, which was far less prone to outliers, because there are presumably relatively few outliers. Searching over all of X finds all of the outliers by definition.
The k-means++ algorithm shares with Dasgupta’s appoach a kind of robustness in the face of outliers. While outliers may have a large maximum distance (i.e. a high D(x) value), by definition, there aren’t very many of them. So with sampling, the common points are balanced against the outliers using their strength in numbers against the outlier’s overall strength.
Is the Theoretical Analysis Practical?
Very nice analysis (that’s why it’s published in a high-end theory conference). But even with a small number of clusters we’re still looking at being a factor of 20 away from the best solution (in the worst case). I don’t know if this is like other theoretical error bounds, such as for SVM error, that tend to be beaten by the reality of typical problems (which usually have way less error than predicted by the bounds).
I have the rest of the day blocked out to implement, optimize and test k-means++ for LingPipe.
The paper authors themselves were also kind enough to provide: