## Arthur & Vassilvitskii (2007) k-means++: The Advantages of Careful Seeding

The method written up in this paper comes independently evaluated and recommended by Bruce Carneal at UCSD:

Arthur, David and Sergei Vassilvitski (2007) k-means++: The Advantages of Careful Seeding. SODA 2007. (also available: presentation slides).

As we all know, k-means, like EM, is a mode-finding algorithm, meaning that it’s guaranteed to find local optima, but unlikely (even in practice) to find global optima.

k-means++ is a simple probabilistic means of initialization for k-means clustering that not only has the best known theoretical guarantees on expected outcome quality, it reportedly works very well in practice. The key is a cluster-initialization technique that preservers diversity of seeds while being robust to outliers.

### Hard Clustering Error

We’ll only consider n-dimensional Euclidean space, though the approach can be generalized. We have a set X of points and are looking for a set C of points, of size K, called centers that minimizes the error: $Err(C,X) = \frac{1}{|X|} \sum_{x \in X} d(x,C(x))$

where the point C(x) in C is the closest center to the point x: $C(x) = \mbox{arg\,min}_{c \in C} \ d(x,c)$

Sorry to overload C (C as a set of clusters; C(x) as a function from an element to its cluster) here, but it’s standard for equivalence relations like clusterings.

### K-means

The k-means algorithm is simple. It works from a given set of centers, which have traditionally been chosen randomly from among the points in the set X.

The K-means loop iteratively performs two deterministic steps until (guaranteed) convergence: (1) pass over all data points and reassign them to their closest centers, then (2) recalculate the centers as the mean of the points assigned to them.

### K-means++

Arthur and Vassilvitskii simply created a very simple probabilistic method for generating initial centers for K-means clustering a set of points X:

1. Sample the first center c from a uniform distribution over X.
2. For k = 2 to K
Sample the k-th center c[k] from a multinomial over X where a point x has has probability θx, defined by: $\theta_x = \frac{D(x)^2}{\sum_{x' \in X} D(x')^2} \propto D(x)^2$

where D(x) is defined as the distance to the closest existing center.

### Empirical Results

The authors (and others since then) found both order-of-magnitude speedups and order of magnitude error reduction over random-initialized k-means on “real” data sets. Like many of these speedups, the results improve relatively as the number of clusters increases. For instance, they report a factor of 4 speedup and factor of 20 reduction in error for a tiny (4600 item) spam dataset with a few dozen features over 50 clusters. Unfortunately, they dont’ evaluate any large-dimensional problems, though they do some large item problems (n=500,000).

### Theoretical Guarantees

If the inputs are nicely separated, it had already been proved that this initialization had good behavior (the same initialization with a different analysis had been independently proposed by Ostrovsky et al. in a FOCS 2006 paper; more of that scientific zeitgeist).

If the points aren’t known to be well behaved, its expected to produce a result that’s O(log k) competitive with the best possible clustering. Being O(f) competitive means that the expected error is bounded to be on the order of f relative to the optimal order. For k-means++, the theorem provied by Arthur and Vassilvitskii bounds k-means++ error for any set of data points X by: $E_{p(C)}[Err(C,X)] \leq 8(\ln K + 2) \ \mbox{min}_{C'} \ Err(C',X)$

Where the expectation is over the probability p(C) of a given clustering (determined by the initial assignment) in the k-means++ algorithm.

### Diversity with Robustness to Outliers

In the past, folks (like me) often chose points successively to maximize their distance from existing clusters. The problem with this approach is that it can be dominated by outliers. Sanjoy Dasgupta (in a COLT 2003 paper) introduced a subsampled approach that used a subset of the initial points, which was far less prone to outliers, because there are presumably relatively few outliers. Searching over all of X finds all of the outliers by definition.

The k-means++ algorithm shares with Dasgupta’s appoach a kind of robustness in the face of outliers. While outliers may have a large maximum distance (i.e. a high D(x) value), by definition, there aren’t very many of them. So with sampling, the common points are balanced against the outliers using their strength in numbers against the outlier’s overall strength.

### Is the Theoretical Analysis Practical?

Very nice analysis (that’s why it’s published in a high-end theory conference). But even with a small number of clusters we’re still looking at being a factor of 20 away from the best solution (in the worst case). I don’t know if this is like other theoretical error bounds, such as for SVM error, that tend to be beaten by the reality of typical problems (which usually have way less error than predicted by the bounds).

### The Code

I have the rest of the day blocked out to implement, optimize and test k-means++ for LingPipe.

The paper authors themselves were also kind enough to provide:

### 19 Responses to “Arthur & Vassilvitskii (2007) k-means++: The Advantages of Careful Seeding”

1. mark johnson Says:

Since k-means is closely related to EM-based “soft” clustering, how well do you think these initialization methods would work for EM clustering?

It’d be really great to have heuristics for initializing EM in general (not just for clustering). Do you think any of these methods generalize to problems other than clustering?

2. lingpipe Says:

Yes, they do. I originally saw this kind of technique presented by Sanjoy Dasgupta in a talk at Columbia in the context of EM. The simplest furthest-first method chooses the first item at random, then each next item to be the furthest away from all the centers chosen so far. It suffers badly from outliers. Dasgupta’s approach subsampled the items and used furthest-first on the subset, which reduced the outlier heaviness of the furthest-first metric. He then did all kinds of fancy analysis to show that it works better in theory and also lots of experiments showed it working well in practice.

I’ve found Gibbs sampling to vary in sensitivity to initializations. Some of the models I’ve run, like multinomial models of annotator accuracy, are very robust to random initializations, whereas the logistic models were highly sensitive to initial choices, not even converging with random inits.

3. Sergei Says:

First, thanks for the blog post!

I wanted to add a couple of things:

– There are other initialization methods that are guaranteed to yield smaller approximation factors, but alas, they are usually much more computationally intensive. (See for example, Optimal Time Bounds for Approximate Clustering by Mettu and Plaxton). On the few examples that I tried, kmeans++ returned better solutions (but this is far from a rigorous test :-)

– We observed was that k-means++ consistently worked better if in each iteration you select several (2 or 3) centers according to the probabilistic method, and then greedily select the best one. This has puzzled me for the past few years, yet other than some hand waving, I don’t know really know why that is.

4. lingpipe Says:

You mean do a bunch of k-means++ initializations and take the one with the least error (sum of square distance to centers)? That brings up a couple of questions:

1. Have you found that the least error initialization is usually the least error after the k-means iterations?

2. Are 2 or 3 runs enough?

It would seem that if there’s enough variance among runs to require replication, 2 or 3 wouldn’t be enough.

Running replicates of entire k-means runs has always been standard operating procedure for these random initialization techniques (e.g. k-means, EM, Gibbs sampling).

I’ve always wondered

3. Is there a way to combine multiple runs in some statistical manner to find better centers?

I was very surprised when I finally read through the theoretical analyses (surprisingly accessible) that the result holds for the initial allocation.

I finished the LingPipe implementation over the weekend, so I’ll be citing you in our documentation and clustering tutorial, too. Our next release is out soon.

5. Sergei Says:

“You mean do a bunch of k-means++ initializations and take the one with the least error (sum of square distance to centers)? That brings up a couple of questions:”

Actually, I mean something slightly different. Say k=3. You pick the first center at random (nothing else to do here), call it c.

Usual k-means++ now says pick the second center in proportion to the square of the distance to c. Pick three different points independently like this — call them x1, x2, x3.

Check which of the x1, x2, x3 is the best point — i.e. which clustering is better (c, x1) or (c, x2) or (cx3). Call it c’.

Now we have two points (c, c’). To select the third, km++ would have you sample proportionally from some distribution. Again, pick three points from that distribution and pick the best one.

I hope that makes sense.

The only intuition I have is the following. Since the sampling of the second center is probabilistic, there is a small (but non-zero) probability that the second center will be very close to the first, even when there are better centers around. In a sense this biases the distribution to the further away points. The double/greedy selection avoids this bad case.

6. lingpipe Says:

Yes — that makes sense.

I’ve been wondering if factoring in something like the average square distance to every other point makes sense in addition to getting ones that are far away from already chosen centers, but I figured that’d bias all the selections too close together.

What Sergei’s suggesting is somewhat similar, but has the benefit of a natural task-related metric — it looks only at the reduction in distance of other points to closest centers (which may now be the sample being considered for the next center). That overcomes the problem with how I was thinking about it at the cost of some minor extra calculations.

It also reminds me of the maximal marginal relevance (MMR) approach to search ranking that selects the next document on the basis of match to the query (pure relevance) and distance from previously returned results (marginal relevance):

Carbonell, Jaime and Jade Goldstein (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. SIGIR.

In a way, it’s like selecting cluster centers to return.

The greedy empirical approach opens a can of worms in terms of how many points to sample, with the limit being considering every possible next point. Does that lean too heavily on the outliers again? I’m not sure it would, given that we’re looking at reduction in error. It’s also possible to relax greediness by choosing the next j centers at once using some method (say the modified version Sergei just described), then comparing them all. The clustering (and decision tree) literature are full of these kinds of heuristics.

7. Wolfgang Says:

Very interesting. In the paper the generalization for any other exponent than 2 was treated theoretically but not empirically. Was anything done exploring this direction in the meantime?

8. lingpipe Says:

Not that I know of. I didn’t study the theory too carefully, but quadratic makes sense because the error metric being optimized is distance, which is quadratic.

• Wolfgang Says:

Could you please help me out whether I understand this right:

Your section “Diversity with Robustness to Outliers” tells to me that k-means++ relies on sampling to obtain robustness to outliers.

I didn’t get this point from the original paper and I still feel that step 1b in their section 2.2 doesn’t make it clear; only directly after Theorem 1.1 we get the magic word “sampling” (well, and one more time when another paper is mentioned), so you can blame my short memory that initially this didn’t stuck until I reached section 2.2

Your section also tells me that the sampling method isn’t new, so the authors’ main achievement is “only” providing proof for the O(log k) competitiveness (no small feat though; I am still chewing on the details. So don’t get me wrong: I still think it’s a great paper.)

So the exponent isn’t really that much important as I first understood and your assessment that “quadratic makes sense” now also seems clearer to me.

9. Wolfgang Says:

OK, is my understanding correct that an exponent of 1 is the k-median problem? (And therefore also interesting in some cases?)

10. lingpipe Says:

The trick is that the probability estimates used for the samples are based on distance, and the density of the samples itself provides the robustness because there are more typical examples than outliers.

You can change the evaluation metric/(log) loss/probability model from Gaussian (L2) to Laplace (L1). I’m not sure what the relation is to medians, but I know there is one.

The reason quadratic makes sense for L2 is that its exponential form is quadratic, so log loss is quadratic, or in other words, proportional to Euclidean distance. Basically, your sampling probability is proportional to your log loss.

You could change the k-means++ algorithm to use L1 distances (absolute value log loss) in place of L2 (quadratic).

11. Occam's Machete » Blog Archive » Clustering: A Short Survey Says:

[…] K-means++ a better, faster k-means. https://lingpipe-blog.com/2009/03/23/arthur-vassilvitskii-2007-kmeans-the-advantages-of-careful-seedi… […]

• Mark Moriarty Says:

Anyone know of any kmeans++ java code out there?
I found a neat ready-to-use kmeans++ algorithm in MLPY library for python. Now I need the same for JAVA, don’t see what I need in Java-ML – any suggestions?

12. annoyed grunt Says:

Hello folks,

I’m trying to implement this in matlab but probably got something wrong here as the results I got are not better than original k-means.
Can someone please point me what I may be doing wrong here? My objective is just to get the centroids.
X is EntityxFeature and k is the number of clusters, MANY thanks!

function [C] = mykmeanspp(X,k)

L = [];
C = X(randi(size(X,1)),:);
L = ones(size(X,1),1);
for i = 2:k
D = X-C(L,:);
D = sqrt(dot(D’,D’));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end

13. Musi Says:

A systematic evaluation of different methods for initializing the K-means clustering algorithm. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. 2010

KMeans++ is that good. I implemented this for my problem domain, hoping it would help, however it does not provide any improvement over random selection of centroids.

Musi

• Bob Carpenter Says:

Thanks for the reference.

Given that these are all randomized initialization, I’d have liked to have seen sample averages and variances. I can’t tell from a quick glance how many times they tried each algorithm. Variability is important in terms of use cases.

Also, I’d rather do my own adjustment for computer time.

14. Sonal Says:

I am doing PhD in Parallel Computing of Data Mining Algorithms. I want to do some analysis on few variant of k-means algorithm. But I am not able to access the code on the following link:
http://www.stanford.edu/~darthur/kMeansppTest.zip
Can you please give me the access or mail me the zip file?
• Bob Carpenter Says: