The method written up in this paper comes independently evaluated and recommended by Bruce Carneal at UCSD:
Arthur, David and Sergei Vassilvitski (2007) kmeans++: The Advantages of Careful Seeding. SODA 2007. (also available: presentation slides).
As we all know, kmeans, like EM, is a modefinding algorithm, meaning that it’s guaranteed to find local optima, but unlikely (even in practice) to find global optima.
kmeans++ is a simple probabilistic means of initialization for kmeans clustering that not only has the best known theoretical guarantees on expected outcome quality, it reportedly works very well in practice. The key is a clusterinitialization technique that preservers diversity of seeds while being robust to outliers.
Hard Clustering Error
We’ll only consider ndimensional Euclidean space, though the approach can be generalized. We have a set X of points and are looking for a set C of points, of size K, called centers that minimizes the error:
where the point C(x) in C is the closest center to the point x:
Sorry to overload C (C as a set of clusters; C(x) as a function from an element to its cluster) here, but it’s standard for equivalence relations like clusterings.
Kmeans
The kmeans algorithm is simple. It works from a given set of centers, which have traditionally been chosen randomly from among the points in the set X.
The Kmeans loop iteratively performs two deterministic steps until (guaranteed) convergence: (1) pass over all data points and reassign them to their closest centers, then (2) recalculate the centers as the mean of the points assigned to them.
Kmeans++
Arthur and Vassilvitskii simply created a very simple probabilistic method for generating initial centers for Kmeans clustering a set of points X:
 Sample the first center c[1] from a uniform distribution over X.
 For k = 2 to K

Sample the kth center c[k] from a multinomial over X where a point x has has probability θ_{x}, defined by:
where D(x) is defined as the distance to the closest existing center.
Empirical Results
The authors (and others since then) found both orderofmagnitude speedups and order of magnitude error reduction over randominitialized kmeans on “real” data sets. Like many of these speedups, the results improve relatively as the number of clusters increases. For instance, they report a factor of 4 speedup and factor of 20 reduction in error for a tiny (4600 item) spam dataset with a few dozen features over 50 clusters. Unfortunately, they dont’ evaluate any largedimensional problems, though they do some large item problems (n=500,000).
Theoretical Guarantees
If the inputs are nicely separated, it had already been proved that this initialization had good behavior (the same initialization with a different analysis had been independently proposed by Ostrovsky et al. in a FOCS 2006 paper; more of that scientific zeitgeist).
If the points aren’t known to be well behaved, its expected to produce a result that’s O(log k) competitive with the best possible clustering. Being O(f) competitive means that the expected error is bounded to be on the order of f relative to the optimal order. For kmeans++, the theorem provied by Arthur and Vassilvitskii bounds kmeans++ error for any set of data points X by:
Where the expectation is over the probability p(C) of a given clustering (determined by the initial assignment) in the kmeans++ algorithm.
Diversity with Robustness to Outliers
In the past, folks (like me) often chose points successively to maximize their distance from existing clusters. The problem with this approach is that it can be dominated by outliers. Sanjoy Dasgupta (in a COLT 2003 paper) introduced a subsampled approach that used a subset of the initial points, which was far less prone to outliers, because there are presumably relatively few outliers. Searching over all of X finds all of the outliers by definition.
The kmeans++ algorithm shares with Dasgupta’s appoach a kind of robustness in the face of outliers. While outliers may have a large maximum distance (i.e. a high D(x) value), by definition, there aren’t very many of them. So with sampling, the common points are balanced against the outliers using their strength in numbers against the outlier’s overall strength.
Is the Theoretical Analysis Practical?
Very nice analysis (that’s why it’s published in a highend theory conference). But even with a small number of clusters we’re still looking at being a factor of 20 away from the best solution (in the worst case). I don’t know if this is like other theoretical error bounds, such as for SVM error, that tend to be beaten by the reality of typical problems (which usually have way less error than predicted by the bounds).
The Code
I have the rest of the day blocked out to implement, optimize and test kmeans++ for LingPipe.
The paper authors themselves were also kind enough to provide:
March 29, 2009 at 7:36 am 
Since kmeans is closely related to EMbased “soft” clustering, how well do you think these initialization methods would work for EM clustering?
It’d be really great to have heuristics for initializing EM in general (not just for clustering). Do you think any of these methods generalize to problems other than clustering?
March 29, 2009 at 10:59 am 
Yes, they do. I originally saw this kind of technique presented by Sanjoy Dasgupta in a talk at Columbia in the context of EM. The simplest furthestfirst method chooses the first item at random, then each next item to be the furthest away from all the centers chosen so far. It suffers badly from outliers. Dasgupta’s approach subsampled the items and used furthestfirst on the subset, which reduced the outlier heaviness of the furthestfirst metric. He then did all kinds of fancy analysis to show that it works better in theory and also lots of experiments showed it working well in practice.
I’ve found Gibbs sampling to vary in sensitivity to initializations. Some of the models I’ve run, like multinomial models of annotator accuracy, are very robust to random initializations, whereas the logistic models were highly sensitive to initial choices, not even converging with random inits.
March 30, 2009 at 12:43 pm 
First, thanks for the blog post!
I wanted to add a couple of things:
– There are other initialization methods that are guaranteed to yield smaller approximation factors, but alas, they are usually much more computationally intensive. (See for example, Optimal Time Bounds for Approximate Clustering by Mettu and Plaxton). On the few examples that I tried, kmeans++ returned better solutions (but this is far from a rigorous test :)
– We observed was that kmeans++ consistently worked better if in each iteration you select several (2 or 3) centers according to the probabilistic method, and then greedily select the best one. This has puzzled me for the past few years, yet other than some hand waving, I don’t know really know why that is.
March 30, 2009 at 1:05 pm 
You mean do a bunch of kmeans++ initializations and take the one with the least error (sum of square distance to centers)? That brings up a couple of questions:
1. Have you found that the least error initialization is usually the least error after the kmeans iterations?
2. Are 2 or 3 runs enough?
It would seem that if there’s enough variance among runs to require replication, 2 or 3 wouldn’t be enough.
Running replicates of entire kmeans runs has always been standard operating procedure for these random initialization techniques (e.g. kmeans, EM, Gibbs sampling).
I’ve always wondered
3. Is there a way to combine multiple runs in some statistical manner to find better centers?
I was very surprised when I finally read through the theoretical analyses (surprisingly accessible) that the result holds for the initial allocation.
I finished the LingPipe implementation over the weekend, so I’ll be citing you in our documentation and clustering tutorial, too. Our next release is out soon.
March 30, 2009 at 2:00 pm 
“You mean do a bunch of kmeans++ initializations and take the one with the least error (sum of square distance to centers)? That brings up a couple of questions:”
Actually, I mean something slightly different. Say k=3. You pick the first center at random (nothing else to do here), call it c.
Usual kmeans++ now says pick the second center in proportion to the square of the distance to c. Pick three different points independently like this — call them x1, x2, x3.
Check which of the x1, x2, x3 is the best point — i.e. which clustering is better (c, x1) or (c, x2) or (cx3). Call it c’.
Now we have two points (c, c’). To select the third, km++ would have you sample proportionally from some distribution. Again, pick three points from that distribution and pick the best one.
I hope that makes sense.
The only intuition I have is the following. Since the sampling of the second center is probabilistic, there is a small (but nonzero) probability that the second center will be very close to the first, even when there are better centers around. In a sense this biases the distribution to the further away points. The double/greedy selection avoids this bad case.
March 30, 2009 at 2:24 pm 
Yes — that makes sense.
I’ve been wondering if factoring in something like the average square distance to every other point makes sense in addition to getting ones that are far away from already chosen centers, but I figured that’d bias all the selections too close together.
What Sergei’s suggesting is somewhat similar, but has the benefit of a natural taskrelated metric — it looks only at the reduction in distance of other points to closest centers (which may now be the sample being considered for the next center). That overcomes the problem with how I was thinking about it at the cost of some minor extra calculations.
It also reminds me of the maximal marginal relevance (MMR) approach to search ranking that selects the next document on the basis of match to the query (pure relevance) and distance from previously returned results (marginal relevance):
Carbonell, Jaime and Jade Goldstein (1998) The use of MMR, diversitybased reranking for reordering documents and producing summaries. SIGIR.
In a way, it’s like selecting cluster centers to return.
The greedy empirical approach opens a can of worms in terms of how many points to sample, with the limit being considering every possible next point. Does that lean too heavily on the outliers again? I’m not sure it would, given that we’re looking at reduction in error. It’s also possible to relax greediness by choosing the next j centers at once using some method (say the modified version Sergei just described), then comparing them all. The clustering (and decision tree) literature are full of these kinds of heuristics.
August 16, 2009 at 1:52 pm 
Very interesting. In the paper the generalization for any other exponent than 2 was treated theoretically but not empirically. Was anything done exploring this direction in the meantime?
August 16, 2009 at 6:11 pm 
Not that I know of. I didn’t study the theory too carefully, but quadratic makes sense because the error metric being optimized is distance, which is quadratic.
August 23, 2009 at 12:24 pm 
Could you please help me out whether I understand this right:
Your section “Diversity with Robustness to Outliers” tells to me that kmeans++ relies on sampling to obtain robustness to outliers.
I didn’t get this point from the original paper and I still feel that step 1b in their section 2.2 doesn’t make it clear; only directly after Theorem 1.1 we get the magic word “sampling” (well, and one more time when another paper is mentioned), so you can blame my short memory that initially this didn’t stuck until I reached section 2.2
Your section also tells me that the sampling method isn’t new, so the authors’ main achievement is “only” providing proof for the O(log k) competitiveness (no small feat though; I am still chewing on the details. So don’t get me wrong: I still think it’s a great paper.)
So the exponent isn’t really that much important as I first understood and your assessment that “quadratic makes sense” now also seems clearer to me.
August 22, 2009 at 12:04 pm 
OK, is my understanding correct that an exponent of 1 is the kmedian problem? (And therefore also interesting in some cases?)
August 24, 2009 at 12:09 pm 
The trick is that the probability estimates used for the samples are based on distance, and the density of the samples itself provides the robustness because there are more typical examples than outliers.
You can change the evaluation metric/(log) loss/probability model from Gaussian (L2) to Laplace (L1). I’m not sure what the relation is to medians, but I know there is one.
The reason quadratic makes sense for L2 is that its exponential form is quadratic, so log loss is quadratic, or in other words, proportional to Euclidean distance. Basically, your sampling probability is proportional to your log loss.
You could change the kmeans++ algorithm to use L1 distances (absolute value log loss) in place of L2 (quadratic).
March 29, 2010 at 5:23 pm 
[…] Kmeans++ a better, faster kmeans.Â https://lingpipeblog.com/2009/03/23/arthurvassilvitskii2007kmeanstheadvantagesofcarefulseedi… […]
June 21, 2011 at 2:04 pm 
Anyone know of any kmeans++ java code out there?
I found a neat readytouse kmeans++ algorithm in MLPY library for python. Now I need the same for JAVA, don’t see what I need in JavaML – any suggestions?
June 21, 2011 at 3:25 pm
LingPipe has kmeans++ implemented. Here’s the javadoc:
http://aliasi.com/lingpipe/docs/api/com/aliasi/cluster/KMeansClusterer.html
It’s a reasonably efficient kmeans implementation — we regularly apply it to doc collections of 10s of thousands of docs.
May 5, 2011 at 6:01 am 
Hello folks,
I’m trying to implement this in matlab but probably got something wrong here as the results I got are not better than original kmeans.
Can someone please point me what I may be doing wrong here? My objective is just to get the centroids.
X is EntityxFeature and k is the number of clusters, MANY thanks!
function [C] = mykmeanspp(X,k)
L = [];
C = X(randi(size(X,1)),:);
L = ones(size(X,1),1);
for i = 2:k
D = XC(L,:);
D = sqrt(dot(D’,D’));
C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);
[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));
end
August 24, 2011 at 4:49 pm 
http://www.public.iastate.edu/~apghosh/files/IEEEclust2.pdf
A systematic evaluation of different methods for initializing the Kmeans clustering algorithm. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. 2010
KMeans++ is that good. I implemented this for my problem domain, hoping it would help, however it does not provide any improvement over random selection of centroids.
Musi
August 28, 2011 at 2:48 pm 
Thanks for the reference.
Given that these are all randomized initialization, I’d have liked to have seen sample averages and variances. I can’t tell from a quick glance how many times they tried each algorithm. Variability is important in terms of use cases.
Also, I’d rather do my own adjustment for computer time.
June 4, 2012 at 12:34 pm 
I am doing PhD in Parallel Computing of Data Mining Algorithms. I want to do some analysis on few variant of kmeans algorithm. But I am not able to access the code on the following link:
http://www.stanford.edu/~darthur/kMeansppTest.zip
Can you please give me the access or mail me the zip file?
Please reply as soon as possible.
Thanks…
June 4, 2012 at 12:38 pm 
We have nothing to do with that software. Maybe you should ask the authors of the paper.
There is LingPipe code in Java for kmeans++, which is available from our home page:
http://aliasi.com/lingpipe
It’s also a relatively simple algorithm to implement.