There is LingPipe code in Java for k-means++, which is available from our home page:

It’s also a relatively simple algorithm to implement.

]]>http://www.stanford.edu/~darthur/kMeansppTest.zip

Can you please give me the access or mail me the zip file?

Please reply as soon as possible.

Thanks… ]]>

Given that these are all randomized initialization, I’d have liked to have seen sample averages and variances. I can’t tell from a quick glance how many times they tried each algorithm. Variability is important in terms of use cases.

Also, I’d rather do my own adjustment for computer time.

]]>A systematic evaluation of different methods for initializing the K-means clustering algorithm. Anna D. Peterson, Arka P. Ghosh and Ranjan Maitra. 2010

KMeans++ is that good. I implemented this for my problem domain, hoping it would help, however it does not provide any improvement over random selection of centroids.

Musi

]]>http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/KMeansClusterer.html

It’s a reasonably efficient k-means implementation — we regularly apply it to doc collections of 10s of thousands of docs.

]]>I found a neat ready-to-use kmeans++ algorithm in MLPY library for python. Now I need the same for JAVA, don’t see what I need in Java-ML – any suggestions? ]]>

I’m trying to implement this in matlab but probably got something wrong here as the results I got are not better than original k-means.

Can someone please point me what I may be doing wrong here? My objective is just to get the centroids.

X is EntityxFeature and k is the number of clusters, MANY thanks!

function [C] = mykmeanspp(X,k)

L = [];

C = X(randi(size(X,1)),:);

L = ones(size(X,1),1);

for i = 2:k

D = X-C(L,:);

D = sqrt(dot(D’,D’));

C(i,:) = X(find(rand < cumsum(D)/sum(D),1),:);

[~,L] = max(bsxfun(@minus,2*real(C*X'),dot(C',C').'));

end

You can change the evaluation metric/(log) loss/probability model from Gaussian (L2) to Laplace (L1). I’m not sure what the relation is to medians, but I know there is one.

The reason quadratic makes sense for L2 is that its exponential form is quadratic, so log loss is quadratic, or in other words, proportional to Euclidean distance. Basically, your sampling probability is proportional to your log loss.

You could change the k-means++ algorithm to use L1 distances (absolute value log loss) in place of L2 (quadratic).

]]>Your section “Diversity with Robustness to Outliers” tells to me that k-means++ relies on sampling to obtain robustness to outliers.

I didn’t get this point from the original paper and I still feel that step 1b in their section 2.2 doesn’t make it clear; only directly after Theorem 1.1 we get the magic word “sampling” (well, and one more time when another paper is mentioned), so you can blame my short memory that initially this didn’t stuck until I reached section 2.2

Your section also tells me that the sampling method isn’t new, so the authors’ main achievement is “only” providing proof for the O(log k) competitiveness (no small feat though; I am still chewing on the details. So don’t get me wrong: I still think it’s a great paper.)

So the exponent isn’t really that much important as I first understood and your assessment that “quadratic makes sense” now also seems clearer to me.

]]>