Thanks for the link. I know that paper and anyone doing k-means should read it. Even if you don’t use the algorithm, it provides great insight into the geometry of the problem.

The main reason I didn’t use it was the extra memory overhead.

I would have sworn I’d discussed exactly Elkan’s algorithm on this blog entry, but can’t find any mention of it here or elsewhere.

]]>Click to access kmeansicml03.pdf

C. Elkan. Using the Triangle Inequality to Accelerate k-Means (pdf). In Proceedings of the Twentieth International Conference on Machine Learning (ICML’03), pp. 147-153.

]]>@dz: Thanks! I keep forgetting that Euclidean distance is a real metric. I’m so used to having to generalize to pseudo-metrics where the triangle ineuqality fails.

@myself: The other thing I really need to do now is cache distances vs. centroids when the centroids don’t change. After a few dozen epochs with K=1000, lots of the clusters don’t change at all between epochs, so that whole columns of sth4nth’s C’ * S matrix are constant.

]]>In each iteration, you want to find the samples that are closest to centers.

Assume you have two matrices C and S. Each colum of C is a center vector and Each colum of S is a sample vector. Then the distance matrix

D=sum(C.^2,1)’*ones(1,n)-2*C’*S+ones(k,1)*sum(S.^2,1); //(1)

D is a k x n matrix with each element is the distance between a sample and a center.

Since you also want to find the min disances, the first term of right hand side of (1) can be droped and the third term (length of all samples) can precomputed. Then you only need to compute C’*S in each iteration. This will be efficient even for dense matrix. For a naive implementation, each iteration will cost O(n*k*d), the matrix multiplication certainly can do better than that, not to mention if the vector is sparse.

]]>