<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Speeding up K-means Clustering with Algebra and Sparse Vectors</title>
	<atom:link href="http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:47:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/#comment-5965</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Wed, 09 Dec 2009 19:57:53 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=904#comment-5965</guid>
		<description><![CDATA[Thanks for the link.  I know that paper and anyone doing k-means should read it.  Even if you don&#039;t use the algorithm, it provides great insight into the geometry of the problem.

The main reason I didn&#039;t use it was the extra memory overhead.  

I would have sworn I&#039;d discussed exactly Elkan&#039;s algorithm on this blog entry, but can&#039;t find any mention of it here or elsewhere.]]></description>
		<content:encoded><![CDATA[<p>Thanks for the link.  I know that paper and anyone doing k-means should read it.  Even if you don&#8217;t use the algorithm, it provides great insight into the geometry of the problem.</p>
<p>The main reason I didn&#8217;t use it was the extra memory overhead.  </p>
<p>I would have sworn I&#8217;d discussed exactly Elkan&#8217;s algorithm on this blog entry, but can&#8217;t find any mention of it here or elsewhere.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jeff</title>
		<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/#comment-5964</link>
		<dc:creator><![CDATA[jeff]]></dc:creator>
		<pubDate>Wed, 09 Dec 2009 18:17:13 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=904#comment-5964</guid>
		<description><![CDATA[Check out:
http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf

C. Elkan.  Using the Triangle Inequality to Accelerate k-Means (pdf).  In Proceedings of the Twentieth International Conference on Machine Learning (ICML&#039;03), pp. 147-153.]]></description>
		<content:encoded><![CDATA[<p>Check out:<br />
<a href="http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf" rel="nofollow">http://cseweb.ucsd.edu/~elkan/kmeansicml03.pdf</a></p>
<p>C. Elkan.  Using the Triangle Inequality to Accelerate k-Means (pdf).  In Proceedings of the Twentieth International Conference on Machine Learning (ICML&#8217;03), pp. 147-153.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/#comment-4195</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Fri, 13 Mar 2009 18:53:29 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=904#comment-4195</guid>
		<description><![CDATA[@sth4nth: Cool!  I should&#039;ve kept going with the algebra.  It&#039;s obvious the constants don&#039;t matter, but I&#039;m always surprised when you can replace distances with products.  The key to the approach I outlined in the entry was using sparse multiplications, because the S vectors (in your notation) are sparse at around 1% or less of the C vectors in the applications we&#039;re considering.

@dz:  Thanks!  I keep forgetting that Euclidean distance is a real metric.  I&#039;m so used to having to generalize to pseudo-metrics where the triangle ineuqality fails.

@myself: The other thing I really need to do now is cache distances vs. centroids when the centroids don&#039;t change.  After a few dozen epochs with K=1000, lots of the clusters don&#039;t change at all between epochs, so that whole columns of sth4nth&#039;s C&#039; * S matrix are constant.]]></description>
		<content:encoded><![CDATA[<p>@sth4nth: Cool!  I should&#8217;ve kept going with the algebra.  It&#8217;s obvious the constants don&#8217;t matter, but I&#8217;m always surprised when you can replace distances with products.  The key to the approach I outlined in the entry was using sparse multiplications, because the S vectors (in your notation) are sparse at around 1% or less of the C vectors in the applications we&#8217;re considering.</p>
<p>@dz:  Thanks!  I keep forgetting that Euclidean distance is a real metric.  I&#8217;m so used to having to generalize to pseudo-metrics where the triangle ineuqality fails.</p>
<p>@myself: The other thing I really need to do now is cache distances vs. centroids when the centroids don&#8217;t change.  After a few dozen epochs with K=1000, lots of the clusters don&#8217;t change at all between epochs, so that whole columns of sth4nth&#8217;s C&#8217; * S matrix are constant.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dz</title>
		<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/#comment-4193</link>
		<dc:creator><![CDATA[dz]]></dc:creator>
		<pubDate>Fri, 13 Mar 2009 15:35:29 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=904#comment-4193</guid>
		<description><![CDATA[Don&#039;t forget the triangle inequality speed-up: if you pre-compute distances between centroids d(c,c&#039;) before each iteration, then you don&#039;t have to compute many distances between examples and centroids d(x,c) at all, because d(x,c) &gt;= d(x,c&#039;) - d(c,c&#039;)]]></description>
		<content:encoded><![CDATA[<p>Don&#8217;t forget the triangle inequality speed-up: if you pre-compute distances between centroids d(c,c&#8217;) before each iteration, then you don&#8217;t have to compute many distances between examples and centroids d(x,c) at all, because d(x,c) &gt;= d(x,c&#8217;) &#8211; d(c,c&#8217;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sth4nth</title>
		<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/#comment-4188</link>
		<dc:creator><![CDATA[sth4nth]]></dc:creator>
		<pubDate>Fri, 13 Mar 2009 05:15:36 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=904#comment-4188</guid>
		<description><![CDATA[The third term of (1) can be dropped not the first one. Sorry for that error but the idea is the same]]></description>
		<content:encoded><![CDATA[<p>The third term of (1) can be dropped not the first one. Sorry for that error but the idea is the same</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sth4nth</title>
		<link>http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/#comment-4187</link>
		<dc:creator><![CDATA[sth4nth]]></dc:creator>
		<pubDate>Fri, 13 Mar 2009 03:07:14 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=904#comment-4187</guid>
		<description><![CDATA[Since it&#039;s hard to write formula in comment, I&#039;m gonna write the idea in matlab language.

In each iteration, you want to find the samples that are closest to centers.
Assume you have two matrices C and S. Each colum of C is a center vector and Each colum of S is a sample vector. Then the distance matrix

D=sum(C.^2,1)&#039;*ones(1,n)-2*C&#039;*S+ones(k,1)*sum(S.^2,1); //(1)
D is a k x n matrix with each element is the distance between a sample and a center.

Since you also want to find the min disances, the first term of right hand side of (1) can be droped and the third term (length of all samples) can precomputed. Then you only need to compute C&#039;*S in each iteration. This will be efficient even for dense matrix. For a naive implementation, each iteration will cost O(n*k*d), the matrix multiplication certainly can do better than that, not to mention if the vector is sparse.]]></description>
		<content:encoded><![CDATA[<p>Since it&#8217;s hard to write formula in comment, I&#8217;m gonna write the idea in matlab language.</p>
<p>In each iteration, you want to find the samples that are closest to centers.<br />
Assume you have two matrices C and S. Each colum of C is a center vector and Each colum of S is a sample vector. Then the distance matrix</p>
<p>D=sum(C.^2,1)&#8217;*ones(1,n)-2*C&#8217;*S+ones(k,1)*sum(S.^2,1); //(1)<br />
D is a k x n matrix with each element is the distance between a sample and a center.</p>
<p>Since you also want to find the min disances, the first term of right hand side of (1) can be droped and the third term (length of all samples) can precomputed. Then you only need to compute C&#8217;*S in each iteration. This will be efficient even for dense matrix. For a naive implementation, each iteration will cost O(n*k*d), the matrix multiplication certainly can do better than that, not to mention if the vector is sparse.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

