<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Convergence is Relative: SGD vs. Pegasos, LibLinear, SVM^light, and SVM^perf</title>
	<atom:link href="http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:47:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: Bob Carpenter</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-16937</link>
		<dc:creator><![CDATA[Bob Carpenter]]></dc:creator>
		<pubDate>Mon, 14 Nov 2011 21:00:22 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-16937</guid>
		<description><![CDATA[No idea -- you should ask the people who wrote the software, not us.]]></description>
		<content:encoded><![CDATA[<p>No idea &#8212; you should ask the people who wrote the software, not us.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: vumaasha</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-16880</link>
		<dc:creator><![CDATA[vumaasha]]></dc:creator>
		<pubDate>Sat, 12 Nov 2011 06:12:13 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-16880</guid>
		<description><![CDATA[I am comparing the performance of SVM perf and Pegasos for imbalanced
datasets. I understand that pegasos implementation currently only
produces the model file. How can i predict the actual class for test
set? can i use svm_classify or svm_perf_classify? or what is the
formula that i should use to write a predictor of my own. Thanks a lot
in advance.

Thanks,
Venkatesh]]></description>
		<content:encoded><![CDATA[<p>I am comparing the performance of SVM perf and Pegasos for imbalanced<br />
datasets. I understand that pegasos implementation currently only<br />
produces the model file. How can i predict the actual class for test<br />
set? can i use svm_classify or svm_perf_classify? or what is the<br />
formula that i should use to write a predictor of my own. Thanks a lot<br />
in advance.</p>
<p>Thanks,<br />
Venkatesh</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Henry</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-7174</link>
		<dc:creator><![CDATA[Henry]]></dc:creator>
		<pubDate>Fri, 25 Jun 2010 04:47:53 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-7174</guid>
		<description><![CDATA[One of the most ridiculous selling trick that online learning guys use is to say:
1. “Oh, we have a huge amount of data, so online is effective”.
2. Online learning needs a random sample at each step.
3. In experiment, we loaded the whole data into memory, and so random access becomes easy.

But hey, if the data can fit into memory, then it’s not called “large”.  If the data has to reside on the hard drive, then the CPU will have to wait for the bus to fetch the data, and we lose all the benefit of cheap update in online learning.

Even worse, the data can be distributed on different data centers.  Do we want to move the data to some central node?  

In such scenarios, a map reduce framework for batch learning is much more effective.]]></description>
		<content:encoded><![CDATA[<p>One of the most ridiculous selling trick that online learning guys use is to say:<br />
1. “Oh, we have a huge amount of data, so online is effective”.<br />
2. Online learning needs a random sample at each step.<br />
3. In experiment, we loaded the whole data into memory, and so random access becomes easy.</p>
<p>But hey, if the data can fit into memory, then it’s not called “large”.  If the data has to reside on the hard drive, then the CPU will have to wait for the bus to fetch the data, and we lose all the benefit of cheap update in online learning.</p>
<p>Even worse, the data can be distributed on different data centers.  Do we want to move the data to some central node?  </p>
<p>In such scenarios, a map reduce framework for batch learning is much more effective.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-4497</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 13 Apr 2009 17:24:46 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-4497</guid>
		<description><![CDATA[I think annealing is independent -- you could drop that into a Pegasos-like approach.  Parallelism is cheating in theory, because it&#039;s using more flops. 

The real key is that it doesn&#039;t need the babysitting.  

In practice, we&#039;re always evaluating different regularization parameters and different feature sets along with everything else, so you&#039;d be able to run Pegasos in parallel to do that more efficiently.  

I think which method wins these contests almost always depends on the exact shape of the problem (sparsity, size, matrix conditioning, etc.).]]></description>
		<content:encoded><![CDATA[<p>I think annealing is independent &#8212; you could drop that into a Pegasos-like approach.  Parallelism is cheating in theory, because it&#8217;s using more flops. </p>
<p>The real key is that it doesn&#8217;t need the babysitting.  </p>
<p>In practice, we&#8217;re always evaluating different regularization parameters and different feature sets along with everything else, so you&#8217;d be able to run Pegasos in parallel to do that more efficiently.  </p>
<p>I think which method wins these contests almost always depends on the exact shape of the problem (sparsity, size, matrix conditioning, etc.).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brendan O'Connor</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-4491</link>
		<dc:creator><![CDATA[Brendan O'Connor]]></dc:creator>
		<pubDate>Mon, 13 Apr 2009 08:18:17 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-4491</guid>
		<description><![CDATA[Very interesting.  The part that interested me most was the relationship between computation cost and accuracy.  I&#039;d love to see more work looking at this.  I get the feeling that when you&#039;re doing this as a practitioner, there&#039;s lots of black magic and undocumented, word-of-mouth tricks.  But you would think there should be solid theory behind understanding tradeoffs of computation cost and accuracy.

Empirical story #1: I found basically the same convergence issues you were talking about when I implemented SGD for L2-regularized linear regression.  My data set was bag-of-words NLP but somewhat small (document titles: 40k examples, 20k features).  Lots of tweaking changed the solution when using a convergence threshold, but held-out accuracy (proportion of variance explained) didn&#039;t always change a ton; but it&#039;s pretty computationally expensive to find out!

I would love a magical method that required less tweaking / babysitting / staring at an accuracy/time graph and intuiting whether it&#039;s converged, while at the same time you&#039;re balancing that against a qualitative gut feeling whether to wait longer.  Quantifying the time cost is a first step of course; but more broadly, I just don&#039;t have very good confidence in reproducibility with any of these algorithms.  Pegasos sounds appealing for that reason, though if your annealing strategy plus parallelism can beat it, so be it.

Story #2: I was using random forests recently and tried using out-of-bag error to figure out how many bootstraps I had to do before convergence.  (Like SGD with a slow enough learning rate, bagging &amp; RF&#039;s are guaranteed to converge if given enough time; so another computation vs accuracy tradeoff.)  OOB estimation is a neat trick, but it seems to overestimate how far you have to go relative to held-out accuracy.  I really don&#039;t understand why this is.  The original Breiman papers where RF&#039;s come from all gloss over this.]]></description>
		<content:encoded><![CDATA[<p>Very interesting.  The part that interested me most was the relationship between computation cost and accuracy.  I&#8217;d love to see more work looking at this.  I get the feeling that when you&#8217;re doing this as a practitioner, there&#8217;s lots of black magic and undocumented, word-of-mouth tricks.  But you would think there should be solid theory behind understanding tradeoffs of computation cost and accuracy.</p>
<p>Empirical story #1: I found basically the same convergence issues you were talking about when I implemented SGD for L2-regularized linear regression.  My data set was bag-of-words NLP but somewhat small (document titles: 40k examples, 20k features).  Lots of tweaking changed the solution when using a convergence threshold, but held-out accuracy (proportion of variance explained) didn&#8217;t always change a ton; but it&#8217;s pretty computationally expensive to find out!</p>
<p>I would love a magical method that required less tweaking / babysitting / staring at an accuracy/time graph and intuiting whether it&#8217;s converged, while at the same time you&#8217;re balancing that against a qualitative gut feeling whether to wait longer.  Quantifying the time cost is a first step of course; but more broadly, I just don&#8217;t have very good confidence in reproducibility with any of these algorithms.  Pegasos sounds appealing for that reason, though if your annealing strategy plus parallelism can beat it, so be it.</p>
<p>Story #2: I was using random forests recently and tried using out-of-bag error to figure out how many bootstraps I had to do before convergence.  (Like SGD with a slow enough learning rate, bagging &amp; RF&#8217;s are guaranteed to converge if given enough time; so another computation vs accuracy tradeoff.)  OOB estimation is a neat trick, but it seems to overestimate how far you have to go relative to held-out accuracy.  I really don&#8217;t understand why this is.  The original Breiman papers where RF&#8217;s come from all gloss over this.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: sth4nth</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-4449</link>
		<dc:creator><![CDATA[sth4nth]]></dc:creator>
		<pubDate>Wed, 08 Apr 2009 22:40:30 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-4449</guid>
		<description><![CDATA[Nice review. There is another method called Core Vector Machine (http://www.cse.ust.hk/~ivor/cvm.html) which performs well when dealing with large scale data sets according to my experience. How is the CVM comparing to these algorithms?]]></description>
		<content:encoded><![CDATA[<p>Nice review. There is another method called Core Vector Machine (<a href="http://www.cse.ust.hk/~ivor/cvm.html" rel="nofollow">http://www.cse.ust.hk/~ivor/cvm.html</a>) which performs well when dealing with large scale data sets according to my experience. How is the CVM comparing to these algorithms?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-4448</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Wed, 08 Apr 2009 22:05:37 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-4448</guid>
		<description><![CDATA[There are lots of implementations to choose from.  Not only have we released source, there&#039;s also Leon Bottou&#039;s SGD, Zhang and Langford&#039;s Vowpal Wabbit, as well as a whole lot of less scalable alternatives and probably lots of implementations I don&#039;t know about.

LingPipe has a Java implementation of sparse, regularized, truncated, stochastic gradient descent in:

&lt;a href=&quot;http://alias-i.com/lingpipe/docs/api/com/aliasi/stats/LogisticRegression.html&quot; rel=&quot;nofollow&quot;&gt;&lt;code&gt;com.aliasi.stats.LogisticRegression&lt;/code&gt;&lt;/a&gt;

It&#039;s wrapped up for classification applications with programmatic feature extraction in:

&lt;a href=&quot;http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/LogisticRegressionClassifier.html&quot; rel=&quot;nofollow&quot;&gt;&lt;code&gt;com.aliasi.classify.LogisticRegressionClassifier&lt;/code&gt;&lt;/a&gt;.

The code&#039;s available in the &lt;a href=&quot;http://alias-i.com/lingpipe/web/download.html&quot; rel=&quot;nofollow&quot;&gt;LingPipe distribution&lt;/a&gt; or online.  The estimation&#039;s all done in the source file:

&lt;a href=&quot;http://alias-i.com/lingpipe/src/com/aliasi/stats/LogisticRegression.java&quot; rel=&quot;nofollow&quot;&gt;&lt;code&gt;com/aliasi/stats/LogisticRegression.java&lt;/code&gt;&lt;/a&gt;.  

I also wrote up a &lt;a href=&quot;http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf&quot; rel=&quot;nofollow&quot;&gt;SGD for logistic regression white paper&lt;/a&gt; which goes over my rediscovery of Tong Zhang et al.&#039;s truncated gradient with regularization (Normal, Laplace, or Cauchy priors).  I also work through all the derivatives, etc., from scratch (I write these things for myself as I&#039;m learning a field).]]></description>
		<content:encoded><![CDATA[<p>There are lots of implementations to choose from.  Not only have we released source, there&#8217;s also Leon Bottou&#8217;s SGD, Zhang and Langford&#8217;s Vowpal Wabbit, as well as a whole lot of less scalable alternatives and probably lots of implementations I don&#8217;t know about.</p>
<p>LingPipe has a Java implementation of sparse, regularized, truncated, stochastic gradient descent in:</p>
<p><a href="http://alias-i.com/lingpipe/docs/api/com/aliasi/stats/LogisticRegression.html" rel="nofollow"><code>com.aliasi.stats.LogisticRegression</code></a></p>
<p>It&#8217;s wrapped up for classification applications with programmatic feature extraction in:</p>
<p><a href="http://alias-i.com/lingpipe/docs/api/com/aliasi/classify/LogisticRegressionClassifier.html" rel="nofollow"><code>com.aliasi.classify.LogisticRegressionClassifier</code></a>.</p>
<p>The code&#8217;s available in the <a href="http://alias-i.com/lingpipe/web/download.html" rel="nofollow">LingPipe distribution</a> or online.  The estimation&#8217;s all done in the source file:</p>
<p><a href="http://alias-i.com/lingpipe/src/com/aliasi/stats/LogisticRegression.java" rel="nofollow"><code>com/aliasi/stats/LogisticRegression.java</code></a>.  </p>
<p>I also wrote up a <a href="http://lingpipe.files.wordpress.com/2008/04/lazysgdregression.pdf" rel="nofollow">SGD for logistic regression white paper</a> which goes over my rediscovery of Tong Zhang et al.&#8217;s truncated gradient with regularization (Normal, Laplace, or Cauchy priors).  I also work through all the derivatives, etc., from scratch (I write these things for myself as I&#8217;m learning a field).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevin Nuckolls</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-4446</link>
		<dc:creator><![CDATA[Kevin Nuckolls]]></dc:creator>
		<pubDate>Wed, 08 Apr 2009 20:40:13 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-4446</guid>
		<description><![CDATA[This was a great writeup. I haven&#039;t had the time lately to keep up with the research so I appreciate the overview. It&#039;s great to see an algorithm that automatically sets the learning parameter.

Thanks for linking to the papers. 

You say you&#039;ve been implementing these. C++? Java?]]></description>
		<content:encoded><![CDATA[<p>This was a great writeup. I haven&#8217;t had the time lately to keep up with the research so I appreciate the overview. It&#8217;s great to see an algorithm that automatically sets the learning parameter.</p>
<p>Thanks for linking to the papers. </p>
<p>You say you&#8217;ve been implementing these. C++? Java?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

