I still can’t quite get over how well stochastic gradient descent (SGD) works for the kinds of large scale, sparse convex optimization problems we find in natural language processing — SVMs, CRFs, logistic regression, etc. I just finished reading this *ICML* paper that was recommended to me:

- Shalev-Schwartz, Shai, Yoram Singer, and Nathan Srebro. 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.
*ICML*.

### Basic Approach

It’s a nice paper that introduces a blocked gradient optimization approach for support vector machines (SVMs) with linear kernels. (The issues for linear SVMs are almost the same as for logistic regression.) Setting the blocks to a single item yields stochastic gradient, and setting them to the whole corpus gives you a traditional gradient approach. The innovation is a step that handles L2 (aka Gaussian prior) regularization by projecting the coefficients down onto an L2-ball (this technique’s since also been used for L1 norms). The benefit over standard SGD is that it automatically sets the pesky learning rate parameter (see below for an empirical discussion).

### Convergence

The paper also introduces a theoretical convergence analysis that shows it converges in O(d/(λε)) where d is a bound on non-zero features in an example, λ is the regularization parameter, and ε is the error bound on estimation. Note that like other SGD-like methods, the convergence is

**independent of the number of samples**, and**independent of total dimensionality**

though it is dependent on d, the maximum number of non-zero features per instance.

### Does it Work in Practice?

Shalev-Schwartz et al. then show that their approach crushes not only Thorsten Joachims’s SVM^{light} but also handily outperforms his linear-kernel optimization using cutting-planes in SVM^{perf}.

But what really struck me is Figure 2 on page 7, which compares the rate of convergence of Pegasos with that of a sparse stochastic gradient, as described in Tong Zhang’s earlier ICML paper:

- Zhang, Tong. 2004. Solving large scale linear prediction problems using stocahstic gradient descent algorithms.
*ICML*.

The problem is text classification of Arxiv papers as to whether they’re about astro-physics or not, still a small problem with 100K features and 60K examples with high sparsity. The lines in red are Tong’s SGD with different constant learning rates: 10, 1, 0.1, 0.01, 0.001 (presumably arranged left-to-right, with 10 being the non-converger). The blue line is Pegasos. The horizontal axis is log time, and the vertical axis represents the error function (for SVMs, that’s hinge loss on the training set; logistic regression uses log loss rather than hinge loss, but basically works the same way).

Shalev-Schwartz conclude that the main advantage of their approach over SGD (which enjoys many of the same high-level advantages, like convergence independent of number of samples) is that **you don’t have to set the learning rate**. This may seem to be nitpicking, but it’s not. Those five runs of SGD represent quite a bit more processing time than one run of Pegasos. Perhaps even more crucially, when you’re doing SGD, you don’t know what the final converged error rate is, so you can’t tell if you’ve gotten there without trying lots of rates.

### Embarssing Parallelism

The graph and our own experiences show that you don’t actually need to run SGD to convergence for the different learning rates. You can see from the early part of the SGD learning rate curves which learning rate has the steepest initial descent. That, combined with a reasonable annealing rate (lowering learning rate over time) winds up being fast in practice. But it requires babysitting by someone with a feel for the results.

Evaluating multiple learning rates can be easily parallelized — it’s what’s known as “embarassingly parallel” process. Or just interleaved on one processor. Running all 5 SGD runs up to 10^{4} seconds gets you to the result in half the time that running Pegasos for 10^{5} seconds.

### Annealing for Final Convergence

Also note that annealing lets you get closer to the optimal answer than is shown in the graph above. The reason the faster learning rate curves asymptote above the error of Pegasos is that near convergence, learning rates need to be lowered.

### How Much Accuracy is Necessary?

For a general overview of the properties of SGD and convergence, I’d highly recommend Léon Bottou’s very readable slides from his 2007 NIPS tutorial:

- Bottou, Léon. 2007. Learning with Large Scale Datasets.
*NIPS Tutorial*.

For me, the most significant piece of the presentation was this graph from page 31, comparing C.-J. Lin et al.’s LibLinear coordinate descent-based regularized SVM solver to SGD:

The horizontal axis is the accuracy of the solution relative to the optimal solution (the ε from above). The top graph shows how held-out test error (also called generalization error) drops as estimation accuracy increases. By the time we’re within 0.0001 of the minimum error, we’ve asymptoted on held-out accuracy. More simply, we don’t need more than 4 or 5 decimal places of accuracy with respect to error (what LingPipe uses as a stopping condition for regression). We’ve found exactly the same thing on the real problems (e.g. classifying clinical reports by disease and linkage between textual gene mentions and Entrez-Gene).

The bottom graph shows the time it takes the solvers on the vertical axis, with SGD being way faster for lower accuracies.

This is why I was so vexed when I wrote the first unit tests for our SGD logistic regression solver. They were taking *forever* (i.e. 100 seconds) to converge to 7 decimal places of accuracy.

### Convergence by Held-Out Performance

With training data you can use cross-validation to estimate convergence using held-out performance. We monitor cross-validation error along with training set error along with the variance of the same. Our cross-validating corpus is getting a heavy workout in conjunction with SGD for logistic regression.

### The Bottom Line

The bottom-line of Bottou’s graph (and this blog post) is that SGD estimation time grows inversely proportionally to accuracy. It’s really fast to get within a thousandth or ten thousandth of the optimal solution. And getting further doesn’t help much, if at all, on held-out data.

April 8, 2009 at 3:40 pm |

This was a great writeup. I haven’t had the time lately to keep up with the research so I appreciate the overview. It’s great to see an algorithm that automatically sets the learning parameter.

Thanks for linking to the papers.

You say you’ve been implementing these. C++? Java?

April 8, 2009 at 5:05 pm |

There are lots of implementations to choose from. Not only have we released source, there’s also Leon Bottou’s SGD, Zhang and Langford’s Vowpal Wabbit, as well as a whole lot of less scalable alternatives and probably lots of implementations I don’t know about.

LingPipe has a Java implementation of sparse, regularized, truncated, stochastic gradient descent in:

`com.aliasi.stats.LogisticRegression`

It’s wrapped up for classification applications with programmatic feature extraction in:

`com.aliasi.classify.LogisticRegressionClassifier`

.The code’s available in the LingPipe distribution or online. The estimation’s all done in the source file:

`com/aliasi/stats/LogisticRegression.java`

.I also wrote up a SGD for logistic regression white paper which goes over my rediscovery of Tong Zhang et al.’s truncated gradient with regularization (Normal, Laplace, or Cauchy priors). I also work through all the derivatives, etc., from scratch (I write these things for myself as I’m learning a field).

April 8, 2009 at 5:40 pm |

Nice review. There is another method called Core Vector Machine (http://www.cse.ust.hk/~ivor/cvm.html) which performs well when dealing with large scale data sets according to my experience. How is the CVM comparing to these algorithms?

April 13, 2009 at 3:18 am |

Very interesting. The part that interested me most was the relationship between computation cost and accuracy. I’d love to see more work looking at this. I get the feeling that when you’re doing this as a practitioner, there’s lots of black magic and undocumented, word-of-mouth tricks. But you would think there should be solid theory behind understanding tradeoffs of computation cost and accuracy.

Empirical story #1: I found basically the same convergence issues you were talking about when I implemented SGD for L2-regularized linear regression. My data set was bag-of-words NLP but somewhat small (document titles: 40k examples, 20k features). Lots of tweaking changed the solution when using a convergence threshold, but held-out accuracy (proportion of variance explained) didn’t always change a ton; but it’s pretty computationally expensive to find out!

I would love a magical method that required less tweaking / babysitting / staring at an accuracy/time graph and intuiting whether it’s converged, while at the same time you’re balancing that against a qualitative gut feeling whether to wait longer. Quantifying the time cost is a first step of course; but more broadly, I just don’t have very good confidence in reproducibility with any of these algorithms. Pegasos sounds appealing for that reason, though if your annealing strategy plus parallelism can beat it, so be it.

Story #2: I was using random forests recently and tried using out-of-bag error to figure out how many bootstraps I had to do before convergence. (Like SGD with a slow enough learning rate, bagging & RF’s are guaranteed to converge if given enough time; so another computation vs accuracy tradeoff.) OOB estimation is a neat trick, but it seems to overestimate how far you have to go relative to held-out accuracy. I really don’t understand why this is. The original Breiman papers where RF’s come from all gloss over this.

April 13, 2009 at 12:24 pm |

I think annealing is independent — you could drop that into a Pegasos-like approach. Parallelism is cheating in theory, because it’s using more flops.

The real key is that it doesn’t need the babysitting.

In practice, we’re always evaluating different regularization parameters and different feature sets along with everything else, so you’d be able to run Pegasos in parallel to do that more efficiently.

I think which method wins these contests almost always depends on the exact shape of the problem (sparsity, size, matrix conditioning, etc.).

June 24, 2010 at 11:47 pm |

One of the most ridiculous selling trick that online learning guys use is to say:

1. “Oh, we have a huge amount of data, so online is effective”.

2. Online learning needs a random sample at each step.

3. In experiment, we loaded the whole data into memory, and so random access becomes easy.

But hey, if the data can fit into memory, then it’s not called “large”. If the data has to reside on the hard drive, then the CPU will have to wait for the bus to fetch the data, and we lose all the benefit of cheap update in online learning.

Even worse, the data can be distributed on different data centers. Do we want to move the data to some central node?

In such scenarios, a map reduce framework for batch learning is much more effective.

November 12, 2011 at 1:12 am |

I am comparing the performance of SVM perf and Pegasos for imbalanced

datasets. I understand that pegasos implementation currently only

produces the model file. How can i predict the actual class for test

set? can i use svm_classify or svm_perf_classify? or what is the

formula that i should use to write a predictor of my own. Thanks a lot

in advance.

Thanks,

Venkatesh

November 14, 2011 at 4:00 pm |

No idea — you should ask the people who wrote the software, not us.

April 28, 2013 at 6:04 am |

I am using pegasos svm but do not understand properly. how can i calculate the accuracy & exucution time. Is any file or program available which calculate the accuracy or any other option??