I still can’t quite get over how well stochastic gradient descent (SGD) works for the kinds of large scale, sparse convex optimization problems we find in natural language processing — SVMs, CRFs, logistic regression, etc. I just finished reading this ICML paper that was recommended to me:
- Shalev-Schwartz, Shai, Yoram Singer, and Nathan Srebro. 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. ICML.
It’s a nice paper that introduces a blocked gradient optimization approach for support vector machines (SVMs) with linear kernels. (The issues for linear SVMs are almost the same as for logistic regression.) Setting the blocks to a single item yields stochastic gradient, and setting them to the whole corpus gives you a traditional gradient approach. The innovation is a step that handles L2 (aka Gaussian prior) regularization by projecting the coefficients down onto an L2-ball (this technique’s since also been used for L1 norms). The benefit over standard SGD is that it automatically sets the pesky learning rate parameter (see below for an empirical discussion).
The paper also introduces a theoretical convergence analysis that shows it converges in O(d/(λε)) where d is a bound on non-zero features in an example, λ is the regularization parameter, and ε is the error bound on estimation. Note that like other SGD-like methods, the convergence is
- independent of the number of samples, and
- independent of total dimensionality
though it is dependent on d, the maximum number of non-zero features per instance.
Does it Work in Practice?
But what really struck me is Figure 2 on page 7, which compares the rate of convergence of Pegasos with that of a sparse stochastic gradient, as described in Tong Zhang’s earlier ICML paper:
- Zhang, Tong. 2004. Solving large scale linear prediction problems using stocahstic gradient descent algorithms. ICML.
The problem is text classification of Arxiv papers as to whether they’re about astro-physics or not, still a small problem with 100K features and 60K examples with high sparsity. The lines in red are Tong’s SGD with different constant learning rates: 10, 1, 0.1, 0.01, 0.001 (presumably arranged left-to-right, with 10 being the non-converger). The blue line is Pegasos. The horizontal axis is log time, and the vertical axis represents the error function (for SVMs, that’s hinge loss on the training set; logistic regression uses log loss rather than hinge loss, but basically works the same way).
Shalev-Schwartz conclude that the main advantage of their approach over SGD (which enjoys many of the same high-level advantages, like convergence independent of number of samples) is that you don’t have to set the learning rate. This may seem to be nitpicking, but it’s not. Those five runs of SGD represent quite a bit more processing time than one run of Pegasos. Perhaps even more crucially, when you’re doing SGD, you don’t know what the final converged error rate is, so you can’t tell if you’ve gotten there without trying lots of rates.
The graph and our own experiences show that you don’t actually need to run SGD to convergence for the different learning rates. You can see from the early part of the SGD learning rate curves which learning rate has the steepest initial descent. That, combined with a reasonable annealing rate (lowering learning rate over time) winds up being fast in practice. But it requires babysitting by someone with a feel for the results.
Evaluating multiple learning rates can be easily parallelized — it’s what’s known as “embarassingly parallel” process. Or just interleaved on one processor. Running all 5 SGD runs up to 104 seconds gets you to the result in half the time that running Pegasos for 105 seconds.
Annealing for Final Convergence
Also note that annealing lets you get closer to the optimal answer than is shown in the graph above. The reason the faster learning rate curves asymptote above the error of Pegasos is that near convergence, learning rates need to be lowered.
How Much Accuracy is Necessary?
For a general overview of the properties of SGD and convergence, I’d highly recommend Léon Bottou’s very readable slides from his 2007 NIPS tutorial:
- Bottou, Léon. 2007. Learning with Large Scale Datasets. NIPS Tutorial.
For me, the most significant piece of the presentation was this graph from page 31, comparing C.-J. Lin et al.’s LibLinear coordinate descent-based regularized SVM solver to SGD:
The horizontal axis is the accuracy of the solution relative to the optimal solution (the ε from above). The top graph shows how held-out test error (also called generalization error) drops as estimation accuracy increases. By the time we’re within 0.0001 of the minimum error, we’ve asymptoted on held-out accuracy. More simply, we don’t need more than 4 or 5 decimal places of accuracy with respect to error (what LingPipe uses as a stopping condition for regression). We’ve found exactly the same thing on the real problems (e.g. classifying clinical reports by disease and linkage between textual gene mentions and Entrez-Gene).
The bottom graph shows the time it takes the solvers on the vertical axis, with SGD being way faster for lower accuracies.
This is why I was so vexed when I wrote the first unit tests for our SGD logistic regression solver. They were taking forever (i.e. 100 seconds) to converge to 7 decimal places of accuracy.
Convergence by Held-Out Performance
With training data you can use cross-validation to estimate convergence using held-out performance. We monitor cross-validation error along with training set error along with the variance of the same. Our cross-validating corpus is getting a heavy workout in conjunction with SGD for logistic regression.
The Bottom Line
The bottom-line of Bottou’s graph (and this blog post) is that SGD estimation time grows inversely proportionally to accuracy. It’s really fast to get within a thousandth or ten thousandth of the optimal solution. And getting further doesn’t help much, if at all, on held-out data.