Langford, Li, and Zhang (2009) Sparse Online Learning via Truncated Gradient


This paper is the bomb:

It contains a very general theoretical analysis of the convergence of the kind of truncated L1 normalization (Laplace prior) we have in LingPipe’s logistic regression implementation. They establish convergence for a very general case of loss including standard least squares linear regression, SVMs and logistic regression as special cases, and a very general stochastic condition which allows updates for N examples at a time. (Is this latter condition useful for parallelizing this algorithm?)

One of the paper’s main motivations is to find sparse sets of coefficients, which is why they explore L1 norms (Cauchy priors [free preprint version] also have this property of having a density spike around the mean coupled with fat tails).

I’ve been planning to add a truly online version of logistic regression training that’d only store the coefficients in memory and a handful of examples (the batch size, which when set to 1, gives traditional stochastic gradient). Our implementation only meets the first of the other two desiderata laid out by Langford et al. (p. 779):

  • The algorithm should be computationally efficient: the number of operations per online step should be linear in the number of nonzero features, and independent of the total number of features.
  • The algorithm should be memory efficient: it needs to maintain a list of active features, and can insert (when the corresponding weight becomes nonzero) and delete (when the corresponding weight becomes zero) features dynamically.

What I haven’t done is made the coefficient vectors sparse. The problem with doing that is there’s a pretty steep penalty to pay in efficiency for doing so, at least using any kind of sparse matrices I know how to implement.

There’s also the issue of symbol tables if there’s a truly vast set of features, but that’s not really the gradient estimator’s responsibility.

There’s the usual set of experimental results, too, showing that the method works well on known data sets, such as Reuters RCV1. They also evaluate some of their own (Yahoo! advertising) data with 3 x 109 features, 26 x 106 training instances. Note that this is nearly three orders of magnitude more features than data points, which is why you need to apply such a strong prior; without it you’d have rather an overfitting problem (there’s more discussion in Tong’s further work on the the subject, Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations).

They use held-out evals like machine learning folks rather than reporting log loss on the training data like statisticians.

All in all, L1 with truncated gradient does a pretty good job at removing features — even better than the regular Lasso (they evaluated prior variances that led to fewer and fewer features; sort of like the diagram in Hastie et al.’s machine learning book when they describe the Lasso). Maybe we should be experimenting with even lower prior variances from a zero mean.

One Response to “Langford, Li, and Zhang (2009) Sparse Online Learning via Truncated Gradient”

  1. The Ideal Large Scale Learning Class « Machine Learning (Theory) Says:

    […] l1 regularization is via truncated gradient. See Bob Carpenter’s discussion. John Duchi’s composite mirror descent generalization is also a useful general […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: