This paper is the bomb:
- Langford, John, Lihong Li and Tong Zhang. 2009. Sparse online learning via truncated gradient. JMLR 10:777–801.
It contains a very general theoretical analysis of the convergence of the kind of truncated L1 normalization (Laplace prior) we have in LingPipe’s logistic regression implementation. They establish convergence for a very general case of loss including standard least squares linear regression, SVMs and logistic regression as special cases, and a very general stochastic condition which allows updates for N examples at a time. (Is this latter condition useful for parallelizing this algorithm?)
One of the paper’s main motivations is to find sparse sets of coefficients, which is why they explore L1 norms (Cauchy priors [free preprint version] also have this property of having a density spike around the mean coupled with fat tails).
I’ve been planning to add a truly online version of logistic regression training that’d only store the coefficients in memory and a handful of examples (the batch size, which when set to 1, gives traditional stochastic gradient). Our implementation only meets the first of the other two desiderata laid out by Langford et al. (p. 779):
- The algorithm should be computationally efﬁcient: the number of operations per online step should be linear in the number of nonzero features, and independent of the total number of features.
- The algorithm should be memory efﬁcient: it needs to maintain a list of active features, and can insert (when the corresponding weight becomes nonzero) and delete (when the corresponding weight becomes zero) features dynamically.
What I haven’t done is made the coefficient vectors sparse. The problem with doing that is there’s a pretty steep penalty to pay in efficiency for doing so, at least using any kind of sparse matrices I know how to implement.
There’s also the issue of symbol tables if there’s a truly vast set of features, but that’s not really the gradient estimator’s responsibility.
There’s the usual set of experimental results, too, showing that the method works well on known data sets, such as Reuters RCV1. They also evaluate some of their own (Yahoo! advertising) data with 3 x 109 features, 26 x 106 training instances. Note that this is nearly three orders of magnitude more features than data points, which is why you need to apply such a strong prior; without it you’d have rather an overfitting problem (there’s more discussion in Tong’s further work on the the subject, Adaptive Forward-Backward Greedy Algorithm for Learning Sparse Representations).
They use held-out evals like machine learning folks rather than reporting log loss on the training data like statisticians.
All in all, L1 with truncated gradient does a pretty good job at removing features — even better than the regular Lasso (they evaluated prior variances that led to fewer and fewer features; sort of like the diagram in Hastie et al.’s machine learning book when they describe the Lasso). Maybe we should be experimenting with even lower prior variances from a zero mean.