## Tsuruoka, Tsujii, Ananiadou (2009) Stochastic Gradient Descent Training for L1-regularized Log-Linear Models with Cumulative Penalty

I’m a big fan of descriptive titles and clearly stated goals, both of which today’s paper has:

### SGD with Lazy Update and Clipping

The paper introduces a variant of truncated stochastic gradient, a perceptron-like online algorithm for estimating L1-regularized (what we’d call a Laplace or double-exponential prior in the Bayesian world) log-linear models (what we call logistic regression).

LingPipe employs the variant the authors evaluated as a baseline and dubbed “SGD-L1 (Clipping + Lazy-Update)” (see LingPipe’s logistic regression tutorial), which is the simplest form of truncated stochastic gradient where batch size is one (that is, it’s fully stochastic, updating stats on a per-item basis).

### The Problem: Non-Sparse Solutions

The beauty of L1 regularization is that it induces sparse solutions (many coefficients zero). The problem with the stochastic approximation to the gradient is that it can leave many features non-zero solely because of their late position in the training sequence (their figure 1 is a nice illustration).

Tsuroaka et al. evaluate this effect on three CRF-sequence tagging problems (a form of logistic regression where you use the undirected graphical model form of forward-backward to compute the gradient and log loss), and propose a novel solution.

### The Solution: Cumulative Penalty

Because the gradient of the L1 regularizer only depends on the sign of a feature and not its scale, the authors suggest buffering regularization penalties and then applying them right after a feature is estimated. Furthermore, if the regularization weight is greater than the feature (that is, applying it drives the feature past zero, so it’s clipped), the amount of clipping is buffered for future use.

LingPipe basically uses an algorithm that’s almost identical, except that (1) the regularization gradient is applied before estimation and weight update, and at least once afterwards, at the end of an epoch at the latest; and (2) if there’s clipping, the buffered gradient is reduced to zero rather than buffered.

Another slight difference in their algorithm is that they randomly permute the items before each epoch.

### Evaluation for Sparsity

They showed that their new method reduces the final number of non-zero features compared to the baseline truncated stochastic gradient from roughly 90K to roughly 30K. Exponential decay annealing reduces even more to zero. Here’s where you want to see plots for different annealing schedules and numbers of iterations.

A standard quasi-Newton implementation resulted in roughly 20K features, so there’s still some slippage in their method compared to the “correct” result.

### A Possible Bug?

What I didn’t see in their algorithm (in figure 2) was something that applied the penalties at the end of training. More penalty accumulates for variables that never get seen again, and that needs to be applied. This could actually account for the 20K vs. 30K feature discrepancy when compared to quasi-Newton methods.

I introduced this bug in the first version of logistic regression in LingPipe! Mike Ross caught this problem during code review.

### Batch Size 1?

I’d have thought using larger batch sizes as proposed by Langford et al. in their truncated stochastic gradient paper would also help to drive features down. If you’re updated in a batch of 100 features, you get a 100* boost in the regularization gradient that applies. Of course, this won’t necessarily have the same effect with the cumulative algorithm — it’ll probably help more in the simple clipping-based approach.

### Evaluation for Log Loss

The quasi-Newton method did better in all of the evaluations (though the results were very close on their own NLPBA gene tagging data).

The authors’ cumulative penalty versions had log loss in between the simple truncation algorithm and the quasi-Newton method, but they were all pretty close (1/10th to 1/20th of a bit per example).

I’m thinking this could just be an issue of not running SGD long enough with low enough learning rates. I found in my experiments, and others have found the same thing, that it takes a long time to converge to several decimal places.

Differing estimates present another issue for speed evaluation. Without pegging log loss to be the same, you haven’t really found the minimum with SGD. You can see in their figure 2 of learning rates that they could’ve stopped after 20 passes and gotten to nearly the same log loss. What we’ve found is that held out performance is sometimes even better this way — it effectively gives you an overlay of extra regularization from the early stopping.

(In an aside, their figures 3 and 4 clip the last 130 epochs of the OWL-QN baseline, misleadingly make it look like the log loss was better for SGD than quasi-Newton.)

### Oracle Evaluation for Speed

Evaluating SGD for speed is tricky, as I mentioned in my favorable review of Pegasos. The problem is that convergence speed varies dramatically w.r.t. the learning rate and annealing schedule.

If I’m reading the Tsuroaka et al. paper correctly, in section 4.1 the authors say they use 30 passes (the final number they use) to tune the learning rate for SGD, using base learning rates of 1, 0.5, 0.2, and 0.1 (?eh — I need much smaller ones as did Langford et al.) and exponential annelaing rates of 0.9, 0.85 and 0.8 (a total of 12 settings, with no indication of how problem-specific they are).

The problem I have with this approach is that the authors are effectively using an oracle to choose learning rates and then claiming their system is faster. It’s exactly this tuning of learning rates and number of epochs that’s so cumbersome for SGD approaches.

To make matters worse, they use Microsoft’s research-use-only quasi-Newton system OWL-QN to tune the regularization parameter. This kind of empirical Bayes tuning of the priors interacts with the learning rate and number of epoch tuning in an unfortunate way. Or in a fortunate way if you use path following to evaluate a range of priors.

The simple method, as used in LingPipe, was just as fast to get through 30 epochs. That’s not surprising, but it’s the wrong evaluation. You can see that the cumulative versions get to the same log loss much faster than the simple approach.

### Exponential vs. Inverse Annealing

They argue for exponential annealing over inverse annealing on practical rather than theoretical grounds; LingPipe implements both forms along with constant learning rates, as well as providing an annealing interface for custom implementations to be plugged in. Inverse annealing is required in theory, but I’ve found the same thing in practice — exponential workds better.)

### Comparing Apples and Oranges

As the authors point out, another problem with evaluating for speed is that they compared with a system they didn’t write, so who knows what unknowns are lurking in that comparison. Even if a single person writes all the code, there’s no reason to believe their experience and skill is equal or the time they spend tuning for application speed is the same. Having more comparable implementations (and I don’t even know how to measure this) might drive the ratio in either direction.

### Memory vs. CPU Speed!

Why does everyone only report CPU type and speed (e.g. “Xeon 2.13 GHz processor”)? Different Xeons have very different amounts of L2 cache and some have different numbers of cores.

Perhaps an even bigger variable is memory and front-side bus speed. We’ve found that to be as big a factor in these large models where essentially every example needs to be moved into and out of the CPU. Error correcting memory is slower, but more robust. Etc. etc.