By “blocked”, I mean the same thing as in the Langford et al. paper on truncated gradient. That is, every N (say 500) items or so, I apply the gradient at 500 times its stochastic strength to all features. But I only block the prior. They also block the likelihood gradient by evaluating all N items in parallel and summing their gradients.

By “lazy”, I mean only applying regularization/prior gradient to features that are in the item about to be handled. I keep track of how many items it’s been since the gradient was applied and scale accordingly. Or at least we used to.

If the L1 penalty clips in a stochastic update, it’ll clip in the lazy update. I update all the features’ priors before using an item with that feature.

The reason lots of categories (say K=200) matters is that when you do the stochastic update, it applies to 200 instances per feature. When features are very common (as they are with character n-grams), you wind up doing lots more work even with the lazy approach than with the blocked approach.

Of course, your mileage (kilometerage?) may vary depending on the sparsity structure of the problem.

]]>Yes, you are right, when the instance is sparse, the penalty leads to a dense gradient.

I don’t think you are correct about equivalence to the stochastic update for L1. Because the result can change depending upon whether you cross the 0, and perhaps clip the value to 0.

I am not sure if you are talking about this technique, but you can also try applying the penalty when the feature value is non-zero. This leads to more frequent applications of the penalty for more frequent features (scaled by the number of time steps since the last penalty update, so that more frequent features are not overpenalized).

I don’t understand the difference between your “lazy” approach and the “blocked” one. I also don’t understand what the number of outputs (categories) has to do with the input sparsity.

]]>The reason not to apply L2 at each instance at which the likelihood gradient is updated is that it will dominate computation if the instance vectors are at all sparse.

In prior version of LingPipe, I used lazy updates, where you keep track of how many instances it has been since the last regularization for a feature you’re about to use, then apply the prior before you use it (multiplied by number of instances since last update divided by total number of instances).

It’s an exact solution equivalent to the stochastic update for L1. You need to do something more clever to do L2 or other non-constant gradient priors because of the non-additivity of consecutive updates (you need to account for the equivalent of compound interest). So we just approximate by applying the number of instances since last update divided by the total number of instances times the gradient (instead of 1 over the number of instances times gradient).

Recently, we’ve been scaling to problems with large numbers of categories (200 plus) and medium sized corpora (150K items), and found that the sparse updates completely dominate compute time with lots of categories. So I just swapped out the lazy update with a blocked one, where every N steps, you just apply all the priors (again with a linear approximation of block size divided by number of training instances times learning rate times gradient).

]]>If you are using an l2 prior (Gaussian), why not apply the l2 penalty at every stochastic update? I do that all the time.

]]>