>> because the gradient is proportional to the value. Is that what

>> Shalev-Schwartz and Singer discuss?

You’re right. Effectively what happens is that “compound interest” builds up. This is exactly what Shalev-Shwartz and Singer discuss. I’m not sure about Cauchy priors. I’ve never worked with them before. There might be some extra bookkeeping there.

I haven’t done any real hardcore benchmarking, but when I’ve coded CRFs and used L2 priors, the sparse updates suggested by Pegasos work equally well there. It depends on how sparse your lattice of feature vectors is. But at least for NER and some word alignment stuff, it still was helpful. Of course, you’re right in noting that computing the gradient for the regularizer is trivial compared to forward-backward.

]]>What I’m doing is a bit different. I’m updating the likelihood gradient fully stochastically (one item at a time). I’m updating the prior every N items. For the likelihood, there’s a multiplier of learning rate. For the prior, a multiplier of learning rate * N / NumberOfItems.

A standard blocking approach would compute the gradient for likelihood for all N items at once, then update. (Thus making it easier to parallelize than what I’m doing.) And it lets you do ordinary (conjugate) gradient by setting N to the number of items.

I only implemented the prior the way I did because I was doing fully item-at-a-time updates and found the full dense prior update was dominating computation time by an order of magnitude.

When I coded logistic regression classifiers, I applied the prior in a lazy fashion, which seems to be pretty standard. I just kept an array storing how many items had been processed before I’d last updated a prior. Then I only updated a prior on a dimension before seeing an item with a non-zero value for a dimension. That didn’t seem like it’d be nearly as efficient for CRFs, because of the lattice of feature vectors all needing to update the last-seen array.

The lazy approach doesn’t quite apply the same penalty as L2, because the gradient is proportional to the value. Is that what Shalev-Schwartz and Singer discuss? I always figured someone good enough at math could figure out the “compound interest”. I’ll have to go back and look that up. (And could you do the Cauchy prior, too, while you’re at it, so I’ll have the complete set?)

The lazy approach does apply the same penalty as item-at-a-time L1, because the gradient’s just a (truncated) constant.

]]>If I understand your “blocked prior” comment correctly, you can actually make this update exactly in the case of L2. In the case of L1, you cannot exactly achieve sparsity at every instance, but you can still, with bookkeeping, efficiently incorporate the gradient of the regularizer at each instance.

The L2 (Gaussian) version is discussed in the Shalev-Shwartz and Singer Pegasos paper (ICML 2007), Since the portion of the gradient corresponding to the regularizer is just a scaling, you can keep track of the updates using 2 extra scalars. Duchi & Singer’s JMLR paper “Online and Batch Learning using Forward Backward Splitting” give details for L1 & L\infty, L1 is essentially the truncation algorithm of Langford et al. (2008).

— John

]]>