I’ve been meaning to blog about this topic ever since seeing David Chiang‘s talk on machine translation a few weeks ago at Columbia. (I don’t have a citation for this idea, but if you know one, I’m happy to insert it.)

What David proposed was largely a typical approach to fitting a larges-scale, sparse linear model: run some kind of stochastic gradient updates and stop before you reach convergence. Even without a shrinkage/prior/regularization component to the model (and hence the gradient), early stopping results in a kind of poor-man’s regularization. The reason is that there’s only so far a coefficient can move in 10 or 20 iterations (all they could afford given the scale of their MT problem). Even a separable feature (one that co-occurs with only a single output category) will remain relatively bounded.

The novel (to me) twist was that David and crew up-weighted the learning rate for low-count features. This was counterintuitive enough to me and Jenny Finkel that we discussed it during the talk. Our intuitions, I believe, were formed by thinking of a model involving priors run to convergence. Actually, at that point, it probably doesn’t make a difference in the limit if you follow the Robbins-Monro update weights.

Of course, we don’t live in the limit.

It occurred to me well after the talk that the reason David’s strategy of up-weighting low-count features could help is exactly because of the early stopping. Features that don’t occur very often have even fewer chances to be moved by the gradient, especially if they’re (a) correlated with features that do occur frequently, or (b) occur in training examples that are well modeled by other features. We have ample experience in NLP that low-count features help with prediction — we usually only prune for the sake of memory.

By boosting the learning rate of low-count features, it helps them overcome the limited number of training iterations to some extent. The trick is to make sure it helps more than it hurts. It’d also be nice to have a better theoretical understanding of what’s going on.

For some reason, this basic approach reminds me of (a) normalization of feature (co)-variance, and (b) other ad-hoc weighting schemes, like Lewis and Madigan’s trick for logistic regression of weighting features in a document for classification by TF/IDF.

June 7, 2011 at 12:18 am |

I don’t think he covers the up-weighting of the rarer features (which is interesting!), but this is probably the paper for their general approach:

11,001 New Features for Statistical Machine Translation

David Chiang, Kevin Knight, and Wei Wang

http://www.isi.edu/~chiang/papers/11001.pdf

June 7, 2011 at 6:07 am |

If you look at David Chiang et al’s first paper on mira training of MT systems (http://www.isi.edu/~chiang/papers/mira.pdf) you can see that he scales down the ‘probabilistic’ features (or equivalently, scales up the sparse features and penalties). But the scaling is just mentioned in passing, and not really discussed.

In more recent work, I thought that he was scaling features according to their variance – the scaling factor is related to the inverse of the variance. I was not aware of the upscaling for rare features.

June 7, 2011 at 4:23 pm |

Inverse of the variance is the first step in conditioning. It puts all the variables on the same scale, but doesn’t adjust for covariance. It doesn’t change the answer (the coefficients just scale inversely to the predictors), but can make stochastic search much more efficient.

More heavily weighting rare features is an entirely different idea. If the features are binary, then rare (and frequent) features have low variance, with maximal variance being features that show up in half the examples, because a binary feature has variance .

June 7, 2011 at 4:40 pm |

This is all based on Mark Dredze et al.’s work on confidence-weighted learning (ICML 2008) and related papers on Second Order Perceptron (Cesa-Bianchi et al, 2005) and AROW (Crammer et al, NIPS 2009).