I’ve been meaning to blog about this topic ever since seeing David Chiang‘s talk on machine translation a few weeks ago at Columbia. (I don’t have a citation for this idea, but if you know one, I’m happy to insert it.)
What David proposed was largely a typical approach to fitting a larges-scale, sparse linear model: run some kind of stochastic gradient updates and stop before you reach convergence. Even without a shrinkage/prior/regularization component to the model (and hence the gradient), early stopping results in a kind of poor-man’s regularization. The reason is that there’s only so far a coefficient can move in 10 or 20 iterations (all they could afford given the scale of their MT problem). Even a separable feature (one that co-occurs with only a single output category) will remain relatively bounded.
The novel (to me) twist was that David and crew up-weighted the learning rate for low-count features. This was counterintuitive enough to me and Jenny Finkel that we discussed it during the talk. Our intuitions, I believe, were formed by thinking of a model involving priors run to convergence. Actually, at that point, it probably doesn’t make a difference in the limit if you follow the Robbins-Monro update weights.
Of course, we don’t live in the limit.
It occurred to me well after the talk that the reason David’s strategy of up-weighting low-count features could help is exactly because of the early stopping. Features that don’t occur very often have even fewer chances to be moved by the gradient, especially if they’re (a) correlated with features that do occur frequently, or (b) occur in training examples that are well modeled by other features. We have ample experience in NLP that low-count features help with prediction — we usually only prune for the sake of memory.
By boosting the learning rate of low-count features, it helps them overcome the limited number of training iterations to some extent. The trick is to make sure it helps more than it hurts. It’d also be nice to have a better theoretical understanding of what’s going on.
For some reason, this basic approach reminds me of (a) normalization of feature (co)-variance, and (b) other ad-hoc weighting schemes, like Lewis and Madigan’s trick for logistic regression of weighting features in a document for classification by TF/IDF.