I don’t know how the elastic net prior (aka regularizer) snuck by me unnoticed. It was introduced in:
- Zou, Hui and Trevor Hastie. 2005. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society B. 67(Part 2):301–320.
I was reading about it in the technical report appearing as part of the documentation to their glmnet R package:
- Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2009. Regularized Paths for Generalized Linear Models via Coordinate Descent. CRAN documentation for glmnet.
The idea’s basically an “interpolation” between the Laplace (L1, lasso) and Gaussian (L2, ridge) priors:
where β is the vector coefficients, λ a factor for prior variance, and α the blending ratio between L2 and L1 priors, themselves represented by the squared L2 norm and (unsquared) L1 norm terms.
Note that the interpolation happens inside the exponential form, so it’s not a mixture of a Gaussian and Laplace prior, but rather a mixture of their bases. Thus on the log scale used in gradient methods, the prior takes on a pleasantly differentiable form:
where Z is the usual regularization term, here dependent on the variance and mixture parameters.
The later paper divides the L2 component by 2, which gives you a slightly different blend. In general, I suppose you could give the two distributions being blended arbitrary variances λ1 and λ2.
The authors are particularly interested in two situations that are problematic for L1 or L2 alone:
- the number of dimensions is larger than the number of training items
- the features/coefficients are highly correlated
Note that bag-of-words and other lexically-based feature representations in natural language classification and tagging are of exactly this nature. In the former case, L1 breaks down with too few features selected. In the latter case, L2 tends to select all the features and L1 only one of the features. With the elastic net, with most of the weight on L1, you get the beneifts of L1’s sparseness and fat tails, with better behavior in the face of correlated variables and large numbers of dimensions.
There’s a beautiful plot in the later paper with prior variance decreasing on the X-axis versus the fitted coefficient values on the Y-axis, with a line for each dimension/feature (and there are lots of non-zero coefficients overlaid on the Gaussian, which never pushes any feature to 0 through regularization):
The Zou and Hastie paper also discusses “bridge regression” (bad pun), which takes an arbitrary Ln penalty for n in the interval [1,2], and thus provides a slightly different way of blending L1 and L2 priors. The limit at 1 and 2 are the L1 and L2 priors. Although the gradient is still easy to calculate, the resulting solutions for n > 1 are not sparse for the same reason L2 isn’t sparse — the shrinkage is proportional to a fraction of the current value.
Given the general exponential form, the gradient of the elastic net’s log probability is easy to compute. It just reduces to interpolating between the L1 and L2 gradients. I’ll probably just go ahead and add elastic net priors to LingPipe as implementations of
stats.RegressionPrior. That’ll let them plug-and-play in multinomial logistic regression and CRFs.