Zou and Hastie (2005) Regularization and Variable Selection via the Elastic Net [Prior]


I don’t know how the elastic net prior (aka regularizer) snuck by me unnoticed. It was introduced in:

I was reading about it in the technical report appearing as part of the documentation to their glmnet R package:

The idea’s basically an “interpolation” between the Laplace (L1, lasso) and Gaussian (L2, ridge) priors:

p_{\lambda,\alpha}(\beta) \propto \exp(-\lambda(\alpha|\beta|_2^2 + (1-\alpha)|\beta|_1)

where β is the vector coefficients, λ a factor for prior variance, and α the blending ratio between L2 and L1 priors, themselves represented by the squared L2 norm and (unsquared) L1 norm terms.

Note that the interpolation happens inside the exponential form, so it’s not a mixture of a Gaussian and Laplace prior, but rather a mixture of their bases. Thus on the log scale used in gradient methods, the prior takes on a pleasantly differentiable form:

\log p_{\lambda,\alpha}(\beta) = -\log Z_{\lambda,\alpha} -\lambda(\alpha|\beta|_2^2 + (1-\alpha)|\beta|_1)

where Z is the usual regularization term, here dependent on the variance and mixture parameters.

The later paper divides the L2 component by 2, which gives you a slightly different blend. In general, I suppose you could give the two distributions being blended arbitrary variances λ1 and λ2.

The authors are particularly interested in two situations that are problematic for L1 or L2 alone:

  • the number of dimensions is larger than the number of training items
  • the features/coefficients are highly correlated

Note that bag-of-words and other lexically-based feature representations in natural language classification and tagging are of exactly this nature. In the former case, L1 breaks down with too few features selected. In the latter case, L2 tends to select all the features and L1 only one of the features. With the elastic net, with most of the weight on L1, you get the beneifts of L1’s sparseness and fat tails, with better behavior in the face of correlated variables and large numbers of dimensions.

There’s a beautiful plot in the later paper with prior variance decreasing on the X-axis versus the fitted coefficient values on the Y-axis, with a line for each dimension/feature (and there are lots of non-zero coefficients overlaid on the Gaussian, which never pushes any feature to 0 through regularization):

The Zou and Hastie paper also discusses “bridge regression” (bad pun), which takes an arbitrary Ln penalty for n in the interval [1,2], and thus provides a slightly different way of blending L1 and L2 priors. The limit at 1 and 2 are the L1 and L2 priors. Although the gradient is still easy to calculate, the resulting solutions for n > 1 are not sparse for the same reason L2 isn’t sparse — the shrinkage is proportional to a fraction of the current value.

Given the general exponential form, the gradient of the elastic net’s log probability is easy to compute. It just reduces to interpolating between the L1 and L2 gradients. I’ll probably just go ahead and add elastic net priors to LingPipe as implementations of stats.RegressionPrior. That’ll let them plug-and-play in multinomial logistic regression and CRFs.

8 Responses to “Zou and Hastie (2005) Regularization and Variable Selection via the Elastic Net [Prior]”

  1. Brendan O'Connor Says:

    Well this is exciting. To try out elastic nets on NLP, I was first going to write the sparse-feature-matrixization scripts to load it all into R, but now I could wait for shiny new LingPipe code to do it all at once! Too many options.

  2. lingpipe Says:

    Their implementation looks pretty tight. I found the package in a discussion of performance for regression. It’s written in Fortran, and uses component-wise descent, which apparently works really well (also used in Genkin et al.’s BMR).

    I’ve been hoping to get the time to take glmnet out for a spin myself. I’ve been doing more and more of my results analyses in R these days. What I really need to do is provide nice LingPipe outputs that can be read back into R. The only problem is scaling.

  3. sth4nth Says:

    Although Elastic Net is a good work, I dont like it for the reason that it brings an extra parameter. As a result, one has to spent lots of time on cross-validations to select a good model.
    That’ why I’m becoming a fan of Bayesian methods. I suggest you try the relevance vector machine (RVM) which uses evidence approximation method (also called type-2 maximum likelihood) to select the optimal prior.
    For my problems, it works just fine, though I have not tried it on really large scale data yet.

  4. Dave Lewis Says:

    Have you been able to work out a closed-form expression for Z_lambda,alpha, so that elastic net is a proper prior? I doodled around with that when the paper came out, but wasn’t able to find one.

  5. lingpipe Says:

    I assume you mean proper in the sense of guaranteeing a proper posterior density? I had just assumed because it was a blend of L1 and L2, both of which are proper in this sense, that it’d be proper, too. To say that I’m not exactly a theoretician is an understatement.

  6. Dave Lewis Says:

    Whoops, I overloaded the meaning of “proper” there. :-) To be more precise, what I’m hoping for is a two parameter univariate prior on coefficients that is a proper (integrates to 1) distribution on the reals, symmetric with mean 0, and leading to the same penalty that elastic net gives for a specified lambda and alpha.

    Intuitively it seems like this should exist, and be baby math to figure out, but I haven’t done it. If intuitiion is wrong, here’s some of the possible failure modes:

    1. The univariate distribution might exist, but not have a closed form.

    2. The univariate distribution might exist, but be improper (doesn’t integrate to 1).

    3. The elastic net solution might not correspond to the Bayesian MAP solution under any product of univariate prior on coefficients.

    4. The elastic net solution might not correspond to the Bayesian MAP solution under any multivariate prior on coefficients.

    The lack of a Bayesian MAP interpretation of course wouldn’t change the happy fact that that the penalized likelihood with elastic net is nicely behaved: convex, single global optimum at a sparse solution (if alpha > 0), etc.

  7. lingpipe Says:

    That’s the same thing I meant by proper. I never thought about the closed form because the gradient of the log’s so easy to compute (the log cancels the exponent, and even I can compute the derivative of a polynomial without looking anything up).

    If $\beta$ is unidimensional, we’d have a closed form if we could analytically evaluate

    Z_{\lambda,\alpha} = \int_{-\infty}^{\infty} \exp(-\lambda(\alpha \ \beta^2 + (1-\alpha)|\beta|)) \ d\beta  =2 \int_{0}^{\infty} \exp(-\lambda\alpha\beta^2 - \lambda(1-\alpha)\beta) \ d\beta.

    Nope, don’t know how to integrate that.

  8. Dave Lewis Says:

    I’ve asked over at:


    Annoyingly, they wouldn’t allow the arXiv.org tags for computational linguistics and machine learning, only ones from the math set.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s