What kind of prior should you use for logistic regression (aka maximum entropy) coefficients? The common choices are the Laplace prior (aka L1 regularization, aka double exponential prior, aka the lasso), the Gaussian prior (aka L2 regularization, aka normal, aka ridge regression), and more recently, the Cauchy prior (aka Lorentz, aka Student-t with one degree of freedom).

These are all symmetric priors, and we typically use zero mean distributions as priors (or zero median in the Cauchy, which has no mean because the integral diverges). The effect is that there’s a penalty for larger coefficients, or looked at the other way, the priors shrink the parameters.

[paragraph revised June 13, 2008] The main difference between the priors is how fat their tails are. For a given variance, the Laplace prior has more mass around zero and fatter tails than the Gaussian; the Cauchy has very thick tails compared to the Gaussian.

There’s an interesting paper by Goodman, Exponential Priors for Maximum Entropy Models, for which Microsoft received a blatantly ridiculous patent. Maybe we’ll get a nastygram from Redmond for using a textbook technique. Like the stats textbooks, Goodman notes that the Laplace prior likes to shrink coefficients to zero, effectively performing a kind of Bayesian feature selection.

There’s also a more recent offering by Genkin, Lewis and Madigan, Large-scale Bayesian logistic regression for text categorization, which covers both L1 and L2 regularization, which also comes with an open-source implementation for the multinomial case. This paper does a good job of comparing regularized logistic regression to SVM baselines. Dave Lewis has been a great help in understanding the math and algorithms behind logistic regression, particularly the one-parameter vector vs. two-parameter vector case, and the implications for priors and offsets in sparse computations.

There’s an interesting paper by Gelman, Jakulin, Su and Pittau, A Default Prior Distribution for Logistic and Other Regression Models, in which they argue that after centering and variance adjusting inputs, the Cauchy prior is a good general purpose prior. They evaluate on a range of binary classification problems.

I just had coffee with Andrew Gelman after guest-lecturing in his stat computation class, and we had a chat about regularization and about discriminitive versus generative models. He wasn’t happy with Laplace priors being applied willy-nilly, suggesting instead that feature selection be done separately from feature estimation. Andrew also surprised me in that he thinks of logistic regression as a generative model (“of course it’s generative, it’s Bayesian”, I believe he said); but for me, this is odd, because the data’s “generated” trivially. He was keen on my trying out some kind of posterior variance fiddling to move from a maximum a posteriori form of reasoning to a more Bayesian one. The only problem is that I just can’t invert the million by million matrix needed for the posterior variance estimates.

The issue here is a culture conflict between “machine learning” and “statistics”. If you read Hill and Gelman’s book on regression, it’s study after study of small dimensionality problems with dense data, not much data, and carefully engineered features and interaction features. Natural language problems, like most machine learning problems, involve millions of features and very sparse vectors. Maybe different techniques work better for quantitatively different problems.

July 28, 2009 at 9:30 pm |

Hello,

Interesting discussion. I’m reading your post more than a year late, but it occurred to me to comment on your last paragraph that there are good inversion techniques for huge sparse matrices. Some are used in Bayesian methods, unsurprisingly, for variance estimation in Gaussian processes.

If this post is still alive, then I’d be happy to cite some algorithms. (Well, Krylov space ones are particularly appealing.)

All the best.

July 29, 2009 at 10:27 am |

I hope the posts are all still alive. Cite away. Quasi-Newton methods like L-BFGS are popular in computational linguistics, but they forego direct computational of the inverted matrix of Hessians (second partial derivatives of log loss w.r.t. parameters).

Just this past weekend, I was reading about Radford Neal’s Hamiltonial Monte Carlo in MacKay’s

Information Theory, Inference, and Learning Algorithms, Section 41.4 (I love this book’s frank and fun discussions of methods). Not surprisingly, given the name, it uses sampling to estimate posterior (co)variance.July 29, 2009 at 10:30 am |

And speaking of fat-tailed priors, I also just read about Friedman, Hastie and Tibshirani’s elastic net prior (mixture of Gaussian/L2 and Laplace/L1) that when weighted toward L1 supposedly gives you the sparseness benefits of L1 with the estimation stability of L2. I’m about to blog about that and implement it for LingPipe.

July 29, 2009 at 3:37 pm |

Great, this still works.

> methods like L-BFGS are popular in computational linguistics

That’s the kind of thing I was going to suggest, so I’m glad I asked.

> they forego direct computational of the inverted matrix of Hessians

Suddenly I’m worried that I misunderstood and that you need the full inverted matrix rather than its application. If so, my apologies, and I have nothing new to suggest.

In any case, since you are a fan of MacKay, you might like this Krylov space technique by Mark Gibbs, a student of his:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.8224

which handles the termination criterion nicely. I’ve used it successfully on sparse problems to optimize hyperparameters. (The variance, among other things.) You could modify it to generate (approximate) samples from the posterior. However, I love Monte Carlo too, and that’s probably the way to go.

I’ve also considered using L1 priors, but shied away from it because I wasn’t sure what the corresponding generative model or maximum entropy principle is.

Looking forward to your post on elastic net prior.

July 29, 2009 at 4:43 pm |

I don’t actually need to compute the matrix inverses.

Thanks for the ref — I hope it defines Krylov spaces!

Maybe I’m not understanding the issue, but for L1 priors, the generative model’s the same, only with a Laplace prior in place of the Gaussian. You can use any probability distribution over vectors at this point in the model.

I have no idea if you could set it up as an entropic prior.

Elastic net post coming up tomorrow.