I (Bob) have been working on logistic regression. In particular, multinomial logistic regression with a range of different priors and a sparse regularized stochastic gradient descent optimizer. It’ll be out in LingPipe 3.5 any day now.

One of the problems I had is the diverse literature around regularized logistic regression and the sheer volume of terms used to refer to it and its properties. Ditto for the stochastic gradient and other optimization literature. Here’s the “also known as” section from the forthcoming LingPipe 3.5 Javadoc (we’re our own search engine optimization providers):

Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression. Binary logistic regression is an instance of a generalized linear model (GLM) with the logit link function.

The logit function is the inverse of the logistic function, and the logistic function is sometimes called the sigmoid function or the s-curve.

Logistic regression estimation obeys the maximum entropy principle, and thus logistic regression is sometimes called “maximum entropy modeling”, and the resulting classifier the “maximum entropy classifier”.

The generalization of binomial logistic regression to multinomial logistic regression is sometimes called a softmax or exponential model.

Maximum a priori (MAP) estimation with Gaussian priors is often referred to as “ridge regression”; with Laplace priors MAP estimation is known as the “lasso”. MAP estimation with Gaussian, Laplace or Cauchy priors is known as parameter shrinkage. Gaussian and Laplace are forms of regularized regression, with the Gaussian version being regularized with the L

_{2}norm (Euclidean distance, called the Frobenius norm for matrices of parameters) and the Laplace version being regularized with the L_{1}norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage.Binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic (sigmoid) activation function trained under log loss. This is sometimes called classification with a single neuron.

I wrote up everything I learned in a white paper, Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. The result was a slew of algebra reminiscent of Dan Klein and Chris Manning’s max-ent tutorial, but with more general regularization, a different (k-1)-vector parameterization, and a different optimization scheme.

I added a new white papers section to the blog to hold it. I made a minor original contribution: a stochastic gradient descent algorithm with regularization that fully preserves both sparseness and the stocahsticity of regularization.

The amount of math I had to do to put the pieces back together took me straight back to grad school. My parents are visiting and asked me last night why there was an enormous pile of math books on the table (Strang on matrices [I couldn’t remember what positive definite means], a calc textbook [I couldn’t even remember how to differentiate logs and exponentials], Larsen and Marx (or was it DeGroot and Schervish?) [need the basic density definitions and properties], Gelman et al.’s Bayesian Data Analysis and Regression books, rounded off with Bishop’s, MacKay’s, and Hastie et al.’s trio of machine learning books, Cover and Thomas’s info theory book, and Manning and Schuetze for good measure). The answer is that they all contained pieces of the puzzle.

I work out the full error function under maximum likelihood and Gaussian, Laplace and Cauchy priors, derive the gradients and Hessians, and show how to build it all into a stochastic gradient descent solver. I show the links between logistic regression and max entropy, work out the impact of feature functions, exponential bases, binary features, input centering and variance normalization, and even kernel methods. I have to admit I skipped full duality (but did talk about kernelization), didn’t work out the Hessian positive definiteness proof for error function is concavity, and don’t cover any of the convergence proofs for SGD. But there are lots of references to books, papers, tutorials and software packages.

During this process I felt like the great Nicolai Ivanovich Lobachevsky as satirized in the Tom Lehrer song Lobachevsky.

The unit tests are now passing. I’d like to thank Dave Lewis for help with test cases and understanding some of the math. Dave and I were undergrad math majors at Michigan State together, and now find ourselves doing logistic regression for NLP; he’s contributing to the BMR: Bayesian Multinomial Regression Software package (check out their cool use of priors for IDF-like term weighting).

March 26, 2009 at 11:08 am |

Hello Bob,

Could you give a few words on how the Laplace prior works on LingPipe? You mention you use gradient descent, but Laplace isn’t differentiable at every point? How did you work it out?

Thanks!

March 26, 2009 at 5:22 pm |

I can do better than that and give you a reference with theoretical and practical evaluations! But first, the answer is simple: it’s differentiable everywhere but at 0. If the feature val’s 0, there’s no need to regularize toward 0, so that non-differentiability isn’t a problem in practice. The trick is that if the regularization gradient moves a feature past zero (that is, changes its sign), truncate the move at zero.

It turns out that I rediscovered the “truncated stochastic gradient” method of Langford, Li and Zhang, for which there’s a NIPS and Arxiv paper:

http://arxiv.org/abs/0806.4686

I thought I was just implementing the industry standard stochastic gradient for Laplace. In fact, I thought I was borrowing the idea from Genkin, Lewis and Madigan, but they said they were truncating Laplaces for other reasons, but that the truncated gradient idea surfaced even earlier in Zhang and Oles’ IR Journal paper:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.5553

I called it “clipped regularization” in the white paper (section 10.6).

May 15, 2010 at 1:47 pm |

I was wondering if you have any references for section G: Generalized Feature Functions? I’m interested in the setting where \phi(x,c)’s dimension depends on c, different classes have different numbers of examples and varying levels of difficulty, and how those factors interact with the regularization priors used for each \beta_c.

Cheers.

May 16, 2010 at 3:14 am |

Check out the Stanford NLP implementation page, which also points to a tutorial:

http://nlp.stanford.edu/software/classifier.shtml

Like most applications in natural language processing, they allow different numbers of features per category.

The priors work exactly the same way, with one per dimension.

Although they don’t allow features varying by category, you can check out the BMR package and related papers for some uses of priors for natural language that vary by dimension:

http://www.bayesianregression.org/

For much earlier refs with applications in economics, check out my blog post:

https://lingpipe-blog.com/2010/01/12/nobel-memorial-prize-for-logistic-regression-aka-discrete-choice-analysis/