Nobel (Memorial) Prize for Logistic Regression (aka Discrete Choice Analysis)

by

I’ve been meaning to blog on this topic ever since Dave Lewis was in town giving a talk to the Columbia stat student seminar. He was talking about one of his favorite topics, (Bayesian) logistic regression. It also fits in with (another) one of my favorite topics, the scientific zeitgeist. As Andrew Gelman once said (and I paraphrase), no matter what model you come up with, someone in psychometrics thought it up 50 years ago (or maybe it was econometrics).

Multinomial Regression

For simplicitly, we’ll stick to the simple multinomial regression case where we have K possible output categories. Our training data then consists of N input/output pairs (x_n,y_n), where the input x_n is any old object and the output category y_n \in \{1,2,\ldots,K\}.

“Max Ent” Feature Extraction

In the maximum entropy style presentations, there is a feature function \phi from pairs of inputs and output categories to M-dimensional feature vectors, so \phi(x,y) \in \mathbb{R}^M for 1 \leq y \leq K.

The model consists of (the feature extractors and) a single coefficient vector \beta \in \mathbb{R}^M. The probability that an input x is categorized as category y \in \{1,\ldots,K\} is then

p(y|x) = \exp(\beta^{\top} \phi(x,y)) / Z(x) \propto \exp(\beta^{\top}\phi(x,y)),

where, as usual, the normalizing constant is computed by summation:

Z(x) = \sum_{y=1}^K \exp(\beta^{\top} \phi(x,y)).

The thing to notice here is that a feature in the same position can be produced by the same input in two different categories.

Stats Textbook Feature Extraction

If you pull out a statistics textbook, you’ll see feature extraction for multinomial logistic regression presented rather differently. In this setting, we have a feature extractor \psi that works only on input categories, producing an J-dimensional feature vector that does not depend on the output category, \psi(x) \in \mathbb{R}^J. The model then consists of a J-dimensional coefficient vector \beta_k for each output category k. The category probability is then defined by

p(y |x) = \exp(\beta_y^{\top}\psi(x)) / Z'(x) \propto \exp(\beta_y^{\top}\psi(x)),

with normalizer computed in the usual way:

Z'(x) = \sum_{y=1}^K \exp(\beta_y^{\top}\psi(x)).

Here, the vector for each output category is distinct, so there’s no (direct) way to produce the same feature (after multiplication with coefficient) from the same input for two different categories.

LingPipe’s Approach

LingPipe adopts the stats textbook approach for both logistic regression and CRFs. Mainly because it’s more efficient and I couldn’t find good enough reasons to go with the max ent approach. I tried to squeeze examples out of people at conferences, via e-mail, over meals, and when I blogged about label-specific features for CRFs!

As is also the case in most stats textbook presentations, without loss of generality (for inference), one of the output category coefficient vectors can be set to zero. It’s actually required to identify a unique solution, because we can always add or subtract constants from all vectors without changing the predictions.

With priors on the coefficients, you can get a different result pegging one category to zero, so it makes most sense to do this transformation after fitting. LingPipe does it before fitting, which isn’t the way I’d do it if I had a do-over. On the other hand, it works well for binary classification problems when everyone uses one coefficient vector (though again, results vary with Gaussian priors because of their non-linearity).

Stats Text to Max Ent Conversion

The conversion from the stats textbook version to the max ent version is simple. We just define \beta to be the concatenation of the \beta_k, setting

\beta = \beta_1,\ldots,\beta_K.

For the feature function, we just define it to only set non-zero values in the appropriate range, with

\phi(x,k) = \mathbf{0}^{k-1} \ \ \psi(x) \ \ \mathbf{0}^{K-k-2},

where we take \mathbf{0} to be the J-dimensional zero vector, and \mathbf{0}^{k-1} to be k-1 instances of the zero vector concatenated.

We get the same result after the translation, because

\beta_k^{\top} \psi(x) = \beta^{\top} \phi(x,k).

Max Ent to Stats Text Conversion

Uh oh. This isn’t so easy. The problem is that in the max ent presentation, we can have the same feature generated for several different categories. In the stats textbook presentation, each feature is implicitly separated because the coefficients are separated.

The solution is to allow tying of coefficients in the stats textbook model. That is, we need to be able to enforce \beta_{k,m} = \beta_{k',m'} for arbitrary pairs of coefficient positions. Then, we just put the tied features there. (It’s hard to write out as a formula, but hopefully the idea’s clear.)

Worth a Nobel (Memorial) Prize in Econ?

Daniel McFadden was awarded (half the) 2000 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel for work he’d done on discrete choice analysis (DCA).

As you may have guessed from the lead in, DCA’s just the term researchers in econometrics use for logistic regression. As Dave pointed out in his talk, they tend to use it in the same way as computational linguists working in the max ent tradition, with lots of feature tying.

McFadden lists his 1974 paper on logit as his first econometrics publication on his list of publications (he’s still teaching intro stats at Berkeley!). The key definition is formula (12) on page 110, where he gives the probability function for multinomial logistic regression and uses max-ent style feature extraction.

It’s clear from McFadden’s extensive references that statisticians had been fitting binomial logistic regressions since 1950 and multinomials since 1960.

Perhaps not coincidentally, McFadden was working on DCA about the same time Nelder and Wedderburn introduced generalized linear models (GLM), which generalize logistic regression to other link functions.

4 Responses to “Nobel (Memorial) Prize for Logistic Regression (aka Discrete Choice Analysis)”

  1. Ken Says:

    One aspect of the paper is that it is unnecessarily technical, like much of econometrics. Maybe it is because usually most of the assumptions are rubbish, and there isn’t a lot of applications, which doesn’t stop econometricians applying it. Sort of “look at the numbers, aren’t they wonderful, and I’ll make sure that you don’t understand the theory”. In this case the theory seems more to hide something that is really simple.

    Contrast this to medical statistics, where it has generally been important to have models that are understandable, and papers are generally readable with the technical aspects relegated to an appendix. Some of the people writing these papers are not lacking in technical skills.

    • lingpipe Says:

      I don’t know enough about the context of this early work or econometrics as a whole to judge the appropriateness of its level of technicality.

      For my purposes at least, most of NIPS, for example, is too theoretical. I like the models but would prefer to see more emphasis on application and less on optimization and asymptotics. Then ACL is like that, and I’m always hankering for more theory! Maybe that’s why I like longer papers that have both.

      When writing, I find it difficult to decide how much to assume about my audience’s background. I pretty much have to lay that out in my mind at the start or I get totally lost when it’s time for the first formula.

      I find it much easier to write if I can assume my audience has a good background in what I’m writing about. But the more you assume, the more people you leave behind.

      If the audience is technically skilled enough, technical papers can actually be easier to read than applied ones, because they’re going over known ground and concepts.

      I think the technical versus applied balance is also partly a matter of taste. Some people (like me), really like reading theory and technical results. What I like about a lot of applied stats modeling work is its attention to empirical data. I find it helps if I know a bit about the theory.

  2. alex Says:

    not only logit was used in dca and part of the contribution is random utility framework that gives it all a nice theoretical interpretation.

    • lingpipe Says:

      Thanks for the extra information. I was assuming the big innovation wasn’t the technical statistical apparatus but the economics application.

      Most of the linguistic applications of models implicitly assume a utility framework for evaluation. It all comes down to decision theory in the end if you’re going to put one of these models to use!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s