I’ve been meaning to blog on this topic ever since Dave Lewis was in town giving a talk to the Columbia stat student seminar. He was talking about one of his favorite topics, (Bayesian) logistic regression. It also fits in with (another) one of my favorite topics, the scientific zeitgeist. As Andrew Gelman once said (and I paraphrase), no matter what model you come up with, someone in psychometrics thought it up 50 years ago (or maybe it was econometrics).
For simplicitly, we’ll stick to the simple multinomial regression case where we have possible output categories. Our training data then consists of input/output pairs , where the input is any old object and the output category .
“Max Ent” Feature Extraction
In the maximum entropy style presentations, there is a feature function from pairs of inputs and output categories to -dimensional feature vectors, so for .
The model consists of (the feature extractors and) a single coefficient vector . The probability that an input is categorized as category is then
where, as usual, the normalizing constant is computed by summation:
The thing to notice here is that a feature in the same position can be produced by the same input in two different categories.
Stats Textbook Feature Extraction
If you pull out a statistics textbook, you’ll see feature extraction for multinomial logistic regression presented rather differently. In this setting, we have a feature extractor that works only on input categories, producing an -dimensional feature vector that does not depend on the output category, . The model then consists of a -dimensional coefficient vector for each output category . The category probability is then defined by
with normalizer computed in the usual way:
Here, the vector for each output category is distinct, so there’s no (direct) way to produce the same feature (after multiplication with coefficient) from the same input for two different categories.
LingPipe adopts the stats textbook approach for both logistic regression and CRFs. Mainly because it’s more efficient and I couldn’t find good enough reasons to go with the max ent approach. I tried to squeeze examples out of people at conferences, via e-mail, over meals, and when I blogged about label-specific features for CRFs!
As is also the case in most stats textbook presentations, without loss of generality (for inference), one of the output category coefficient vectors can be set to zero. It’s actually required to identify a unique solution, because we can always add or subtract constants from all vectors without changing the predictions.
With priors on the coefficients, you can get a different result pegging one category to zero, so it makes most sense to do this transformation after fitting. LingPipe does it before fitting, which isn’t the way I’d do it if I had a do-over. On the other hand, it works well for binary classification problems when everyone uses one coefficient vector (though again, results vary with Gaussian priors because of their non-linearity).
Stats Text to Max Ent Conversion
The conversion from the stats textbook version to the max ent version is simple. We just define to be the concatenation of the , setting
For the feature function, we just define it to only set non-zero values in the appropriate range, with
where we take to be the -dimensional zero vector, and to be instances of the zero vector concatenated.
We get the same result after the translation, because
Max Ent to Stats Text Conversion
Uh oh. This isn’t so easy. The problem is that in the max ent presentation, we can have the same feature generated for several different categories. In the stats textbook presentation, each feature is implicitly separated because the coefficients are separated.
The solution is to allow tying of coefficients in the stats textbook model. That is, we need to be able to enforce for arbitrary pairs of coefficient positions. Then, we just put the tied features there. (It’s hard to write out as a formula, but hopefully the idea’s clear.)
Worth a Nobel (Memorial) Prize in Econ?
Daniel McFadden was awarded (half the) 2000 Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel for work he’d done on discrete choice analysis (DCA).
As you may have guessed from the lead in, DCA’s just the term researchers in econometrics use for logistic regression. As Dave pointed out in his talk, they tend to use it in the same way as computational linguists working in the max ent tradition, with lots of feature tying.
McFadden lists his 1974 paper on logit as his first econometrics publication on his list of publications (he’s still teaching intro stats at Berkeley!). The key definition is formula (12) on page 110, where he gives the probability function for multinomial logistic regression and uses max-ent style feature extraction.
It’s clear from McFadden’s extensive references that statisticians had been fitting binomial logistic regressions since 1950 and multinomials since 1960.
Perhaps not coincidentally, McFadden was working on DCA about the same time Nelder and Wedderburn introduced generalized linear models (GLM), which generalize logistic regression to other link functions.