I’m designing a conditional random field (CRF) package for LingPipe, so I’ve been reading through the literature on CRFs, most of which is expressed in terms of maximum entropy. I can’t recommend Ryan McDonald’s tutorial slides highly enough:

- McDonald, Ryan. 2007. Generalized Linear Classifiers in NLP. Course presented to the Swedish Graduate School in Language Technology.

Among other things, Ryan shows how the maximum entropy estimate subject to expectation matching constraints yields the same unique solution as the maximum likelihood estimate (where the derivative of error shows the max likelihood estimate matches expectations). But I’m not going to harp on terminology or philosophy today, but rather on the more immediate practical problems of feature extraction and coefficient encoding.

### K vs. K-1 Coefficient Vectors?

Consider multinomial classifiers with K outcome categories c[1],…,c[K] and n-dimensional vector inputs x. In logistic regression, as typically done in stats, you have K-1 n-dimensional coefficient vectors β[1], …, β[K-1] and the probability distribution over categories given an input vector x is defined so that p(c[k]|x) is proportional to exp(β[k] * x) for 1 <= k <= K-1, and p(c[K]|x) is proportional to exp(0 * x) = 1.

Sometimes, you see K coefficient vectors and p(c[K]|x) is proprotional to exp(β[K] * x) just like for the other dimensions. For instance, Dave Lewis and David Madigan et al.’s package Bayesian Multinomial Regression (BMR) lets you choose K or K-1 coefficient vectors; LingPipe uses the K-1 encoding. Dave (Lewis) told me they included the K-vector approach because their users found it easier to interpret than pinning the last coefficient vector to 0.

The problem with the K coefficient vector approach is that the parameters are not identified in the statistical sense. Basically, if you subtract β[k] from all the coefficient vectors (leaving β[k] = 0), you get exactly the same predictions. Zero-centered (or other) priors can identify unique solutions that minimize coefficient values. Gaussian priors (aka L2 regularization) always leads to a unique solution. Laplace priors might not. Imagine a very simple binary (outcomes K = 2) single regression (dimensions n = 1), and the vector β[1] = 1.0 in the K-1 coefficient coding, which implicitly sets β[2] = 0.0. For an input x, we get

p(c[1]|x) = exp(x * 1.0) / ( exp(x * 1.0) + exp(x * 0.0) )

p(c[2]|x) = exp(x * 0.0) / ( exp(x * 1.0) + exp(x * 0.0) ).

We get the same results with the K-vector coefficient encoding, taking β[1] = 0.5 and β[2] = -0.5, namely

p(c[1]|x) = exp(x * 0.5) / ( exp(x * 0.5) + exp(x * -0.5) )

p(c[2]|x) = exp(x * -0.5) / ( exp(x * 0.5) + exp(x * -0.5) )

(To see the equivalence, multiply the numerator and denominator of the second set of equations by exp(x*0.5). )

Coefficients of 0.5 and -0.5 are more likely in a zero-centered Gaussian prior (L2 regularization) than coefficients of 1.0 and 0.0, but have the same likelihood under a Laplace prior (L1 regularization).

Sampling approaches to modeling the posterior distribution p(θ|X) are more problematic, because the usual measures of Gibbs sampler convergence (e.g. R hat) won’t work. The usual solution is to monitor the K-1 dimensional transform, because that is identified.

### Using a Single Coefficient Vector

What always throws me when I read max entropy presentations is that they don’t use K or K-1 vectors, they use a single vector, with a different kind of feature extraction. Specifically, there’s a different input feature vector for each category, traditionally written φ(c[k],e) for input e and category c[k]. Recall that the usual set up in logistic regression is to use a single feature extractor ψ(e) that is independent of category (the value of ψ(e) is the input vector x we used above before talking about feature extraction). Now with K vectors for input e, φ(c[1],e) through φ(c[K],e), we have a single coefficient vector β and we model probabilities p(c[k]|e) as being proportional to φ(c[k],e) * β.

It’s easy to code the usual logistic setup in the max entropy pattern. Just set β = β[1],…,β[K] and set φ(c[k],e) to 0,0,…ψ(c[k]),…,0. Going the other way around doesn’t work, because there may be features extracted by more general φ that can’t be replicated in the usual logistic setup.

I asked Hal Daumé III when he was in town if people used this flexibility in the models. He said usually not, but he did mention one case he’d seen, which is in first-order CRFs, where the conditioning is on the previous category as well as the entire input string. He said he’d seen people code up the feature “my tag is the same as the previous tag”, which is not something you can code directly in the logistic setting. There, you can only get “is my tag T and is the previous tag T” duplicated for every tag T.

Are there other examples?

Is it worth implementing the more general feature extraction case?

May 1, 2009 at 10:33 am |

I like to take the largest category and automagically make it the ‘-1’ in the K-1 setting.

What exactly is “maximum entropy” about a few additional features?

May 1, 2009 at 2:09 pm |

LingPipe takes the last category, but the largest might be more robust.

The “max ent” part of this is completely irrelevant — it just gives the same solution as the MAP solution.

The issue is how to extract the feature vector. The usual practice in max ent is to use the single-vector encoding, which allows you to encode features that are (1) not category dependent, such as “my tag’s the same as the previous tag”, and (2) vary by category, so that you get a feature in the vector for one output but not others (e.g. my output is NP and I’m in a name list, as opposed to “I’m in a name list”, the latter of which is then a feature in every category). I’m not even sure how to talk about all this clearly.

May 1, 2009 at 3:37 pm |

Thanks for the elucidation of the feature vector extraction process. Basically, with polytomous logistic regression, you effectively have interactions between category variable and beta, but “max ent” allows betas that don’t interact with the category variable. Sure, just one aspect of the model.

Max Ent is about the practice of fixing marginal PDF to match the data frequencies, but then maximizing the entropy for the joint PDF as to keep the model smooth. Max Ent in the context of Bayes has also referred to the priors preferring models that generate predictions with high entropy.

I continue to be confused about what’s Max Ent about CRF, other than through a duality that’s never exercised.

May 4, 2009 at 1:03 pm |

Aleks, please let me apologize for my field (natural language processing), which has, in many places, confused the estimation strategy (e.g. max entropy) with the model (e.g. CRF or logistic regression), while failing to recognize that the standard max ent estimate is the same as the maximum likelihood estimate, and that the more complex max ent estimates using minimum KL-divergence from a “prior” distribution is the same as a maximum a posteriori (MAP) estimate.

June 26, 2009 at 12:57 am |

why do the nlp folks confuse the estimation strategy with the model? I wish we’d stop doing this!

June 26, 2009 at 12:56 am |

Dan Klein of Berkeley has one or two slides in his course on the feature extraction issue you brought up (see http://www.cs.berkeley.edu/~klein/cs288/sp09/SP09%20cs288%20lecture%205%20–%20maximum%20entropy%20(2PP).pdf ). The terminology he uses is “block feature vectors” and “non-block feature vectors”. The sole example he gives of the latter is “a parse tree’s features may be the productions present in the tree”. I don’t really follow the parsing literature much, so I don’t know if it is a real example or a contrived one, but Dan is no stranger to parsing, for what it’s worth.

June 26, 2009 at 11:01 am |

Thanks for the pointer. I’m implementing CRFs using only what Dan’s calling “block features” (it’s tough coming up with names; “block” is also used as the number of items used in a training step in gradient descent).

I don’t see the point of the non-block features Dan’s using (p. 11 of slides), which are sub parse trees. If my intuition’s right here and he’s doing the usual thing of CRF-like normalizing over the whole tree, the weights for those subtrees will just keep contributing to every supertree of which they’re a part, regardless of root category of that tree. So it’d work like a nesting penalty or boost.

Dan promises to return to the non-block case later (after p. 11), but I couldn’t find anything, so maybe that’s a different slide set.

I’m very familiar with Dan’s work on parsing. In fact, the first paper I ever wrote in stat NLP was a joint paper with Chris Manning (Dan’s Ph.D. advisor) on parsing the Penn Treebank with a non-lexicalized left-corner parser, estimating online left-corner parser transition probabiliites. If you want some more recent goodies in parsing from Dan’s alma mater, check out Jenny Finkel and Chris Manning’s 2008 CRF-based parsing papers (available on Jenny’s home page) and the more recent ones.

Dan knows the difference between estimation and modeling, but there’s the issue of whether or not you use whatever terminology’s standard in the field.

June 26, 2009 at 1:02 pm |

After rattling around in my head for an hour or two, I figured out what’s going on in the parsing example. As I read it now, it’s more about efficiency than expressive power.

Dan must be assuming a single vector of coefficients, and then coding up structural assumptions about impossible rewrite rules by using “non-block” features. So there’ll be an (S (NP VP)) feature, but not an (PP (NP VP)) feature, and therefore coding a (. (NP VP)) “block feature” would be wasteful. If the set of possible mother categories for a given sequence of daughters is limited, you save space using the “non-block” encoding.

Of course, you can use these structural assumptions in search. We do the same thing with HMMs, which when using BIO-type encoding of chunkers wind up disallowing some tag sequences on structural grounds.

June 26, 2009 at 1:06 pm

good insight. I, too, struggled to understand exactly what was going on, but I agree with your analysis.

October 14, 2010 at 6:25 am |

Having learned about Logistic Regression first and having realized that it’s the same thing as what NLP people call “Maximum Entropy” only later, I was also surprised when I saw that, in many presentations, they use a single weight vector.

I came to the conclusion that it’s just a representation convenience. In the case when category-dependent features are used, if one weight vector per category is used, one needs to extract K vectors per data instance, which means instances are represented as D x K matrices (D for dimensionality and K for classes). A dataset is thus represented as N x D x K matrix. If one single big weight vector is used instead, data instances can be represented as vectors and datasets as 2d-matrices.

Also, I would like to thank you for your white paper on logistic regression. It contains practically everything one needs to know about it.

One thing I have always wondered though: if I train K one-vs-the-rest binary logistic regression classifiers, will the solution be the same as the multinomial logistic regression solution (with K weight vectors)?

October 14, 2010 at 1:49 pm |

As far as prediction goes, they have the same representation power. If you have K vectors, you can convert to the K-1 vector representation by subtracting the K-th vector from the others. You get the same predictions that way.

In the traditional maximum likelihood setting, the K-vector model is not identified, meaning that you can add a vector to each of the existing vectors and get the same model predictions. Using K-1 vectors with the K-th vector pinned to 0 identifies the model.

The story’s different with priors.

If you Gaussian priors (L2/ridge), even the K-vector model is identified. Using the square penalty makes the weights evenly balance to their overall minimal sum of squares.

But if you use Laplace (L1/lasso), you might not get identification because it’s only absolute value. For instance, if you have K=2 and the prediction vector x is one-dimensional, then the model with beta[1] = 1 and beta[2] = 2 has lower prior probability than the predictively equivalent beta[1] = 0, beta[2] = 1. But that is equivalent in terms of prior to beta[1] = -1/2 and beta[2] = 1/2.

What’s not often pointed out is that you get a different posterior with K-1 vectors than with K because the penalty can be spread out. For instance, in our example above, if we use the K-1 vector representation, then beta[1] = -1 and beta[2] = 0, for a total prior probability proportional to the sum of the absolute value of the coefficients, (-1)**2 = 1. If you use K vectors, the same predictions minimize the prior probability at beta[1] = -1/2, beta[2] = 1/2, for a total prior probability proportional to (-1/2)**2 + (1/2)**2 = 1/2.