As far as prediction goes, they have the same representation power. If you have K vectors, you can convert to the K-1 vector representation by subtracting the K-th vector from the others. You get the same predictions that way.

In the traditional maximum likelihood setting, the K-vector model is not identified, meaning that you can add a vector to each of the existing vectors and get the same model predictions. Using K-1 vectors with the K-th vector pinned to 0 identifies the model.

The story’s different with priors.

If you Gaussian priors (L2/ridge), even the K-vector model is identified. Using the square penalty makes the weights evenly balance to their overall minimal sum of squares.

But if you use Laplace (L1/lasso), you might not get identification because it’s only absolute value. For instance, if you have K=2 and the prediction vector x is one-dimensional, then the model with beta[1] = 1 and beta[2] = 2 has lower prior probability than the predictively equivalent beta[1] = 0, beta[2] = 1. But that is equivalent in terms of prior to beta[1] = -1/2 and beta[2] = 1/2.

What’s not often pointed out is that you get a different posterior with K-1 vectors than with K because the penalty can be spread out. For instance, in our example above, if we use the K-1 vector representation, then beta[1] = -1 and beta[2] = 0, for a total prior probability proportional to the sum of the absolute value of the coefficients, (-1)**2 = 1. If you use K vectors, the same predictions minimize the prior probability at beta[1] = -1/2, beta[2] = 1/2, for a total prior probability proportional to (-1/2)**2 + (1/2)**2 = 1/2.

]]>I came to the conclusion that it’s just a representation convenience. In the case when category-dependent features are used, if one weight vector per category is used, one needs to extract K vectors per data instance, which means instances are represented as D x K matrices (D for dimensionality and K for classes). A dataset is thus represented as N x D x K matrix. If one single big weight vector is used instead, data instances can be represented as vectors and datasets as 2d-matrices.

Also, I would like to thank you for your white paper on logistic regression. It contains practically everything one needs to know about it.

One thing I have always wondered though: if I train K one-vs-the-rest binary logistic regression classifiers, will the solution be the same as the multinomial logistic regression solution (with K weight vectors)?

]]>good insight. I, too, struggled to understand exactly what was going on, but I agree with your analysis.

]]>After rattling around in my head for an hour or two, I figured out what’s going on in the parsing example. As I read it now, it’s more about efficiency than expressive power.

Dan must be assuming a single vector of coefficients, and then coding up structural assumptions about impossible rewrite rules by using “non-block” features. So there’ll be an (S (NP VP)) feature, but not an (PP (NP VP)) feature, and therefore coding a (. (NP VP)) “block feature” would be wasteful. If the set of possible mother categories for a given sequence of daughters is limited, you save space using the “non-block” encoding.

Of course, you can use these structural assumptions in search. We do the same thing with HMMs, which when using BIO-type encoding of chunkers wind up disallowing some tag sequences on structural grounds.

]]>I don’t see the point of the non-block features Dan’s using (p. 11 of slides), which are sub parse trees. If my intuition’s right here and he’s doing the usual thing of CRF-like normalizing over the whole tree, the weights for those subtrees will just keep contributing to every supertree of which they’re a part, regardless of root category of that tree. So it’d work like a nesting penalty or boost.

Dan promises to return to the non-block case later (after p. 11), but I couldn’t find anything, so maybe that’s a different slide set.

I’m very familiar with Dan’s work on parsing. In fact, the first paper I ever wrote in stat NLP was a joint paper with Chris Manning (Dan’s Ph.D. advisor) on parsing the Penn Treebank with a non-lexicalized left-corner parser, estimating online left-corner parser transition probabiliites. If you want some more recent goodies in parsing from Dan’s alma mater, check out Jenny Finkel and Chris Manning’s 2008 CRF-based parsing papers (available on Jenny’s home page) and the more recent ones.

Dan knows the difference between estimation and modeling, but there’s the issue of whether or not you use whatever terminology’s standard in the field.

]]>why do the nlp folks confuse the estimation strategy with the model? I wish we’d stop doing this!

]]>Max Ent is about the practice of fixing marginal PDF to match the data frequencies, but then maximizing the entropy for the joint PDF as to keep the model smooth. Max Ent in the context of Bayes has also referred to the priors preferring models that generate predictions with high entropy.

I continue to be confused about what’s Max Ent about CRF, other than through a duality that’s never exercised.

]]>The “max ent” part of this is completely irrelevant — it just gives the same solution as the MAP solution.

The issue is how to extract the feature vector. The usual practice in max ent is to use the single-vector encoding, which allows you to encode features that are (1) not category dependent, such as “my tag’s the same as the previous tag”, and (2) vary by category, so that you get a feature in the vector for one output but not others (e.g. my output is NP and I’m in a name list, as opposed to “I’m in a name list”, the latter of which is then a feature in every category). I’m not even sure how to talk about all this clearly.

]]>