[*There’s now a followup post, All Bayesian models are generative (in theory)*.]

I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.

To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.

Frequentist | Bayesian | |

Discriminative | p(z ; w, β) |
p(z, β ; w) = p(z | β ; w) * p(β) |

Generative | p(z, w ; β) |
p(z, w, β) = p(z, w | β) * p(β) |

If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.

Frequentists model the probability of observations given the parameters. This involves a likelihood function.

Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.

Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.

April 12, 2013 at 3:43 pm |

Nice summary, Bob. Some would write “conditional” instead of “discriminative” and reserve “discriminative” for the case in which the method whereby the parameters of the conditional distribution are estimated employs a loss function with reference to true labels. Would you agree with this refinement?

April 13, 2013 at 1:28 pm |

I’m OK with “conditional”. The frequentists wouldn’t be, because we don’t have a probability model for x, so it’s not involved in a conditional probability statement (see my reply to David Hogg below).

In Bayesian stats, the “loss function” for “training” is always negative log probability. The posterior and posterior predictive distributions don’t even reference “loss”, per se.

You do sometimes see heuristic approximations of loss for point estimates or for variational inference, but that’s not changing the model itself or what the researcher wants to fit, just introducing a heuristic for computation.

How could a loss function for classification not involve the true labels?

April 12, 2013 at 11:07 pm |

Stupid question: What is the difference between the vertical bar symbol and the semi-colon?

April 13, 2013 at 1:21 pm |

You must’ve been reared as a Bayesian (like me).

Frequentist philosophy doesn’t let you treat parameters as random quantities. So a frequentist wouldn’t write the density of a random variable with a normal distribution with location mu and scale sigma as p(y | mu, sigma), because mu and sigma are not random quantities. Conditional probability, defined as

p(a | b) = p(a, b) / p(b)

only makes sense when p(b) makes sense, which it doesn’t for a frequentist.

Like frequentists, Bayesians treat unknown parameters as actually having fixed values. But Bayesians characterize our knowledge of the values as probabilistic, so they’re OK talking about the parameters of a distribution probabilistically.

April 13, 2013 at 10:24 pm |

Hmm, I think you’re slightly oversimplifying here by not talking about the models are trained and used.

I think “generative” and “discriminative” refer to different ways of defining the loss function used in training. Basically, generative training involves optimising a likelihood defined in terms of your the generative probabilities, and discriminative training involves optimising a likelihood defined in terms of your discriminative probabilities.

At run time the models trained either generatively or discriminatively are used in the same way, i.e., conditional on the observations w.

For a long time the generative/discriminative distinction was confused with issues about local vs global normalisation, but I think now it’s pretty clear that for lots of models (HMMs, CRFs, PCFGs, etc.) the expressive power of globally normalised models is no greater than the corresponding locally normalised models.

And of course at run time Bayesians want to integrate over the distribution over parameters β that were inferred at training time.

M

April 14, 2013 at 11:00 pm |

I agree with the second sentence of your second paragraph, but don’t see how it relates to the first. Do you mean something different by “training” than “estimation”?

The optimization-based estimators include maximum likelihood and penalized maximum likelihood on the frequentist side and posterior maximum (MAP), posterior mean (minimizes expected square error of parameter estimates), and posterior median (minimizes expected absolute error of estimates) on the Bayesian side.

As you say in your last paragraph, full Bayesian inference involves the entire posterior distribution. And then there’s VB, EP, INLA, etc., that approximate posterior distribution estimation problem with an optimization problem.

I don’t see how any of these choices of estimation or inference technique affect whether a model is generative or discriminative.

A generative model (at least as I defined it above) is more flexible in what it can be used for than a discriminative model. If you model words w and category z together, p(w,z), then you can derive the conditional p(z|w). But you can also derive the marginal p(w) and use it as a language model. Specifically, you can use a naive Bayes “classifier” (or HMM) as a language model, but you can’t use a logistic regression classifier (or CRF) that way. Everyone seems to evaluate their fancy LDA enhancements by their marginal probability assignments to new documents (i.e., the predictive posterior in Bayesian terminology).

You can also marginalize out category prevalence p(z) from a naive Bayes “classifier”, but you can’t do that with logistic regression because you don’t have the p(w) to plug into the sum rule.

April 13, 2013 at 10:25 pm |

Sorry, I meant “not talking about *how* the models are trained and used”.

M

April 22, 2013 at 7:05 am |

Well, think about the difference between an HMM and a CRF where each state is linked to the corresponding observation (i.e., the graphical model looks like a comb). The only difference between the HMM and CRF is whether the normalisation is local or global. We know that both HMMs and CRFs define the same class of distributions, and in fact we know how to map any CRF into an equivalent HMM that defines the same distribution as the CRF.

So the difference isn’t in the class of distributions HMMs and CRFs define. Rather, the difference is in the training (or model estimation) procedure; the loss function for the HMM is joint, while the loss function for the CRF is conditional.

Of course, if you optimise conditional log likelihood, you shouldn’t expect the resulting model to do a good job of predicting the joint distribution (even though you could in fact technically use a conditionally-estimated model to do so).

April 23, 2013 at 3:28 pm |

OK — I see what you’re getting at. I guess this is a standard way to think of this issue in machine learning or NLP. I’ve drunk too much of the stats Kool-Aid at this point, so I’d prefer not to think of HMMs or CRFs that way. I think of them as probabilistic models with different parameterizations and likelihood functions. In these terms, the goal of most “machine learning” applications is to maximize (regularized/penalized) likelihood on the training corpus, or sometimes with respect to a held-out corpus if there are unmodeled parameters such as quantity of regularization.

In graphical modeling terms, the difference is that the HMM “comb” model is a standard directed graphical model, whereas the CRF “comb” model is an undirected graphical model. The standard interpretations of these graphical modeling languages along with an indication of what’s data and what’s parameter leads to the difference.

Perhaps a classification problem is simple. I sometimes see the presentation in machine learning contexts that you are trying to build a linear classifier, with predictor vector , parameter vector and linear predictor .

Then we have a different error function given outcome . Then the standard log loss coupled with a logistic regression model corresponds to a loss of . Probit regression replaces inverse logit with the unit normal inverse CDF and robust (“robit”) regression being an inverse CDF for a Cauchy or other Student-t distribution.

Simple 0/1 loss is then $x^{\top}\beta > 0.5 ? (y-1) : (y-0)$. Hinge loss, as used by SVM is yet another formula, and so on.

The view is then that you minimize your loss on the training set. Easy to do with logit, probit and robit, still not so bad with SVMs (only a single non-differentiable point), but rather difficult to do directly with 0/1 loss (because there’s no useful gradient information).

May 23, 2013 at 9:11 pm |

I thought the representational equivalence holds as long as one restricts the CRF to use very simple graph structures. Does it still hold if the CRF has non-local, (and potentially non-regular) edges between the state node and a feature node ? Inference on such a CRF should be pretty much as easy as on a standard “chain CRF”. .

April 24, 2013 at 3:48 am |

Hi Bob, thanks for the reply.

What I’m reacting to is the (what I think is the conventional) view that the difference between the generative HMM and conditional CRF is whether you use local or global normalisation. For several years I thought that the CRF was more expressive than the HMM, and it was through many conversations with Stu Geman and his student Zhiyi Chi that I learnt that the globally normalised models are no more expressive than the locally normalised ones.

I heard Noah Smith express the opinion at a conference that globally normalised WCFGs are more expressive than PCFGs, so I twisted his arm to write the 2007 CL paper “Weighted and Probabilistic Context-Free Grammars Are Equally Expressive” with me to disabuse him of that false belief. (smile)

So these days I see the main difference between HMMs and CRFs to be the loss function used in training or estimation. But of course, precisely because the CRF loss function involves conditioning on some of the variables in the training data (rather than generating them), CRFs permit you to have features that depend on those variables in very complex ways.

April 26, 2013 at 9:59 pm |

Mark, I think your last paragraph is key. Any model that treats some inputs as given can easily incorporate very complex “features” of those inputs. In practice, this gives much more representational freedom to the modeler (but requires more global normalization as a consequence, right?). A generative model that incorporated those complex relationships is possible in theory, but in practice it would likely be mis-specified (because of the difficulty of getting those complex relationships right) and that could have bad consequences for predictive performance.

April 28, 2013 at 3:11 am

I agree completely — the graphical model that results from conditioning on a set of variables deletes those variables and the edges connected to them, which means that conditional estimation can be quite tractable in situations where joint estimation would be intractable. I see this as the main insight behind the CRF.

April 26, 2013 at 10:03 pm |

Mark, I think your last paragraph is key. Any model that conditions on some input variables can just as easily condition on arbitrary (fixed) functions of those inputs. This gives the modeler great freedom. However, if those functions depend on most or all of the input variables, then you will need global normalization, I believe. In principle, one could construct an equivalent generative model of the input, but it would likely be mis-specified, and that could damage the predictive power of the model.

April 26, 2013 at 10:07 pm |

As a machine learning person, I feel compelled to add a third column to your table in which we consider non-probabilistic “models” as well. Maybe the column heading should be “Vapniker”. The notation would be f(z; w, \beta) for the discriminative case. And this could be trained for 0/1 loss or hinge loss or whatever loss function you care about. I don’t think there is non-probabilistic equivalent of the generative model (although nearest-neighbor methods come close).

April 27, 2013 at 5:01 pm |

Absolutely — this is all a matter of perspective. The body of the post only considered probabilistic models. (I tend not to think of other kinds these days.)

I don’t know of a probabilistic interpretation of SVMs, but I haven’t looked. As usually stated, SVMs (and perceptrons) give you a classification machine, but don’t model either the probabilities of the predictors or the outcome categories. They’re more like the discriminative models than the generative ones if you work by analogy to logistic regression. After all, MAP estimates for logistic regression and the usual estimates for SVMs look the same when you fit logistic regression with a point estimate by minimizing log loss and fit an SVM by minimizing hinge loss.

April 28, 2013 at 2:06 pm

Bob Duin pointed out (in what I believe is a still-unpublished note) that the hard SVM (without the slack penalty) is a non-statistical classifier in that it produces exactly the same decision boundary regardless of the number of times a given data point appears in the training set. This makes it rather unusual in the Stat/ML/Pattern Recognition universe.

I also find the interpretation of SVMs as a robust classifier (Xu, et al. JMLR 2009) to be intriguing. In any case, I think there is a lot more going on in SVMs than just a convex approximation to a probabilistic model.

P.S. I use Platt scaling all the time. However, I don’t see how that gives us a probabilistic interpretation of SVMs. I think it just tells us that the discriminant function computed by an SVM is a useful input to a logistic regression.

April 28, 2013 at 3:23 am

Like you (I suspect), I’ve tended to stay away from SVMs and Empirical Risk Minimisation in my own work, so I don’t know as much about them as I should.

But SVMs do have a probabilistic interpretation, i.e., SVMs are designed to minimise the expected loss. There’s a certain attraction to this — in many applications this is precisely what we want to do.

Also, there is a close relationship between the Perceptron and log-linear “MaxEnt” classifier models. Specifically, if you estimate a log-linear classifier with stochastic gradient descent (i.e., mini-batch of size 1) and make a Viterbi approximation (i.e., assume that the expected feature values are the feature values of the mode label) then you obtain the Perceptron update rule! Perhaps even more surprisingly, in SGD, L2 regularisation becomes multiplicative weight decay and L1 regularisation becomes subtractive weight decay! (This isn’t too surprising after a bit of thought).

April 27, 2013 at 6:47 pm

Yes, you get a classification machine (or a regression machine or a ranking machine, depending on the loss function). So if the learning component is the last step in your pipeline, you can train it using the loss function of the actual task.

Probabilistic methods make sense when the component being learned is buried inside a large system, and you seek a modular solution. Probabilities provide a good intermediate representation that can be converted (in a principled way) to optimize many other loss functions. Of course if you are willing to perform end-to-end training of the entire system (in the style of deep neural networks), then you don’t need modularity and you can just use the task-specific loss.

April 27, 2013 at 7:10 pm

Great discussion. Platt ( http://research.microsoft.com/apps/pubs/?id=69187 ) offers a post hoc way to estimate a posterior class probability for an SVM. That’s not a probabilistic interpretation per se, but it allows one to use an SVM in settings where a probability is required. In the paper he also compares that method to a “kernel classifier with a logit link function and a regularized maximum likelihood score”. Perhaps the latter could be seen as the sought-for probabilistic interpretation of SVMs.

April 28, 2013 at 5:30 pm

You’re quite right that SVMs and Perceptrons aren’t probabilistic models (i.e., they don’t estimate the probabilities of different outputs). I just meant that the SVM is designed to optimise a probabilistically-defined objective, namely the expected loss. (Expectations require there to be an underlying probability distribution, of course).

Yes, that’s an interesting point about hard SVMs — I guess that follows from the fact that SVMs maximise the margin; it’s pretty clear that “duplicating” a data point has no effect whatsoever on the margin.

Interestingly, a number of psycholinguists (e.g., Bybee) suggest that humans pay far more attention to type frequency rather than token frequency when learning morphology; i.e., multiple presentations of the same data item doesn’t help human learning, rather we need to see different variants instantiating the same principle.

But type-based learning isn’t unique to SVMs — learning the parameters in the upper levels of hierarchical Bayesian models is type-based in a similar way.