All Bayesian Models are Generative (in Theory)


[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix X and outcome vector y. In regression modeling, we assume a coefficient vector \beta and explicitly model the outcome likelihood p(y|\beta,X) and coefficient prior p(\beta). But what about X? If we were modeling X, we’d have a full likelihood p(X,y|\beta,\psi), where \psi are the additional parameters involved in modeling X and the prior is now joint, p(\beta,\psi).

So how do we justify conditional modeling as Bayesians? We simply assume that \psi and \beta have independent priors, so that

p(\beta,\psi) = p(\beta) \times p(\psi).

The posterior then neatly factors as

p(\beta,\psi|y,X) = p(\psi|X) \times p(\beta|X,y).

Looking at just the inference for the regression coefficients \beta, we have the familiar expression

p(\beta|y,X) \propto p(\beta) \times p(y|\beta,X).

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for \psi. To quote BDA,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given k others than a joint distribution on all k+1 variables.

We knew that all along.

13 Responses to “All Bayesian Models are Generative (in Theory)”

  1. Cornbread Says:

    That independence assumption is something I rarely think about–very often I just run the regression and try to be surprised. It’s a great day when I get to be reminded of my own assumptions!

  2. Mike Collins Says:

    [ed.: added LaTeX escapes]

    Do you really need to introduce \psi?

    Couldn’t you also say that

    p(x, y, \beta) = p(x) p(\beta) p(y | \beta, x)

    where p(x) is an arbitrary density over the x‘s?

    This seems a little simpler and is also correct I think(?)
    Here the independence assumption is that x and \beta
    are independent.

    • Bob Carpenter Says:

      Yes, I think that’s simpler.

      The question is whether it’s a stronger assumption. I’ll have to work out that Andrew’s assumptions imply yours and vice-versa assuming p(x) = p(x|\psi).

  3. Daniel Says:

    If you want to train it with ML yeah, but you can even train a Bayesian model in discriminative way …

    • Bob Carpenter Says:

      A Bayesian model plus data defines a posterior. Full Bayesian inference is uniquely determined by the model.

      There are techniques for approximate Bayesian inference such as using a point estimate based on MAP (equivalent to regularized or penalized MLE) or based on the posterior mean (or its L1 equivalent which minimizes a different point estimate loss function), or you can use variational inference or expectation propagation or a Laplace approximation to approximate the whole posterior.

      But none of this changes a “generative” model to a “discriminative” one.

  4. Fernando Says:

    If two parameters covary, I want to ask why? What causes them to covary?

    I’m not sure the above framework allows for such questions. Sure, one can add another hierarchy of parameters, but at some point the buck stops. And at that point Bayes seems to lack an answer.

    Some may argue the question is not pertinent. I would argue that the point of a (causal) model is precisely to add enough explanatory variables to ensure the independence of parameters.

    Or, to put it another way, to seek a factual explanation for the implicit heterogeneity.

    • Bob Carpenter Says:

      Statistics isn’t really about the “why”. You can build causal experiments or buy into the whole Judea Pearl structural graphical modeling thing, but the bottom line is that neither classical frequentist stats nor Bayesian stats really try to answer the “why” question.

      With a Bayesian posterior, you do model the posterior covariance. But the “why” is a different question.

      You are never going to ensure independence of parameters in models. Even in a simple linear regression, Y = \alpha X + \beta + \epsilon, you find correlation of the slope \alpha and intercept \beta. In a more realistic case, if I use income and education as predictors, you’ll find the coefficients correlated because the predictors are correlated. You could decorrelate the predictors for any given sample using SVD, but it’s a lousy assumption that the resulting transform will decorrelate the entire population.

      • Bob Carpenter Says:

        I overstated the above. You can do causal inference in various ways using statistics. I just meant that statistical inference itself isn’t specific to causal inference and doesn’t intrinsically say anything about causation.

  5. Fernando Says:

    In the linear regression you can often break that correlation by adding more variables. For example, in the time series case by adding Y_0 as a regressor for Y_t (ignoring problems of endogeneity for now). In general we want to explain fixed effects.

    Also that two regressors covary is neither necessary nor sufficient for the parameters attached to them to also covary. It might be an empirical regularity, but it need not.

    But we agree on this: “that statistical inference itself isn’t specific to causal inference and doesn’t intrinsically say anything about causation.”

    • Bob Carpenter Says:

      I realize that coefficients can covary even if the predictors don’t (like slope and intercept in a simple regression). But when will the predictors covary and the coefficients not (negatively) covary?

      • Fernando Says:

        Maybe I am not understanding but if I play God for a minute and build my one model to generate my own data I can draw x1 and x2 from a multivariate Normal where they covary, draw independent parameters b1 and b2, and make y=b1x1 +b2x2.

        Is Nature barred from doing that?

      • Bob Carpenter Says:

        I thought we were talking about statistical inference? Both Bayesians and frequentists view the parameters b1 and b2 as being fixed, and the problem as one of inferring what they are.

        I can’t speak for what Nature can and cannot do, but if the multivariate It creates is correlated for the predictors X1[n] and X2[n], then I as a lowly observer of data, will get correlation in my estimates for b1 and b2.

        If I knew what the multivariate normal was that they were drawn from, I can decompose it into two different, independent variables Y1[n] and Y2[n] plus the translation, rotation and scaling matrix. So if you can model the generative process of X1 and X2, and then you’d have the whole story. But there’d still be correlation in your posterior for all of these parameters (or a non-diagonal Fischer information matrix if you’re a frequentist).

  6. Fernando Says:

    Thanks Bob.

    The confusion arises bc when I read “generative” models I think of the models Nature uses to generate the data we observe, otherwise know as structural models. Hence I thought you were imposing costraints on Nature.

Leave a Reply to Bob Carpenter Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: