[*This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.*]

I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in *Bayesian Data Analysis*, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.

Suppose we have predictor matrix and outcome vector . In regression modeling, we assume a coefficient vector and explicitly model the outcome likelihood and coefficient prior . But what about ? If we were modeling , we’d have a full likelihood , where are the additional parameters involved in modeling and the prior is now joint, .

So how do we justify conditional modeling as Bayesians? We simply assume that and have independent priors, so that

.

The posterior then neatly factors as

.

Looking at just the inference for the regression coefficients , we have the familiar expression

.

Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for . To quote *BDA*,

The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given others than a joint distribution on all variables.

We knew that all along.

May 24, 2013 at 8:35 pm |

That independence assumption is something I rarely think about–very often I just run the regression and try to be surprised. It’s a great day when I get to be reminded of my own assumptions!

June 7, 2013 at 11:51 am |

[ed.: added LaTeX escapes]Do you really need to introduce ?

Couldn’t you also say that

where is an arbitrary density over the ‘s?

This seems a little simpler and is also correct I think(?)

Here the independence assumption is that and

are independent.

June 7, 2013 at 12:17 pm |

Yes, I think that’s simpler.

The question is whether it’s a stronger assumption. I’ll have to work out that Andrew’s assumptions imply yours and vice-versa assuming .

June 19, 2013 at 1:57 pm |

If you want to train it with ML yeah, but you can even train a Bayesian model in discriminative way …

June 19, 2013 at 3:26 pm |

A Bayesian model plus data defines a posterior. Full Bayesian inference is uniquely determined by the model.

There are techniques for approximate Bayesian inference such as using a point estimate based on MAP (equivalent to regularized or penalized MLE) or based on the posterior mean (or its L1 equivalent which minimizes a different point estimate loss function), or you can use variational inference or expectation propagation or a Laplace approximation to approximate the whole posterior.

But none of this changes a “generative” model to a “discriminative” one.

July 10, 2013 at 2:06 pm |

If two parameters covary, I want to ask why? What causes them to covary?

I’m not sure the above framework allows for such questions. Sure, one can add another hierarchy of parameters, but at some point the buck stops. And at that point Bayes seems to lack an answer.

Some may argue the question is not pertinent. I would argue that the point of a (causal) model is precisely to add enough explanatory variables to ensure the independence of parameters.

Or, to put it another way, to seek a factual explanation for the implicit heterogeneity.

July 10, 2013 at 2:17 pm |

Statistics isn’t really about the “why”. You can build causal experiments or buy into the whole Judea Pearl structural graphical modeling thing, but the bottom line is that neither classical frequentist stats nor Bayesian stats really try to answer the “why” question.

With a Bayesian posterior, you do model the posterior covariance. But the “why” is a different question.

You are never going to ensure independence of parameters in models. Even in a simple linear regression, , you find correlation of the slope and intercept . In a more realistic case, if I use income and education as predictors, you’ll find the coefficients correlated because the predictors are correlated. You could decorrelate the predictors for any given sample using SVD, but it’s a lousy assumption that the resulting transform will decorrelate the entire population.

July 10, 2013 at 8:56 pm

I overstated the above. You can do causal inference in various ways using statistics. I just meant that statistical inference itself isn’t specific to causal inference and doesn’t intrinsically say anything about causation.

July 11, 2013 at 10:05 am |

In the linear regression you can often break that correlation by adding more variables. For example, in the time series case by adding Y_0 as a regressor for Y_t (ignoring problems of endogeneity for now). In general we want to explain fixed effects.

Also that two regressors covary is neither necessary nor sufficient for the parameters attached to them to also covary. It might be an empirical regularity, but it need not.

But we agree on this: “that statistical inference itself isn’t specific to causal inference and doesn’t intrinsically say anything about causation.”

July 12, 2013 at 12:57 am |

I realize that coefficients can covary even if the predictors don’t (like slope and intercept in a simple regression). But when will the predictors covary and the coefficients not (negatively) covary?

July 12, 2013 at 8:04 am

Maybe I am not understanding but if I play God for a minute and build my one model to generate my own data I can draw x1 and x2 from a multivariate Normal where they covary, draw independent parameters b1 and b2, and make y=b1x1 +b2x2.

Is Nature barred from doing that?

July 12, 2013 at 10:50 am

I thought we were talking about statistical inference? Both Bayesians and frequentists view the parameters b1 and b2 as being fixed, and the problem as one of inferring what they are.

I can’t speak for what Nature can and cannot do, but if the multivariate It creates is correlated for the predictors X1[n] and X2[n], then I as a lowly observer of data, will get correlation in my estimates for b1 and b2.

If I knew what the multivariate normal was that they were drawn from, I can decompose it into two different, independent variables Y1[n] and Y2[n] plus the translation, rotation and scaling matrix. So if you can model the generative process of X1 and X2, and then you’d have the whole story. But there’d still be correlation in your posterior for all of these parameters (or a non-diagonal Fischer information matrix if you’re a frequentist).

July 12, 2013 at 11:46 am |

Thanks Bob.

The confusion arises bc when I read “generative” models I think of the models Nature uses to generate the data we observe, otherwise know as structural models. Hence I thought you were imposing costraints on Nature.