[This post is a followup to my previous post, Generative vs. discriminative; Bayesian vs. frequentist.]
I had a brief chat with Andrew Gelman about the topic of generative vs. discriminative models. It came up when I was asking him why he didn’t like the frequentist semicolon notation for variables that are not random. He said that in Bayesian analyses, all the variables are considered random. It just turns out we can sidestep having to model the predictors (i.e., covariates or features) in a regression. Andrew explains how this works in section 14.1, subsection “Formal Bayesian justification of conditional modeling” in Bayesian Data Analysis, 2nd Edition (the 3rd Edition is now complete and will be out in the foreseeable future). I’ll repeat the argument here.
Suppose we have predictor matrix and outcome vector . In regression modeling, we assume a coefficient vector and explicitly model the outcome likelihood and coefficient prior . But what about ? If we were modeling , we’d have a full likelihood , where are the additional parameters involved in modeling and the prior is now joint, .
So how do we justify conditional modeling as Bayesians? We simply assume that and have independent priors, so that
The posterior then neatly factors as
Looking at just the inference for the regression coefficients , we have the familiar expression
Therefore, we can think of everything as a joint model under the hood. Regression models involve an independence assumption so we can ignore the inference for . To quote BDA,
The practical advantage of using such a regression model is that it is much easier to specify a realistic conditional distribution of one variable given others than a joint distribution on all variables.
We knew that all along.