[There’s now a followup post, All Bayesian models are generative (in theory).]
I was helping Boyi Xie get ready for his Ph.D. qualifying exams in computer science at Columbia and at one point I wrote the following diagram on the board to lay out the generative/discriminative and Bayesian/frequentist distinctions in what gets modeled.
To keep it in the NLP domain, let’s assume we have a simple categorization problem to predict a category z for an input consisting of a bag of words vector w and parameters β.
|Discriminative||p(z ; w, β)||p(z, β ; w) = p(z | β ; w) * p(β)|
|Generative||p(z, w ; β)||p(z, w, β) = p(z, w | β) * p(β)|
If you’re not familiar with frequentist notation, items to the right of the semicolon (;) are not modeled probabilistically.
Frequentists model the probability of observations given the parameters. This involves a likelihood function.
Bayesians model the joint probability of data and parameters, which, given the chain rule, amounts to a likelihood function times a prior.
Generative models provide a probabilistic model of the predictors, here the words w, and the categories z, whereas discriminative models only provide a probabilistic model of the categories z given the words w.