Ironically, naive Bayes, as standardly presented, is not Bayesian. At least not in the sense that we estimate a posterior distribution over parameters given data and use it for predictive inference for new .
What is Naive Bayes?
The naive Bayes model assumes that a corpus of documents is generated by selecting a category for a document then generating the words of that document independnetly based on a category-specific distribution. The naivete in question is in assuming the words are independent, an assumption that’s clearly violated by natural language texts.
In symbols, if we have documents, we assign document to category based on a discrete distribution over categories:
We then generate words for document according to a discrete distribution over words for category :
Naive Bayes Priors
Given a corpus of training documents, we typically set and by maximum likelihood or additive smoothing. These are both maximum a posteriori (MAP) estimates given (uniform) Dirichlet priors.
Specifically, we assume prior category counts (plus 1) and prior word counts (plus 1) :
As is clear from the posteriors defined in the next section, if we set , the MAP parameter estimates are the maximum likelihood estimates (MLE). If we have what’s known as “add 1 smoothing” or “Laplace smoothing”.
Maximum a Posteriori Estimates
The maximum a posteriori (MAP) estimates given priors and data are:
The MAP parameter estimates are calculated using our old friend additive smoothing:
Note the subtraction of one from the prior counts; Dirichlet distributions are parameterized with and being prior counts plus one. If , that’s a prior count of zero for categories and words, and hence a maximum likelihood posterior.
Inference with Point Estimates
With point estimates, inference is simple for a sequence of words. The probability of category given our parameters is:
Bayesian Posteriors for Naive Bayes
Because the Dirichlet is conjugate to the discrete (or multinomial) distribution, the posterior parameter distributions and given training data have the closed form solutions:
Bayesian Inference for Naive Bayes
It’s easy to convert the point-estimates used here to full Bayesian inference, thus creating a Bayesian version of naive Bayes. (The naivete, by the way, refers to the independence assumption over words, not the point estimates.)
We just replace the point estimates with integrals over their posteriors.
where as before,
As usual, the integrals could be evaluated by Gibbs sampling. Or the integrals could be factored into Dirichlet-Multinomial form and solved analytically, along the same lines as for the Beta-Binomial model.
Has anyone evaluated this setup for text classification? I’m thinking it might help with the over-attenuation often found in naive Bayes by allowing a more dispersed posterior compared to the point estimates.