Ironically, naive Bayes, as standardly presented, is not Bayesian. At least not in the sense that we estimate a posterior distribution over parameters
given data
and use it for predictive inference for new
.
What is Naive Bayes?
The naive Bayes model assumes that a corpus of documents is generated by selecting a category for a document then generating the words of that document independnetly based on a category-specific distribution. The naivete in question is in assuming the words are independent, an assumption that’s clearly violated by natural language texts.
In symbols, if we have documents, we assign document
to category
based on a discrete distribution
over categories:
for
We then generate words
for document
according to a discrete distribution
over words for category
:
for
Naive Bayes Priors
Given a corpus of training documents, we typically set
and
by maximum likelihood or additive smoothing. These are both maximum a posteriori (MAP) estimates given (uniform) Dirichlet priors.
Specifically, we assume prior category counts (plus 1) and prior word counts (plus 1)
:
for
As is clear from the posteriors defined in the next section, if we set , the MAP parameter estimates are the maximum likelihood estimates (MLE). If
we have what’s known as “add 1 smoothing” or “Laplace smoothing”.
Maximum a Posteriori Estimates
The maximum a posteriori (MAP) estimates given priors
and data
are:
The MAP parameter estimates are calculated using our old friend additive smoothing:
Note the subtraction of one from the prior counts; Dirichlet distributions are parameterized with and
being prior counts plus one. If
, that’s a prior count of zero for categories and words, and hence a maximum likelihood posterior.
Inference with Point Estimates
With point estimates, inference is simple for a sequence of
words. The probability of category
given our parameters is:
Bayesian Posteriors for Naive Bayes
Because the Dirichlet is conjugate to the discrete (or multinomial) distribution, the posterior parameter distributions and
given training data
have the closed form solutions:
, where
and
, where
Bayesian Inference for Naive Bayes
It’s easy to convert the point-estimates used here to full Bayesian inference, thus creating a Bayesian version of naive Bayes. (The naivete, by the way, refers to the independence assumption over words, not the point estimates.)
We just replace the point estimates with integrals over their posteriors.
where as before,
, and
.
As usual, the integrals could be evaluated by Gibbs sampling. Or the integrals could be factored into Dirichlet-Multinomial form and solved analytically, along the same lines as for the Beta-Binomial model.
Evaluation?
Has anyone evaluated this setup for text classification? I’m thinking it might help with the over-attenuation often found in naive Bayes by allowing a more dispersed posterior compared to the point estimates.
May 31, 2012 at 2:37 pm |
[…] Dirichlet-Multinomial: Bayesian Naive Bayes […]