Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers

by

Ironically, naive Bayes, as standardly presented, is not Bayesian. At least not in the sense that we estimate a posterior distribution p(\theta|y) over parameters \theta given data y and use it for predictive inference for new \tilde{y}.

What is Naive Bayes?

The naive Bayes model assumes that a corpus of documents is generated by selecting a category for a document then generating the words of that document independnetly based on a category-specific distribution. The naivete in question is in assuming the words are independent, an assumption that’s clearly violated by natural language texts.

In symbols, if we have N documents, we assign document n to category c_n based on a discrete distribution \theta over categories:

c_n \sim \mbox{\sf Discrete}(\theta) for 1 \leq n \leq N

We then generate I_n words y_{n,i} for document n according to a discrete distribution \phi_{c_n} over words for category c_n:

y_{n,i} \sim \mbox{\sf Discrete}(\phi_{c_n}) for 1 \leq n \leq N, 1 \leq i \leq I_n

Naive Bayes Priors

Given a corpus y of training documents, we typically set \theta and \phi by maximum likelihood or additive smoothing. These are both maximum a posteriori (MAP) estimates given (uniform) Dirichlet priors.

Specifically, we assume prior category counts (plus 1) \alpha and prior word counts (plus 1) \beta:

\theta \sim \mbox{\sf Dirichlet}(\alpha)

\phi_c \sim \mbox{\sf Dirichlet}(\beta) for 1 \leq c \leq C

As is clear from the posteriors defined in the next section, if we set \alpha=1, \beta=1, the MAP parameter estimates are the maximum likelihood estimates (MLE). If \alpha=2, \beta=2 we have what’s known as “add 1 smoothing” or “Laplace smoothing”.

Maximum a Posteriori Estimates

The maximum a posteriori (MAP) estimates \theta^*, \phi^* given priors \alpha,\beta and data y are:

\theta^*,\phi^* = \arg\max_{\theta,\phi} p(\theta,\phi|y,\alpha,\beta) = \arg\max_{\theta,\phi} p(y|\theta,\phi) \, p(\theta|\alpha) \, p(\phi|\beta)

The MAP parameter estimates are calculated using our old friend additive smoothing:

\theta^*_c \propto (\alpha-1) + \sum_{n = 1}^N \mbox{I}(c_n = c)

\phi_{c,w} \propto (\beta - 1) + \sum_{n = 1}^N \sum_{i = 1}^{I_n} \mbox{I}(c_n=c) \mbox{I}(w_{n,i} = w)

Note the subtraction of one from the prior counts; Dirichlet distributions are parameterized with \alpha and \beta being prior counts plus one. If \alpha=\beta=1, that’s a prior count of zero for categories and words, and hence a maximum likelihood posterior.

Inference with Point Estimates

With point estimates, inference is simple for a sequence w of M words. The probability of category c given our parameters is:

p(c|w,\theta^*,\phi^*) \propto p(c|\theta^*) \prod_{m = 1}^M p(w_m|\phi_c^*)

p(c|\theta^*) = \mbox{\sf Discrete}(c|\theta^*) = \theta_c^*

p(w_m|\phi_c^*) = \mbox{\sf Discrete}(w_m|\phi_c^*) = \phi_{c,w_m}^*

Bayesian Posteriors for Naive Bayes

Because the Dirichlet is conjugate to the discrete (or multinomial) distribution, the posterior parameter distributions p(\theta|y) and p(\phi|y) given training data y have the closed form solutions:

p(\theta|y) = \mbox{\sf Dirichlet}(\theta|\alpha'), where

\alpha'_c = \alpha + \sum_{n = 1}^N \mbox{I}(c_n = c)

and

p(\phi_c|y) = \mbox{\sf Dirichlet}(\phi_c|\beta'), where

\beta' = \beta + \sum_{n = 1}^N \sum_{i = 1}^{I_n} \mbox{I}(c_n=c) \mbox{I}(w_{n,i} = w)

Bayesian Inference for Naive Bayes

It’s easy to convert the point-estimates used here to full Bayesian inference, thus creating a Bayesian version of naive Bayes. (The naivete, by the way, refers to the independence assumption over words, not the point estimates.)

We just replace the point estimates \theta^*, \phi^* with integrals over their posteriors.

p(c|w,\alpha',\beta') = \int \int p(c|\theta) \, p(\theta|\alpha') \, p(w|\phi_c) \, p(\phi|\beta') \, d\phi \, d\theta

where as before,

p(c|\theta) = \theta_c, and

p(w|\phi_c) = \prod_{1 \leq m \leq M} p(w_m|\phi_c) = \prod \phi_{c,w_m}.

As usual, the integrals could be evaluated by Gibbs sampling. Or the integrals could be factored into Dirichlet-Multinomial form and solved analytically, along the same lines as for the Beta-Binomial model.

Evaluation?

Has anyone evaluated this setup for text classification? I’m thinking it might help with the over-attenuation often found in naive Bayes by allowing a more dispersed posterior compared to the point estimates.

One Response to “Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers”

  1. What is “Bayesian” Statistical Inference? « LingPipe Blog Says:

    […] Dirichlet-Multinomial: Bayesian Naive Bayes […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s