## Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers

Ironically, naive Bayes, as standardly presented, is not Bayesian. At least not in the sense that we estimate a posterior distribution $p(\theta|y)$ over parameters $\theta$ given data $y$ and use it for predictive inference for new $\tilde{y}$.

### What is Naive Bayes?

The naive Bayes model assumes that a corpus of documents is generated by selecting a category for a document then generating the words of that document independnetly based on a category-specific distribution. The naivete in question is in assuming the words are independent, an assumption that’s clearly violated by natural language texts.

In symbols, if we have $N$ documents, we assign document $n$ to category $c_n$ based on a discrete distribution $\theta$ over categories:

$c_n \sim \mbox{\sf Discrete}(\theta)$ for $1 \leq n \leq N$

We then generate $I_n$ words $y_{n,i}$ for document $n$ according to a discrete distribution $\phi_{c_n}$ over words for category $c_n$:

$y_{n,i} \sim \mbox{\sf Discrete}(\phi_{c_n})$ for $1 \leq n \leq N, 1 \leq i \leq I_n$

### Naive Bayes Priors

Given a corpus $y$ of training documents, we typically set $\theta$ and $\phi$ by maximum likelihood or additive smoothing. These are both maximum a posteriori (MAP) estimates given (uniform) Dirichlet priors.

Specifically, we assume prior category counts (plus 1) $\alpha$ and prior word counts (plus 1) $\beta$:

$\theta \sim \mbox{\sf Dirichlet}(\alpha)$

$\phi_c \sim \mbox{\sf Dirichlet}(\beta)$ for $1 \leq c \leq C$

As is clear from the posteriors defined in the next section, if we set $\alpha=1, \beta=1$, the MAP parameter estimates are the maximum likelihood estimates (MLE). If $\alpha=2, \beta=2$ we have what’s known as “add 1 smoothing” or “Laplace smoothing”.

### Maximum a Posteriori Estimates

The maximum a posteriori (MAP) estimates $\theta^*, \phi^*$ given priors $\alpha,\beta$ and data $y$ are:

$\theta^*,\phi^* = \arg\max_{\theta,\phi} p(\theta,\phi|y,\alpha,\beta) = \arg\max_{\theta,\phi} p(y|\theta,\phi) \, p(\theta|\alpha) \, p(\phi|\beta)$

The MAP parameter estimates are calculated using our old friend additive smoothing:

$\theta^*_c \propto (\alpha-1) + \sum_{n = 1}^N \mbox{I}(c_n = c)$

$\phi_{c,w} \propto (\beta - 1) + \sum_{n = 1}^N \sum_{i = 1}^{I_n} \mbox{I}(c_n=c) \mbox{I}(w_{n,i} = w)$

Note the subtraction of one from the prior counts; Dirichlet distributions are parameterized with $\alpha$ and $\beta$ being prior counts plus one. If $\alpha=\beta=1$, that’s a prior count of zero for categories and words, and hence a maximum likelihood posterior.

### Inference with Point Estimates

With point estimates, inference is simple for a sequence $w$ of $M$ words. The probability of category $c$ given our parameters is:

$p(c|w,\theta^*,\phi^*) \propto p(c|\theta^*) \prod_{m = 1}^M p(w_m|\phi_c^*)$

$p(c|\theta^*) = \mbox{\sf Discrete}(c|\theta^*) = \theta_c^*$

$p(w_m|\phi_c^*) = \mbox{\sf Discrete}(w_m|\phi_c^*) = \phi_{c,w_m}^*$

### Bayesian Posteriors for Naive Bayes

Because the Dirichlet is conjugate to the discrete (or multinomial) distribution, the posterior parameter distributions $p(\theta|y)$ and $p(\phi|y)$ given training data $y$ have the closed form solutions:

$p(\theta|y) = \mbox{\sf Dirichlet}(\theta|\alpha')$, where

$\alpha'_c = \alpha + \sum_{n = 1}^N \mbox{I}(c_n = c)$

and

$p(\phi_c|y) = \mbox{\sf Dirichlet}(\phi_c|\beta')$, where

$\beta' = \beta + \sum_{n = 1}^N \sum_{i = 1}^{I_n} \mbox{I}(c_n=c) \mbox{I}(w_{n,i} = w)$

### Bayesian Inference for Naive Bayes

It’s easy to convert the point-estimates used here to full Bayesian inference, thus creating a Bayesian version of naive Bayes. (The naivete, by the way, refers to the independence assumption over words, not the point estimates.)

We just replace the point estimates $\theta^*, \phi^*$ with integrals over their posteriors.

$p(c|w,\alpha',\beta') = \int \int p(c|\theta) \, p(\theta|\alpha') \, p(w|\phi_c) \, p(\phi|\beta') \, d\phi \, d\theta$

where as before,

$p(c|\theta) = \theta_c$, and

$p(w|\phi_c) = \prod_{1 \leq m \leq M} p(w_m|\phi_c) = \prod \phi_{c,w_m}$.

As usual, the integrals could be evaluated by Gibbs sampling. Or the integrals could be factored into Dirichlet-Multinomial form and solved analytically, along the same lines as for the Beta-Binomial model.

### Evaluation?

Has anyone evaluated this setup for text classification? I’m thinking it might help with the over-attenuation often found in naive Bayes by allowing a more dispersed posterior compared to the point estimates.

### One Response to “Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers”

1. What is “Bayesian” Statistical Inference? « LingPipe Blog Says:

[…] Dirichlet-Multinomial: Bayesian Naive Bayes […]