## Naive Bayes, Binomials, and Bags of Words

Sherlock’s analysis in the last blog post, “With Bayes’s Rule, it’s Elementary”, says Sherlock, explicitly said that when Mr. Green laughs, 80% of the words are “haw” and 20% are “hee”.

This can’t mean every laugh utterance has 80% “hee”s. With only three words it’s impossible to have an 80/20 balance. It also doesn’t mean that if he’s uttered 4 “haw”s in row, the next word has to be a “hee”. Instead, the idea is that when Mr. Green laughs, he picks the next word to utter with an 80% probability of “haw” and a 20% probability of “hee”, no matter what he’s uttered already.

Sherlock makes a strong independence assumption according to which each word in an utterance is generated independently. This allows us to factor

```p(hee haw haw|Mr) = p(hee|Mr) * p(haw|Mr) * p(haw|Mr)
```

With this representation, it’s clear that

```p(hee haw haw|Mr) = p(haw hee haw|Mr) = p(haw haw hee|Mr)
```

Order doesn’t matter, only the counts, which in this case is 2 “haw”s and 1 “hee”.

Real language is more subtle. If I say “car” in a conversation, I’m more likely to say “driver” or “gas” than if I hadn’t said “car”. Also, position matters: “Mrs. Green likes Mr. Green” is a different statement than “Mr. Green likes Mrs. Green”, not to mention the word salad “Green Mrs. Mr. likes Green”. That’s why they call strong independence assumptions “naive” — they’re usually not met by reality.

With naive Bayes, the probability of an utterance only depends only on the count, not the order, of words. In this sense, naive Bayes uses what is known as the “bag of words” representations. A bag of words contains a (non-negative, integer) count for each word, but does not represent order.

We can calculate the likelihood that Mr. Green utters 0, 1, 2, or 3 “hee”s when uttering a 3-word laugh. You just add up the probabilities of all the different permutations. That is:

```p(0|Mr) = p(haw haw haw|Mr) = .8 .8 .8 = 64/125
p(1|Mr) = p(hee haw haw|Mr) + p(haw hee haw|Mr) + p(haw haw hee|Mr)
= .2 .8 .8 + .8 .2 .8 + .8 .8 .2 = 48/125
p(2|Mr) = .2 .2 .8 + .2 .8 .2 + .2 .2 .8 = 12/125
p(3|Mr) = .2 .2 .2 = 1/125
```

The same logic applies to Mrs. Green, who utters 60% “hee”s and 40% “haw”s. We summarize the probabilities of the various counts in the following table:

# “hee”s p(# “hee”s|Mr) p(# “hee”s|Mrs) p(Mr|# “hee”s)
0 1 * 64/125 1 *   8/125 8/ 9
1 3 * 16/125 3 * 12/125 4/ 7
2 3 *  4/125 3 * 18/125 2/11
3 1 *  1/125 1 * 27/125 1/28

The binomial coefficient (M choose N) is a mathematical expression standing for how many different ways there are to order M objects consisting of N objects of one type (e.g. “hee”) and (M-N) objects of the other type (e.g. “haw”). The general formula is:

```(M choose N) = M! / [ N! * (M-N)! ]
```

Recall that the factorial is defined by `M! = M * (M-1) * (M-2) * ... * 2 * 1`. Working this out for our case of a three-word laugh, (3 choose 0) = (3 choose 3) = 1, whereas (3 choose 1) = (3 choose 2) = 3.

Now what’s the probability that Mr. Green utters 2 “hee”s in a 3 word laugh? We just multiply the probability of any of the instances by the number of variants. That is, `p(hee haw haw|Mr) = .2 * .8 * .8` times `(3 choose 1) = 3`, for a total of `3 * .2 * .8 * .8`. That’s what’s shown in the table.

In general, we have the following formula:

```p(N success|M trials, p chance of success)
= Binomial(N|p,M)
= (M choose N) * p^N * (1-p)^(M-N)
```

This is known as the binomial distribution over the non-negative integers (though all values above M have probability zero).

It turns out we don’t even need to know the order of the “hee”s and “haw”s to estimate who’s laughing. Going back to our example of “hee haw haw”, we see that is 1 “hee” and 2 “haw”s. We have:

```p(1 "hee"|3 words, Mr) = Binom(1|3,.2) = .38
p(1 "hee"|3 words, Mrs) = Binom(1|3,.6) = .29
```

Applying Bayes’s rule as we did before, only now with counts other than orderings, we have:

```p(Mr|1 "hee", 3 words)
= p(1 "hee"|3 words, Mr) * p(Mr) / p(1 "hee" |3 words)
=  p(1 "hee"|3 words, Mr) * p(Mr)
/ [  p(1 "hee"|3 words, Mr) * p(Mr)
+  p(1 "hee"|3 words, Mrs) * p(Mrs) ]
```

All of the binomial coefficients drop out of the equations, as there’s an (3 choose 1) in all of the `p(1 "hee"|3 words, X)` terms. The bottom line is that the naive Bayes model, which models order by ignoring it, and the binomial model, which explicitly ignores counts, lead to the same posterior probability estimate for who’s laughing.

Other probability models, such as logistic regression, can also use a bag of words representation; but logistic regression is not naive in that it doesn’t assume the words are independent.