It’s easy to convert naive Bayes into a linear classifier, then reduce the number of coefficient vectors to the number of categories minus 1. We’ll work through the binomial (two category) case here, where we can get away with a single parameter vector. The general case is the same with more subscripts.

First, let’s express naive Bayes as a linear classifier. Suppose we have a binary naive Bayes text classifier over tokens:

The typical parameterization is with a coefficient vector for each category:

Naive Bayes represents documents as bags of words, which provide a count for each word in the document. Specifically, an input document represented as a vector of token counts, with a constant zero dimension set to one for an intercept:

Then the joint probability of a category and a document is given by:

Thus it’s essentially a multinomial classifier.

Now suppose we only have two categories, 0 and 1. The conditional probabilities are computed in the usual way by marginalization:

Now suppose we define a new feature vector:

and look at its behavior when multiplied by an document token vector:

we see that it produces the log odds. Noting that the two conditionals add to 1:

we contiue the expansion to derive:

so that:

Therefore, we only need to store the single parameter vector and multiply it by our document count vector. That neatly cuts our storage and the number of arithmetic operations in half.

If we have K dimensions, we can get away with (K-1) feature vectors by subtracting the last feature vector from all the previous vectors. The time and space saving are 1/K.

In practice, there’s no need to allocate a vector of token counts. Instead, as each token is streamed out of the tokenizer, its coefficient is looked up and added to the running total, which is initialized with the category log probability difference (aka the intercept).

This recasting of naive Bayes as a linear classifier is not novel. The question becomes how best to choose the coefficient vector. Naive Bayes tends to be outperformed in terms of accuracy by discriminative methods like logistic regression or linear support vector machines, but these methods are orders of magnitude more expensive to train. There’s a good discussion of discriminative versus generative estimation of linear classifier coefficients in:

- Klein, Dan and Christopher D. Manning. 2002. Conditional structure versus conditional estimation in NLP models. In
*Proceedings of EMNLP*.

Note: Klein and Manning use a vector per category, as is traditional in natural language and machine learning approaches to both naive Bayes and logistic regression.

February 5, 2009 at 9:21 pm |

Hi,

I follow you on Twitter but 140 chars is short! I have some questions about using LingPipe to cluster questions and answers. I need to find similarities and extract formulaic language and such things.

Could you give me a hand with how I can use LingPipe for that? I can give you more information of course.

Thanks,

cj

February 6, 2009 at 12:02 pm |

String similarity, we can do. General boilerplate extraction, or content zoning, is something we’ve only played with. A good solution to the extraction problem requires 2D rendering of HTML, for instance.

Basic similarity detection in terms of tilings is something that’s very easy to implement with the Karp-Rabin algorithm. I’ve been looking for an excuse to add this functionality to LingPipe.

Our mailing list’s the best place to post questions you don’t mind others seeing:

http://groups.yahoo.com/group/LingPipe/

Otherwise, write to me directly (carp@alias-i.com) or to our LingPipe mail (lingpipe@alias-i.com), which others see, too.

PS: When you contact me another way, I’m going to remove these comments from the blog, where they’re out of place.

February 6, 2009 at 4:47 pm |

I should’ve also mentioned in the post that this only cuts down on the size of the vectors being stored and the amount of multiplication/addition in the arithmetic computation of the probability.

What remains is the storage space for the symbols (map from strings to vectors), which is going to dominate the size of the vectors themselves. That is, a single entry in a map from a string to a double is a fair bit larger than the double value that’s saved.

Also, looking up symbols in a hash map will dominate time. For each string, a hash code is computed by walking over all the characters, doing shifts, multiplies and adds for each one, followed by a lookup, which likely has to do a character-by-character string comparison.

As we’ve seen before, feature extraction by the tokenizer and storing the set of known symbols will still dominate computation.