So what do you do when a blog doesn’t allow comments? Comment from your own blog. The BioNLP mailing list had a link to:

It’s apparently posted by a M.S. student at U. Washington, but there’s no more evidence, even through whois, about who it is. And there are no comments allowed!

The perspective of a student learning computational linguistics is interesting. I was particularly struck by the absurdity of this year-old post:

- Emerging CL. 3 February 2008. Comparing Bernoulli and Multinomial [Classifiers]

It’s a classic, “let’s compare two approaches” post, here the two simplest linear classifiers, Bernoulli (each feature is true/false) and Multinomial (each feature gets a count, aka naive Bayes). They both use Bayes’s rule to compute the category probabilities:

Bayes’s Rule: p(cat|x) ∝ p(x|cat) p(cat)

Presumably we take the same space of features numbered 1 to D (the dimensionality). The Bernoulli model uses boolean feature values, setting:

p(feats|cat) = Π_{d in 1:D} feats(d) ? θ[d,cat] : (1 – θ[d,cat])

where θ[f,cat] is the probability that feature d is true in the specified category and feats(d) is true or false, based on whether the feature appeared in the input.

The multinomial uses count values (0,1,2,…) for features, setting:

p(feats|cat) ∝ Π_{d in 1:D} φ[d,cat]^{feats(d)}

where φ[d,cat] is the probability of the word d in the model and feats(d) is the number of times it appeared in the input being classified. We don’t usually compute the normalizing term for first-best classifiers.

There are two problems with this report. One, it shows the multinomial classifier being about 40 times faster! How can that be given that the inside of the product is simpler for the Bernoulli model? The answer is that they must have used a sub-optimal implementation. I can guess what they did — implemented Bernoulli as the naive product and implemented the binomial in the usual way ignoring the zero counts, which work out to φ^{0} = 1 and thus cancel.

The Bernoulli should be much faster if it’s implemented the right way, which is to start with an array of scores computed by baseline log odds assuming all features are false.

score[cat] = Π_{d in 1:D} (1 – θ[d,cat])

Then for each true feature d,

score[cat] *= θ[d,cat] / (1 – θ[d,cat])

If we convert all that to logs, the Bernoulli will be faster. And it’s a lot cheaper to compute boolean feature vectors than to use counts (you can use a set instead of a map, and in a good implementation, feature extraction dominates).

The second problem is that the experiments explored a set of prior parameter values where the best performance was at the edge of the values tested. They tried priors of 0.1, 0.5, 1, and 2, with 0.1 performing the best. We know even more diffuse priors are good for naive Bayes, so I suspect setting the prior concentration to less than 0.1 would’ve worked even better.

The real issue is that I see conference papers with the same two flaws in evaluation. What typically happens to the professionals is that they use a baseline implementation of someone else’s algorithm with which they’re not familiar, and a well-tuned version of their own. It’s notoriously tricky to do fair efficiency and scalability tests without a no-holds-barred policy and a bakeoff. And even then, skill of programmers is likely to dominate the evaluation.

June 9, 2009 at 8:54 am |

The posts link to some writeups that the author has done for various topics, and those papers are attributed to “Bob New” who does appear to be a student at the University of Washington.

September 2, 2009 at 10:18 pm |

Thanks for the feedback on my blog. Sorry that I don’t have comments enabled.