I suggest you actually try to calibrate some real naive Bayes output. What you’ll find is that all the predictions go to 0.999 and 0.001 if you have 100+ word documents and you get really strong overdispersion effects (which logistic regression also can’t correct—you need to transform word vectors to do that or build a proper negative-binomial type model along the lines of the 1964 Bayesian classifier of Mosteller and Wallace!).

You can perform heuristic inference with a Bayesian mean just as well as a maximum likelihood estimate. You’ll find the Bayesian approach a bit more robust, but you can probably get most of the way there without much loss by regularizing heavily.

P.S. My second paper on stats (circa 1998) used exactly this kind of post-calibration on an SVD-reduced representation of word vectors.

]]>Thanks for the suggestion with the hierarchical prior, I was hoping for a method that doesn’t require sampling, because prediction speed is crucial in my case. The variational approximations are also not that fast as far as I know?

]]>No, you cannot post-calibrate.

I’d suggest something like a logistic regression with a hierarchical prior. The problem there is that the posterior isn’t conjugate, so you need MCMC methods or Laplace approximations or variational approximations, none of which are implemented in LingPipe.

]]>If the posteriors are not calibrated then calibration techniques can be used in order to do so and obtain meaningful redible intervals, correct?

]]>Given the failure of the independence assumptions under naive Bayes, the posteriors will **NOT** be well calibrated. So there’s not much use in computing them.

The problem for mixing discrete with HMC is that it messes with the continuity of the posterior geometry and thus makes adaptation tricky.

Marginalizing out discrete parameters is *much* better for the main reason that you get the Rao-Blackwell type speedups of using analytic expectations instead of samples.

The problem for Stan is efficiently evaluating the conditional probabilities up to a proportion you’d need to evaluate Gibbs (then you have the correlation problems with Gibbs). Metropolis isn’t very effective in high dimensions.

]]>Isn’t this work? Or, it works but not easy to implement? Or it will be too slow?

]]>c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} is the argument in the Gamma function, so I think they both depends on z_{a,b}.

Thanks.

]]>