It’s not that someone can’t write an algorithm called “post-calibrate” and apply it to any old thing, it’s just that you won’t be able to actually calibrate the output. If you don’t believe me, try it. The theory here is that you have already lost the signal by piping it through the inappropriate naive Bayes likelihood function.

I suggest you actually try to calibrate some real naive Bayes output. What you’ll find is that all the predictions go to 0.999 and 0.001 if you have 100+ word documents and you get really strong overdispersion effects (which logistic regression also can’t correct—you need to transform word vectors to do that or build a proper negative-binomial type model along the lines of the 1964 Bayesian classifier of Mosteller and Wallace!).

You can perform heuristic inference with a Bayesian mean just as well as a maximum likelihood estimate. You’ll find the Bayesian approach a bit more robust, but you can probably get most of the way there without much loss by regularizing heavily.

P.S. My second paper on stats (circa 1998) used exactly this kind of post-calibration on an SVD-reduced representation of word vectors.

]]>Why is post-calibration not possible in that case? There’s a non-parametric method using isotonic regression which seems to be quite standard to use (it’s implemented in sklearn).

Thanks for the suggestion with the hierarchical prior, I was hoping for a method that doesn’t require sampling, because prediction speed is crucial in my case. The variational approximations are also not that fast as far as I know?

]]>Yes, you can just use the unnormalized posterior, as it’ll be proportional to what you want, so normalizing gives you the exact probabilities.

No, you cannot post-calibrate.

I’d suggest something like a logistic regression with a hierarchical prior. The problem there is that the posterior isn’t conjugate, so you need MCMC methods or Laplace approximations or variational approximations, none of which are implemented in LingPipe.

]]>Thank you so much for your quick reply, that was enlightening! Do I understand you correctly, that I simply can use an unnormalized posterior (such as equation 64), evaluate it for all classes per predicted instance and normalize by that? (Without having an extra formalization of the normalization constant) And then after the normalization I integrate over the classes (assume random order)?

If the posteriors are not calibrated then calibration techniques can be used in order to do so and obtain meaningful redible intervals, correct?

]]>Given the failure of the independence assumptions under naive Bayes, the posteriors will **NOT** be well calibrated. So there’s not much use in computing them.

If you’re asking that question, you’re a stats person :-)

The problem for mixing discrete with HMC is that it messes with the continuity of the posterior geometry and thus makes adaptation tricky.

Marginalizing out discrete parameters is *much* better for the main reason that you get the Rao-Blackwell type speedups of using analytic expectations instead of samples.

The problem for Stan is efficiently evaluating the conditional probabilities up to a proportion you’d need to evaluate Gibbs (then you have the correlation problems with Gibbs). Metropolis isn’t very effective in high dimensions.

]]>Isn’t this work? Or, it works but not easy to implement? Or it will be too slow?

]]>I’m afraid I don’t have time to go through the doc symbol by symbol.

]]>If the the general product that involves \Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) does not depend on z_{a,b}, why does the term c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} depend on z_{a,b} so that it cannot be dropped?

c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} is the argument in the Gamma function, so I think they both depends on z_{a,b}.

Thanks.

]]>