Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes


[Update: 1 February 2014. David Bammam points out that there’s a mistake just before equation (62). The correct formula should be

\Gamma(x + q) = \Gamma(x) \times \prod_{i=1}^q (x + i - 1).

This has implications going forward replacing i with i - 1 which I don’t have time to change right now.]

[Update: 27 September 2010. I corrected several typos and brainos in the tech report after corrections in the comments (see below) from Arwen Twinkle. I also added some historical notes and references. Thanks for all the feedback.]

I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don’t derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child’s play. Maybe they are if you’re a physicist or theoretical statistician. But that was a whole lot of choppin’ with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result’s interesting. I didn’t wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you’re curious.

Thanks to Wikipedia

I couldn’t have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia’s Derivation is Wrong!

At least I’m pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we’ve already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what’s going on is in the text above which says it drops terms that don’t depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia’s calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don’t need to keep (as large) a symbol table in your head.

I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.

LaTeX Rocks

I also don’t think I could’ve done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?

63 Responses to “Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes”

  1. Brad Says:

    There’s a typo in your report in equation 13: it should be p (… | alpha, beta) instead of p(…, alpha, beta).

  2. Mark Reid Says:

    On your last point, have you tried LyX? If you already know LaTeX it is a great note-taking tool, rendering equations on the fly.

  3. lingpipe Says:

    @Brad Thanks. Corrected. Writing math is like trying to write computer programs without a compiler. Which sort of leads into my request for tools to help:

    @Mark. Lyx looks more like online LaTeX, which is a step in the right direction (and reminds me of the LaTeX I used to run on a Mac back in the early 90s), but I was hoping for something with more math know-how built in. Something more like Eclipse meets LaTeX, ideally with some Mathematica-type functionality thrown in. I also don’t quite understand why the rendering is so bad in Lyx (judging from their screen shots) — are they using something like ASCIIMathML or one of the other converters that produce more ASCII-like output?

  4. Brendan O'Connor Says:

    Perfect timing — I started looking in to this just today. Thanks for writing this up.

    As for the tool you want — maybe you want Mathematica itself :)

  5. Alvin Kaule Says:

    Wray Buntine and Aleks Jakuline provide a Gibbs sampling algorithm for Discrete Component Analysis which is a general model and related to LDA.

    • lingpipe Says:

      Thanks. I really like the generality of that. They show how LDA is an instance of what they’re doing and then discuss the collapsed sampler under the heading “Rao-Blackwellization”. But their equation (21) only goes down to the Gamma function stage.

      They do make the critical observation that the collapsing increases the effective resampling rate of the theta parameters (and phi parameters), which helps convergence.

      I should ask Aleks if he knows who did all these derivations originally.

  6. Ramnath Balasubramanyan Says:

    This is very interesting.
    Gregor Heinrich has a tech report which goes through the steps of deriving the final form for a collapsed Gibbs sampler for LDA just like your tech report does.

    Click to access text-est.pdf

    • lingpipe Says:

      Thanks for the link. I like Heinrich’s generalization of the B() notation for the beta normalizer to the Delta() for the Dirichlet. It’s that step from (78) to (79) that I found so hard to figure out, and where the Wikipedia derivation went wrong.

      As I say in the blog post, my own derivation was based on the Wikipedia entry. I should say that in the tech report itself, now that I think about it.

      I’m really wondering who the original author (deriver?) of that was and if there’s a paper I could cite. There’s no ref for the collapsing derivation in the Heinrich paper you link. Actually there is, to Griffiths 2002, but that only says it’s possible and shows the final form! And to a McCallum et al. paper on author-topic models, but also no derivation there.

      • Ramnath Balasubramanyan Says:

        I agree, the Wikipedia derivation is needlessly over complicated.
        As far as moving from (78) to (79) in Heinrich’s paper goes, isn’t it a simple reduction using \Gamma (n) = (n-1) \Gamma (n-1)?

      • lingpipe Says:

        I think the problem with the Wikipedia entry, at least a week or two ago, was that it was wrong.

        The step’s a bit more involved because you have to reason about the counts with and without the item being sampled, unfold, then fold back together, all the while dropping constant terms that don’ otherwise cancel. With my limited experience manipulating these kinds of functions, it seemed pretty hard to me.

    • lingpipe Says:

      You should also be careful in interpreting the Heinrich paper. Its Figure 8 seems to imply that the maximum likelihood-like estimates for the discrete topic and word distributions, \theta and \varphi parameters can be used as Gibbs samples for those parameters. This isn’t the case, because they can wind up with far too little variance.

      If you need samples from the full posterior, you need to actually sample \theta and \varphi. This is actually really easy, because we have all the topic assignments z and priors \alpha and \beta, and the Dirichlet is conjugate for the discrete distribution. But we need to do that sampling to properly represent the variance of the estimates for the discrete parameters.

      • Ramnath Balasubramanyan Says:

        Thanks! I hadn’t noticed the potential problem in using the MLE estimates.

      • Mathieu Says:

        Thank you very much for sharing your derivation. This is greatly appreciated.

        Regarding the estimation of the thetas and phis, can you elaborate why sampling is needed? As Heinrich points out in equations 80-83, their distribution is a Dirichlet so their expectation is known.

        Regarding burn-in and lag, do you apply them with respect to the topic assignment samples or with respect to the phi and theta samples?

        Have you considered posing i=(a,b) to make some of the equations easier to read?

        If I remember correctly, another paper which includes the derivation for the Naive Bayes sampler is “Gibbs sampling for the uninitiated”, but probably not as detailed as yours.

      • lingpipe Says:

        @Mathieu Your question contains a clue to the answer. The expectations of the multinomial parameters are known. Even the closed form of their variance is known, because you the Dirichlet is conjugate.

        What you can’t do, though, is set the multinomial parameters to their expectations in each sample and then use those expectations for posterior inference. In general, the Bayesian posterior predictive distribution computes the expectation of the likelihood p(\tilde{y}|\gamma) given the posterior p(\gamma|y),

        p(\tilde{y}|y) = \int p(\tilde{y}|\gamma) p(\gamma|y) d\gamma.

        The point of Gibbs sampling is that it lets you approximate the posterior with samples in order to approximate the integral by

        \int p(\tilde{y}|\gamma) p(\gamma|y) d\gamma \approx \frac{1}{N} \sum_{n=1}^N p(\tilde{y}|\gamma^{(n)})

        where \gamma^{(n)} is the n-th Gibbs sample for the parameters \gamma.

        If instead of sampling the \theta,\varphi parameters in LDA, you use their expectations, you underestimate the variance of the posterior parameter distribution p(\theta,\varphi|y).

        You need to sample the multinomial from the Dirichlet in each sample and reason from that.

  7. Arwen Says:

    I have a question regrading equation (29) when you expand the gamma functions. When you expand the gamma in the numerator of the last term instead of having
    Gamma(X+1) = X*Gamma(X)
    you wind up with
    Gamma(X+1) = (X+1)*Gamma(X)
    Is this a typo or is there a particular trick I am missing in this equation?

    • lingpipe Says:

      Thanks for the error correction. Our technicians are working on a patch. I’ll post another comment when the new version’s out.

    • lingpipe Says:

      Thanks again. I updated equations (29) and (30) to get rid of the “+1” terms, which shouldn’t have been there. This should’ve been obvious to me when I proofread it for the umpteenth time given that the “+1” was gone again in equation (31). I blame not having unit tests for math!

  8. arwentwinkle Says:

    Question: in the step from (31) to (32) you replace the sum over J with the appropriate count (makes sense) and then add (J*beta_j). However, there is no longer an index variable j with which to select the jth element of the beta vector.
    Looking at the Griffiths and Steyvers paper, they are using a single element beta , symmetric Dirichlet prior, instead of a vector beta allowing non-symmetric prior. Also, the relationship between your final form (sans normalization) and the Griffith & Steyvers final form (5) isn’t crystal clear to me.
    Typo: In equations (28) – (30) the denominator of the first beta term [Gamma(sum over J)] I believe that you have dropped the superscript –(a,b) that should be attached to C_k,*,j

    I really appreciated your quick response previously. I am working through this derivation in detail because I have to do one using some slightly different distributions but same basic idea. I will be sighting you as your work has been invaluable to me. :)
    Thank you again. Very helpful piece of work and very well done.

    • lingpipe Says:

      I feel your pain. The reason I worked through it was so that I could get to the naive Bayes case. What app are you working on?

      Thanks again for the corrections. The reduction to J \times \beta_j in the denominator of (32) is wrong. It should be \sum_{j=1}^J \beta_j for the general case where the \beta_j may be different.

      The -(a,b) was intentionally dropped from the first terms based on the argument in the text before (28). The key is factoring out terms that that depend on (a,b) and those that don’t. This is why everything can drop out in the end up to proportionality.

      I still haven’t found a citation to whoever worked this out the first time. Maybe it’s just obvious to people with math brains the size of a planet.

      I’m looking forward to your comments on the naive Bayes derivation! That was a little trickier in the \Gamma() terms, but not much.

  9. Bf Says:

    I’ve been trying for a day now to find out how Griffiths and Steyvers arrive at those results after the integration. Now everything is clear. I wasn’t aware of the trick with the normalisation constant of the Dirichlet to get rid of the integral! Great work! Thanks for submitting it!

  10. Viet-An Nguyen Says:

    Thank you for your wonderful note. It has helped me a lot.
    I just have a question regarding the Gamma expansion at the top of page 9. Should the expansion be
    \Gamma(x+q) = \Gamma(x) + \prod_{i=0}^P{q-1} (x+i)
    instead of
    \Gamma(x+q) = \Gamma(x) + \prod_{i=1}^P{q} (x+i)

    • Bob Carpenter Says:

      It’s right the way it is. To see that, note that

      \Gamma(x+0) = \Gamma(x) \times \prod_{i=1}^0 (x+i) = \Gamma(x) + 0 = \Gamma(x).

      \Gamma(x+1) = \Gamma(x) \times \prod_{i=1}^1 (x+i) = \Gamma(x) \times x.

      Check out the Wikipedia page on the Gamma function.

      • chang Says:

        I think the Gamma Expansion should be
        \Gamma(x+q) = \Gamma(x) \prod_{i=1}^P{q} (x+i-1)
        instead of
        \Gamma(x+q) = \Gamma(x) \prod_{i=1}^P{q} (x+i)

        ortherwise you get
        \Gamma(x+1) = \Gamma(x) (x+1)

      • Bob Carpenter Says:

        Thanks! Don’t know what I was thinking (twice). We definitely want

        \Gamma(x+1) = x \times \Gamma(x).

        So the formula should be

        \Gamma(x + q) = \Gamma(x) \times \prod_{i = 1}^q (x + i - 1).

        I’ll update the PDF when I get a chance.

  11. Summing Out Topic Assignments in LDA « LingPipe Blog Says:

    […] a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In this post, I’ll show how to turn this […]

  12. Kairit Says:

    I’m wondering if there are products over words missing in the first integral of 18 and 19.

    there is currently
    \int\prod_{m=1}^N p(z_m|\theta_m)p(\theta_m|\alpha)d\theta,

    but I think there should be:
    \int\prod_{m=1}^N\prod_{n=1}^{N_m} p(z_{m, n}|\theta_m)p(\theta_m|\alpha)d\theta,

    because z_m doesn’t make sense as there are more than one topic in each m-th document.

    I’m also confused about the derivation of the second term in 21. In 20 it has product over topics and inside every topic there is a product over all the words in all documents. So in every topic one iteration over all words in all documents will be made and by using the exponentiation of the sums it could be turned into the product over all topics and all word types. But this product over all topics and all word types must be done inside every iteration of outer product over topics. So in my understanding it is not possible to convert the second term of 20 which has outer product over topics and inner product over documents and words (regardless of the topic) into 21 where there is only outer product over topics and inner product over word types and outer product selecting the topics.

    However, for some reason I believe that you are right and I am mistaken, but where does my reasoning go wrong?

  13. Kairit Says:

    So, spending some more time staring at the formulas 20 and 21 and expanding the terms in a small example, I must agree with that derivation. So no need for explanation any more.

    • Bob Carpenter Says:

      Thanks for clarifying — I still hadn’t gotten around to diving back into the derivation.

      • tabproxy Says:

        Thank you for your great explanation. It helps me a lot :-) Probably this is the most comprehensive explanation I have ever found.

        Regarding Kairit’s question: “…if there are products over words missing in the first integral of 18 …” .. I believe it is true too, but perhaps could you please give a little more explanation ? Because I could not find a relationship between the first part of integral and the independence assumption (7).

        I really appreciate it. Please kindly advise. Many thxs in advance.

  14. Bob Carpenter Says:

    OK, I went back and had a look. The reason you can reduce (17) to (18) is that the words y are conditionally independent of everything but the word distributions \phi and topic assignments z. More specifically, a given word y[m,n] only depends on \phi_{z[m,n]}, the distribution over words for the topic assigned to word n in document m.

  15. Martin Says:

    Can you please explain why your result in (32) differs from the one in Griffiths2002 (1)?

  16. Bob Carpenter Says:

    I take it you mean

    Griffiths, T. L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. Proceedings of the 24th Annual Conference of the Cognitive Science Society.

    Formula (1) in that paper is the same as formula [5] in their 2004 PNAS paper cited above.

    Part of the answer is that I allowed asymmetric priors. With a symmetric prior, the W\beta term in their denominator term is the same as what I called J \gamma in the note after the formula.

    The other step was dropping the topic-related denominator (in their notation, the n^{(d_i)}_{-i,\cdot} + T\alpha term); note that it’s the same for every j, so you don’t need to compute it.

  17. Ben Wing Says:

    Hello. I have been trying to work out in detail how collapsed Gibbs sampling works and in particular collapsing of Dirichlet priors.

    I’ve done a great deal of work on the following pages:

    1. Gibbs sampling (see in particular the section on collapsed sampling)
    2. Compound probability distribution (this includes a fair amount of math that underlies exactly how collapsing works in the general conjugate prior case)
    3. Dirichlet compound multinomial distribution (this is the name of the distribution that results from collapsing out a Dirichlet prior; nearly the whole page is my work and has lots of details about how to collapse out Dirichlet priors in various situations commonly seen in Bayesian networks, including both the LDA and Naive Bayes cases)

    Ben Wing

  18. Ben Wing Says:

    That should say, the specified pages in Wikipedia.

    In addition, the thing about the Naive Bayes case is that, yes, you can in fact “eliminate” the gammas but only at the expense of a huge number of factors. Essentially, you have a factor for every single token in your corpus. With a large corpus this will be much worse than the number of factors in the unmodified DCM (Dirichlet compound multinomial) distribution, although it’s true that in the latter case those factors have a gamma function in them.

    For example, if you have 5 topics, 50K word types and 20M word tokens, your modified expression has 20M factors while the unmodified DCM has 5*50K = 250K.

    Furthermore, you can’t reasonably multiply 20M factors and not overflow your floating point. This means you will actually have to add up 20M log factors (vs. adding 250K log-gamma factors in the unmodified DCM). Unless the polynomial implementing log is 100K the speed of the log-gamma polynomial (both are primitives in R, for example), you don’t gain anything.

    • Bob Carpenter Says:

      I haven’t actually evaluated this case, but it’s not just the number of expressions, it’s the effectiveness of the sampler. Often if you integrate/sum-out terms, you get better estimates of everything else, which can be worth more computational effort.

      But I was really just working everything out for theory’s sake and as an exercise for myself. I did the naive Bayes case because someone had assured me it couldn’t be done :-)

      With something like Hamiltonian Monte Carlo, you get a very efficient continuous-parameter sample in terms of computational cost per effective sample (the latter being the denominator term in the Monte Carlo version of the central limit theorem). But it can’t handle discrete parameters.

      An R “built in” is nothing more than a call to an underlying C++ function. lgamma() is much more expensive to evaluate than even a log() or exp(), which are in turn more expensive than multiplication (* operator), which is more expensive than addition (+ operator). But R’s so inefficient in its interpreter and some of its numerical representations that you can mask the cost of the underlying C++ operation.

  19. Ben Wing Says:

    Hey Bob, thanks for the quick response!

    The two things I was comparing are both collapsed versions. They should evaluate to exactly the same thing — the issue is only how those values are computed. Phil Resnik suggests not collapsing at all, which is another beast entirely.

    I don’t know about Hamiltonian Monte Carlo; will have to look into that.

    I wonder how much more expensive lgamma really is than log. Both cases are presumably implemented using polynomials, and it doesn’t seem to me that the number of polynomial terms can be that much huger.

    As for R, if you vectorize your operations you no longer the R hit to the same degree, since the looping in the vectorized primitives is all in C++, I’m pretty sure.

    Another thought that occurred to me: assuming that you have a symmetric prior, you could hugely speed up the evaluation of the direct DCM density by keeping track of frequencies of frequencies. I’m not sure how much effort that would be but I don’t think very much. Then, in this big product over vocabulary items, when we have a symmetric prior, what matters is only the number of times this word token occurs (its frequency, more or less) in documents labeled with this topic. All word tokens with the same frequency contribute the same term, so we only need to compute the log-gamma once and then multiply by the frequency of frequency. This should give us HUGE speedups for the cost of just a bit of extra bookkeeping. For your transformed version, I’m not sure whether this would be feasible — there are two different types of counts to keep track of (those across just this document, and those across all other documents having the same topic.

    Thanks again for writing this up with all the math!

    Take a look at the Wikipedia entry for Dirichlet compound multinomial distribution and see if it makes sense to you, and tell me what you think might help to clarify it. I’ve also gone in great detail, but from a different perspective: I want to lay out a step-by-step procedure for collapsing Dirichlet priors in various different Bayes networks with different configurations of the categorical nodes dependent on the Dirichlet priors. So I take pains to motivate exactly where the different steps come from, attempting to do things so that the actual amount of math gets reduced to a minimum. For example, if you have the right intuition about things, you don’t really need to integrate out all the theta’s or phi’s at once. Rather, once you understand how the integrating out and simplification works in the simple case of one Dirichlet prior with a set of dependent categorical variables and no other variables, you can leverage this to apply to all the other cases without having to redo all the computations from scratch.

    • Brendan O'Connor Says:

      FYI here are some lgamma() implementations.

      In Python:

      In Lightspeed, by Tom Minka. Uses the Lanczos method.

    • Bob Carpenter Says:

      Right — I was talkinag about not collapsing at all.

      Log gamma’s relatively expensive. You can control the speed/accuracy tradeoff somewhat with the order of Lanczos approximation. Different libraries use different ones. In the Columbia project we’ve been using the Boost C++ libs, mainly because we’re already using Boost but also because of restrictive (copyleft, mainly) licensing issues for most other packages. I wound up writing my own lgamma() for LingPipe for the same reason. There’s needless to say a discussion in Numerical Recipes.

      If you have a uniform Dirichlet, it drops out altogether! We’re actually making those calcs in our sampling software at Columbia because models often assume uniform (on the linear scale) priors.

      You can vectorize R operations sometimes, and also Python using numpy sometimes. The looping in the vectorized primitives in R is definitely C++.

      Lookup can sometimes be more expensive than recalculation, but that’s certainly not going to be the case here. As long as you don’t need to normalize (if the counts change, the gamma terms change), you should be able to cache. There’s been some work out of U Irvine by Welling et al. on efficiently handling large-K cases using more efficient categorical sampling (worth reading if you care about efficiency for LDA).

      I’ll take a look at the Dirichlet-multinomial page when I get a chance. The generalization is another reason I wanted to do naive Bayes, too.

  20. Ben Wing Says:

    Hey, I have a related question. The actual name of the page on the Dirichlet compound multinomial is “Multivariate Pólya distribution”. I don’t think this is the most common name for this distribution, and IMO it’s problematic for other reasons (not parallel to “beta-binomial”, “Pólya distribution” is multiply ambiguous, acute accent is somewhat annoying).

    Which of the following do you think should be the canonical name of this distribution, as given in Wikipedia?

    1. Multivariate Pólya distribution
    2. Dirichlet compound multinomial distribution
    3. Dirichlet-multinomial distribution
    4. Other names?

  21. Allen Riddell Says:

    Thanks for this!

    I think I’ve spotted a typo in equations 40 and 41. At the end of the lines, the \theta_{z_m}’s should be \phi_{z_m}.

    • Bob Carpenter Says:

      You’re absolutely right — the \theta_{z_m} should be \phi_{z_m} at the ends of equations (40) and (41).

      Now I’ll have to dig up the source and edit and post a revision.

      It’s amazingly difficult to get this much algebra written out consistently! If only I had the equivalent of a compiler, I could check some of my errors.

  22. Tom Taylor Says:

    Thanks Bob, and all the people responding; you’ve certainly provided for me a point of clarity in the LDA literature and explained the wikipedia LDA article. This helped me with my own mathematica collapsed sampling LDA implementation, which seems to work efficiently and gives me beautiful and intuitive word clusters.

    I do have a couple of niggling confusions that I haven’t resolved (and I may be drummed out of the mathematical profession for confessing this) that I hope you (all) might have the patience to help me clarify, even at this late date. These are 1) Is the collapsed Gibbs sampling really LDA, and 2) if not what does collapsed sampling really do?

    First of all the business of collapsed sampling seems to go back to the publications of Griffiths and Steyvers. Griffiths’ 2002 unpublished report (see ) seems to suggest that the collapsed sampling is not really LDA, because it doesn’t perform inference on the \alpha and \beta. This doesn’t address my worry though.

    If I’m understanding correctly, the logic of collapsed Gibbs sampling seems to be that we want to sample p(z_{a,b},\theta,\phi | y, z_{-(a,b)}, \alpha, \beta) by Gibbs sampling, but we don’t need \theta, \phi very much, so we marginalize them out and instead just sample p(z_{a,b} | y, z_{-(a,b)}, \alpha, \beta), and then can recover \theta, \phi at the end of it all. For this to be quite correct, I would want z_{a,b} | y, z_{-(a,b)},\theta,\phi,\alpha, \beta to be independent of \theta,\phi, although \theta, \phi could depend on z, y \alpha, \beta. I don’t see why this should be the case.

    By comparison, in an un-collapsed Gibbs sampling I think I would expect to alternate sampling p(z_{a,b} | y, z_{-(a,b)}, \theta, \phi, \alpha, \beta) with sampling p(\theta,\phi | y, z, \alpha, \beta). It looks to me like the evolutions of z, \theta, \phi are coupled to each other. That is, is it not the case that collapsed sampling is an approximation to Gibbs-sampled LDA by a sort of z sampling with mean-field-\theta-\phi? And while collapsed sampling is more efficient and arguably a good approximation of LDA, might it not incur a cost of neglecting to sample the odd z assignments from those instances that the \theta & \phi variables Gibbs-walk themselves into some unlikely corner?

    • brendan o'connor (@brendan642) Says:

      (1) “Is collapsed GS really LDA?” The basic collapsed Gibbs sampler for LDA only fits the “z” variables, so it’s a sampler for part of LDA. What some people do (I’m one of them) is also fit the Dirichlet concentration parameters (alpha and beta, in this notation), which you can do with, say, slice sampling (so it’s all one big bayesian gibbs sampler), or optimization moves (to be “empirical bayes”).

      (2) On recovering theta and phi at the end: the count tables define posterior dirichlets over them. So if you just take the empirical mean off the count table (plus dirichlet pseudocounts), that’s the posterior mean for a particular theta or phi. When I want to be careful, I take the mean across many different ‘z’ samplings (or, to better know the variability, take theta samples for each ‘z’ sampling, then aggregate all…).

      (3) On collapsed GS vs mean-field theta,phi — So the right way to think about it is contrasting to uncollapsed GS for theta,phi … or MAP (maximization move) inference for theta,phi. The advantage of both CGS and meanfield is they account for a large number of settings of theta,phi, instead of a single point. One nice thing about CGS is that it *correctly* accounts for all possible settings of theta,phi by integrating them out. Mean field only approximately accounts for this through the approximation. But perhaps the more important thing is that CGS is online and cheap: every time you update a single ‘z’, you’ve implicitly immediately updated the theta,phi posterior for the next sample, and there are no expensive calculations (no digammas, but not even exps or logs either!). And you can even play tricks due to count sparsity: for example, if there was no change to ‘z’, don’t update any count tables. This and other tricks make CGS incredibly fast.

      This paper is pretty interesting for comparing all these methods:

    • Bob Carpenter Says:

      To elaborate on Brendan’s excellent comments, a “collapsed” sampler is just marginalizing out some parameters. So you get proper posterior samples for the marginal posterior. And no, nothing gets neglected.

      There are many ways to block out Gibbs updates in an uncollapsed sampler and they will all have the right stationary distribution, but perhaps different efficiencie.

      But the real killer is that LDA is impossible to sample from due to the multi-modality. So you really might as well use something that does optimization like a variational estimator. I’m partial to Matt Hoffman’s online version, which is now part of the Vowpal Wabbit software.

      You can alternatively marginalize out the discrete parameters. I have a later blog post that shows how to do that (it’s much easier):

  23. Ramon Lopes Says:

    After Equation (30) in the document, it is stated ‘Which along with the topic denominator, may be dropped because they are constant’.

    I have been trying to figure out why the term \Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) does not depend on the current sample position (a; b), so that it can be dropped. In other words, what makes the term \Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) *not* depend on the current sample position (so that it can be dropped) whereas the term c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} does?

    • Bob Carpenter Says:

      The \alpha and \beta terms are constant hyperparameters and the other terms explicitly remove the contribution of (a,b); that’s what the -(a,b) notation is for. Just trace which of the terms is dropped. You may also want to keep in mind that y_{a,b} is also constant because it’s data. Only the z_{a,b} terms are non-variable.

      • Ramon Lopes Says:

        Thank you for your quick answer.

        From Equation (30) to (31), the term c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} does depend on the current sample position, but this term is also the argument for the Gamma function. Therefore, I cannot understand why the term Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) can be dropped.

        Thank you.

      • Bob Carpenter Says:

        I do’t see the term you’re referring to. Only the \Gamma function terms are dropped, and none of them have a z_{a,b} inside them.

      • Ramon Lopes Says:

        Equation (29) has the term Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}), which refers to z_{a,b}. In Equation (30), that term was refold into the general product. Finally, that general product was dropped because it is not supposed to depend on z_{a,b}.

        The issue is just that the general product refers to the term Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) which in turn does depend on z_{a,b}. The same reasoning can be taken for the term \Gamma(c_{z_{a,b},a,*}^{-(a,b)} + \alpha_{z_{a,b}}). Thus, I cannot understand why Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) can be dropped.

        Thank you.

      • Bob Carpenter Says:

        The general product doesn’t depend on z_{a,b}. That’s the point.

      • Ramon Lopes Says:

        If the the general product that involves \Gamma(c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}}) does not depend on z_{a,b}, why does the term c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} depend on z_{a,b} so that it cannot be dropped?

        c_{z_{a,b},*,y_{a,b}}^{-(a,b)} + \beta_{y_{a,b}} is the argument in the Gamma function, so I think they both depends on z_{a,b}.


      • Bob Carpenter Says:

        I’m afraid I don’t have time to go through the doc symbol by symbol.

  24. Sangmin Gang Says:

    I have a theoretical question (I’m not a stat person) out of curiosity. If discrete processes are not easy to use for HMC (i.e., in stan), why not pick only those variables out and apply other MC method on them? It should be possible (intuitively to me) that MC on discrete processes and HMC on continuous processes could be executed in alternating order. A most primitive version of this could be: 1) random walk Metropolis algorithm (or whatever works) for discrete processes while all continuous processes fixed, then 2) HMC for continuous processes while all discrete processed fixed, then 3) repeat 1, 2.

    Isn’t this work? Or, it works but not easy to implement? Or it will be too slow?

    • Bob Carpenter Says:

      If you’re asking that question, you’re a stats person :-)

      The problem for mixing discrete with HMC is that it messes with the continuity of the posterior geometry and thus makes adaptation tricky.

      Marginalizing out discrete parameters is much better for the main reason that you get the Rao-Blackwell type speedups of using analytic expectations instead of samples.

      The problem for Stan is efficiently evaluating the conditional probabilities up to a proportion you’d need to evaluate Gibbs (then you have the correlation problems with Gibbs). Metropolis isn’t very effective in high dimensions.

  25. DreamFlasher Says:

    Thank you very much for the clear derivation of Bayesian Naive Bayes. I would like to calculate credible intervals for the estimate, in your derivations you leave out some normalization factors. Do I assume correctly that it should be possible to have the credible intervals in closed-form, too?

  26. Bob Carpenter Says:

    Yes, you can just use the inverse CDFs for the posterior. For more complicated models, you’re going to need MCMC as there’s rarely conjugacy straight through to the posterior.

    Given the failure of the independence assumptions under naive Bayes, the posteriors will NOT be well calibrated. So there’s not much use in computing them.

    • DreamFlasher Says:

      Thank you so much for your quick reply, that was enlightening! Do I understand you correctly, that I simply can use an unnormalized posterior (such as equation 64), evaluate it for all classes per predicted instance and normalize by that? (Without having an extra formalization of the normalization constant) And then after the normalization I integrate over the classes (assume random order)?

      If the posteriors are not calibrated then calibration techniques can be used in order to do so and obtain meaningful redible intervals, correct?

      • Bob Carpenter Says:

        Yes, you can just use the unnormalized posterior, as it’ll be proportional to what you want, so normalizing gives you the exact probabilities.

        No, you cannot post-calibrate.

        I’d suggest something like a logistic regression with a hierarchical prior. The problem there is that the posterior isn’t conjugate, so you need MCMC methods or Laplace approximations or variational approximations, none of which are implemented in LingPipe.

      • DreamFlasher Says:

        Why is post-calibration not possible in that case? There’s a non-parametric method using isotonic regression which seems to be quite standard to use (it’s implemented in sklearn).

        Thanks for the suggestion with the hierarchical prior, I was hoping for a method that doesn’t require sampling, because prediction speed is crucial in my case. The variational approximations are also not that fast as far as I know?

      • Bob Carpenter Says:

        It’s not that someone can’t write an algorithm called “post-calibrate” and apply it to any old thing, it’s just that you won’t be able to actually calibrate the output. If you don’t believe me, try it. The theory here is that you have already lost the signal by piping it through the inappropriate naive Bayes likelihood function.

        I suggest you actually try to calibrate some real naive Bayes output. What you’ll find is that all the predictions go to 0.999 and 0.001 if you have 100+ word documents and you get really strong overdispersion effects (which logistic regression also can’t correct—you need to transform word vectors to do that or build a proper negative-binomial type model along the lines of the 1964 Bayesian classifier of Mosteller and Wallace!).

        You can perform heuristic inference with a Bayesian mean just as well as a maximum likelihood estimate. You’ll find the Bayesian approach a bit more robust, but you can probably get most of the way there without much loss by regularizing heavily.

        P.S. My second paper on stats (circa 1998) used exactly this kind of post-calibration on an SVD-reduced representation of word vectors.

Leave a Reply to Brendan O'Connor Cancel reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: