Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

by

[Update: 27 September 2010. I corrected several typos and brainos in the tech report after corrections in the comments (see below) from Arwen Twinkle. I also added some historical notes and references. Thanks for all the feedback.]

I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don’t derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child’s play. Maybe they are if you’re a physicist or theoretical statistician. But that was a whole lot of choppin’ with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result’s interesting. I didn’t wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you’re curious.

Thanks to Wikipedia

I couldn’t have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia’s Derivation is Wrong!

At least I’m pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we’ve already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what’s going on is in the text above which says it drops terms that don’t depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia’s calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don’t need to keep (as large) a symbol table in your head.

I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.

LaTeX Rocks

I also don’t think I could’ve done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?

30 Responses to “Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes”

  1. Brad Says:

    There’s a typo in your report in equation 13: it should be p (… | alpha, beta) instead of p(…, alpha, beta).

  2. Mark Reid Says:

    On your last point, have you tried LyX? If you already know LaTeX it is a great note-taking tool, rendering equations on the fly.

  3. lingpipe Says:

    @Brad Thanks. Corrected. Writing math is like trying to write computer programs without a compiler. Which sort of leads into my request for tools to help:

    @Mark. Lyx looks more like online LaTeX, which is a step in the right direction (and reminds me of the LaTeX I used to run on a Mac back in the early 90s), but I was hoping for something with more math know-how built in. Something more like Eclipse meets LaTeX, ideally with some Mathematica-type functionality thrown in. I also don’t quite understand why the rendering is so bad in Lyx (judging from their screen shots) — are they using something like ASCIIMathML or one of the other converters that produce more ASCII-like output?

  4. Brendan O'Connor Says:

    Perfect timing — I started looking in to this just today. Thanks for writing this up.

    As for the tool you want — maybe you want Mathematica itself :)

  5. Alvin Kaule Says:

    Wray Buntine and Aleks Jakuline provide a Gibbs sampling algorithm for Discrete Component Analysis which is a general model and related to LDA.

    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.61.645&rep=rep1&type=pdf

    • lingpipe Says:

      Thanks. I really like the generality of that. They show how LDA is an instance of what they’re doing and then discuss the collapsed sampler under the heading “Rao-Blackwellization”. But their equation (21) only goes down to the Gamma function stage.

      They do make the critical observation that the collapsing increases the effective resampling rate of the theta parameters (and phi parameters), which helps convergence.

      I should ask Aleks if he knows who did all these derivations originally.

  6. Ramnath Balasubramanyan Says:

    This is very interesting.
    Gregor Heinrich has a tech report which goes through the steps of deriving the final form for a collapsed Gibbs sampler for LDA just like your tech report does.
    http://www.arbylon.net/publications/text-est.pdf

    • lingpipe Says:

      Thanks for the link. I like Heinrich’s generalization of the B() notation for the beta normalizer to the Delta() for the Dirichlet. It’s that step from (78) to (79) that I found so hard to figure out, and where the Wikipedia derivation went wrong.

      As I say in the blog post, my own derivation was based on the Wikipedia entry. I should say that in the tech report itself, now that I think about it.

      I’m really wondering who the original author (deriver?) of that was and if there’s a paper I could cite. There’s no ref for the collapsing derivation in the Heinrich paper you link. Actually there is, to Griffiths 2002, but that only says it’s possible and shows the final form! And to a McCallum et al. paper on author-topic models, but also no derivation there.

      • Ramnath Balasubramanyan Says:

        I agree, the Wikipedia derivation is needlessly over complicated.
        As far as moving from (78) to (79) in Heinrich’s paper goes, isn’t it a simple reduction using \Gamma (n) = (n-1) \Gamma (n-1)?

      • lingpipe Says:

        I think the problem with the Wikipedia entry, at least a week or two ago, was that it was wrong.

        The step’s a bit more involved because you have to reason about the counts with and without the item being sampled, unfold, then fold back together, all the while dropping constant terms that don’ otherwise cancel. With my limited experience manipulating these kinds of functions, it seemed pretty hard to me.

    • lingpipe Says:

      You should also be careful in interpreting the Heinrich paper. Its Figure 8 seems to imply that the maximum likelihood-like estimates for the discrete topic and word distributions, \theta and \varphi parameters can be used as Gibbs samples for those parameters. This isn’t the case, because they can wind up with far too little variance.

      If you need samples from the full posterior, you need to actually sample \theta and \varphi. This is actually really easy, because we have all the topic assignments z and priors \alpha and \beta, and the Dirichlet is conjugate for the discrete distribution. But we need to do that sampling to properly represent the variance of the estimates for the discrete parameters.

      • Ramnath Balasubramanyan Says:

        Thanks! I hadn’t noticed the potential problem in using the MLE estimates.

      • Mathieu Says:

        Thank you very much for sharing your derivation. This is greatly appreciated.

        Regarding the estimation of the thetas and phis, can you elaborate why sampling is needed? As Heinrich points out in equations 80-83, their distribution is a Dirichlet so their expectation is known.

        Regarding burn-in and lag, do you apply them with respect to the topic assignment samples or with respect to the phi and theta samples?

        Have you considered posing i=(a,b) to make some of the equations easier to read?

        If I remember correctly, another paper which includes the derivation for the Naive Bayes sampler is “Gibbs sampling for the uninitiated”, but probably not as detailed as yours.

      • lingpipe Says:

        @Mathieu Your question contains a clue to the answer. The expectations of the multinomial parameters are known. Even the closed form of their variance is known, because you the Dirichlet is conjugate.

        What you can’t do, though, is set the multinomial parameters to their expectations in each sample and then use those expectations for posterior inference. In general, the Bayesian posterior predictive distribution computes the expectation of the likelihood p(\tilde{y}|\gamma) given the posterior p(\gamma|y),

        p(\tilde{y}|y) = \int p(\tilde{y}|\gamma) p(\gamma|y) d\gamma.

        The point of Gibbs sampling is that it lets you approximate the posterior with samples in order to approximate the integral by

        \int p(\tilde{y}|\gamma) p(\gamma|y) d\gamma \approx \frac{1}{N} \sum_{n=1}^N p(\tilde{y}|\gamma^{(n)})

        where \gamma^{(n)} is the n-th Gibbs sample for the parameters \gamma.

        If instead of sampling the \theta,\varphi parameters in LDA, you use their expectations, you underestimate the variance of the posterior parameter distribution p(\theta,\varphi|y).

        You need to sample the multinomial from the Dirichlet in each sample and reason from that.

  7. Arwen Says:

    I have a question regrading equation (29) when you expand the gamma functions. When you expand the gamma in the numerator of the last term instead of having
    Gamma(X+1) = X*Gamma(X)
    you wind up with
    Gamma(X+1) = (X+1)*Gamma(X)
    Is this a typo or is there a particular trick I am missing in this equation?

    • lingpipe Says:

      Thanks for the error correction. Our technicians are working on a patch. I’ll post another comment when the new version’s out.

    • lingpipe Says:

      Thanks again. I updated equations (29) and (30) to get rid of the “+1″ terms, which shouldn’t have been there. This should’ve been obvious to me when I proofread it for the umpteenth time given that the “+1″ was gone again in equation (31). I blame not having unit tests for math!

  8. arwentwinkle Says:

    Question: in the step from (31) to (32) you replace the sum over J with the appropriate count (makes sense) and then add (J*beta_j). However, there is no longer an index variable j with which to select the jth element of the beta vector.
    Looking at the Griffiths and Steyvers paper, they are using a single element beta , symmetric Dirichlet prior, instead of a vector beta allowing non-symmetric prior. Also, the relationship between your final form (sans normalization) and the Griffith & Steyvers final form (5) isn’t crystal clear to me.
    Typo: In equations (28) – (30) the denominator of the first beta term [Gamma(sum over J)] I believe that you have dropped the superscript –(a,b) that should be attached to C_k,*,j

    I really appreciated your quick response previously. I am working through this derivation in detail because I have to do one using some slightly different distributions but same basic idea. I will be sighting you as your work has been invaluable to me. :)
    Thank you again. Very helpful piece of work and very well done.

    • lingpipe Says:

      I feel your pain. The reason I worked through it was so that I could get to the naive Bayes case. What app are you working on?

      Thanks again for the corrections. The reduction to J \times \beta_j in the denominator of (32) is wrong. It should be \sum_{j=1}^J \beta_j for the general case where the \beta_j may be different.

      The -(a,b) was intentionally dropped from the first terms based on the argument in the text before (28). The key is factoring out terms that that depend on (a,b) and those that don’t. This is why everything can drop out in the end up to proportionality.

      I still haven’t found a citation to whoever worked this out the first time. Maybe it’s just obvious to people with math brains the size of a planet.

      I’m looking forward to your comments on the naive Bayes derivation! That was a little trickier in the \Gamma() terms, but not much.

  9. Bf Says:

    I’ve been trying for a day now to find out how Griffiths and Steyvers arrive at those results after the integration. Now everything is clear. I wasn’t aware of the trick with the normalisation constant of the Dirichlet to get rid of the integral! Great work! Thanks for submitting it!

  10. Viet-An Nguyen Says:

    Thank you for your wonderful note. It has helped me a lot.
    I just have a question regarding the Gamma expansion at the top of page 9. Should the expansion be
    \Gamma(x+q) = \Gamma(x) + \prod_{i=0}^P{q-1} (x+i)
    instead of
    \Gamma(x+q) = \Gamma(x) + \prod_{i=1}^P{q} (x+i)

  11. Summing Out Topic Assignments in LDA « LingPipe Blog Says:

    [...] a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In this post, I’ll show how to turn this [...]

  12. Kairit Says:

    I’m wondering if there are products over words missing in the first integral of 18 and 19.

    there is currently
    \int\prod_{m=1}^N p(z_m|\theta_m)p(\theta_m|\alpha)d\theta,

    but I think there should be:
    \int\prod_{m=1}^N\prod_{n=1}^{N_m} p(z_{m, n}|\theta_m)p(\theta_m|\alpha)d\theta,

    because z_m doesn’t make sense as there are more than one topic in each m-th document.

    I’m also confused about the derivation of the second term in 21. In 20 it has product over topics and inside every topic there is a product over all the words in all documents. So in every topic one iteration over all words in all documents will be made and by using the exponentiation of the sums it could be turned into the product over all topics and all word types. But this product over all topics and all word types must be done inside every iteration of outer product over topics. So in my understanding it is not possible to convert the second term of 20 which has outer product over topics and inner product over documents and words (regardless of the topic) into 21 where there is only outer product over topics and inner product over word types and outer product selecting the topics.

    However, for some reason I believe that you are right and I am mistaken, but where does my reasoning go wrong?

  13. Kairit Says:

    So, spending some more time staring at the formulas 20 and 21 and expanding the terms in a small example, I must agree with that derivation. So no need for explanation any more.

    • Bob Carpenter Says:

      Thanks for clarifying — I still hadn’t gotten around to diving back into the derivation.

      • tabproxy Says:

        Thank you for your great explanation. It helps me a lot :-) Probably this is the most comprehensive explanation I have ever found.

        Regarding Kairit’s question: “…if there are products over words missing in the first integral of 18 …” .. I believe it is true too, but perhaps could you please give a little more explanation ? Because I could not find a relationship between the first part of integral and the independence assumption (7).

        I really appreciate it. Please kindly advise. Many thxs in advance.

  14. Bob Carpenter Says:

    OK, I went back and had a look. The reason you can reduce (17) to (18) is that the words y are conditionally independent of everything but the word distributions \phi and topic assignments z. More specifically, a given word y[m,n] only depends on \phi_{z[m,n]}, the distribution over words for the topic assigned to word n in document m.

  15. Martin Says:

    Can you please explain why your result in (32) differs from the one in Griffiths2002 (1)?

  16. Bob Carpenter Says:

    I take it you mean

    Griffiths, T. L., & Steyvers, M. (2002). A probabilistic approach to semantic representation. Proceedings of the 24th Annual Conference of the Cognitive Science Society.

    Formula (1) in that paper is the same as formula [5] in their 2004 PNAS paper cited above.

    Part of the answer is that I allowed asymmetric priors. With a symmetric prior, the W\beta term in their denominator term is the same as what I called J \gamma in the note after the formula.

    The other step was dropping the topic-related denominator (in their notation, the n^{(d_i)}_{-i,\cdot} + T\alpha term); note that it’s the same for every j, so you don’t need to compute it.

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

Please log in to WordPress.com to post a comment to your blog.

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 149 other followers