[Update: 20 April 2011. I replaced my original (correct, but useless) derivation with the derivation Matt Hoffman provided in the comments.]
In a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In fact, we could go all the way to deriving the probability of each word’s topic assignment given all other parameters, thus defining a natural collapsed Gibbs sampling strategy.
In this post, I’ll show how to turn this around, summing out the discrete parameters in order to produce a fully continuous density. The nice feature about such a density is that it’s amenable to Hamiltonian Monte Carlo methods. These may actually mix faster than the collapsed Gibbs sampler. Now that we have a working HMC implementation, it’ll be easy to test.
It’s always pretty straightforward to sum out discrete parameters. For instance, it’s commonly encountered in the textbook mixture model in the E step of EM (e.g., for soft K-means clustering or semi-supervised learning). It’s all based on the sum law, which says that if is discrete in and can take on values , then
All we need to do is repeatedly apply this rule to each of LDA’s topic assignments to tokens.
Latent Dirichlet Allocation
The LDA model involves a combination of Dirichlet and multinomial distributions. We observe the words for each document , where document is of length . We let be the set of all words observed. We assume there are topics. Next, there are multinomial parameters of dimension for the distribution of topics in document . We also assume multinomial parameters of dimension for the distribution of words in topic . For simplicity, we assume fixed Dirichlet hyperpriors and of dimensionality and over the multinomial parameters and .
The full probability model is defined by
Summing out the Topic Assignments
In our previous post, we showed how to simplify the expression for the distribution of topic assignments given words and hyperpriors, . Here, we are going to derive , the density of the document distributions and topic distributions given the words and hyperpriors.
Next, by repeated application of the sum rule for total probability, we get
Expanding out the definitions, we are left with a nice tight expression that’s easy to calculate:
Of course, we’ll probably be doing this on the log scale, where
where is a normalizing term that we can ignore for most optimization, expectation or sampling-based methods. Note that the functions inside the Dirichlet terms go into the term if we leave the priors and constant.