[**Update:** 20 April 2011. *I replaced my original (correct, but useless) derivation with the derivation Matt Hoffman provided in the comments.*]

In a previous post, I showed how to integrate out LDA’s multinomial parameters, with the result being a discrete distribution. In fact, we could go all the way to deriving the probability of each word’s topic assignment given all other parameters, thus defining a natural collapsed Gibbs sampling strategy.

In this post, I’ll show how to turn this around, summing out the discrete parameters in order to produce a fully continuous density. The nice feature about such a density is that it’s amenable to Hamiltonian Monte Carlo methods. These may actually mix faster than the collapsed Gibbs sampler. Now that we have a working HMC implementation, it’ll be easy to test.

It’s always pretty straightforward to sum out discrete parameters. For instance, it’s commonly encountered in the textbook mixture model in the E step of EM (e.g., for soft K-means clustering or semi-supervised learning). It’s all based on the sum law, which says that if is discrete in and can take on values , then

.

All we need to do is repeatedly apply this rule to each of LDA’s topic assignments to tokens.

### Latent Dirichlet Allocation

The LDA model involves a combination of Dirichlet and multinomial distributions. We observe the words for each document , where document is of length . We let be the set of all words observed. We assume there are topics. Next, there are multinomial parameters of dimension for the distribution of topics in document . We also assume multinomial parameters of dimension for the distribution of words in topic . For simplicity, we assume fixed Dirichlet hyperpriors and of dimensionality and over the multinomial parameters and .

The full probability model is defined by

### Summing out the Topic Assignments

In our previous post, we showed how to simplify the expression for the distribution of topic assignments given words and hyperpriors, . Here, we are going to derive , the density of the document distributions and topic distributions given the words and hyperpriors.

Next, by repeated application of the sum rule for total probability, we get

Expanding out the definitions, we are left with a nice tight expression that’s easy to calculate:

.

Of course, we’ll probably be doing this on the log scale, where

where is a normalizing term that we can ignore for most optimization, expectation or sampling-based methods. Note that the functions inside the Dirichlet terms go into the term if we leave the priors and constant.

April 14, 2011 at 10:33 pm |

Interesting. This also looks like it would be really easy to maximize numerically (at least up to a local optimum). Any idea why this isn’t done more often?

April 15, 2011 at 12:13 am |

There’s a pretty compelling connection between optimization and sampling when you start thinking about Hamiltonian Monte Carlo (Metropolis + gradients with really clever accept probabilities) and gradient-based optimization methods.

The lack of identifiability of topics and the highly multimodal nature of the optimization makes LDA a hard problem from a pure sampling or optimization point of view. Luckily, local optima work pretty well for most purposes.

April 17, 2011 at 10:13 pm |

The EM algorithm for pLSI does precisely that—it’s an algorithm for optimizing log p(w | θ, φ) over θ and φ. (Adding Dirichlet priors to θ and φ is easy, and worthwhile.)

April 18, 2011 at 2:03 pm

But what’s going on with that huge sum term on the outside? I can’t see how it’d be computed. It’s going to be exponential in the number of tokens to compute by brute force, and on top of that, it has a number of tokens term on the inside from the products!

April 18, 2011 at 3:01 pm

[Update:

I (Bob) combined Matt’s 3 comments into one fixed one. Thanks! And very cool use of unicode rather than LaTeX. FWIW, you can use latex with]`$`

`latex`

…`$`

, but I don’t know how to preview in WordPress, which is decidedly suboptimal for this sort of thing.Oh, sorry, I think I didn’t read this closely enough.

Note that the z variables are conditionally independent of one another given θ. So we can write

log p(θ,φ,z,w|α,β)

= log p(θ|α) + \log p(φ|β) + Σ_d Σ_i log p(z_{di}|θ_d) + log p(w_{di}|z_{di},φ).

This just shows the conditional independence of z_{di}, w_{di} from the other z’s and w’s (given θ and φ).

Thus, when we want to marginalize out the z variables, we only have to worry about one word at a time. Now, marginalizing over z_{di}, we have:

p(w_{di}|θ_d,φ)

= Σ_k p(z_{di}=k|θ)p(w_{di}|z_{di}=k,φ)

= Σ_k θ_{dk}φ_{kw},

so

log p(θ,φ,w|α,β) =

log p(θ|α) + \log p(φ|β) + Σ_d Σ_i log Σ_k θ_{dk} φ_{kw}.

April 18, 2011 at 3:12 pm |

BTW, for those who’re interested, the first reference I know of that makes this observation in this context is Wray Buntine’s “Variational Extensions to EM and Multinomial PCA,” which recasts pLSI as a special case of exponential family PCA (see “A generalization of principal components to the exponential family” by Collins, Dasgupta, and Schapire).

I’m sure people noticed it before then, of course. Lee & Seung’s nonnegative matrix factorization paper, for example, optimizes an objective function very closely related to the pLSI objective.