Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.
Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,
- LingPipe Blog: What is Bayesian Statistical Inference?
and on computational intractability, which is a bit more involved than a program being slow,
- Wikipedia: Computationally Intractaability.
The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.
Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.
Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.
And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.
So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).
Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.
So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.
I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.
Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.
The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.