Bayesian inference for LDA is intractable. And I mean really really deeply intractable in a way that nobody has figured or is ever likely to figure out how to solve.

Before sending me a “but, but, but, …” reply, you might want to bone up on the technical definition of Bayesian inference, which is a bit more than applying Bayes’s rule or using a prior,

- LingPipe Blog: What is Bayesian Statistical Inference?

and on computational intractability, which is a bit more involved than a program being slow,

- Wikipedia: Computationally Intractaability.

The reason Bayesian inference for LDA is intractable is that we can’t effectively integrate over its posterior. And I don’t just mean there’s no analytic solution; there rarely is for interesting Bayesian models. But for some models, you can use numerical techniques like MCMC to sample from the posterior and compute a posterior integral by averaging.

Posterior sampling doesn’t work for LDA. I know, I know, you’re going to tell me that lots of smart people have applied Gibbs sampling to LDA. You might even point out that LingPipe has a Gibbs sampler for LDA.

Yes, these systems return values. But they’re not sampling from the posterior according to the posterior distribution. The number of modes in the posterior makes comprehensive posterior sampling intractable.

And I’m not just talking about what people call “label switching”, which refers to the non-identifiability of the indexes. Label switching just means that you can permute the K indices corresponding to topics and get the same model predictions. It adds another layer of intractability, but it’s not the one that’s interesting — the real problem is multimodality.

So what does everybody do? Some use Gibbs sampling to take a single sample and then go with it. This is not a Bayesian approach; see the description of Bayesian inference above. It’s not even clear you can get a single effective sample this way (or at least it’s unclear if you can convince yourself you have an effective sample).

Others use variational inference to compute an approximate posterior mean corresponding to a single posterior mode. Variational inference approximates the posterior variance as well as the posterior mean. But alas, they don’t work that way for LDA, where there is no meaningful posterior mean due to lack of identifiability.

So what to do? The output of LDA sure does look cool. And clients do love reading the tea leaves, even if you obfuscate them in a word cloud.

I only implemented Gibbs for LDA in LingPipe because it was all that I understood when I did it. But now that I understand Bayesian inference much better, I agree with the experts: just do variational inference. You’re going to get a lousy point-based representation of the posterior no matter what you do, but variational inference is fast.

Specifically, I recommend Vowpal Wabbit package’s online LDA algorithm, which is based on Matt Hoffman’s stochastic variational LDA algorithm. It’ll also be much much more scalable than LingPipe’s sampling-based approach because it works online and doesn’t need to store all the data in memory, much like stochastic gradient descent.

The algorithm’s clean and it’d be easy enough to add such an online LDA package to LingPipe so that we could compute topic models over dozens of gigabytes of text really quickly (as Matt did with Wikipedia in the paper). But I don’t know that anyone has the time or inclination to do it.

February 18, 2013 at 5:47 pm |

Exactly to the point! :)

February 18, 2013 at 7:32 pm |

Thanks for the reality check, Bob. –Eric

February 19, 2013 at 6:50 am |

[…] Bayesian Inference for LDA is Intractable (lingpipe-blog.com) […]

February 19, 2013 at 9:59 am |

There are new spectral methods for doing LDA, stemming from the method of moments. They do not use the likelihood function at all.

See for example this recent paper, by Anandkumar et al.:

http://books.nips.cc/papers/files/nips25/NIPS2012_0441.pdf

February 20, 2013 at 11:17 am |

I’ve been seeing a lot of these spectral methods for mixture models (clustering) lately. Michael Collins and crew have been applying them to PCFGs and Hsu and crew have been doing HMMs.

I didn’t see a quick summary of their separability conditions, but claim to be relaxing a requirement that there be anchor words that only show up in a single topic. I’m surprised even distinct anchor words is enough separability. These results are pretty amazing.

Of course, they still don’t give you full Bayesian inference over LDA.

February 21, 2013 at 6:59 am |

Lending further support for your claim, have you seen Shay Cohen and Noah Smith’s work on the intractability of PCFG inference? Given the close connection between PCFG and LDA inference (LDA can be reduced to a PCFG) perhaps their result can be generalised to LDA? (Maybe they’ve already done it?)

February 21, 2013 at 3:16 pm |

No, I should ask Shay about it. He’s teaching a Bayesian NLP seminar this semester and the paper of yours and Sharon Goldwater’s on morphology is next on the reading list!

February 26, 2013 at 7:27 pm |

Is there any speed comparison between stochastic variational approximations and collapsed Gibbs Sampling (for LDA-ish models)? The papers I know about (http://arxiv.org/pdf/1206.7051v1.pdf and the one you linked) only look at stochastic variational vs batch variational; but from the Asuncion 2009ish paper I was under the impression CGS was a bit faster than batch variational, and I’d at least argue that’s due in part to its having a more online nature than batch variational.

March 1, 2013 at 12:06 am |

Another recent paper on method of moments type of methods,

with experiments, and I think is very cool, is “A Practical Algorithm for Topic Modeling with Provable Guarantees”

http://arxiv.org/pdf/1212.4777v1.pdf

November 24, 2017 at 11:14 am |

Bob, would you recommend against using the Variational Bayes option in Stan? If not, could you please help me understand why that’d be inferior to your recommendation?

November 24, 2017 at 1:17 pm |

Yes, the variational Bayes option in Stan uses a generic multivariate normal variational approximation. And it’s batch. I’d recommend Matt Hoffman’s highly scalable stochastic variational inference implementation in Vowpal Wabbit. It uses a custom variational approximation to LDA which should also be better than Stan’s generic multivariate normal approximation. LingPipe’s Monte Carlo implementation will be pretty fast, too. All of these methods will only find a single mode, so if you re-run with new random seeds, you’ll get different answers.