I read lots of papers, so I thought I’d start a blog thread to share what I’m thinking about them. This is the kickoff edition, based on a paper I read on the subway to and from Columbia Uni this moring:
- Porteous, Ian, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation. KDD ’08.
In the first pass writing this, I forgot to mention the results. They were able to speed up LDA with 4000 latent topics over a pruned MEDLINE corpus of 8.2M docs, 140K words, and 730M token instances by a factor of 8 (with some footnotes). Some 800 topic runs for smaller data sets were sped up by a factor of 4. I’d have expected a larger speedup for 4K topics, but issues such as cache locality can have a huge effect on this kind of thing. One question is how much constant overhead was shared between the runs?
You might be familiar with collapsed Gibbs sampling for LDA through the LingPipe class
cluster.LatentDirichletAllocation, which is explained with simple and real-world examples in our clustering tutorial.
If you’re not familiar with LDA and its collapsed Gibbs sampler, I’m not going to have time to explain it in this note!
Collapsed Gibbs sampling for LDA was introduced in (Griffiths and Steyvers 2004). The collapsing is of the multinomials, so that a Gibbs sample is represented by an assignment of topics to each word in each document. The Dirichlet priors are hyperparameters (that is, set to constants, not fitted) in the collapsed sampler.
Porteous et al. present a method for speeding up the sampling step of the inner loop of collapsed LDA, but their method is actually much more general. The approach is pretty straightforward, and reminds me of arithmetic coding. It relies on being able to visit outcomes in roughly decreasing order or probability, compute their unnormalized probability efficiently, and computing an upper bound on the normalization constant given only the unnormalized probabilities of the items seen so far (plus other knowledge if it’s easy to compute).
Suppose we have
K topics and
W words and we’re looking at word
w in document
j. We can compute unnormalized probabilities
q[k] for each topic
k, so that the probability of a topic
p[k] = q[k]/Z, where the normalization term
Z=q+...+q[K]. If we can create a sequence of upper bounds
Z,...,Z[K], such that
Z[i] > Z[K] = Z, then we have a sequence of decreasing upper bounds on
q[k]/Z,...,q[k]/Z[K], such that the final bound is exact
q[k]/Z[K] = q[k]/Z = p[k].
The point of the algorithm is to commit to a sampled outcome as soon as soon as the probability is bounded. We start by generating random number
u in Uniform(0,1). If it is lower than
q/Z, we can return topic 1 without computing any of the other
Z[k]. If it is greater, we have to compute
Z, at which point we may or may not have bounded the result (we have if
(q+q)/Z < u). If the bounds are fairly tight and we consider the topics with high values of
q[k] first, we can potentially return answers with much less work than computing all normalized probabilities.
The amount of speedup is clearly related to the amount of skew in the distributions and the tightness of the upper bounds on the normalization constants. LDA topic distribution per word is highly skewed, especially as the Dirichlet parameter gets smaller, as is typically the case with large topic models. In cases where the distribution is uniform, the sort is poor, or the upper bounds are very loose, Porteous et al.’s optimization can take longer than Griffiths and Steyvers’ direct sampling.
For LDA, Porteous et al. sort topics by document probability, which is a good proxy for overall topic probability. They compute upper bounds
Z[k] using a nifty inequality discovered in the late 1800s called Hölder’s inequality (which is based on vector norms). They consider two parameterizations, but this is not the only way to bound probabilities.
Porteous et al.’s technique can be applied in many situations in graphical models. The situation in LDA has us estimate
p(topic=k|word,document,θ,φ) which is proportional to
p(topic|θ[doc]) * p(word|φ[k]). This is a typical setup if we’re Gibbs sampling an interior node in a graphical model. It involves the parent node (
θ[doc]), the daughter node (
w), and the daughter’s mother (aka an aunt)
I wouldn’t be surprised if general Gibbs sampling packages like HBC could incorporate Porteous et al.’s algorithm as a compiler optimization. It’d be even better if we could make like Java’s Just in Time (JIT) compiler and do run-time profiling to decide when to apply Porteous et al.’s algorithm.
Finally, a big raspberry for KDD for making their proceedings closed source through the pay-to-access ACM portal.