I’m sure people noticed it before then, of course. Lee & Seung’s nonnegative matrix factorization paper, for example, optimizes an objective function very closely related to the pLSI objective.

]]>[Update: *I (Bob) combined Matt’s 3 comments into one fixed one. Thanks! And very cool use of unicode rather than LaTeX. FWIW, you can use latex with $latex…$, but I don’t know how to preview in WordPress, which is decidedly suboptimal for this sort of thing.*]

Oh, sorry, I think I didn’t read this closely enough.

Note that the z variables are conditionally independent of one another given θ. So we can write

log p(θ,φ,z,w|α,β)

= log p(θ|α) + \log p(φ|β) + Σ_d Σ_i log p(z_{di}|θ_d) + log p(w_{di}|z_{di},φ).

This just shows the conditional independence of z_{di}, w_{di} from the other z’s and w’s (given θ and φ).

Thus, when we want to marginalize out the z variables, we only have to worry about one word at a time. Now, marginalizing over z_{di}, we have:

p(w_{di}|θ_d,φ)

= Σ_k p(z_{di}=k|θ)p(w_{di}|z_{di}=k,φ)

= Σ_k θ_{dk}φ_{kw},

so

log p(θ,φ,w|α,β) =

log p(θ|α) + \log p(φ|β) + Σ_d Σ_i log Σ_k θ_{dk} φ_{kw}.

But what’s going on with that huge sum term on the outside? I can’t see how it’d be computed. It’s going to be exponential in the number of tokens to compute by brute force, and on top of that, it has a number of tokens term on the inside from the products!

]]>The EM algorithm for pLSI does precisely that—it’s an algorithm for optimizing log p(w | θ, φ) over θ and φ. (Adding Dirichlet priors to θ and φ is easy, and worthwhile.)

]]>There’s a pretty compelling connection between optimization and sampling when you start thinking about Hamiltonian Monte Carlo (Metropolis + gradients with really clever accept probabilities) and gradient-based optimization methods.

The lack of identifiability of topics and the highly multimodal nature of the optimization makes LDA a hard problem from a pure sampling or optimization point of view. Luckily, local optima work pretty well for most purposes.

]]>