I’ve recently run across not one, but two recent papers on Bayesian language modeling. I haven’t been living under a rock, but it has taken me a while to get my head around Bayesian modeling.
The first of these is a very flexible hierarchical model:
David M. Blei, Andrew Y. Ng and Michael I. Jordan. 2003.
Latent Dirichlet Allocation. JMLR.
Their approach is so by-the-book, that it would have made a nice example in either the hierarchical models chapter or the posterior mode chapter of Gelman et al.’s Bayesian Data Analysis. This model generates a document by first selecting a multinomial topic distribution, then for each word in the paper, selecting a topic, then for each topic, generating a word based on a conditional probability estimate of the word given the topic. The Dirichlet distribution in the title is a conjugate prior for multinomials and is used to represent the distribution from which the topic distributions are selected, with the posterior distribution (after EM style training) is also a Dirichlet. The topics are latent in that they are bootstrapped using EM rather than trained.
The obvious applications are in multi-class clustering, where an optimal mixture of topics may be inferred for each document. It’s also being used for dimensionality reduction in the same way as latent semantic indexing (LSI).
I’m going to use a simpler latent Dirichlet model for inferring paradigms for morphological stemming from unlabeled data. The inference is a bit simple as there’s only one paradigm chosen per word, in contrast with the multiple topics chosen per document. With EM-style training, this will amount to what’s often called "model-based clustering", or "soft clustering", or just "EM clustering". My models aren’t simple conditionals, but will be a combination of length model (for stems and suffixes) and a character language model for generating the actual stems. Note that this proposal is not at all what Erwin Chan did recently in his 2006 SIGPHON paper Learning probabilistic paradigms for morphology in a latent class model; he applied LDA to perform a divisive clustering of stems and suffixes, with the goal of retrieving linguistically sensible hard clusters for stems and suffixes.
Another paper in this paradigm brings in correlation between topics, Laffterty and
Blei’s 2005 NIPS paper, Correlated Topic Models. The basic Dirichlet/multinomial model has been used for IR (Lafferty and Zhai), as has the LDA (Zhai).
The second of these papers solves a problem I’ve been mulling over for years, which is how to model the Zipf-like (power law) distribution of language with a reasonable prior and inference scheme. For the answer, look no further than:
Yee Whye Teh. 2006. A hierarchical Bayesian LM based on Pitman-Yor Processes. ACL.
The Pitman-Yor process is also known as a Chinese restaurant process. The standard presentation is that there’s a restaurant with infinitely many tables of infinite capacity and a sequence of customers showing up. Each customer either sits at an existing table with probability proportional to how many people are already sitting there, or sits at a new table with a small probability. This leads to a natural power-law distribution. What’s really nice is that simulated draws can be made pretty much the way the model’s described. And what’s really neat is that the empirical Bayes priors look just like Kneser-Ney smoothing.
The language model performs about as well as Kneser-Ney smoothing (which is itself just what is known as "update exclusion" in the prediction-by-partial matching compression literature) with token trigrams on 14M tokens of English news training data. LingPipe’s Witten-Bell smoothing for character n-grams performed indistinguishably from Kneser-Ney smoothing. In any case, Pitman-Yor’s far too costly for our uses, at least with the sampling Teh describes, but I’m curious as to how low we can bound the entropy rate of a source like MEDLINE.
I think the analogy could be improved. Instead of tables, imagine a Chinese restaurant that can make infinitely many different dishes. Each customer comes in and says “I’ll have what he’s having” with probability proportional to the number of people already having a dish, and says “I’ll have something no one else is having” with some small probability.
The obvious step is to combine these two ideas to get a very tight model. Use LDA to do what Chen and Goodman did with mixing topics, and use Pitman-Yor to do the estimation of the models (replacing the simple conditional estimates used in LDA). I wouldn’t be surprised if someone does this for ACL.