This forthcoming NIPS paper outlines a neat little idea for evaluating clustering output:

- Chang, Jonathan, Jordan Boyd-Graber, Sean Gerrish, Chong Wang and David M. Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models.
*NIPS*.

The question they pose is how to evaluate clustering output, specifically topic-word and document-topic coherence, for human interpretability.

### Bag of Words

Everything’s in the bag of words setting, so a topic is modeled as a discrete distribution over words.

### Multiple Topics per Document

They only consider models that allow multiple topics per document. Specifically, the clusterers all model a document as discrete distribution over topics. The clusterers considered share strong family resemblances: probabilistic latent semantic indexing (pLSI) and two forms of latent Dirichlet allocation (LDA), the usual one and one in which topics for documents are drawn from a logistic prior modeling topic correlations rather than a uniform Dirichlet.

### Intrusion Tasks

To judge the coherence of a topic, they take the top six words from a topic, delete one of the words and insert a top word from a different topic. They then measure whether subjects can detect the “intruder”.

To judge the coherence of the topics assigned to a document, they do the same thing for document distributions: they take the top topics for a document, delete one and insert a topic not assigned with high probability to the document.

### Analysis

They considered two small corpora of roughly 10K articles, 10K word types and 1M tokens, one from Wikipedia pages and one from *NY Times* articles. These can be relatively long documents compared to tweets, customer support requests, MEDLINE abstracts, etc, but are shorter than full-text research articles or corporate 10K or 10Q statements.

They only consider 50, 100, and 150 topic models, and restrict parameterizations to add-1 smoothing (aka the Laplace form of the Dirichlet prior) for per-document topic distributions. I didn’t see any mention of what the prior was for the per-topic word distributions. I’ve found both of these parameters to have a huge effect on LDA output, with larger prior counts in both cases leading to more diffuse topic assignments to documents.

They only consider point estimates of the posteriors, which they compute using EM or variational inference. This is is not surprising given the non-identifiability of topics in the Gibbs sampler.

### Mechanical Turker Voting

They used 8 mechanical Turkers per task (aka HIT) of 10 judgments (wildly overpaying at US$0.07 to US$0.15 per HIT).

### (Pseudo Expected) Predictive Log Likelihood

They do the usual sample cross-entropy rate evaluations (aka [pseudo expected] predictive log likelihoods). Reporting these to four decimal places is a mistake, because the different estimation methods for the various models have more variance than the differences shown here. Also, there’s a huge effect from the priors. For both points, check out Asuncion et al.’s analysis of LDA estimation, which the first author, Jonathan Chang, blogged about.

### Model Precision

Their evaluation for precision is the percentage of subjects who pick out the intruder. It’d be interesting to see the effect of adjusting for annotator bias and accuracy. This’d be easy to evaluate with any of our annotation models. For instance, it’d be interesting to see if it reduced the variance in their figure 3.

There’s variation among the three models at different topics over the different corpora. I’m just not sure how far to trust their model precision estimates.

### Their Take Home Message

The authors drive home the point that traditional measures such as expected predictive log likelihood are negatively correlated with their notion of human evaluated precision. As I said, I’m skeptical about the robustness of this inference given the variation in estimation techniques and the strong effect of priors.

The authors go so far as to suggest using humans in the model selection loop. Or developing an alternative estimation technique. If they’d been statisticians rather than computer scientists, my guess is that they’d be calling for better models, not a new model selection or estimation technique!

### The Real Take Home Message

I think the real contribution here is the evaluation methodology. If you’re using clustering for exploratory data analysis, this might be a way to vet clusters for further consideration.

### What They Could’ve Talked About

Although they mention Griffiths and Steyvers’ work on using LDA for traditional psychometrics, I think a more interesting result is Griffith and Steyvers’ use of KL-divergence to measure the stability of topics across Gibbs samples (which I describe and recreate in the LingPipe clustering tutorial). Using KL divergence to compare different clusters may give you a Bayesian method to automatically assess Chang et al.’s notion of precision.