I just got back from a very nice talk at Columbia by Sanjoy Dasgupta, who’s on leave at Yahoo! from UCSD. The talk was basically about an old paper, Dasgupta and Schulman 2000, in which the authors introduce a “fat Gaussian” to overcome the curse of dimensionality in EM (soft) clustering.

Dasgupta started by making the point that a Gaussian with 1000 independent dimensions has almost all sample points roughly the same distance from the mean, which is easy to prove given the central limit theorem. In other words, the standard Gaussian defines a thin high-likelihood shell around the expected distance. In reality (or at least in experimental data sets), data’s more dispersed than independent high-dimensional Gaussians can represent. Therefore, they introduce fat Gaussians, which are just mixtures of Gaussians with different variances, thus averaging over a bunch of thin shells at their respective expected distances from the mean. He went on to show various separability results for clustering based on these fat Gaussians and mostly separable data.

What struck me most about the talk was Dasgupta’s answer to a question from an audience member who asked if his method would help determine the number of clusters. This is a perennial problem in clustering. K-means and EM-clustering require the number of clusters to be set ahead of time. Dasgupta’s answer was in two parts. The official answer was “no”. But he qualified that by saying that what he really believed was that the problem was ill-formed because different numbers of clusters were useful for different applications.

Some natural language processing problems are like this, and some are different. Categorizing documents by topic is like this, because the world doesn’t come pre-segmented into a known number of topics. For instance, we might categorize news into business, politics, sports and entertainment, leading to four categories. Or we might go further, and divide them into regional and international politics, baseball, soccer and bowling, and movies versus music. It depends on the application.

But one reason we’re interested in clustering is for coreference resolution, including database deduplication and grouping mentions of the same entity (such as a consumer electronics product, gene name or person) in text. For these tasks, there is a notion of “true” number of clusters, though one might quibble that Bob-as-cook is different than Bob-as-programmer.

EM clustering that estimates means and variances in a maximum likelihood setting is ill-formed. The max likelihood solution in the 2-cluster setting places one cluster mean on a single data point with zero variance, leading to an infinite value of the density for that point, with the other points belonging to a single high-variance covering distribution. One solution to this problem is to fix the variances to positive values ahead of time and only estimate the means.

Another thing that strikes me about EM clustering is that it’s almost always done with a uniform distribution over categories. That is, the likelihood of a point x being in cluster c is p(c) * p(x|c). I’ve never seen a presentation of Gaussian EM clustering where p(c) is taken to be anything other than uniform. That is, we’re not modeling the situation where clusters have different likelihoods — the generative model is to draw a cluster according to a uniform distribution and then draw a point from that cluster’s distribution.

The Bayesian (equivalently minimum description length) solution to the number of clusters problem also addresses the uniform-cluster size problem. Specifically, we put a prior on the model structure and then choose the clustering that maximizes the likelihood times the prior: p(data|model) * p(model). The standard Bayesian prior for this kind of setting would comprise something like a Dirichlet process prior on the distribution of categories p(c); we could also put Bayesian priors on the means or more commonly variances of the Gaussians. With a Dirichlet process, there is no bound on the number of categories, and draws from a Dirichlet process prior lead to power-law (Zipf-like) distributions of number and size of categories. You can see this working quite well in coreference resolution as evidenced by the recent paper Haghighi and Klein 2007.

The only problem with the Bayesian “solution” is that the concentration parameter of the Dirichlet process prior is just another way of setting k. That is, you need to make a prior assumption about how costly it is to introduce new categories. This is just replacing a direct control (number of clusters) with an indirect one (cost to introduce a new cluster). Haghighi and Klein, for instance, set the concentration parameter of their Dirichlet process prior empirically.

March 17, 2008 at 2:59 pm |

Nice post! The importance of the Bayesian solution is that one does not KNOW the number of clusters – and that the number does not MATTER: what’s important is to model the P(X|C).

But in some cases, C is interesting: we are actually interested in creating a new concept – to be used for communication and other problems. If that’s the intention, C has to be crisp.

March 20, 2008 at 2:08 pm |

Wow. Still not used to getting comments on the blog!

The issue about the number of clusters not mattering depends on what you’re trying to do with the Bayesian analysis. The problem in something like Gibbs is that you get exchangeability of latent topics, so those won’t be stable. The pairwise associations can be extracted (do entity x and y show up in the same cluster, or estimate likelihood of same given model), but that’s a lot of data to aggregate over samples when we need to scale.

But consider this case: estimating how many entity mentions there are in a corpus of text. That’s a case where you could do a Bayesian estimate of the number of clusters in a DP-prior setting.

– Bob