Yup, just like I was warned:
Soft K-means can blow up. Put one cluster exactly on one data point and let its variance go to zero — you can obtain an arbitrarily large likelihood! Maximum likelihood methods can break down by finding highly tuned models that fit part of the data perfectly. This phenomenon is known as overfitting. The reason we are not interested in these solutions with enormous likelihood is this: sure, these parameter-settings may have enormous posterior probability density, but the density is large over only a very small volume of parameter space. So the probability mass associated with these likelihood spikes is usually tiny.
(MacKay 2003; Chapter 22, p. 308)
The problem for my hierarchical annotation model, which estimates annotator sensitivity and specificity and their beta priors, is that after a few samples, the annotator sensitivities are all fairly close, so the variance estimate is lowered. In the next round, the low variance on the prior pulls the sensitivities closer to the prior mean, which in turn lowers the variance for the next round. You can see where this is going. (It’s “poof” goes the variance [to 0], or “kaboom” go the scale parameters and log likelihood [to infinity].)
The solution, of course, is to (a) fix priors, or (b) put a prior on the beta prior scale. Option (b) is required for the full Bayesian model, and since BUGS did the heavy lifting, it’s what I’ve used up until now. It’s been very stable.
Solution (b) is a pain for the beta distribution, because there’s no conjugate prior, which hugely complicates maximum likelihood estimation (see Thomas Minka’s Estimating a Dirichlet distribution — the beta just a Dirichlet with k=2). I could just add in some more data samples and let them act as prior counts, even if I can’t characterize the resulting prior analytically. That should probably work, too, even with my (admittedly low road implementation) moment matching estimate.
For now, along with moment-matching for priors (which can diverge), I implemented solution (a), fixing the priors.
With fixed beta priors and EM estimates, I now have a straight-up implementation of Dawid and Skene model (sorry, JSTOR’s still best reference online; if anyone finds an open version, let me know)!
The good news is that with fixed Beta(1,1) priors for sensitivity and specificity, the models converge lickety-split and get the same answers as the full Bayesian models (not so good if you’re looking for evidence that Bayesian models are the way to go). I could even compute beta estimates based on final sensitivity and specificity estimates if I wanted. And the results are still better than voting, even if we allow voting to reject annotators who are obviously bad.
You can read MacKay’s chapter 22 or the whole book online:
- David J. C. MacKay. 2003. Information Theory, Inference, and Learning Algorithms. Cambridge University Press.
- Chapter 22, Maximum Likelihood and Clustering [pdf]
- PDF index by chapter (off-by-one error in numbering vs. PDF)
I love Cambridge University Press! They’re letting their authors distribute PDFs for free (recent examples: Marti Hearst’s book on search; Manning, Raghavan, and Schütze’s IR book).