## Collapsed Gibbs Sampler for Hierarchical Annotation Model

The R and BUGS combination is fine as far as it goes, but it’s slow, hard to debug, and doesn’t scale well. Because I’m going to need to process some big Turker-derived named entity corpora in the next month (more about that later), I thought I’d work on scaling the sampler.

Mitzi was away over the weekend, so I naturally spent my newly found “free time” coding and reading up on stats. While I was procrastinating refactoring feature extraction for CRFs reading a really neat paper (On Smoothing and Inference for Topic Models) from the Irvine paper mill (I also blogged about another of their paper’s on fast LDA sampling), it occurred to me that I could create a fast collapsed sampler for the multinomial form of the hierarchical models of annotation I’ve blogged about.

Hierarchical Model of Binomial Annotation

The model’s quite simple, at least in the binomial case. I’ve further simplified here to the case where every annotator labels every item, but the general case is just as easy (modulo indexing):

Variable Range Status Distribution Description
I > 0 input fixed number of Items
J > 0 input fixed number of annotators
π [0,1] estimated Beta(1,1) prevalence of category 1
c[i] {0,1} estimated Bern(π) category for item i
θ0[j] [0,1] estimated Beta(α0,β0) specificity of annotator j
θ1[j] [0,1] estimated Beta(α1,β1) sensitivity of annotator j
α0/(α0+β0) [0,1] estimated Beta(1,1) prior specificity mean
α0 + β0 (0,∞) estimated Pareto(1.5)* prior specificity scale
α1/(α1+β1) [0,1] estimated Beta(1,1) prior sensitivity mean
α1 + β1 (0,∞) estimated Pareto(1.5)* prior sensitivity scale
x[i,j] {0,1} input Bern(c[i,j]==1
? θ1[j]
: 1-θ0[j])
annotation of item i by annotator j

The Collapsed Sampler

The basic idea is to sample only the category assignments c[i] in each round of Gibbs sampling. With categories given, it’s easy to compute prevalence, annotator sensitivity and specificity given their conjugate priors.

The only thing we need to sample is c[i], and we can inspect the graphical model for dependencies: the parent π of the c[i], and the parents θ0 and θ1 of the descendants x[i,j] of c[i]. The formula’s straightforwardly derived with Bayes’ rule:

p(c[i]|x, θ0, θ1) p(c[i]) * Πj in 1:J p(x[i,j] | c[i], θ0[j], θ1[j])

Moment-Matching Beta Priors

*The only trick is estimating the priors over the sensitivities and specificities, for which I took the usual expedient of using moment matching. Note that this does not take into account the Pareto prior on scales of the prior specificity and sensitivity (hence the asterisk in the table). In particular, given a set of annotator specificities (and there were 200+ annotators for the named-entity data), we find the beta prior with mean matching the empirical mean and variance matching the empirical variance (requires some algebra).

I’m not too worried about the Pareto scale prior — it’s pretty diffuse. I suppose I could’ve done something with maximum likelihood rather than moment matching (but for all I know, this is the maximum likelihood solution! [update: it’s not the ML estimate; check out Thomas Minka’s paper Estimating a Dirichlet Distribution and references therein.]).

Initialization

The inputs are initial values for annotator specificity, annotator sensitivity, and prevalence. These are used to create the first category sample given the above equation, which allows us to define all the other variables for the first sample. Then each epoch just resamples all the categories, then recomputes all the other estimates. This could be made more stochastic by updating all of the variables after each category update, but it converges so fast as is, that it hardly seemed worth the extra coding effort. I made sure to scale for underflow, and that’s about it.

It took about an hour to think about (most of which was working out the moment matching algebra, which I later found in Gelman’s BDA book’s appendix), about two hours to code, and about forty-five minutes to write up for this blog entry.

Speed and Convergence

It’s very speedy and converges very very quickly compared to the full Gibbs sampler in BUGS. I spent another hour after everything was built and running writing the online sample handler that’d compute means and deviations for all the estimated variables, just like the R/BUGS interface prints out. Having the online mean and variance calculator was just what I needed (more about it later, too), especially as many of the values were very close to each other and I didn’t want to store all of the samples (part of what’s slowing BUGS down).

Identifiability

I didn’t run into identifiability problems, but in general, something needs to be done to get rid of the dual solutions (you’d get them here, in fact, if the initial sensitivities and specificities were worse than 0.5).

Open Question (for me)

My only remaining question is: why does this work? I don’t understand the theory of collapsed samplers. First, I don’t get nearly the range of possible samples for the priors given that they’re estimated from discrete sets. Second, I don’t apply beta-binomial inference in the main sampling equation — that is, I take the prevalence and annotator sensitivities and specificities as point estimates rather than integrating out their beta-binomial form. Is this kosher?

You can find the code in the LingPipe Sandbox in the project `hierAnno` (the original R and BUGS code and the data are also in that project). It’s not documented at all yet, but the one Ant task should run out of the box; I’ll probably figure out how to move the application into LingPipe.

[Update: The evaluation looks fine for named entities; I’m going to censor the data the same way as in the NAACL submission, and then evaluate again; with all the negative noise, it’s a bit worse than voting as is and the estimates aren’t useful because the specificites so dominate the calculations. For Snow et al.’s RTE data, the collapsed estimator explodes just like EM, with sensitivity scales diverging to infinity; either there’s a bug, the collapsed sampler isn’t stable, or I really do need a prior on those scales!]

[Update 2: The censored NE evaluation with collapsed Gibbs sampling and simple posterior evaluation by category sample worked out to have one fewer error than the full sampling model in BUGS (versus the original, albeit noisy, gold standard): collapsed Gibbs 232 errors, full Gibbs 233 errors, voting 242.5 errors (the half is from flipping coins on ties). Voting after throwing out the two really noisy annotators is probably competitive as is. I still don’t know why the RTE data’s blowing out variance.]

### 5 Responses to “Collapsed Gibbs Sampler for Hierarchical Annotation Model”

1. Brendan O'Connor Says:

fyi i got EM to stabilize on the RTE data; needs the right priors.

2. lingpipe Says:

The collapsed sampler converges almost instantly with any fixed beta priors for sensitivity and specificity.

What I can’t do with Brendan et al.’s RTE data is get a good estimate of the beta priors if I use moment-matching inside of EM; likelihood diverges with zero-variance beta priors. The “right” thing to do is a better estimate of the beta priors based on another layer of priors; that’s what I did in the paper and in R/BUGS (reparameterized, I used Beta(1,1) on the beta mean (alpha/(alpha+beta)), and Pareto(1.5) on the beta scale (alpha+beta). It’s easy when BUGS does all the work, and I may just need to implement a proper beta estimator with priors.

I’m in the middle of writing up a longer blog post with more details.

3. Brendan O'Connor Says:

ah. right. i meant that. got it.

4. Maria Bernal Says:

Hi, I’m getting started on statistics, and I was wondering how the Pareto scale prior can be made more and more diffuse. Ane help in this regard is greatly appreciated (some bebliographical reference would be great too). Many thanks in advance.

5. lingpipe Says:

I just pulled the Pareto prior out of Gelman et al.’s Bayesian Data Analysis, where they don’t even describe it as such and really only mention it in passing.

Like other power-law-type distributions, it gets more diffuse as the exponent is lowered. Below quadratic (exponent 2), it has infinite variance, so that’s pretty darn diffuse.

I just followed the BUGS manual. There’s also Wolfram: Pareto Distribution and Wikipedia: Pareto Distribution.