Whew. I spent most of the weekend finishing this off. It’s one stop shopping for all the info on models and applications I’ve been blogging about (hopefully with enough introductory material that it’ll be comprehensible even if you don’t know anything about latent data models, multilevel models, or Bayesian inference):

- Carpenter, Bob. 2008 Multilevel Bayesian Models of Categorical Data Annotation. Technical Report. Alias-i.

It’s 52 pages, but it’s mostly figures. As usual, any feedback here or to my e-mail (`carp@alias-i.com`

) would be greatly appreciated. Here’s the abstract:

This paper demonstrates the utility of multilevel Bayesian models of data annotation for classiers (also known as coding or rating). The observable data is the set of categorizations of items by annotators (also known as raters or coders) from which data may be missing at random or may be replicated (that is, it handles fixed panel and varying panel designs). Estimated model parameters include the prevalence of category 1 outcomes, the “true” category of each item (the latent class), the accuracy in terms of sensitivity and specicity of each annotator (latent annotator traits), the difficulty of each item (latent item traits). The multilevel parameters represent the average behavior and variances among the annotators and the items. We perform inference with Gibbs sampling, which approximates the full posterior distribution of parameters as a set of samples. Samples from the posterior category distribution may be used for probabilistic supervision and evaluation of classiers, as well as in gold-standard adjudication and active learning. We evaluate our approach with simulated data and two real data sets, including data for which a prior “gold standard” exists.

All the code (R, BUGS, with some Java for munging), all the data, and the source for the paper are available from the LingPipe sandbox in the project `hierAnno`

; see the sandbox page for information on checking it out, or just use this command (you’ll need to have CVS installed):

cvs -d :pserver:anonymous@threattracker.com:/usr/local/sandbox checkout hierAnno

Next up, I’m hoping to collect some named entity data in time to write this all up for a NAACL/HLT 2009 submission, so I’m especially keen to get feedback before then.

November 18, 2008 at 12:36 am |

For a frequentist approach to this type of model my package randomLCA is available on CRAN. It still needs some more work but the vignette includes an analysis of the dentistry data. When I get a chance I’ll read your paper and compare.

November 18, 2008 at 6:29 am |

[…] * White Paper: Multilevel Bayesian Models of Categorical Data Annotation […]

November 18, 2008 at 2:27 pm |

I found fitting the dentistry data a challenge in the hierarchical setting with item-level effects because of the small number of coders per item. There’s a broad expanse of parameters that provide roughly the same deviance or log likelihood. So I’d suspect EM would have similar problems — how’d it work on the dentistry data?

I might also be able to speed up the sampler by starting at the ML solution found by EM if it’s fast and robust.

I really need to handle the varying panel situation in which not every coder annotates every item.

Here’s the link to Ken Beath’s randomLCA package, which implements Qu, Tan and Kutner’s (1996) random effects model 2LCR.

http://cran.r-project.org/web/packages/randomLCA/index.html

As an aside, I find it confusing when doc has speculative comments about what might be coming next rather than sticking to what’s implemented.

November 19, 2008 at 3:45 pm |

Very nice, thanks for pulling all this together — I was starting to refer people to several different blog posts of yours for more information on the topic :)

What’s the CVS project name? I couldn’t find it on the webpage listing all the project names, and it seems to be necessary to do a checkout.

November 19, 2008 at 6:53 pm |

The project is named “hierAnno”. It’s not on the sandbox web page yet, because it wasn’t done in time for the last release. I edited the blog post to include the relevant command so it’s clearer.

November 19, 2008 at 7:12 pm |

Agreed about the speculative comments, it would be better just dealing with the present.

I’ll have a look at the rest when I get the chance. The dentistry data isn’t too bad to fit, some of the other Qu et al data is harder to fit. The paper is also full of typo’s. An interesting question will be whether MCMC will require a massive number of samples to fully explore the posterior probability for some of the models.

November 20, 2008 at 12:54 pm |

The Gibbs samplers in BUGS somewhat counter-intuitively to me needed fewer samples when there was more data. I found with bigger data sets that the sampler converged faster in real time and much faster in terms of number of epochs. I’m guessing that’s because the underlying distribution samplers in BUGS were doing less rejection.

With the dentistry data, I found it didn’t take a huge number of samples, but the posteriors were rather diffuse. Especially on the item-level difficulty parameters (in the beta-binomial by item or logistic models) and in the hierarchical parameters for annotators (beta-binomial by annotator and logistic models). I also found it hard to fit the multiplicative slope parameter from Qu et al. (also in the Uebersax and Grove models) — the posteriors were all over the place with very little difference in log likelihood.

Also, BUGS pitched a fit (throwing underflow/overflow exceptions) when I tried to swap in probit for logit. I’ve seen this mentioned before.

The other problem with BUGS and R is scaling — I’m about to create a 200K item data set using the mechanical Turk and I may need to swap over to something like Hal Daume’s hierarchical Bayesian compiler (HBC). Looking at the HBC doc, it doesn’t look like it’ll implement hierarchical generalized linear models like logistic or probit regression.

May 6, 2010 at 11:59 am |

@lingpipe Infer.NET may be able to help with your larger data set.

http://research.microsoft.com/infernet/

It’s been used for datasets of up to 100M records by using the support for chunking up the data.

It offers variational EM and Expectation Propagation as its main inference algorithms (preliminary Gibbs support is in there as well). For a comparison to BUGS read this thread:

http://community.research.microsoft.com/forums/t/4823.aspx

We’re planning a new release shortly with a number of optimisations that should increase the size of dataset that can be used without chunking.

Hope that helps,

John Winn, Infer.NET team