Archive for the ‘Data Annotation’ Category

Multimodal Priors for Binomial Data

December 23, 2009

Andrey Rzhetsky gave me some great feedback on my talk at U. Chicago. I want to address one of the points he rasied in this blog post:

What if the prior is multimodal?

This is a reasonable question. When you look at the Mechanical Turk annotation data we have, sensitivities and specificities are suggestively multimodal, with the two modes representing the spammers, and what I’ve decided to call “hammers” (because the opposite of “spam” is “ham” in common parlance).

Here’s a replay from an earlier post, Filtering Out Bad Annotators:

The center of each circle represents the sensitivity and specificity of an annotator relative to the gold standard from Snow et al.’s paper on Mech Turk for NLP.

Beta Priors are Unimodal or Amodal

But a beta distribution \mbox{\sf Beta}(\alpha,\beta) has a single mode at (\alpha-1)/(\alpha + \beta - 2) in the situation where \alpha, \beta > 1, but has no modes if \alpha \leq 1 or \beta \leq 1.

Mixture Priors

One possibility would be to use a mixture of two beta distributions, where for \alpha_1,\beta_1,\alpha_2,\beta_2 > 0 and 0 \leq \lambda \leq 1, we define:

p(\theta|\alpha_1,\beta_1,\alpha_2,\beta_2,\lambda) = \lambda \ \mbox{\sf Beta}(\theta|\alpha_1,\beta_1) + (1 - \lambda) \ \mbox{\sf Beta}(\theta|\alpha_2,\beta_2).

Hierarchical Fit

We can fit the mixture component \lambda along with the parameters of the beta mixture components in the usual way. It’s just one more variable to sample in the Gibbs sampler.

Hierarchical Priors

On the other hand, if you consider baseball batting data, there would be at least two modes in the prior for batting average corresponding to fielding position (i.e. shortstops and second-basemen typically don’t bat as well as outfielders and I imagine there’s more variance in batting ability for infielders). If we didn’t know the fielding position, it’d make sense to use a mixture prior. But if we knew the fielding position, we’d want to create a hierarchical model with a position prior nested inside of a league prior.

Coding Inference Talk at University of Chicago, Wed 16 Dec 2009

December 13, 2009

Update: It’s still Wednesday, but the right day is 16 December.

This Wednesday, I (Bob) will be giving a talk at the brand-spanking new Knapp Center for Biomedical Discovery at the University of Chicago.

2 PM, Wed 16 Dec 2009

Knapp Center for Biomedical Discovery (KCBD)
10th Floor, South Conference Room
900 East 57th Street
University of Chicago

Multilevel Models of Coding and Diagnosis
with Multiple Tests and No Gold Standards

Bob Carpenter, Alias-i

I’ll introduce multilevel generalizations of some well known models drawn from the epidemiology literature and evaluate their fit to diagnostic testing and linguistic annotation data. The analogy is that data annotation for machine learning (or, e.g., ICD-9 coding) is the same kind of process as diagnostic testing.

The observed data consists of multiple testers supplying results for multiple tested units. True labels for some units may be known (perhaps with selection bias), and not all units need to undergo every test.

In all models, there are parameters for outcome prevalence, the result for each unit, and some form of test accuracy. I’ll also consider models with parameters for individual test accuracy and bias (equivalently sensitivity and specificity in the binary case) and item difficulty/severity.

I’ll focus on the priors for annotator accuracy/bias and item difficulty, showing how diffuse hyperpriors allow them to be effectively inferred along with the other parameters using Gibbs sampling. The posterior samples may be used for inferences for diagnostic precision, multiple comparisons of test accuracies, population prevalence, unit-level labels, etc.

I’ll show that the resulting multilevel models can be fit using data simulated according to the model. I will then fit the model to a range of clinical and natural language data. I’ll discuss their advantages for inference with epidemiology data ranging from dentists diagnosing caries based on x-rays, oncologists diagnosing tumors based on slides, infection diagnosis based on exams and serum tests, and with natural language data including name spotting, word stemming, and classification.

I’ll conclude by discussing extensions to further pooling through random effects for different testing facilities, different kinds of annotators (e.g. doctors vs. ICD-9 specialists), different kinds of subjects/units (e.g. genetic predisposition to diseases, or articles drawn from different journals), etc.

All the software (Java, R, BUGS) and data discussed in this talk are freely available from the LingPipe sandbox in the hierAnno project.

You may already be familiar with all this from the data annotation thread on this blog.

Chang et al. (2009) Reading Tea Leaves: How Humans Interpret Topic Models

November 18, 2009

This forthcoming NIPS paper outlines a neat little idea for evaluating clustering output:

The question they pose is how to evaluate clustering output, specifically topic-word and document-topic coherence, for human interpretability.

Bag of Words

Everything’s in the bag of words setting, so a topic is modeled as a discrete distribution over words.

Multiple Topics per Document

They only consider models that allow multiple topics per document. Specifically, the clusterers all model a document as discrete distribution over topics. The clusterers considered share strong family resemblances: probabilistic latent semantic indexing (pLSI) and two forms of latent Dirichlet allocation (LDA), the usual one and one in which topics for documents are drawn from a logistic prior modeling topic correlations rather than a uniform Dirichlet.

Intrusion Tasks

To judge the coherence of a topic, they take the top six words from a topic, delete one of the words and insert a top word from a different topic. They then measure whether subjects can detect the “intruder”.

To judge the coherence of the topics assigned to a document, they do the same thing for document distributions: they take the top topics for a document, delete one and insert a topic not assigned with high probability to the document.


They considered two small corpora of roughly 10K articles, 10K word types and 1M tokens, one from Wikipedia pages and one from NY Times articles. These can be relatively long documents compared to tweets, customer support requests, MEDLINE abstracts, etc, but are shorter than full-text research articles or corporate 10K or 10Q statements.

They only consider 50, 100, and 150 topic models, and restrict parameterizations to add-1 smoothing (aka the Laplace form of the Dirichlet prior) for per-document topic distributions. I didn’t see any mention of what the prior was for the per-topic word distributions. I’ve found both of these parameters to have a huge effect on LDA output, with larger prior counts in both cases leading to more diffuse topic assignments to documents.

They only consider point estimates of the posteriors, which they compute using EM or variational inference. This is is not surprising given the non-identifiability of topics in the Gibbs sampler.

Mechanical Turker Voting

They used 8 mechanical Turkers per task (aka HIT) of 10 judgments (wildly overpaying at US$0.07 to US$0.15 per HIT).

(Pseudo Expected) Predictive Log Likelihood

They do the usual sample cross-entropy rate evaluations (aka [pseudo expected] predictive log likelihoods). Reporting these to four decimal places is a mistake, because the different estimation methods for the various models have more variance than the differences shown here. Also, there’s a huge effect from the priors. For both points, check out Asuncion et al.’s analysis of LDA estimation, which the first author, Jonathan Chang, blogged about.

Model Precision

Their evaluation for precision is the percentage of subjects who pick out the intruder. It’d be interesting to see the effect of adjusting for annotator bias and accuracy. This’d be easy to evaluate with any of our annotation models. For instance, it’d be interesting to see if it reduced the variance in their figure 3.

There’s variation among the three models at different topics over the different corpora. I’m just not sure how far to trust their model precision estimates.

Their Take Home Message

The authors drive home the point that traditional measures such as expected predictive log likelihood are negatively correlated with their notion of human evaluated precision. As I said, I’m skeptical about the robustness of this inference given the variation in estimation techniques and the strong effect of priors.

The authors go so far as to suggest using humans in the model selection loop. Or developing an alternative estimation technique. If they’d been statisticians rather than computer scientists, my guess is that they’d be calling for better models, not a new model selection or estimation technique!

The Real Take Home Message

I think the real contribution here is the evaluation methodology. If you’re using clustering for exploratory data analysis, this might be a way to vet clusters for further consideration.

What They Could’ve Talked About

Although they mention Griffiths and Steyvers’ work on using LDA for traditional psychometrics, I think a more interesting result is Griffith and Steyvers’ use of KL-divergence to measure the stability of topics across Gibbs samples (which I describe and recreate in the LingPipe clustering tutorial). Using KL divergence to compare different clusters may give you a Bayesian method to automatically assess Chang et al.’s notion of precision.

Phrase Detectives Linguistic Annotation Game

November 10, 2009

Massimo Poesio just sent me a pointer to the following awesome web application:

Annotation Game

Annotation games (aka “games with a purpose”) were popularized by van Ahn’s ESP game, though I first heard about them through David Stork’s Open Mind project.

Unlike the mechanical Turk, the games approach tries to make the task somewhat fun and competitive. It seems like making the users “detectives” is a thin veneer of “fun”, but they maintain the metaphor beautifully throughout the whole site, so it works.

As with many games, Phrase Detectives pays out in leader board bragging rights and cash prizes rather than directly for work completed as on Mechanical Turk.

Phrase Detective Tasks

The really brilliant part is how they break the coref annotation task into four easy-to-answer questions about a single highlighted phrase.

  1. Not Mentioned Before: yes/no question as to whether the referent of highlighted phrase was previously mentioned in the text
  2. Mentioned Before: highlight previous mention of a given phrase
  3. Non-Referring: pick out non-referential nouns (like the “there” in “there is trouble brewing”)
  4. Property of Another Phrase: pick out other phrases that describe someone already mentioned (e.g. attributives or apositives)

The site also has nice clean, easy-to-follow graphics, and appears to still have users after two years.

Adjudication Phase

OK, they call it “Detectives Conference”, but the idea is you get to vote yes/no as to whether someone else’s answer is right. This is a good idea widely used on Mechanical Turk because it’s easier to check someone’s work than to create it from scratch.

Read All About It

It was developed by academics, so there are at least as many papers as contributors:

Coreference Annotation

There are “expert” annotated within-doc coref corpora for the MUC 7 and ACE 2005 evaluations (available from LDC, who charge an arm and a leg for this stuff, especially for commercial rights).

LingPipe does within-document coreference and we’ve worked on cross-document coreference.

More Like This

As soon as you find one of these things, you find more. Check out:

I’d love to hear about more of these if anyone knows any.

Whitehill, Ruvolo, Wu, Bergsma, and Movellan (2009) Optimal Integration of Labels from Labelers of Unknown Expertise

October 5, 2009

[Update: 4:51 PM, 5 October 2009 after corrections from Jacob Whitehill (thanks!); they did use a prevalence estimate and did allow mixed panel fits, as the post now reflects.]

Thanks to Panos for pointing out this upcoming NIPS poster, which makes a nice addition to our running data annotation thread:

The authors’ knowledge of the epidemiology literature was limited when they stated:

To our knowledge BLOG [bilinear log odds generative] is the first model in the literature to simultaneously estimate the true label, item difficulty, and coder expertise in an unsupervised manner.

Just check out the literature survey portion of my technical report for a range of different approaches to this problem, some of which have even been applied to binary image data classification such as Handelman’s X-ray data for dental caries diagnoses (see the tech report for the Handleman data).

Model Overview

In this case, the authors use a logistic scale (that’s the log-odds part) consisting of the product of an annotator accuracy term and an item (inverse) difficulty term (that’s the “bilinear” part). Although the authors mention item-response and Rausch models (see below), they do not exploit their full expressive power.

In particular, the authors model annotator accuracy, but do not break out sensitivity and specificity separately, and thus do not model annotator bias (a tendency to overlabel cases in one category or another). I and others have found huge degrees of annotator bias for real data (e.g. Handelman’s dentistry data and the Snow et al. NLP data).

The authors’ model also clips difficulty at a random coin flip, whereas in reality, some positive items may be so hard to find as to have less than a 50% chance of being modeled correctly.

They impose unit normal priors over annotator accuracy and normal priors over the log of item difficulty (ensuring item difficulties are non-negative). They fit the models with EM using conjugate gradient to solve the logistic regression in the M-step. Epidemiologists have fitted empirical Bayes priors by using other expert opinion, and I went further and actually fitted the full hierarchical Bayesian model using Gibbs sampling (in BUGS; the code is in the sandbox project).

Point estimates (like EM-derived maximum a posterior estimates as the authors use) always underestimate posterior uncertainty compared to full Bayesian inference. Posterior uncertainty in item difficulty is especially difficult to estimate with only 10 annotators. In fact, we found the Bayesian posteriors for item difficulty to be so diffuse with only 10 annotators that using the full posterior effectively eliminated the item difficulty effect.

Synthetic Data

They run synthetic data and show fits. Not surprisingly given the results I’ve reported elsewhere about fitting item difficulty, they only report fits for difficulty with 50 annotators! (I found reasonable fits for a linear (non-multiplicative) model with 20 annotators, though recall the reviewers for my rejected ACL paper thought even 10 annotators was unrealistic for real data!)

They also simulate very low numbers of noisy annotators compared to the actual numbers found on Mechanical Turk (even with pre-testing, we had 1/10 noisy annotations and without testing, Snow et al. found 1/2 noisy labels). I was surprised they had such a hard time adjusting for the noisy labelers. I think this may be due to trying to model item difficulty. Without item difficulty, as in the Dawid and Skene-type models, there’s no problem filtering out bad annotators.

Pinning Values

The authors note that you can fix some values to known gold-standard values and thus improve accuracy. I noted this in my papers and in my comment on Dolores Labs’ CrowdControl, which only uses gold-standard values to evaluate labelers.

Real Data

They show some nice evaluations for image data consisting of synthetic images and classification of Duchenne smiles. As with other data sets of this kind (such as my results and Raykar et al.’s results), they show decreasing advantage of the model-based methods over pure voting as the number of annotators approaches 10. This is as we’d expect — the Bayesian priors and proper weighting are most important for sparse data.

Mathematical Presentation

The authors suppose J items (images in this case) and I annotators. The correct label for item J is c_j and the label provided by the annotator i is is x_{i,j}. They consider fitting for the case where not every annotator labels every item.

The authors model correctness of an annotation by:

\mbox{Pr}(x_{i,j} = c_j) = \mbox{logit}^{-1}(\alpha_i\beta_j)

where \alpha_i is a measure of an annotators ability and \beta_j > 0 a measure of inverse item difficulty. The authors observe some limits to help understand the parameterization. First, as inverse item difficulties approach 0, items become more difficult to label, and accuracy approaches chance (recall \mbox{logit}^{-1}(0) = 0.5):

\lim_{\beta_j \rightarrow 0} \mbox{Pr}(x_{i,j} = c_i) = 0.5.

As inverse item difficulties approach infinity, the item becomes easier to label:

\lim_{\beta_j \rightarrow \infty} \mbox{Pr}(x_{i,j} = c_i) = 1.0.

As annotator accuracy approaches infinity, accuracy approaches perfection:

\lim_{\alpha_i \rightarrow \infty} \mbox{Pr}(x_{i,j} = c_i) = 1.0.

As annotator accuracy approaches zero, accuracy approaches chance:

\lim_{\alpha_i \rightarrow 0} \mbox{Pr}(x_{i,j} = c_i) = 0.5.

If accuracy is less than zero, the annotator is adversarial. We didn’t find any adversarial annotators in any of our Mechanical Turk data, but there were lots of noisy ones, so some of the models I fit just constrained prior accuracies to be non-adversarial. Others have fixed priors to be non-adversarial. In some settings, I found initialization to non-adversarial accuracies in EM or Gibbs sampling led to the desired solution. Of course, when lots of annotators are adversarial and priors are uniform, the solution space is bimodal with a flip of every annotator’s adversarial status and every item’s label.

The authors also model prevalence with a p(c_i) term. If prevalence is 0.5, it drops out, but their Duchenne smile example was unbalanced, so the prevalence term is important.

Comparison with Item-Response and Ideal Point Models

The authors mention item-response theory (IRT), which is also where I got the idea to use these models in the first place (too bad the epidemiologists beat us all to the punch).

A basic IRT-like model for annotation would use the difference \delta_i - \beta_j between annotator accuracy \delta_i and item difficulty \beta_j. By allowing \beta_j to be positive or negative, you can model positive and negative items on the same scale. Or you can fit separate \delta_i and \beta_j for positive and negative items, thus independently modeling sensitivity and specificity.

Discriminativeness can be modeled by a multiplicative factor \alpha_i, producing a predictor \alpha_i \, (\delta_i - \beta_j). In this way, the \delta_i term models a positive/negative bias and the \alpha_i the sharpness of the decision boundary.

I’m a big fan of the approach in (Uebersax and Grove 1993), which handles ordinal responses as well as discrete ones. Too bad I can’t find an open-access version of the paper.

CrowdControl: Gold Standard Crowdsource Evals

September 17, 2009

CrowdControl is a new technology from Dolores Labs. It’s a component of their new offering CrowdFlower, a federated crowdsourcing application server/API. It fronts for providers like Amazon Mechanical Turk and others. The interface looks a little bit simpler than Mechanical Turk’s and there are other nice add-ons, including help with application building from the experts at Dolores. So far, I’ve only looked over Luke‘s shoulder for a demo of the beta when I was out visiting Dolores Labs.

CrowdControl: Supervised Annotator Evaluation

So what’s CrowdControl? For now, at least, it’s described on the CrowdFlower Examples page. In a nutshell, it’s Dolores labs’ strategy of inserting gold-standard items into crowdsourced tasks and using responses to them to estimate annotator accuracy in an ongoing fashion. You may have seen this technique described and evaluated for natural language data in:

Dolores Labs uses the evaluations to give annotators feedback, filter out bad/spam annotators, and derive posterior confidence estimates for labels.

Unsupervised Approach

I took their data and showed you could do just as well as gold-standard seeding using completely unsupervised model-based gold-standard inference (see my blog post on Dolores’s data and my analysis and the related thread of posts).

The main drawback to using inference is that it requires more data to get off the ground (which isn’t usually a problem in this setting). It also requires a bit more computational power to run the model, as well as all the complexity in scheduling, which is admittedly a pain (we haven’t implemented it). It’d also be harder for customers to understand and runs the risk of giving workers bad feedback if you use it as a feedback mechanism (but then so does any non-highly-redundantly-rechecked gold standard).

Overestimating Confidence

I was mostly interested in posterior estimates of gold-standard labels and their accuracy. The problem is that the model they use in the paper above (Dawid and Skene’s), typically overestimates confidence due to false independence assumptions among annotations for an item.

The problem I have in my random effects models is that I can’t estimate the item difficulties accurately enough to draw measurably better predictive inferences. That is, I can get slightly more diffuse priors in the fully Bayesian item difficulty approach, but not enough of an estimate on item difficulty to really change the gold-standard inference.

Semi-Supervised Approach

A mixture of gold-standard inference and gold-standard seeding (a semi-supervised approach) should work even better. The model structure doesn’t even change — you just get more certainty for some votes. In fact, the Bayesian Gibbs sampling and point estimated EM computations generalize very neatly, too. I just haven’t gotten around to evaluating it.

Brodley and Friedl (1999) Identifying Mislabeled Training Data

August 27, 2009

Today’s blast from the past is the descriptively titled:

It’s amazing this paper is so long given the simplicity of their approach — it’s largely careful evaluation on three approaches versus five data sets and a thorough literature survey. It’s also amazing just how many people have tackled this problem over time in more or less the same way.

The approach is almost trivial: cross-validate using multiple classifiers, throw away training instances on which the classifiers disagree with the gold standard, then train on the remaining items. They consider three forms of disagreement: at least one classifier disagrees, a majority of classifiers disagree, and all classifiers disagree with the gold label. They consider three classifiers, 1-nearest-neighbor with Euclidean distance, linear classifiers (using the “thermal training rule”, whatever that is), and C4.5-trained decision trees.

They conclude that filtering improves accuracy, though you’ll have to be more patient than me to dissect the dozens of detailed tables they provide for more inisght. (Good for them for including confidence intervals; I still wish more papers did that today.)

The authors are careful to set aside 10% of each data set for testing; the cross-validation-based filtering is on the other 90% of the data, to which they’ve introduced simulated errors.

One could substitute annotators for algorithms, or even use a mixture of the two to help with active learning. And you could try whatever form of learners you like.

What I’d like to see is a comparison with probabilistic training on the posterior category assignments, as suggested by Padhraic Smyth et al. in their 1994 NIPS paper, Inferring ground truth from subjective labelling of Venus images. I’d also like to see more of an analysis of noisy training data on evaluation, along the lines of Lam and Stork’s 2003 IJCAI paper, Evaluating classifiers by means of test data with noisy labels. Because my experience is that gold standards are less pure than their audience of users imagines.

Rzhetsky, Shatkay and Wilbur (2009) How to Get the Most out of Your Curation Effort

August 25, 2009

Following the scientific zeitgeist, here’s another paper rediscovering epidemiological accuracy models for data coding, this time from Mitzi‘s former boss and GeneWays mastermind Andrey Rzhetsky, IR and bio text mining specialist Hagit Shatkay, and the co-creator of the Biocreative entity data, NLM/NCBI’s own John Wilbur:

Their motivations and models look familar. They use a Dawid-and-Skene-like multinomial model and a “fixed effects” correlation-based model (to account for the overdispersion relative to the independence assumptions of the multinomial model).

Neither they nor the reviewers knew about any of the other work in this area, which is not surprising given that it’s either very new, quite obscure, or buried in the epidemiology literature under a range of different names.

Data Distribution

What’s really cool is that they’ve distributed their data through PLoS. And not just the gold standard, all of the raw annotation data. This is a great service to the community.

What they Coded

Özlem Uzuner‘s i2b2 Obesity Challenge and subsequent labeling we’ve done in house convinced me that modality/polarity is really important. (Yes, this should be obvious, but isn’t when you’ve spent years looking at newswire and encyclopedia data.)

Rzhetsky et al. used 8 coders (and 5 follow-up control coders) to triple code sentences (selected with careful distributions from various kinds of literature and paper sections) for:

  • Focus (Categorical): generic, methodological, scientific
  • Direct Evidence for Claim (Ordinal): 0, 1, 2, 3
  • Polarity: (Ordinal) Positive/Negative with scale 0,1,2,3 for certainty

I’m not sure why they allowed Positive+0 and Negative+0, as they describe 0 certainty as completely uncertain.

Given the ordinal nature of their data, they could’ve used something like Uebersax and Grove’s 1993 model based on ordinal regression (and a really nice decomposition of sensitivity and specificity into accuracy and bias).

Collapsed Gibbs Sampler for Hierarchical Annotation Model

July 6, 2009

The R and BUGS combination is fine as far as it goes, but it’s slow, hard to debug, and doesn’t scale well. Because I’m going to need to process some big Turker-derived named entity corpora in the next month (more about that later), I thought I’d work on scaling the sampler.

Mitzi was away over the weekend, so I naturally spent my newly found “free time” coding and reading up on stats. While I was procrastinating refactoring feature extraction for CRFs reading a really neat paper (On Smoothing and Inference for Topic Models) from the Irvine paper mill (I also blogged about another of their paper’s on fast LDA sampling), it occurred to me that I could create a fast collapsed sampler for the multinomial form of the hierarchical models of annotation I’ve blogged about.

Hierarchical Model of Binomial Annotation

The model’s quite simple, at least in the binomial case. I’ve further simplified here to the case where every annotator labels every item, but the general case is just as easy (modulo indexing):

Variable Range Status Distribution Description
I > 0 input fixed number of Items
J > 0 input fixed number of annotators
π [0,1] estimated Beta(1,1) prevalence of category 1
c[i] {0,1} estimated Bern(π) category for item i
θ0[j] [0,1] estimated Beta(α0,β0) specificity of annotator j
θ1[j] [0,1] estimated Beta(α1,β1) sensitivity of annotator j
α0/(α0+β0) [0,1] estimated Beta(1,1) prior specificity mean
α0 + β0 (0,∞) estimated Pareto(1.5)* prior specificity scale
α1/(α1+β1) [0,1] estimated Beta(1,1) prior sensitivity mean
α1 + β1 (0,∞) estimated Pareto(1.5)* prior sensitivity scale
x[i,j] {0,1} input Bern(c[i,j]==1
? θ1[j]
: 1-θ0[j])
annotation of item i by annotator j

The Collapsed Sampler

The basic idea is to sample only the category assignments c[i] in each round of Gibbs sampling. With categories given, it’s easy to compute prevalence, annotator sensitivity and specificity given their conjugate priors.

The only thing we need to sample is c[i], and we can inspect the graphical model for dependencies: the parent π of the c[i], and the parents θ0 and θ1 of the descendants x[i,j] of c[i]. The formula’s straightforwardly derived with Bayes’ rule:

p(c[i]|x, θ0, θ1) p(c[i]) * Πj in 1:J p(x[i,j] | c[i], θ0[j], θ1[j])

Moment-Matching Beta Priors

*The only trick is estimating the priors over the sensitivities and specificities, for which I took the usual expedient of using moment matching. Note that this does not take into account the Pareto prior on scales of the prior specificity and sensitivity (hence the asterisk in the table). In particular, given a set of annotator specificities (and there were 200+ annotators for the named-entity data), we find the beta prior with mean matching the empirical mean and variance matching the empirical variance (requires some algebra).

I’m not too worried about the Pareto scale prior — it’s pretty diffuse. I suppose I could’ve done something with maximum likelihood rather than moment matching (but for all I know, this is the maximum likelihood solution! [update: it’s not the ML estimate; check out Thomas Minka’s paper Estimating a Dirichlet Distribution and references therein.]).


The inputs are initial values for annotator specificity, annotator sensitivity, and prevalence. These are used to create the first category sample given the above equation, which allows us to define all the other variables for the first sample. Then each epoch just resamples all the categories, then recomputes all the other estimates. This could be made more stochastic by updating all of the variables after each category update, but it converges so fast as is, that it hardly seemed worth the extra coding effort. I made sure to scale for underflow, and that’s about it.

It took about an hour to think about (most of which was working out the moment matching algebra, which I later found in Gelman’s BDA book’s appendix), about two hours to code, and about forty-five minutes to write up for this blog entry.

Speed and Convergence

It’s very speedy and converges very very quickly compared to the full Gibbs sampler in BUGS. I spent another hour after everything was built and running writing the online sample handler that’d compute means and deviations for all the estimated variables, just like the R/BUGS interface prints out. Having the online mean and variance calculator was just what I needed (more about it later, too), especially as many of the values were very close to each other and I didn’t want to store all of the samples (part of what’s slowing BUGS down).


I didn’t run into identifiability problems, but in general, something needs to be done to get rid of the dual solutions (you’d get them here, in fact, if the initial sensitivities and specificities were worse than 0.5).

Open Question (for me)

My only remaining question is: why does this work? I don’t understand the theory of collapsed samplers. First, I don’t get nearly the range of possible samples for the priors given that they’re estimated from discrete sets. Second, I don’t apply beta-binomial inference in the main sampling equation — that is, I take the prevalence and annotator sensitivities and specificities as point estimates rather than integrating out their beta-binomial form. Is this kosher?

Downloading from LingPipe Sandbox

You can find the code in the LingPipe Sandbox in the project hierAnno (the original R and BUGS code and the data are also in that project). It’s not documented at all yet, but the one Ant task should run out of the box; I’ll probably figure out how to move the application into LingPipe.

[Update: The evaluation looks fine for named entities; I’m going to censor the data the same way as in the NAACL submission, and then evaluate again; with all the negative noise, it’s a bit worse than voting as is and the estimates aren’t useful because the specificites so dominate the calculations. For Snow et al.’s RTE data, the collapsed estimator explodes just like EM, with sensitivity scales diverging to infinity; either there’s a bug, the collapsed sampler isn’t stable, or I really do need a prior on those scales!]

[Update 2: The censored NE evaluation with collapsed Gibbs sampling and simple posterior evaluation by category sample worked out to have one fewer error than the full sampling model in BUGS (versus the original, albeit noisy, gold standard): collapsed Gibbs 232 errors, full Gibbs 233 errors, voting 242.5 errors (the half is from flipping coins on ties). Voting after throwing out the two really noisy annotators is probably competitive as is. I still don’t know why the RTE data’s blowing out variance.]

Raykar et al. (2009) Supervised Learning from Multiple Experts: Whom to Trust when Everyone Lies a Bit

June 16, 2009

Brendan just sent me a link to this fresh-off-the-press ICML paper:

The scientific zeitgest says to assess inter-annotator agreement, infer gold standards, and use them to train classifiers.

Raykar et al. use EM for a binomial model of annotator sensitivity and specificity (like Dawid and Skene’s original multinomial approach from the 1970s paper and the Snow et al. EMNLP paper). My experiments showed full Bayesian models slightly outperform EM, which slightly outperforms naive voiting (the effects are stronger with fewer annotators).

The obvious thing to do is to take the output of the gold standard inference and use that to train a classifier. With EM, you can use the MAP estimate of category likelihoods (a fuzzy gold standard); with Bayesian models, you can sample from the posterior, which provides more dispersion. Smyth et al.’s 1995 NIPS paper showed EM-style training was effective for simulations.

I was just in San Francisco presenting this work to the Mechanical Turk Meetup, and Jenny Finkel opined that fuzzy training wouldn’t work well in practice. Even taking the discussion offline, I’m still not sure why she thinks that [update: see her comments below]. In some ways, if we use the fuzzy truth as the gold standard, then using it to train should perform better than quantizing the gold standard to 0/1. There’s not a problem with convexity; we just impute a big data set with Gibbs sampling and train on that. We could even train up an SVM or naive Bayes system that way.

The interesting twist in the Raykar et al. paper is to jointly estimate a logistic regression classifier along with the gold standard. That is, throw the regression coefficients into the model and estimate them along with everything else. That’s the same linkage as I suggested above. But Raykar et al. go further — they let the trained model vote on the gold standard just like another annotator.

Even though the annotation model corrects for individual annotator bias (or in this case, the logistic regression classifier’s bias as estimated), each annotator still affects the overall model through its bias-adjusted vote (if it didn’t, you couldn’t get off the ground at all). If you evaluate the classifier on a “gold standard” which was voted upon by a committee including the classifier itself, the classifier should perform better because it’s getting a vote on the truth!

The right question is whether Raykar et al.’s jointly estimated classifiers are “better” in some sense than ones trained on the imputed gold standard. For that, I’d think we’d need some kind of held-out eval, but that begs the question on inferring the gold standard. The gold standards behind Snow et al.’s work weren’t that pure after all (I have some commentary on discrepancies in the paper cited below).

I have considered using the trained classifier as another annotator when doing active learning of the kind proposed in Sheng et al.’s 2008 KDD paper on getting another label for an existing item vs. annotating a new item. In fact, there’s no reason in principle why you can’t have more than one classifier being trained along with annotator sensitivities and specificities.

Another nice idea in the Raykar et al. paper is the use of simulation from a known gold standard to create a fuzzy gold standard. That’s still questionable, in that it’s generating fake data that are known to follow the model. But everyone should do this in every way possible for all parts of their models, so you can bet I’ll be saving this one for my bag of tricks.

I’m a little unclear on why the numbers in the lefthand plots in figures 1 and 2 don’t have the same AUC value for the proposed algorithm. Figure 2 actually does evaluate the gold-standard estimation followed by classifier estimation. If I’m reading that figure right, then training on the imputed gold standard didn’t do measurably better than the majority voted baseline.

[Update with comment: The right hand side plot in figure 2 is of the inferred gold standard versus the “golden gold standard”. It’s possible to plot this because the inferred gold standard is actually a point probability estimate of the item being in category 1.]

If we’re lucky, Raykar et al. will share their data. [Update 2: no luck — the data belongs to Siemens.]

P.S. All of these models assume the annotators don’t actually lie. Specifically, in order for the models to be identifiable, we need to assume the annotators are not adversarial (that is, they don’t know the right answer and intentionally lie, and thus perform worse than chance). There was, to reinforce the zeitgeist, also a paper about mixing in adversarial coders at ICML, Dekel and Shamir’s Good learners for evil teachers.