It’s striking just how much the zeitgeist affects scientific thinking. I’m thinking about my own present work on deriving gold standards at the moment, but this has been a recurring theme over my whole scientific career.

The first theorem I proved in grad school was the equivalence of ID/LP grammars and standard CFGs. It’s a particularly trivial reduction, so it was no surprise that Stu Shieber had already done it. But it was my first exposure to independently discovering a known result.

While at CMU, I was working with a student on a project on unification-based phonology, and it turns out my old Ph.D. advisor, Ewan Klein, had done the same thing. Surprising? Not really. We were working from Ewan and Steve Bird’s earlier work, and had done the next logical step. Well, so had they.

Fast forward to the present, and I find myself in the position of having independently recreated about thirty years worth of work in epidemiology (without the empirical legwork, of course). Am I a genius? Of course not. It’s just the benefit of a 2008 perspective on hieararchical Bayesian modeling and having the same problem to solve.

So how’d I stumble onto the same models? Well, first, I had a problem to solve, namely figuring out what the recall of a set of annotations for gene mention linkage into Entrez-Gene without a gold standard. Then I started paying attention to the literature.

The first thing that struck me was Klebanov et al.’s paper on easy/hard cases for annotation and establishing a gold standard:

- Beata Beigman Klebanov, Eyal Beigman, Daniel Diermeier. 2008. Analyzing Disagreements. COLING.

This got me thinking about inferring gold standards rather than inter-annotator agreement rates, as well as the nature of errors. We’d done some data annotation ourselves, and this point of view makes a lot of sense.

I also found at the time Reidsma and Carletta’s paper on annotator bias, which brought home the point that different annotators had different abilities.

- Dennis Reidsma and Jean Carletta. In press. Reliability measurement without limits. To appear in
*Comptuational Linguistics*.

This also felt right from our own and others’ data annotation efforts, but it’s just not built into things like the kappa statistics.

At the same time, I’ve started dropping in on Andrew Gelman and Jennifer Hill’s working group on multiple imputation. My goal was to learn R and BUGS while learning how the statisticians thought about these latent data problems. I always recommend diving into the deep end on problems like this; it takes a while to pick up the native’s lingo, but it’s almost always worth it. Reading through Gelman et al.’s *Bayesian Data Analysis* led me to hierarchical binomial models — it’s their first example. Reading Gelman and Hill’s regression book led me to the item-response models, which are the same kind of thing in a generalized linear model setting with item- and annotator-level predictors.

Hmm, I think, these models I’m learning about from Andrew and Jennifer could be applied to these problems I’m having in data annotation. So I wrote out the basic model I had in mind that separated out sensitivity and specificity so I could account for the kind of bias that Reidsma and Carletta mentioned. I coded it up in R and BUGS and it seemed to provide quite reasonable results. I was really really psyched. I haven’t been this excited about an idea I’ve had in ages. I wrote my model up on the board at the next group meeting, and Jennifer suggested adding difficulty coefficients for the items (problems to classify) like in the item-response models. As far as I know she knows as little about the work in epidemiology as I did. I then modified the binomial model for easy/normal mixtures. Now I have a whole suite of models that I’ve never seen before, though they’re built out of the same lego bricks as other models with which I am familiar.

So I write off to Beata and post a blog entry with some sketches of my models. Then I hear from Brendan O’Connor of Dolores Labs that Panos Ipeirotis’s blog has a post on Amazon Mechanical Turk with lots of references to articles trying to do the same thing.

My point of entry was Dawid and Skene’s 1979 use of EM to get a point estimate of annotator accuracy and gold standard categories. Turns out (Bruce and Wiebe 1999) in a paper on sentiment corpus construction had a reference to Dawid and Skene as well as an application. I’d read Bruce and Wiebe’s paper, but the fundamental philosophical points it raises didn’t stand out at the time. I’m actually surprised more people don’t reference this.

From this, I could find more entries in the epidemiology literature that sited it. What I found is that in a sequence of incremental steps, the epidemiologists had already explored exactly the same models over the past 30 years and have more or less caught up to the models that I sketched on the board (the basic form of which is the same graphical model as Dawid and Skene used) have evolved in the direction Jennifer suggested.

There were only two aspects of what I came up with that I didn’t find in the epidemiology literature: inference for the hierarchical parameters (they tended to estimate them with prior knowledge and prune interactions based on knowledge of their diagnostic tests), and general missing and replicated data in the sense of not every annotator annotating every piece of data and some annotating a piece of data more than once. Once you kick over to general Bayesian inference, as provided by BUGS, the missing and replicated data is trivial if you code it up the way Gelman and Hill code up item-response models. The estimation of hierarchical parameters (for priors on annotator sensitivity and specificity and problem difficulty) also wind up being easy to estimate (actually, difficulty is hard — the models seem to be only weakly identified). This extension wasn’t so much an extension, as the way I was taught to think about models by hanging out with Jennifer and Andrew and reading their books.

If you want other examples of the zeitgeist effect, I would highly recommend James Gleick’s biographies of Richard Feynman and Isaac Newton (though the former is the much better and more interesting book). As Newton said, *If I have seen a little further it is by standing on the shoulders of Giants.*, though as it turns out, he either re-invented or borrowed the expression himself.

September 14, 2008 at 5:00 pm |

Hi Bob,

Is this post an indirect way of saying that you were scooped on the annotation evaluation model you posted on the 5th Sept? If so, is there a specific paper on this kind of annotation evaluation that you can recommend? BTW, I think your model is very nice, as it gives us a way of estimating the accuracy of the annotations, which is what we really want to know.

Mark

September 15, 2008 at 10:23 am |

Yes. I thought it was pretty direct.

I was scooped by roughly 30 years in the epidemiology literature for a simple model, and by 10 years in epidemiology literature for both Bayesian models. They don’t seem to have ever estimated their hierarchical priors (the beta params), or dealt with partial data, but those are pretty simple additions.

That’d be about 10 years in the CL literature — Bruce and Wiebe applied the Dawid and Skene 1979 scheme in 1999, so I’d say I was scooped on the notion of trying to induce a gold standard by about 10 years.

The really cool stuff coming out now is the EMNLP paper from Dolores Labs and friends on using the Amazon Mechanical Turk to do annotations. I’m about to dump out results of the binomial tagging models on that data.

December 17, 2008 at 1:58 pm |

There’s a great discussion of the topic of multiple discoveries at Peter Turney’s blog Apperceptual in a post The Heroic Theory of Scientific Development. Lots of great references in the comments, too.