Here I’m assuming conditional independence of the annotators given the true category of the item being annotated. If you actually look at the annotations, they’re highly correlated! In addition, the sensitivity and specificity terms for modeling annotator accuracy provides a bit more power to model shared biases (e.g. higher specificity than sensitivity).

There are two extensions, only one of which I’ve pursued.

The first is to use a fixed-effects type model and try to estimate the correlations among annotators. That’s a pretty big covariance matrix with 150 annotators, and most pairs of annotators have no data in common, making direct estimation pretty much impossible.

The second is to use a random-effects model, and I’ve done this and written it up in the tech report (linked from the white papers section of the blog) and sketched it in the talk and tutorial.

In the simplest random-effects model, each item being annotated gets an effect, which is essentially modeling difficulty to annotate. This is like the approach taken by Uebersax and Grove (1993) in their ordinal rating model, and has been widely adopted within epidemiology, where the random effect models something like size of tumor and the ordinal rating something like stage of cancer.

The data from your (co-authored) *CIKM* paper, Assessor Error in Stratified Evaluation, is much richer. What I’d like to do is model accuracy based on strata effects. In the simplest case here, each stratum gets a difficulty parameter. In the richer model, you’d use those as priors for the individual item effects.

The problem I had with individual item difficulty effects is the usual one — with only a handful of annotators (10 or fewer) per item, it’s hard to get a read on item difficulty. There are too many “degrees of freedom” in a possible explanation of annotator’s labels — you can either say an item is hard or that annotator’s are inaccurate.

]]>