Following the scientific zeitgeist, here’s another paper rediscovering epidemiological accuracy models for data coding, this time from Mitzi‘s former boss and GeneWays mastermind Andrey Rzhetsky, IR and bio text mining specialist Hagit Shatkay, and the co-creator of the Biocreative entity data, NLM/NCBI’s own John Wilbur:

- Rzhetsky Andrey, Hagit Shatkay, and W. John Wilbur. 2009. How to get the most out of your curation effort.
*PLoS Computational Biology*. [free download]

Their motivations and models look familar. They use a Dawid-and-Skene-like multinomial model and a “fixed effects” correlation-based model (to account for the overdispersion relative to the independence assumptions of the multinomial model).

Neither they nor the reviewers knew about any of the other work in this area, which is not surprising given that it’s either very new, quite obscure, or buried in the epidemiology literature under a range of different names.

### Data Distribution

What’s really cool is that they’ve distributed their data through PLoS. And not just the gold standard, all of the raw annotation data. This is a great service to the community.

### What they Coded

Özlem Uzuner‘s i2b2 Obesity Challenge and subsequent labeling we’ve done in house convinced me that modality/polarity is really important. (Yes, this should be obvious, but isn’t when you’ve spent years looking at newswire and encyclopedia data.)

Rzhetsky et al. used 8 coders (and 5 follow-up control coders) to triple code sentences (selected with careful distributions from various kinds of literature and paper sections) for:

- Focus (Categorical): generic, methodological, scientific
- Direct Evidence for Claim (Ordinal): 0, 1, 2, 3
- Polarity: (Ordinal) Positive/Negative with scale 0,1,2,3 for certainty

I’m not sure why they allowed Positive+0 and Negative+0, as they describe 0 certainty as completely uncertain.

Given the ordinal nature of their data, they could’ve used something like Uebersax and Grove’s 1993 model based on ordinal regression (and a really nice decomposition of sensitivity and specificity into accuracy and bias).

August 25, 2009 at 6:45 pm |

wow, this looks great. thanks for bringing it to people’s attention…

August 25, 2009 at 9:21 pm |

Here’s another approach to ordinals:

Frank, E. and Hall, M.

A simple approach to ordinal classification

http://researchcommons.waikato.ac.nz/bitstream/10289/64/1/content.pdf

August 26, 2009 at 4:10 pm |

Frank and Hall’s method’s indeed simple. They take ordinal scales, using an example cool < mild < hot. They then build two classifiers using decision trees to estimate p(y>cool|x) and p(y>mild|x) by binarizing the data given the decision. They then estimate

p(y=cool|x)=1-p(y>cool|x)

p(y=mild|x)=p(y>mild|x)-p(y>cool|x)

p(y=hot|x)=p(y>mild|x).

I don’t see why they say this is less ad hoc than ordinal logistic regression. Ordinal regression estimates a linear predictor and then models p(y=c|x) using a linear predictor β’x and cut points for the cool-mild and mild-hot boundaries. These act like the usual intercepts with probabilities interpolated on the logistic scale.

The big difference is that ordinal regression uses a single coefficient vector, which guarantees monotonicity of decision points (that is, if A > B on the ordinal scale, p(y>A|x) >= p(y>B|x)). With Frank and Hall’s method, it’s possible to have p(y=cool|x) > p(y=mild|x) and p(y=hot|x) > p(y=mild|x), which seems to violate the assumed order of the outcomes. Maybe this’d make sense with the right predictors, such as a highly weighted deviation-from-average-temperature predictor.

August 28, 2009 at 12:04 pm |

You are on fire lately :-) Thanks for all the pointers!

Question: Do you have any idea how robust are these techniques to spammers?

For example, in your analysis of the data from Snow et al., https://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/ you identified some annotators are spammers, sitting on the random line of the ROC plot.

Would it make sense to remove the random annotators altogether? What about removing the “almost random” guys? In an environment where we pay per label (e.g., MTurk) the benefit that these “too noisy” annotators bring may not be worth paying for.

But do you find that the labels of such annotators bring *any* benefit while computing the posterior classes? Or do they simply introduce noise

in practice?

August 28, 2009 at 1:00 pm |

They’re guessing randomly, so it’s pure noise. Given that you can estimate the bias of the noise, you can suppress it. The results I reported were similar with or without the random taggers in the mix.

What Dolores Labs and other folks do is seed the tasks with known gold-standard examples. That lets you find the random taggers more easily and also estimate the good taggers’ accuracy. You could, of course, mix the known cases with estimation via EM or Gibbs sampling pretty easily.

August 28, 2009 at 1:47 pm |

I was wondering what is the robustness of these techniques to random noise. I have some tasks on MTurk that attract a very significant number of spammers (right now, I have more than 50% spam submissions). So, I was just curious if you had any experience with similar settings. I noticed that the EM technique of Dawid&Skene started having problems with such level of noise and with 4 possible categories.

Using an existing gold standard works but I somehow find the solution “less elegant” than inferring everything from the unfiltered stream of tags.

August 29, 2009 at 2:23 pm |

There was around 50% noise in the data both Dolores Labs and we collected for the NLP tasks. In subsequent tasks, we used a qualifying test, which really reduced the amount of “spam”.

The binary model’s very robust at least up to 50% noise using full Gibbs sampling (including beta priors). I found collapsed Gibbs sampling (only sample categories, as in the collapsed LDA samplers) could run into the same problem as EM, which is driving the likelihood to infinity (it’s a density) by driving variance of some component to zero. If I set the beta priors (as Dawid and Skene and all of the following epidemiologists did), then the model’s very robust under collapsed sampling. The advantage of collapsing is that my Java code runs in split seconds whereas the full sampling scheme in BUGS takes minutes (and won’t scale up).

What I couldn’t do was estimate item difficulty parameters with only a handful of annotations per category. It took between 10 and 20 decently accurate annotators to get a handle on item difficulty. The spammers really impact this, because an alternative explanation for an error other than low accuracy is a difficult problem.

What I think is cool is that you can mix the seeded gold standard approach by fixing the category in the model. Or, you can have the gold-standard annotator mixed in as another annotator with very high fixed accuracy (or you could just estimate the gold standard annotator’s accuracy).

August 31, 2009 at 5:11 pm |

Indeed, adding examples with certain classes (or errorless annotators) is a nice touch and can be easily incorporated in the models.

“my Java code runs in split seconds”: Is the code available as part of the lingpipe distribution?

September 1, 2009 at 4:56 pm |

It’s available in the sandbox. It only runs for binary problems as written, though it’d be easy enough to extend to multinomials. I discuss the sampler in the blog entry Collapsed Gibbs Sampler for Hier Anno. It contains info on how to get the collapsed Java sampler from the LingPipe sandbox (project “hierAnno”).

The sandbox repository also contains the R/BUGS code for all the other models (warning: my R and BUGS code’s pretty amateurish). Let me know if you need help running anything — the R and BUGS aren’t very well documented or easy to intuit.

October 2, 2009 at 7:30 pm |

You may want to take a look at this approach: http://mplab.ucsd.edu/~jake/OptimalLabeling.pdf

Seems similar to what you have been trying. Do you think that your results match theirs?

October 3, 2009 at 12:07 pm |

Thanks for the reference. I’ll write up a fuller review next week.

The authors are using a degenerate form of the item-response model with no annotator bias term, essentially tying annotator sensitivity and specificity. The authors use a linear predictor where is annotator “expertise” and inverse item difficulty, on a logistic scale, so the probability of a correct annotation is:

.

The usual form of this model I’ve seen is

where is annotator discriminativeness, determining the sharpness of divide (technically, the slope of the sigmoid) between positive and negative and thus high values reduce error. is the item “location”, with positive items and negative category items represented by positive and negative values respectively. The remaining term accounts for annotator bias in the sense of favoring positive versus negative items. It essentially models the position of an item to make the annotator flip a coin between positive and negative labels.

In any of the setups, , or may also be factored into one model for positive items (modeling sensitivity) and one for negative items (modeling specificity).

I discuss fitting these models in my tech report and provide BUGS code in the sandbox project.

November 10, 2009 at 4:04 pm |

A quick comment WRT the original post of “What they coded”

We coded: focus, evidence, polarity, certainty and direction

A complete description of what those mean, and their respective values, was given in an earlier paper:

H. Shatkay, F. Pan, A. Rzhetsky and W.J. Wilbur

“Multi-Dimensional Classification of Biomedical Text: Toward Automated, Practical Provision of High-Utility Text to Diverse Users.”

Bioinformatics. 24(18). 2008. (pp. 2086-2093).

http://bioinformatics.oxfordjournals.org/cgi/reprint/24/18/2086

November 10, 2009 at 4:32 pm |

@Hagit: Thanks for the pointer.

I took a look and still don’t see the difference between P0 and N0 (positive with 0 certainty and negative with 0 certainty). There was only a P0 example in the paper. Is it really certainty so much as degree? That is, I can be certain there’s no polarity in an example, as when the example’s hypothetical, as in the example in the paper labeled P0: “Future structural and functional studies will be necessary to understand precisely how She2p binds ASH1 mRNA …”.