I took a look and still don’t see the difference between P0 and N0 (positive with 0 certainty and negative with 0 certainty). There was only a P0 example in the paper. Is it really certainty so much as degree? That is, I can be certain there’s no polarity in an example, as when the example’s hypothetical, as in the example in the paper labeled P0: “Future structural and functional studies will be necessary to understand precisely how She2p binds ASH1 mRNA …”.

]]>We coded: focus, evidence, polarity, certainty and direction

A complete description of what those mean, and their respective values, was given in an earlier paper:

H. Shatkay, F. Pan, A. Rzhetsky and W.J. Wilbur

“Multi-Dimensional Classification of Biomedical Text: Toward Automated, Practical Provision of High-Utility Text to Diverse Users.”

Bioinformatics. 24(18). 2008. (pp. 2086-2093).

http://bioinformatics.oxfordjournals.org/cgi/reprint/24/18/2086

]]>The authors are using a degenerate form of the item-response model with no annotator bias term, essentially tying annotator sensitivity and specificity. The authors use a linear predictor where is annotator “expertise” and inverse item difficulty, on a logistic scale, so the probability of a correct annotation is:

.

The usual form of this model I’ve seen is

where is annotator discriminativeness, determining the sharpness of divide (technically, the slope of the sigmoid) between positive and negative and thus high values reduce error. is the item “location”, with positive items and negative category items represented by positive and negative values respectively. The remaining term accounts for annotator bias in the sense of favoring positive versus negative items. It essentially models the position of an item to make the annotator flip a coin between positive and negative labels.

In any of the setups, , or may also be factored into one model for positive items (modeling sensitivity) and one for negative items (modeling specificity).

I discuss fitting these models in my tech report and provide BUGS code in the sandbox project.

]]>Seems similar to what you have been trying. Do you think that your results match theirs?

]]>The sandbox repository also contains the R/BUGS code for all the other models (warning: my R and BUGS code’s pretty amateurish). Let me know if you need help running anything — the R and BUGS aren’t very well documented or easy to intuit.

]]>“my Java code runs in split seconds”: Is the code available as part of the lingpipe distribution?

]]>The binary model’s very robust at least up to 50% noise using full Gibbs sampling (including beta priors). I found collapsed Gibbs sampling (only sample categories, as in the collapsed LDA samplers) could run into the same problem as EM, which is driving the likelihood to infinity (it’s a density) by driving variance of some component to zero. If I set the beta priors (as Dawid and Skene and all of the following epidemiologists did), then the model’s very robust under collapsed sampling. The advantage of collapsing is that my Java code runs in split seconds whereas the full sampling scheme in BUGS takes minutes (and won’t scale up).

What I couldn’t do was estimate item difficulty parameters with only a handful of annotations per category. It took between 10 and 20 decently accurate annotators to get a handle on item difficulty. The spammers really impact this, because an alternative explanation for an error other than low accuracy is a difficult problem.

What I think is cool is that you can mix the seeded gold standard approach by fixing the category in the model. Or, you can have the gold-standard annotator mixed in as another annotator with very high fixed accuracy (or you could just estimate the gold standard annotator’s accuracy).

]]>Using an existing gold standard works but I somehow find the solution “less elegant” than inferring everything from the unfiltered stream of tags.

]]>What Dolores Labs and other folks do is seed the tasks with known gold-standard examples. That lets you find the random taggers more easily and also estimate the good taggers’ accuracy. You could, of course, mix the known cases with estimation via EM or Gibbs sampling pretty easily.

]]>Question: Do you have any idea how robust are these techniques to spammers?

For example, in your analysis of the data from Snow et al., https://lingpipe-blog.com/2008/09/15/dolores-labs-text-entailment-data-from-amazon-mechanical-turk/ you identified some annotators are spammers, sitting on the random line of the ROC plot.

Would it make sense to remove the random annotators altogether? What about removing the “almost random” guys? In an environment where we pay per label (e.g., MTurk) the benefit that these “too noisy” annotators bring may not be worth paying for.

But do you find that the labels of such annotators bring *any* benefit while computing the posterior classes? Or do they simply introduce noise

in practice?