There’s a pointer to a very important EMNLP paper with Dan Jurafsky, Andrew Ng and Rion Snow evaluating five NLP tasks with annotations done by non-experts through Amazon’s Mechanical Turk (aka AMT). What I can’t believe is how cheap it is; they’re getting 3000 multiple-choice annotations for US$2 (yes, that’s two bucks).

]]>I’m also thinking a collapsed sampler that only takes the category assignments, computing sensitivity and specificity from those and then the alpha and beta priors by moment matching would do this whole thing almost instantly in a direct implementation.

One way the epidemiologists have suggested modeling the dependence between sensitivity and specificity is to use a 4-parameter Dirichlet prior over the rates FPR, TPR, FNR, TNR, which must sum to 1.0. But that’s pretty weakly relating TPR and TNR.

I’m guessing you’re thinking about a kind of bias dimension, in which some annotators are more prone to classify things as category 1 than others. You see some of this in the dentistry data. You also see a case where one annotator has both a higher sensitivity and a higher specificity than another annotator.

The ideal-point model of voting (like the item-response model for tests) is a natural way to look at the kind of systematic correlation between TPR and TNR you’re talking about. It models outcomes on the logit (or probit) scale and may include a bias term as an intercept varying by annotator.

]]>This is a very nice model. How fast was it?

One technicality: The current model draws “true positive rate” and “true negative rate” independently of each other. In many cases, we will have a natural tradeoff between the two and they will not be independent.

Would it be possible to alter the model so that it first draws the “quality” of the annotator (e.g., to be the AUC), and then picks the tradeoff between “true positive rate” and “true negative rate”?

]]>Here’s an overview … http://www.statmethods.net/advgraphs/axes.html (The Quick-R site is really nice in general.)

]]>