Some of the LDC corpora have had better quality control, but not much. When you adjudicate, there’s no reason to believe the third annotator (or the so-called “expert”) is any more consistent or accurate than other annotators.

The real problem is correlation between annotators or item-level difficulty. Even if you’re 90% agreement on average, it’s clear from real data that the data’s overdispersed relative to the independence assumption. If you consider item difficulty/correlation, the expected number of errors even with three annotators remains pretty high. You see this more clearly when you get 10 annotators, or when you sit down and annotate data yourself — some cases are just very hard.

My own rejected ACL submission had comments that considering 10 annotators was unrealistic (despite the fact that I’d collected the data for about $200 using Amazon Mechanical Turk).

Luckily, most of the estimation techniques, such as CRFs or logistic regresison or naive Bayes are pretty robust with respect to noise. The effect on evaluation tends to be larger than that on estimation.

]]>For natural language processing experiments, say an annotation problem, how many annotatiors would be enough for your experiments to be considered satisfied. I was told that for NLP conference papers such as EACL, NAACL or ACL, two annotators would be enough to do experiments. Is this right? If that’s not the case, how many annotators would be needed for such labeleing taks to collect data and do supervised learning methods such as CRF or what not.

Many thanks for your timely answes.

Best regards,

Kemal.

]]>This is a much stronger result than I got with the other NLP data, which looks much more like this paper’s figure 4, where the advantage of the model tails off as you approach 100% of the given annotations.

]]>I can compute a smoother category posterior directly from the annotator accuracy posteriors and prevalence (p(category)) posteriors. That’s what I did for the dentistry data (and I believe I took more than 100 samples, but I’d have to go look at the code).

The likelihoods will change a lot if there’s a feature with a very high coefficient that changes. Whether you can keep pushing the feature up again goes against my experience with regularization, which almost always helps in posterior predictions (obviously not on fitting the training data).

And you’re right that 0.7 on the logistic scale is a much more specific place than 0 or 1, which are super broad on the logistic scale (basically any predictor with absolute value greater than 400 or so rounds to 1 or 0 in double-precision arithmetic).

But how hard it tries to get that datum there depends on its relative weight in the corpus. It’s the same as regularization which is trying to get every feature to 0. You just have to weight how hard it tries. I think the 0.7 and 0.3 weightings are right.

PS: I also want to point out that I’m thinking log loss on held out data as the target, not 0/1 loss. One reason I care so much is that we’re trying to build very high recall gene mention extraction and database linkage.

]]>I want to elaborate my thoughts on logistic a little bit more. If you are trying to learn that something is 0.7, then the values for the weights are sort of being pushed from both sides – you don’t want them too big or too small. Also, because of how the curve looks, the interactions between the weights has to be very precise to actually get it to be 0.7. Now, at test time, you get a new datum and its gonna have some slightly different set of features present. Those features really only have to be slightly different for the likelihood (0.7) to change a lot. Then, take the 0/1 case. Here the weights are only really being pushed from one side – make it true or make it false. While the interactions between the features are certainly still there, because some will appear with both positive and negative examples, my hypothesis is that the interactions between the weights don’t need to be as precise. If you have some feature that is pretty indicative of true, you can just keep pushing it’s weight upwards (modulo regularization). In your case you are trying to get the likelihood for the datum to very specific place, while in my case I just keep pushing it further and further in some direction. Then, I think if the features change slightly there is a lot more wiggle room for me to still get the right classification.

]]>With 0/1 loss, I may be setting myself up for a fall, but for total corpus log loss, logistic regression’s the right metric.

The posterior samples follow the posterior distribution, so they’re only as noisy as your uncertainty. My intuition then runs the opposite way. Which frightens me, because you have a lot more experience with these kinds of models than I do.

My intuition is that training on 0/1 outcome data tries too hard to force decisions about marginal cases one way or another. If there isn’t posterior certainty, then training with 95 cases where an example is positive and 5 where it’s negative (or even 50/50) performs a kind of regularization. For instance, in the 50/50 sampled case, it’s equivalent to extra regularization. The optimal parameters in the 50/50 case for the defined features are all zero, so training pushes the vectors that way. The logistic regression error function is the right one (at the combined corpus level) for fuzzy examples. For instance, it penalizes anything other than a 70% prediction. 0/1 doesn’t penalize any prediction in (0.5,1] and applies a 1 penalty to predictions in [0,0.5].

Of course, most “gold standards” fudge this problem by censoring uncertain training data.

In the Snow et al. recreate-the-linguistic-gold-standard task, averaging over the posterior samples did help a little bit. But that’s different in that it’s averaging over annotator accuracy estimates, not over the samples themselves. With more posterior samples, I could’ve just sampled the categories and gotten the same result.

You could swap out SVMs or other models — anything that can handle non-separable training data.

]]>I know these reasons are not that concrete – like I said I basically tried this and failed and then never thought about it again until your talk. But I wonder if it has something to do with the shape of the sigmoid. You can try to train it to learn that something is 0.7, but the curve is relatively flat there, so slight changes at test time will result in much larger changes in the likelihood in the middle portion than at the extremities. I think if you want to specifically model that something is 70% probable (to be some particular label) then maybe you want to change the objective function entirely. And of course I am making assumptions here that you are training a log-linear model, but that seems like a reasonable assumption.

(ps – it was nice to actually meet you!)

]]>