Archive for the ‘Data Annotation’ Category

Ethics, logistic regression, and 0-1 loss

December 27, 2016

Andrew Gelman and David Madigan wrote a paper on why 0-1 loss is so problematic:

This is related to the issue of whether one should be training on an artificial gold standard. Suppose we have a bunch of annotators and we don’t have perfect agreement on items. What do we do? Well, in practice, machine learning evals tend to either (1) throw away the examples without agreement (e.g., the RTE evals, some biocreative named entity evals, etc.), or (2) go with the majority label (everything else I know of). Either way, we are throwing away a huge amount of information by reducing the label to artificial certainty. You can see this pretty easily with simulations, and Raykar et al. showed it with real data.

Yet 0/1 corpora and evaluation remain the gold standard (pun intended) in most machine learning evals. Kaggle has gone largely to log loss, but even that’s very flat around the middle of the range, as Andrew discusses in this blog post:

The problem is that it’s very hard to train a model that’s well calibrated if you reduce the data to an artificial gold standard. If you don’t know what I mean by calibration, check out this paper:

It’s one of my favorites. Once you’ve understood it, it’s hard not to think of evaluating models in terms of calibration and sharpness.

Fokkens et al. on Replication Failure

September 25, 2013

After ACL, Becky Passonneau said I should read this:

You should, too. Or if you’d rather watch, there’s a video of their ACL presentation.

The Gist

The gist of the paper is that the authors tried to reproduce some results found in the literature for word sense identification and named-entity detection, and had a rather rough go of it. In particular, they found that every little decision they made impacted the evaluation error they got, including how to evaluate the error. As the narrative of their paper unfolded, all I could do was nod along. I loved that their paper was in narrative form — it made it very easy to follow.

The authors stress that we need more information for reproducibility and that reproducing known results should get more respect. No quibble there. A secondary motivation (after the pedagogical one) of the LingPipe tutorials was to show that we could reproduce existing results in the literature.

The Minor Fall

Although they’re riding one of my favorite hobby horses, they don’t quite lead it in the direction I would have. In fact, they stray into serious discussion of the point estimates of performance that their experiments yielded, despite the variation they see under cross-validation folds. So while they note that rankings of approaches changes based on what seem like minor details, they stop short of calling into question the whole idea of trying to assign a simple ranking. Instead, they stress getting to the bottom of why the rankings differ.

A Slightly Different Take on the Issue

What I would’ve liked to have seen in this paper is more emphasis on two key statistical concepts:

  1. the underlying sample variation problem, and
  2. the overfitting problem with evaluations.

Sample Variation

The authors talk about sample variation a bit when they consider different folds, etc., in cross-validation. For reasons I don’t understand they call it “experimental setup.” But they don’t put it in terms of estimation variance. That is, independently of getting the features to all line up, the performance and hence rankings of various approaches vary to a large extent because of sample variation. The easiest way to see this is to run cross-validation.

I would have stressed that the bootstrap method should be used to get at sample variation. The bootstrap differs from cross-validation in that it samples training and test subsets with replacement. The bootstrap effectively evaluates within-sample sampling variation, even in the face of non-i.i.d. samples (usually there is correlation among the items in a corpus due to choosing multiple sentences from a single document or even multiple words from the same sentence). The picture is even worse in that the data set at hand is rarely truly representative of the intended out-of-sample application, which is typically to new text. The authors touch on these issues with discussions of “robustness.”


The authors don’t really address the overfitting problem, though it is tangential to what they call the “versioning” problem (when mismatched versions of corpora such as WordNet produce different results).

Overfitting is a huge problem in the current approach to evaluation in NLP and machine learning research. I don’t know why anyone cares about the umpteenth evaluation on the same fold of the Penn Treebank that produces a fraction of a percentage gain. When the variation across folds is higher than your estimated gain on a single fold, it’s time to pack the papers in the file drawer, not in the proceedings of ACL. At least until we also get a journal of negative results to go with our journal of reproducibility.

The Major Lift

Perhaps hill-climbing error on existing data sets is not the best way forward methdologically. Maybe, just maybe, we can make the data better. It’s certainly what Breck and I tell customers who want to build an application.

If making the data better sounds appealing, check out

I find the two ideas so compelling that it’s the only area of NLP I’m actively researching. More on that later (or sooner if you want to read ahead).


(My apologies to Leonard Cohen for stealing his lyrics.)

VerbCorner: Another Game with Words

July 4, 2013

Games with Words continues to roll out new games, releasing VerbCorner.

VerbCorner contains a series of sub-games with titles like “Philosophical Zombie Hunter” and questions that sound like they’re straight of an SAT reading comprehension test, starting with a long story I won’t repeat and then questions like:

Michelle smashed the gauble into quaps. Which of the following is *definitely* a real human?

  1. Michelle
  2. None of the above
  3. Can’t tell because in this context ‘smash’ has more than one meaning.
  4. Can’t tell because the sentence is ungrammatical.
  5. Can’t tell because I don’t know that verb.

Josh Hartshorne, one of the developers, writes that Verbcorner is “a new project I’m doing in collaboration with Martha Palmer at CU-Boulder, which is aimed at getting human judgments to inform the semantics assigned to verbs in VerbNet. You can find a press release about the project … You’ll see we’ve incorporated a lot of gamification elements in order to make the task more engaging (and also to improve annotation quality).”

This note from the press release harkens back to my philosophical semantics days, “If we do not want to define words in terms of other words, what should they be defined in terms of? This remains an open research question…”. I’ll say! The project has a generative semantics feel; in the press release, Josh is quoted as saying “There are a few dozen core components of meaning and there are tens of thousands of verbs in English.” Now I’m left hanging as to what a “core component of meaning” is!

Another Linguistic Corpus Collection Game

November 12, 2012

Johan Bos and his crew at University of Groningen have a new suite of games aimed at linguistic data data collection. You can find them at:

Wordrobe is currently hosting four games. Twins is aimed at part-of-speech tagging, Senses is for word sense annotation, Pointers for coref data, and Names for proper name classification.

One of the neat things about Wordrobe is that they try to elicit some notion of confidence by allowing users to “bet” on their answers.

They also discuss prizes, but I didn’t see any mention of what the prizes were.

The project is aimed at imrpoving the Groningen Meaning Bank. I hope they release the raw user data as well as their best guess at a gold standard. I had some background discussion with Johan about annotation models, but they’re going to go with something relatively simple, which means there’s an opportunity to compare a richer statistical models like the other ones I’ve cited on the Data Annotation category of this blog.

Other Linguistic Games

The first linguistic game of which I was aware was Ahn’s reCAPTCHA. Although aimed at capturing OCR annotations as a side effect, it is more of a security wall aimed at filtering out bots than a game. Arguably, I’ve been played by it more than the other way around.

A more linguistically relevant game is Poesio et al.’s Phrase Detectives, which is aimed at elucidating coreference annotations. I played through several rounds of every aspect of it. The game interface itself is very nice for a web app. Phrase Detectives occassionally has cash prizes, but it looks like they ran out of prize money because the last reference to prizes was July 2011.

Are they Really Games?

Phrase Detectives is more like an Amazon Mechanical Turk task with a backstory and leaderboard. I didn’t create a login for Wordrobe to try it, but I’m going out on a limb to guess it’s going to be similar given the descriptions of the games.

High Kappa Values are not Necessary for High Quality Corpora

October 2, 2012

I’m not a big fan of kappa statistics, to say the least. I point out several problems with kappa statistics right after the initial studies in this talk on annotation modeling.

I just got back from another talk on annotation where I was ranting again about the uselessness of kappa. In particular, this blog post is an attempt to demonstrate why a high kappa is not necessary. The whole point of building annotation models a la Dawid and Skene (as applied by Snow et al. in their EMNLP paper on gather NLP data with Mechanical Turk) is that you can create a high-reliability corpus without even having high accuracy, much less acceptable kappa values — it’s the same kind of result as using boosting to combine multiple weak learners into a strong learner.

So I came up with some R code to demonstrate why a high kappa is not necessary without even bothering with generative annotation models. Specifically, I’ll show how you can wind up with a high-quality corpus even in the face of low kappa scores.

The key point is that annotator accuracy fully determines the accuracy of the resulting entries in the corpus. Chance adjustment has nothing at all to do with corpus accuracy. That’s what I mean when I say that kappa is not predictive. If I only know the annotator accuracies, I can tell you expected accuracy of entries in the corpus, but if I only know kappa, I can’t tell you anything about the accuracy of the corpus (other than that all else being equal, higher kappa is better; but that’s also true of agreement, so kappa’s not adding anything).

First, the pretty picture (the colors are in honor of my hometown baseball team, the Detroit Tigers, clinching a playoff position).

Kappa for varying prevalences and accuracies

What you’re looking at is a plot of the kappa value vs. annotator accuracy and category prevalence in a binary classification problem. (It’s only the upper-right corner of a larger diagram that would let accuracy run from 0 to 1 and kappa from 0 to 1. Here’s the whole plot for comparison.

Kappa for varying prevalences and accuracies

Note that the results are symmetric in both accuracy and prevalence, because very low accuracy leads to good agreement in the same way that very high accuracy does.)

How did I calculate the values? First, I assumed accuracy was the same for both positive and negative categories (usually not the case — most annotators are biased). Prevalence is defined as the fraction of items belonging to category 1 (usually the “positive” category).

Everything else follows from the definitions of kappa, to result in the following definition in R to compute expected kappa from binary classification data with a given prevalence of category 1 answers and a pair of annotators with the same accuracies.

kappa_fun = function(prev,acc) {
  agr = acc^2 + (1 - acc)^2;
  cat1 = acc * prev + (1 - acc) * (1 - prev);
  e_agr = cat1^2 + (1 - cat1)^2;
  return((agr - e_agr) / (1 - e_agr));

Just as an example, let’s look at prevalence = 0.2 and accuracy = 0.9 with say 1000 examples. The expected contingency table would be

  Cat1 Cat2
Cat1 170 90
Cat2 90 650

and the kappa coefficient would be 0.53, below anyone’s notion of “acceptable”.

The chance of actual agreement is the accuracy squared (both annotators are correct and hence agree) plus one minus the accuracy squared (both annotators are wrong and hence agree — two wrongs make a right for kappa, another of its problems).

The proportion of category 1 responses (say positive responses) is the accuracy times the prevalence (true category is positive, correct response) plus one minus accuracy times one minus prevalence (true category is negative, wrong response).

Next, I calculate expected agreement a la Cohen’s kappa (which is the same as Scott’s pi in this case because the annotators have identical behavior and hence everything’s symmetric), which is just the resulting agreement from voting according to the prevalences. So that’s just the probability of category 1 squared (both annotators respond category 1) and the probability of a category 2 response (1 minus the probability of a category 1 response) squared.

Finally, I return the kappa value itself, which is defined as usual.

Back to the plot. The white border is set at .66, the lower-end threshold established by Krippendorf for somewhat acceptable kappas; the higher-end threshold of acceptable kappas set by Krippendorf was 0.8, and is also indicated on the legend.

In my own experience, there are almost no 90% accurate annotators for natural language data. It’s just too messy. But you need well more than 90% accuracy to get into acceptable kappa range on a binary classification problem. Especially if prevalence is high, because as prevalence goes up, kappa goes down.

I hope this demonstrates why having a high kappa is not necessary.

I should add that Ron Artstein asked me after my talk what I thought would be a good thing to present if not kappa. I said basic agreement is more informative than kappa about how good the final corpus is going to be, but I want to go one step further and suggest you just inspect a contingency table. It’ll tell you not only what the agreement is, but also what each annotator’s bias is relative to the other (evidenced by asymmetric contingency tables).

In case anyone’s interested, here’s the R code I then used to generate the fancy plot:

pos = 1;
K = 200;
prevalence = rep(NA,(K + 1)^2);
accuracy = rep(NA,(K + 1)^2);
kappa = rep(NA,(K + 1)^2);
for (m in 1:(K + 1)) {
  for (n in 1:(K + 1)) {
    prevalence[pos] = (m - 1) / K;
    accuracy[pos] = (n - 1) / K;
    kappa[pos] = kappa_fun(prevalence[pos],accuracy[pos]);
    pos = pos + 1;
df = data.frame(prevalence=prevalence,
kappa_plot = 
  ggplot(df, aes(prevalence,accuracy,fill = kappa)) +
     labs(title = "Kappas for Binary Classification\n") +
     geom_tile() +
                        limits =c(0.5,1)) +
                        limits=c(0.85,1)) +
     scale_fill_gradient2("kappa", limits=c(0,1), midpoint=0.66,
                          low="orange", mid="white", high="blue",

Interannotator Agreement for Chunking Tasks Like Named Entities and Phrases

May 18, 2012

From the Emailbox

Krishna writes,

I have a question about using the chunking evaluation class for inter annotation agreement : how can you use it when the annotators might have missing chunks I.e., if one of the files contains more chunks than the other.

The answer’s not immediately obvious because the usual application of interannotator agreement statistics is to classification tasks (including things like part-of-speech tagging) that have a fixed number of items being annotated.

Chunker Evaluation

The chunker evaluations built into LingPipe calculate the usual of precision and recall measures (see below). These evaluations compare a set of response chunkings to a set of reference chunkings. Usually the reference is drawn from a gold-standard corpus and the response from an automated system built to do chunking.

Precision (aka positive predictive accuracy) measures the proportion of chunks in the response that are also in the reference. Recall (aka sensitivity) measures the proportion of chunks in the reference that are in the response. If we swap the reference and response chunkings, we swap precision and recall.

True negatives aren’t really being counted here — theoretically there are a huge number of them — any possible span with any possible tag could have been labeled. LingPipe just sets the true negative count to zero, and as a result, specificity (TN/[TN+FP]) doesn’t make sense.

Interannotator Agreement

Suppose you have chunkings from two human annotators. Just treat one as the reference and one as the response and run a chunking evaluation. The precision and recall values will tell you which annotator returned more chunkings. For instance, if precision is .95 and recall .75, you know that the annotator assigned as the reference chunking had a whole bunch of chunks the other annotator didn’t think were chunks, but most of the chunks found by the response annotator were also chunks of the reference annotator.

You can use F-measure as an overall single-number score.

The base metrics are all explained in

and their application to chunking in

Examples of running chunker evaluations can be found in

LingPipe Annotation Tool

If you’re annotating entity data, you might be interested in our learn-a-little, tag-a-little tool.

Now that Mitzi’s brought it up to compatibility with LingPipe 4, we should move citationEntities out of the sandbox and into a tutorial.

Mavandadi et al. (2012) Distributed Medical Image Analysis and Diagnosis through Crowd- Sourced Games: A Malaria Case Study

May 5, 2012

I found a link from Slashdot of all places to this forthcoming paper:

The main body of the paper is about they reapplication to malaria diagnosis. But I’m more interested in the statistical techniques they used for crowd sourcing.

None of the nine authors, the reviewer(s) or editor(s) knew that their basic technique for analyzing crowd sourced data has been around for over 30 years. (I’m talking about the statistical technique here, not the application to distributed diagnosis of diseases, which I don’t know anything about.)

Of course, many of us reinvented this particular wheel over the past three decades, and the lack of any coherent terminology for the body of work across computer science, statistics, and epidemiology is part of the problem.

Previous Work

The authors should’ve cited the seminal paper in this field (at least it’s the earliest one I know — if you know earlier refs, please let me know):

  • Dawid, A. P. and A. M. Skene. 1979. Maximum likelihood estimation of observer error rates using the EM algorithm. Applied Statistics 28(1):20–28.

Here’s a 20-year old paper on analyzing medical image data (dental X-rays) with similar models:

  • Espeland, M. A. and S. L. Handelman. 1989. Using latent class models to characterize and assess relative error in discrete measurements. Biometrics 45:587–599.

Mavandadi et al.

Mavandadi et al. use an approach they call a “binary channel model for gamers”. On page 4 of part II of the supplement to their paper, they define a maximum a posteriori estimate that is the same as Dawid and Skene’s maximum likelihood estimate. It’s the same wheel I reinvented in 2008 (I added hierarchical priors because I was asking Andrew Gelman and Jennifer Hill for advice) and that several groups have subsequently reinvented.

I didn’t understand the section about “error control coding” (starting with whether they meant the same thing as what I know as an “error correcting code”). Why have an annotator annotate an item an odd number of times and then take a majority vote? You can build a probabilistic model for reannotation of any number of votes (that presumably would take into account the correlation (fixed effect) of having the same annotator).

Role of Automatic Classifiers

As in Raykar et al.’s 2009 JMLR paper, Mavandadi et al. also include a machine-based system. But it is not tightly linked as in the work of Raykar et al. It’s just trained from the data a la Padhraic Smyth’s mid-1990s model of crowdsourcing crater location data and then training image analysis models on the resulting crowdsourced data.

Mavandadi et al. instead run their automatic classifier first, then if it’s not confident, hand it over to the crowd. This is, by the way, the standard practice in speech-recognition-based automated call centers.

Mavandadi et al. should check out (Sheng et al. 2010), which analyzes when you need to find another label, also using a Dawid-and-Skene-type model of data annotation. It’s also a rather common topic in the epidemiology literature, because it’s the basis of the decision as to which diagnostic test to administer next, if any, in situations like breast cancer diagnosis (which involves notoriously false-positive-prone image tests and notoriously false-negative-prone tissue tests).

I didn’t see any attempt by Mavandadi et al. to calibrate (or even measure) their system’s confidence assessments. I’d wait for that analysis before trusting their output.

Cross Validation vs. Inter-Annotator Agreement

March 12, 2012

Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct. This approach was pioneered in MITRE’s Alembic Workbench, which was used to create the original MUC-6 named-entity corpus.

The chunker underlying LingPipe’s annotation toolkit is based on LingPipe’s character language-model rescoring chunker, which can be trained online (that is, as the data streams in) and has quite reasonable out-of-the-box performance. It’s LingPipe’s best out-of-the-box chunker. In contrast, CRFs can be engineered to outperform the rescoring chunker with good feature engineering.

A very nice project would be to build a semi-supervised version of the rescoring chunker. The underlying difficulty is that our LM-based and HMM-based models take count-based sufficient statistics.

It Works!

Mitzi’s getting reasonable system accuracy under cross validation, with over 80% precision and recall (and hence over 80% balanced F-measure).

That’s not Cricket!

According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement.

So What?

If we can train a system to performa at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation?

We have something better — we can train a system to match Mitzi’s annotation!

In fact, training such a system is really all that we often care about. It’s much better to be able to train a system than another human to do the annotation.

The other thing we might want a corpus for is to evaluate a range of systems. There, if the systems are highly comparable, the fringes of the corpus matter. But perhaps the small, but still p < 0.05, differences in such systems don't matter so much. What the MT people have found is that even a measure that's roughly correlated with performance can be used to guide system development.

Error Analysis and Fixing Inconsistencies

Mitzi’s been doing the sensible thing of actually looking at the errors the system’s making under cross validation. In some of these cases, she’d clearly made a braino and annotated the data wrong. So she fixes it. And system performance goes up.

What Mitzi’s reporting is what I’ve always found in these tasks. For instance, she inconsistently annotated time plus date sequences, sometimes including the times and sometimes not. So she’s going back to correct to do it all consistently to include all of the time information in a phrase (makes sense to me).

After a couple of days of annotation, you get a much stronger feeling for how the annotations should have gone all along. The annotations drifted so much over time in this fashion in the clinical notes annotated for the i2b2 Obesity Challenge that the winning team exploited time of labeling as an informative feature to predict co-morbidities of obesity!

That’s also not Cricket!

The danger with re-annotating is that the system’s response will bias the human annotations. System-label bias is also a danger with single annotation under the tag-a-little, learn-a-little setup. If you gradually change the annotation to match the system’s responses, you’ll eventually get to very good, if not perfect, performance under cross validation.

So some judgment is required in massaging the annotations into a coherent system, but one that you care about, not one driven by the learned system’s behavior.

On the other hand, you do want to choose features and chunkings the system can learn. So if you find you’re trying to make distinctions that are impossible for the system to learn, then change the coding standard to make it more learnable, that seems OK to me.

Go Forth and Multiply

Mitzi has only spent a few days annotating the data and the system’s already working well end to end. This is just the kind of use case that Breck and I had in mind when we built LingPipe in the first place. It’s so much fun seeing other people use your tools

When Breck and Linnea and I were annotating named entities with the citationEntities tool, we could crank along at 5K tokens/hour without cracking a sweat. Two eight-hour days will net you 80K tokens of annotated data and a much deeper insight into the problem. In less than a person-week of effort, you’ll have a corpus the size of the MUC 6 entity corpus.

Of course, it’d be nice to roll in some active learning here. But that’s another story. As is measuring whether it’s better to have a bigger or a better corpus. This is the label-another-instance vs. label-a-fresh-instance decision problem that (Sheng et al. 2008) addressed directly.

Tang and Lease (2011) Semi-Supervised Consensus Labeling for Crowdsourcing

September 12, 2011

I came across this paper, which, among other things, describes the data collection being used for the 2011 TREC Crowdsourcing Track:

But that’s not why we’re here today. I want to talk about their modeling decisions.

Tang and Lease apply a Dawid-and-Skene-style model to crowdsourced binary relevance judgments for highly-ranked system responses from a previous TREC information retrieval evaluation. The workers judge document/query pairs as highly relevant, relevant, or irrelevant (though highly relevant and relevant are collapsed in the paper).

The Dawid and Skene model was relatively unsupervised, imputing all of the categories for items being classified as well as the response distribution for each annotator for each category of input (thus characterizing both bias and accuracy of each annotator).

Semi Supervision

Tang and Lease exploit the fact that in a directed graphical model, EM can be used to impute arbitrary patterns of missing data. They use this to simply add some known values for categories (here true relevance values). Usually, EM is being used to remove data, and that’s just how they pitch what they’re doing. They contrast the approach of Crowdflower (nee Dolores Labs) and Snow et al. as fully supervised. They thus provide a natural halfway point between Snow et al. and Dawid and Skene.

Good Results vs. NIST Gold

The key results are in the plots in figures 4 through 7,which plot performance versus amount of supervision (as well as fully unsupervised and majority vote approaches). They show the supervision helping relative to the fully unsupervised approach and the approach of training on just the labeled data.

Another benefit of adding supervised data (or adding unsupervised data if viewed the other way) is that you’ll get better estimates of annotator responses (accuracies and biases) and of topic prevalences.

Really Gold?

They get their gold-standard values from NIST, and the notion of relevance is itself rather vague and subjective, so the extra labels are only as golden as the NIST annotators. See below for more on this issue.

Voting: Quantity vs. Quality

Tang and Lease say that voting can produce good results with high quality annotators. It’ll also produce good results with a high quantity of annotators of low quality. As long as their results are independent enough, at least. This is what everyone else has seen (me with the Snow et al. data and Vikas Raykar et al. very convincingly in their JMLR paper).

Regularized Estimates (vs. MLE vs. Bayesian)

I think it’d help if they regularized rather than took maximum likelihood estimates. Adding a bit of bias from regularization often reduces variance and thus expected error even more. It helps with fitting EM, too.

For my TREC entry, I went whole hog and sampled from the posterior of a Bayesian hierarchical model which simultaneously estimates the regularization parameters (now cast as priors) along with the other parameters.

I also use Bayesian estimates, specifically posterior means, which minimize expected squared error. MLE for the unregularized case and maximum a posterior (MAP) estimates for the regularized case can both be viewed as taking posterior maximums (or modes) rather than means. These can be pretty different for the kinds of small count beta-binomial distributions used in Dawid and Skene-type models.

Really Adversarial Turkers?

How in the world did they get a Mechanical turker to have an accuracy of 0 with nearly 100 responses? That’s very very adversarial. I get higher accuracy estimates using their data for TREC and don’t get very good agreement with the NIST gold standard, so I’m really wondering about this figure and the quality of the NIST judgments.

Active Learning

Choosing labels for items on the margin of a classifier is not necessarily the best thing to do for active learning. You need to balance uncertainty with representativeness, or you’ll do nothing but label a sequence of outliers. There’s been ongoing work by John Langford and crew on choosing the right balance here.

Adding a Model

Vikas Raykar et al. in their really nice JMLR paper add a regression-based classifier to the annotators. I think this is the kind of thing Tang and Lease are suggesting in their future work section. They cite the Raykar et al. paper, but oddly not in this context, which for me, was its major innovation.

Not Quite Naive Bayes

Tang and Lease refer to the Dawid and Skene approach as “naive Bayes”, which is not accurate. I believe they’re thinking of generating the labels as analogous to generating tokens. But the normalization for that is wrong, being over annotators rather than over annotator/label pairs. If they had a term estimating the probability of an annotator doing an annotation, then it would reduce to naive Bayes if they allow multiple annotations by the same annotator independently (which they actually consider, but then rule out).

So it’s odd to see the Nigam et al. paper on semi-supervised naive Bayes text classification used as an example, as it’s not particularly relevant, so to speak. (I really like Nigam et al.’s paper, by the way — our semi-supervised naive Bayes tutorial replicates their results with some more analysis and some improved results.)

Two-Way vs. K-Way Independence

Another nitpick is that it’s not enough to assume every pair of workers is independent. The whole set needs to be independent, and these conditions aren’t the same. (I was going to link to the Wikipedia article on independent random variables, but it only considers the pairwise case. So you’ll have to go to a decent probability theory textbook like Degroot and Schervish or Larsen and Marx, where you’ll get examples of three variables that are not independent though each pair is pairwise independent.

One Last Nitpick

A further nitpick is equation (6), the second line of which has an unbound i in the p[i] term. Instead, i needs to be bound to the true category for instance m.

Synthetic Data Generation?

I also didn’t understand their synthetic data generation in 3.1. If they generate accuracies, do they take the sensitivities and specificities to be the same (in their notation, pi[k,0,0] = pi[k,1,1]). In my (and others’) experience, there’s usually a significant bias so that sensitivity is not equal to specificity for most annotators.

Modeling Item Difficulty for Annotations of Multinomial Classifications

September 8, 2011

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

Binary Data

My previous suggestions have been for binary data and were based on item-response theory (IRT) style models as applied to epidemiology by Uebersax and Grove. These are reflected, for instance, in my tutorial for LREC with Massimo Poesio. The basic idea is that you characterize an item i by position \alpha_i and an annotator by position \beta_j and discriminativeness \delta_j, then generate a label for item i by annotator j whose probability of being correct is:

y_{i,j} \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\delta_j(\alpha_i - \beta_j)))

The higher the value of \delta_j, the sharper the distinctions made by annotator j. You can break this out so there’s one model for positive items and another for negative items (essentially, one set of parameters for sensitivity and one for specificity).

If there’s no difficulty, equivalently all \alpha_i = 0, so we can reduce the logistic regression to a simple binomial parameter

\theta_j = \mbox{logit}^{-1}(\delta_j \beta_j)

which is the form of model I’ve been proposing for binary data with single response parameter \theta_j, with \theta_{0,j} for specificity and \theta_{1,j} for specificity.

Difficulty as Flattening Responses

Paul Mineiro’s blog post Low-Rank Confusion Modeling of Crowdsourced Workers introduces a general approach to handling item difficulty. If the true category is k, each annotator j has a response distribution paraemterized by a multinomial parameter \theta_{j,k}.

This is just Dawid and Skene‘s original model.

Mineiro applies a neat single-parameter technique to model item difficulty where the larger the parameter, the closer the response distribution is to uniform. That is, difficulty amounts to flattening an annotator’s response.

He does this in the standard temperature-based analogy used in annealing. If the difficulty of item i of true category k is \alpha_i, the response of annotator j is flattened from \theta_{j,k} to being proportional to \theta_{j,k}^{1/\alpha_i}. The higher the value of \alpha, the greater the amount of flattening. A value of \alpha_i greater than 1 indicates an annotator will do worse than their basic response and a value less than 1 indicates they’ll do better (assuming they assign the true value the highest probability).

General Regression Approach

We can do almost the same thing in a more general regression setting. To start, convert an annotator’s response probability vector \theta_{j,k} to a regression representation \log \theta_{j,k}. To get a probability distribution back, apply softmax (i.e., multi-logit), where the probability of label k' for an item i with true label k for annotator j is proportional to \exp(\theta_{j,k,k'})

We can encode Mineiro’s approach in this setting by adding a multiplicative term for the item, making the response proportional to \exp((1/\alpha_i) \theta_{j,k,k'}) = \exp(\theta_{j,k,k'})^{1/\alpha_i}. It would probably be more common to make the difficulty additive, so you’d get \exp(\alpha_i + \theta_{j,k,k'}).

For instance, suppose we have an ice-cream-flavor classification task with four response categories, lemon, orange, vanilla and chocolate. Built into users’ responses (and in my hierarchical models into the hierarchical priors) are the basic confusions. For instance, a lemon ice cream would be more likely to be confused with orange than vanilla or chocolate. A more difficult item will flatten this response to one that’s more uniform. But until \alpha_i approaches infinity, we’ll still confuse orange with lemon more than with the others.

Problem with the Basic Approach

The fundamental problem with Mineiro’s approach is that there’s no way to have the difficulty be one of limited confusability. In the ice cream case, an item that’s very citrusy will never be confused with chocolate or vanilla, but might in the limit of a very hard problem, have a uniform response between orange and lemon.

You also see this in ordinal problems, like Rzhetsky et al.’s modality and strength of assertion ordinal scale. A very borderline case might be confused between a 2 or 3 (positive and very positive), but won’t be confused between a -3 (very negative) modality of assertion.

Genealized Multi-Parameter Approach

What I can do is convert everything to a regression again. And then the generalization’s obvious. Instead of a single difficulty parameter \alpha_i for an item, have a difficulty parameter for each response k, namely \alpha_{i,k}. Now the probablity of annotator j responding with category k' when the true category is k is taken to be proportional to \exp(\beta_{j,k,k'} + \alpha_{i,k'}).

If you want it to be more like Mineiro’s approach, you could leave it on a multiplicative scale, and take the response to be proportional to \exp(\alpha_{i,k'} \beta_{j,k,k'}).

Of course, …

We’re never going to be able to fit such a model without a pile of annotations per item. It’s just basic stats — your estimates are only as good as (the square root of) the count of your data.

Being Bayesian, at least this shouldn’t hurt us (as I showed in my tech report). Even if the point estimate is unreliable, if we apply full Bayesian posterior inference, we get back overdispersion relative to our point estimate and everything should work out in principle. I just didn’t find it helped much in the problems I’ve looked at, which had 10 noisy annotations per item.

But then, …

If we have contentful predictors for the items, we might be able to model the difficulties more directly as regressions on basic predictors. Examples would be knowing the journal in which a paper came from when doing classification of article subjects. Some journals might be more interdisciplinary and have more confusable papers in general.