We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

### Binary Data

My previous suggestions have been for binary data and were based on item-response theory (IRT) style models as applied to epidemiology by Uebersax and Grove. These are reflected, for instance, in my tutorial for LREC with Massimo Poesio. The basic idea is that you characterize an item by position and an annotator by position and discriminativeness , then generate a label for item by annotator whose probability of being correct is:

The higher the value of , the sharper the distinctions made by annotator . You can break this out so there’s one model for positive items and another for negative items (essentially, one set of parameters for sensitivity and one for specificity).

If there’s no difficulty, equivalently all , so we can reduce the logistic regression to a simple binomial parameter

which is the form of model I’ve been proposing for binary data with single response parameter , with for specificity and for specificity.

### Difficulty as Flattening Responses

Paul Mineiro’s blog post Low-Rank Confusion Modeling of Crowdsourced Workers introduces a general approach to handling item difficulty. If the true category is , each annotator has a response distribution paraemterized by a multinomial parameter .

This is just Dawid and Skene‘s original model.

Mineiro applies a neat single-parameter technique to model item difficulty where the larger the parameter, the closer the response distribution is to uniform. That is, difficulty amounts to flattening an annotator’s response.

He does this in the standard temperature-based analogy used in annealing. If the difficulty of item of true category is , the response of annotator is flattened from to being proportional to . The higher the value of , the greater the amount of flattening. A value of greater than 1 indicates an annotator will do worse than their basic response and a value less than 1 indicates they’ll do better (assuming they assign the true value the highest probability).

### General Regression Approach

We can do almost the same thing in a more general regression setting. To start, convert an annotator’s response probability vector to a regression representation . To get a probability distribution back, apply softmax (i.e., multi-logit), where the probability of label for an item with true label for annotator is proportional to

We can encode Mineiro’s approach in this setting by adding a multiplicative term for the item, making the response proportional to . It would probably be more common to make the difficulty additive, so you’d get .

For instance, suppose we have an ice-cream-flavor classification task with four response categories, lemon, orange, vanilla and chocolate. Built into users’ responses (and in my hierarchical models into the hierarchical priors) are the basic confusions. For instance, a lemon ice cream would be more likely to be confused with orange than vanilla or chocolate. A more difficult item will flatten this response to one that’s more uniform. But until approaches infinity, we’ll still confuse orange with lemon more than with the others.

### Problem with the Basic Approach

The fundamental problem with Mineiro’s approach is that there’s no way to have the difficulty be one of limited confusability. In the ice cream case, an item that’s very citrusy will never be confused with chocolate or vanilla, but might in the limit of a very hard problem, have a uniform response between orange and lemon.

You also see this in ordinal problems, like Rzhetsky et al.’s modality and strength of assertion ordinal scale. A very borderline case might be confused between a 2 or 3 (positive and very positive), but won’t be confused between a -3 (very negative) modality of assertion.

### Genealized Multi-Parameter Approach

What I can do is convert everything to a regression again. And then the generalization’s obvious. Instead of a single difficulty parameter for an item, have a difficulty parameter for each response , namely . Now the probablity of annotator responding with category when the true category is is taken to be proportional to .

If you want it to be more like Mineiro’s approach, you could leave it on a multiplicative scale, and take the response to be proportional to .

### Of course, …

We’re never going to be able to fit such a model without a pile of annotations per item. It’s just basic stats — your estimates are only as good as (the square root of) the count of your data.

Being Bayesian, at least this shouldn’t hurt us (as I showed in my tech report). Even if the point estimate is unreliable, if we apply full Bayesian posterior inference, we get back overdispersion relative to our point estimate and everything should work out in principle. I just didn’t find it helped much in the problems I’ve looked at, which had 10 noisy annotations per item.

### But then, …

If we have contentful predictors for the items, we might be able to model the difficulties more directly as regressions on basic predictors. Examples would be knowing the journal in which a paper came from when doing classification of article subjects. Some journals might be more interdisciplinary and have more confusable papers in general.

September 8, 2011 at 6:14 pm |

Welinder et. al. had a nice paper in which they embedded both the users and the examples in a vector space. This leads to a multi-dimensional concept of item difficulty. As you indicate you need more ratings per item to support a richer concept of item-difficulty, so I haven’t found this very compelling, but it does make cool pictures.

Re: ordinal problems, I have a specialized technique just for that based upon polytomous rasch.

http://www.machinedlearnings.com/2011/02/ordered-values-and-mechanical-turk-part.html

September 9, 2011 at 5:47 pm |

Thanks for the references. I’m very interested in the ordinal problem for graded linguistic judgments and survey responses.