## Modeling Item Difficulty for Annotations of Multinomial Classifications

We all know from annotating data that some items are harder to annotate than others. We know from the epidemiology literature that the same holds true for medical tests applied to subjects, e.g., some cancers are easier to find than others.

But how do we model item difficulty? I’ll review how I’ve done this before using an IRT-like regression, then move on to Paul Mineiro’s suggestion for flattening multinomials, then consider a generalization of both these approaches.

### Binary Data

My previous suggestions have been for binary data and were based on item-response theory (IRT) style models as applied to epidemiology by Uebersax and Grove. These are reflected, for instance, in my tutorial for LREC with Massimo Poesio. The basic idea is that you characterize an item $i$ by position $\alpha_i$ and an annotator by position $\beta_j$ and discriminativeness $\delta_j$, then generate a label for item $i$ by annotator $j$ whose probability of being correct is:

$y_{i,j} \sim \mbox{\sf Bern}(\mbox{logit}^{-1}(\delta_j(\alpha_i - \beta_j)))$

The higher the value of $\delta_j$, the sharper the distinctions made by annotator $j$. You can break this out so there’s one model for positive items and another for negative items (essentially, one set of parameters for sensitivity and one for specificity).

If there’s no difficulty, equivalently all $\alpha_i = 0$, so we can reduce the logistic regression to a simple binomial parameter

$\theta_j = \mbox{logit}^{-1}(\delta_j \beta_j)$

which is the form of model I’ve been proposing for binary data with single response parameter $\theta_j$, with $\theta_{0,j}$ for specificity and $\theta_{1,j}$ for specificity.

### Difficulty as Flattening Responses

Paul Mineiro’s blog post Low-Rank Confusion Modeling of Crowdsourced Workers introduces a general approach to handling item difficulty. If the true category is $k$, each annotator $j$ has a response distribution paraemterized by a multinomial parameter $\theta_{j,k}$.

This is just Dawid and Skene‘s original model.

Mineiro applies a neat single-parameter technique to model item difficulty where the larger the parameter, the closer the response distribution is to uniform. That is, difficulty amounts to flattening an annotator’s response.

He does this in the standard temperature-based analogy used in annealing. If the difficulty of item $i$ of true category $k$ is $\alpha_i$, the response of annotator $j$ is flattened from $\theta_{j,k}$ to being proportional to $\theta_{j,k}^{1/\alpha_i}$. The higher the value of $\alpha$, the greater the amount of flattening. A value of $\alpha_i$ greater than 1 indicates an annotator will do worse than their basic response and a value less than 1 indicates they’ll do better (assuming they assign the true value the highest probability).

### General Regression Approach

We can do almost the same thing in a more general regression setting. To start, convert an annotator’s response probability vector $\theta_{j,k}$ to a regression representation $\log \theta_{j,k}$. To get a probability distribution back, apply softmax (i.e., multi-logit), where the probability of label $k'$ for an item $i$ with true label $k$ for annotator $j$ is proportional to $\exp(\theta_{j,k,k'})$

We can encode Mineiro’s approach in this setting by adding a multiplicative term for the item, making the response proportional to $\exp((1/\alpha_i) \theta_{j,k,k'}) = \exp(\theta_{j,k,k'})^{1/\alpha_i}$. It would probably be more common to make the difficulty additive, so you’d get $\exp(\alpha_i + \theta_{j,k,k'})$.

For instance, suppose we have an ice-cream-flavor classification task with four response categories, lemon, orange, vanilla and chocolate. Built into users’ responses (and in my hierarchical models into the hierarchical priors) are the basic confusions. For instance, a lemon ice cream would be more likely to be confused with orange than vanilla or chocolate. A more difficult item will flatten this response to one that’s more uniform. But until $\alpha_i$ approaches infinity, we’ll still confuse orange with lemon more than with the others.

### Problem with the Basic Approach

The fundamental problem with Mineiro’s approach is that there’s no way to have the difficulty be one of limited confusability. In the ice cream case, an item that’s very citrusy will never be confused with chocolate or vanilla, but might in the limit of a very hard problem, have a uniform response between orange and lemon.

You also see this in ordinal problems, like Rzhetsky et al.’s modality and strength of assertion ordinal scale. A very borderline case might be confused between a 2 or 3 (positive and very positive), but won’t be confused between a -3 (very negative) modality of assertion.

### Genealized Multi-Parameter Approach

What I can do is convert everything to a regression again. And then the generalization’s obvious. Instead of a single difficulty parameter $\alpha_i$ for an item, have a difficulty parameter for each response $k$, namely $\alpha_{i,k}$. Now the probablity of annotator $j$ responding with category $k'$ when the true category is $k$ is taken to be proportional to $\exp(\beta_{j,k,k'} + \alpha_{i,k'})$.

If you want it to be more like Mineiro’s approach, you could leave it on a multiplicative scale, and take the response to be proportional to $\exp(\alpha_{i,k'} \beta_{j,k,k'})$.

### Of course, …

We’re never going to be able to fit such a model without a pile of annotations per item. It’s just basic stats — your estimates are only as good as (the square root of) the count of your data.

Being Bayesian, at least this shouldn’t hurt us (as I showed in my tech report). Even if the point estimate is unreliable, if we apply full Bayesian posterior inference, we get back overdispersion relative to our point estimate and everything should work out in principle. I just didn’t find it helped much in the problems I’ve looked at, which had 10 noisy annotations per item.

### But then, …

If we have contentful predictors for the items, we might be able to model the difficulties more directly as regressions on basic predictors. Examples would be knowing the journal in which a paper came from when doing classification of article subjects. Some journals might be more interdisciplinary and have more confusable papers in general.

### 2 Responses to “Modeling Item Difficulty for Annotations of Multinomial Classifications”

1. Paul Mineiro (@PaulMineiro) Says:

Welinder et. al. had a nice paper in which they embedded both the users and the examples in a vector space. This leads to a multi-dimensional concept of item difficulty. As you indicate you need more ratings per item to support a richer concept of item-difficulty, so I haven’t found this very compelling, but it does make cool pictures.

Click to access WelinderEtalNIPS10.pdf

Re: ordinal problems, I have a specialized technique just for that based upon polytomous rasch.

http://www.machinedlearnings.com/2011/02/ordered-values-and-mechanical-turk-part.html

2. Bob Carpenter Says:

Thanks for the references. I’m very interested in the ordinal problem for graded linguistic judgments and survey responses.