## Need Another Label(s)!

It occurred to me while working through the math for my last post that there are situations when you not only need another label for an existing item, but need more than one to achieve an expected positive payoff.

### Prior and Posterior Odds Ratios

First, let’s summarize the result from adding another annotator. For a given image $i$, suppose the current odds (aka prior) of clean ($c_i = 1$) versus porn ($c_i = 0$) are $V:1$, which corresponds to a probability of being clean of $\mbox{Pr}[c_i =1] = V/(1+V)$. For instance, if the odds are 4:1 the image is clean, the probability it is clean is 4/5.

Now suppose we get a label $y_{i,j}$ from an annotator $j$ for image $i$. We update the odds to include the annotator’s label to get new (posterior) odds $V'$, by the following formula if the label is

$V' = V \times \mbox{Pr}[y_{i,j}|c_i=1] \, / \, \mbox{Pr}[y_{i,j}|c_i=0]$.

We just multiply the prior odds $V$ by the likelihood ratio of the annotation $y_{i,j}$ given that $c_i = 1$ or $c_i = 0$. The new probability of a clean image is thus $V'/(1+V')$.

### The Porn Payoff

Suppose the payoff for the porn task is as follows. A porn image classified as clean has a payoff of -100, a porn image classified as porn has a cost of 0, a clean image classified as clean has a payoff of 1, and a clean image classified as porn has a payoff of -1.

In this setup, when we have an image, we need odds of better than 100:1 odds (a bit more than 99% probability) that it is clean before we return the decision that it’s clean; otherwise we maximize our payoff saying it’s porn. Unless we are 100% sure of our decision, we always have an expected loss from returning a decision that an image is porn, because the payoff is zero for a true negative (classifying porn as porn), whereas the payoff is -1 for rejecting a non-porn image.

Now suppose that the prevalence of clean images is 20 in 21 (of course, we work with an estimate). So we start with odds of 20:1. The decision to classify the item as clean before annotation has an expected payoff of [20*1 + 1*(-100)]/21 or about -4 per decision. The decision to classify an image as porn before an annotation has a payoff of [20*(-1) + 1*0]/21 which is around -1 per decision.

### Need Multiple Annotators

Clearly we need to annotate an item before we can have a positive expected payoff.

Suppose for the sake of argument (and easy arithmetic) that we have an annotator with sensitivity 0.9 (they correctly classify 90% of the clean images as clean and reject 10% of the clean images as porn) and specificity 0.8 (they correctly classify 80% of the porn images as porn and let 20% through as clean).

In this context, we actually need more than one annotator to annotate an item before we get a positive expected payoff. We first need to work through our expectations properly. First, we start with 20:1 odds (probability 20/21) an image is clean, so we can expect to see 20 clean images for each porn image. Then we have to look at the annotator’s expected response. If the image is clean, there’s a 80% chance the annotator says it’s clean and a 20% chance they say it’s porn. If the image is porn, there’s a 90% chance they say it’s porn and a 10% chance they say it’s clean. That let’s us work out the expectations for true category and response a priori. For instance, the chance the image is clean and the annotator says its porn is 20/21 * 0.2 = 4/21.

We then calculate the updated odds under both possible annotator responses and figure out the new odds and weight them by our expectations before seeing the label.

I’ll let you work out the arithmetic, but the upshot is that until you have more than one annotator, you can’t get 100:1 odds or more of an image not being porn. The key step is noting that we only get positive expected payoff if we return that the image is clean, but the posterior odds if the annotator provides a 1 label, which are 20/1 * 0.9/0.2 = 90:1. And don’t forget to factor in that we only land in the happy situation of a getting a non-porn label around 90% of the time, so the expected gain in payoff from the annotation is less than the improvement from 20:1 to 90:1 we get in the ideal case.

In reality, you’re likely to have slightly better porn/not-porn annotators than this because it’s a relatively easy decision problem. But you’re also likely to have spammers, as we mentioned last time, which is really a mixed pool of spammers and cooperators.

### Unknown Annotators

I’ll say again that one of the pleasant properties of the hierarchical model extension of Dawid and Skene is that it allows us to predict the behavior for a new unknown annotator. This is particularly useful in a mechanical Turk setting where we can’t choose our annotators directly (though we can feed items to an annotator if we write our own web app and they volunteer to do more).

We just sum over predictions from each posterior sample of the hyperpriors, then sample an annotator from that, calculate what the outcome would be for them, and average the result. What’s extra cool is that this includes all the extra dispersion in posterior inference that’s warranted due to our uncertainty in the hyperpriors representing the population of annotators.