Active learning for classifiers dynamically orders items to be annotated by humans so as to train classifiers with as little human annotation effort as possible. Active learning typically starts with a few items chosen at random for annotation. After they’re annotated, one or more classifiers is trained using the annotated data. Then the next item to be annotated is chosen based on the behavior of the classifier(s). With one classifier, this is typically done by taking unlabeled items for which the classifier is most uncertain; with more than one classifier, disagreements among the classifiers are often used to rank items for annotation.

Suppose for simplicity that we have a binary classification problem, perfect annotators, and a single classifier. Furthermore, let’s assume we’ll settle for a greedy algorithm that always chooses one item at a time; it’s easy in principle, if not in notation and computation, to consider groups of items to annotate. As an item selection metric, I’m proposing **expected error reduction**.

If we have some items annotated, we can train a classifier. We can even use the unlabeled items if the classifier allows semi-supervised learning. With this probabilistic classifier, we can estimate `p[i] = Pr(c[i]=1|item[i])`

, the probability that item `i`

is of category 1.

The expected error under log loss for item `i`

is `E(err[i]) = - p[i] * log(1-p[i]) - (1-p[i]) * log(p[i])`

, which is just the loss if it’s category 1 times the likelihood that it’s category 1 plus the same for category 0. The total expected loss for the corpus, `E(err)`

, is just the sum of the item losses, `Σ`

. _{i} E(err[i])

The key step is estimating the effect of annotating item `i`

. Consider labeling the item 1 and compute `E(err|c[i]=1)`

, the new expected error for a corpus with the `i`

-th item having category 1. Do the same for category 0. Then weight them by the classifier’s estimates. Labeling item `i`

provides an expected error after the labeling of `p[i] * E(err|c[i]=1) + (1-p[i]) * E(err|c[i]=0)`

. The next item to label is then the one with the largest expected error reduction.

The beauty of this approach is that it takes all of our information into account. It considers the reduction in error from the example in hand, because that will have no uncertainty after its labeled. But it also considers the effect of labeling that item on all the other items. A danger of standard active learning is that it just keeps pulling out outlier examples that have high uncertainty but the labeling of which doesn’t bear much on the overall problem. The estimated error reduction approach takes into account how much knowing the label for item `i`

will affect inference about the entire corpus.

We can push as much Bayesian uncertainty in labeling through probabilistic supervision and evaluation as we have computer power.

The bummer’s that a naive implementation is quadratic in corpus size even if the training algorithm is linear (like logistic regression estimated by SGD). It’s easy to consider subsets to annotate in the same way; it just requires even more cycles to compute as the combinatorics of set selection interact with retraining.

The other shortcoming is that it doesn’t take into account expected noise from annotators. The noise issue is addressed in Victor Sheng et al.’s 2008 KDD paper Get Another Label?. It opens the problem up to considering not just new items, but getting further labels for existing items. The technically challenging step is to estimate annotator accuracy, though I think we have a good handle on that. But if you can’t re-use annotators whose accuracies have been inferred, then you need to be able to estimate the accuracy of a randomly chosen new annotator. Inferences for new annotators is possible given the model I describe in Bayesian Hierarchical Models of Categorical Data Annotation.

Now all I need is an algorithm.

December 2, 2008 at 1:56 pm |

One can approximate the error through disagreement of several classification models (you can do this by bootstrapping for a single model, or one can get to this through an ensemble). Posterior uncertainty for a particular item is the Bayesian way of doing it.

The problem is that such a method would concentrate effort on borderline cases who simply might not have a good classification.

It’s a complex problem – probably better solved for specific cases first and then generalized.

December 2, 2008 at 2:00 pm |

It’s the point Aleks brings up in his second paragraph that worries me about the posterior uncertainty selection method. What I was suggesting is looking at the expected effect on entire corpus uncertainty given the next item(s) to tag. Proxies for this I’ve seen are lexical overlap with other entities for natural language classifiers. That is, take an uncertain item that’s a lot like other uncertain items other than just the most uncertain item.

December 3, 2008 at 9:45 am |

One possibility I thought of is to deal with a wide outlier base class – which is considered unlearnable, and learning it isn’t pursued.