Whew, that was a long title. Luckily, the paper’s worth it:
Settles, Burr. 2011. Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances. EMNLP.
It’s a paper that shows you how to use active learning to build reasonably high-performance classifier with only minutes of user effort. Very cool and right up our alley here at LingPipe.
The Big Picture
The easiest way to see what’s going on is with a screenshot of DUALIST, the system on which the paper is based:
It’s basically a tag-a-little, learn-a-little annotation tool for classifiers. I wrote something along these lines for chunk tagging (named entities, etc.) — you can find it in the LingPipe sandbox project citationEntities (called that because I originally used it to zone bibliogrphies in docs, citations in bibliographies and fields in citations). Mitzi just brought it up to date with the current LingPipe and generalized it for some large multi-part document settings.
In DUALIST, users provide two kinds of input:
- category classifications for documents
- words associated with categories
The left-hand-side of the interface presents a scrolling list of documents, with buttons for categories. There are then columns for categories with words listed under them. Users can highlight words in the lists that they believe are associated with the category. They may also enter new words that don’t appear on the lists.
Settles points out a difficult choice in the design. If you update the underlying model after every user choice, the GUI items are going to rearrange themselves. Microsoft tried this with Word, etc., arranging menus by frequency of use, and I don’t think anyone liked it. Constancy of where something’s located is very important. So what he did was let the user mark up a bunch of choices of categories and words, then hit the big submit button at the top, which would update the model. I did roughly the same thing with our chunk annotation interface.
There’s always a question in this kind of design whether to pre-populate the answers based on the model’s guesses (as far as I can tell, DUALIST does not pre-populate answers). Pre-populating answers makes the user’s life easier in that if the system is halfway decent, there’s less clicking. But it raises the possibility of bias, with users just going with what the system suggests without thinking too hard.
The underlying classification model is naive Bayes with a Dirichlet prior. Approximate inference is carried out using a maximum a posteriori (MAP) estimate of parameters. It’s pretty straightforward to implement naive Bayes this in a way that’s fast enough to use in this setting. The Dirichlet is conjugate to the multinomial so the posteriors are analytically tractable and the sufficient statistics are just counts of documents in each category and the count of words in documents of each category.
The innovation here is twofold.
The first innovation is that Settles uses EM to create a semi-supervised MAP estimate. As we’ve said before, it’s easy to use EM or some kind of posterior sampling like Gibbs sampling over a directed graphical model with any subset of its parameters or labels being unknown. So technically, this is straightforward. But it’s still a really good idea. Although semi-supervised classifiers are very popular, I’ve never seen it used in this kind of active-learning tagging interface. I should probably add this to our chunking tagger.
The second (and in my opinion more important) innovation is in letting users single out some words as being important words in categories. The way this gets pushed through to the model is by setting the component of the Dirichlet prior corresponding to the word/category pair to a larger value. Settles fits this value using held-out data rather than rolling it into the model itself with a prior. The results seem oddly insensitive to it, which surprised me (but see below).
Comments on the Classifier
Gadzooks! Settles seems to be missing the single biggest tuning parameter typically applied to naive Bayes — the document length normalizer. Perhaps he did this because when you document-length normalize, you no longer have a properly generative model that corresponds to the naive Bayes paradigm. But it makes a huge difference.
The LingPipe EM tutorial uses naive Bayes and the same 20 Newsgroups corpus as (Nigam, McCallum and Mitchell 2000) used for evaluation, and I showed the effect of document length normalization is huge (the Nigam et al. article has been cited nearly 2000 times!). You can do way better than Nigam et al.’s reported results by setting the document length norm to a small value like 5. (What document length norm’s doing is trying to correct for the lack of covariance and overdispersion modeling in naive multinomial document model — it’s the same kind of shenanigans you see in speech recognition in weighting the acoustic and language models and the same trick I just saw Kevin Knight pull out during a talk last week about decoding encrypted documents and doing machine translation.)
I think one of the reasons that the setting of the prior for important words has so little effect (see the performance figures) is that all of the priors are too high. If Settles really is starting with the Laplace prior (aka add 1), then that’s already too big for naive Bayes in this setting. Even the uniform prior (aka add 0) is too big. We’ve found that we need very small (less than 1) prior parameters for word-in-topic models unless there’s a whole lot of data (and the quantity Settles is using hasn’t gotten there by a long shot — you need to get enough so that the low counts dominate the prior before the effect of the prior washes out, so we’re talking gigabytes of text, at least).
Also, this whole approach is not Bayesian. It uses point estimates. For a discussion of what a properly Bayesian version of naive Bayes would look like, check out my previous blog post, Bayesian Naive Bayes, aka Dirichlet Multinomial Classifiers. For a description of what it means to be Bayesian, see my post What is Bayesian Statistical Inference?.
Confusion with Dirichlet-Multinomial Parameterization and Inference
There’s a confusion in the presentation of the Dirichlet prior and consequent estimation. The problem is that the prior parameter for a Dirichlet is conventionally the prior count (amount you add to the usual frequency counts) plus one. That’s why a prior of 1 is uniform (you add nothing to the frequency counts) and why a prior parameter of 2 corresponds to Laplace’s approach (add one to all frequency counts). The parameter is constrained to be positive, so what does a prior of 0.5 mean? It’s sort of like subtracting 1/2 from all the counts (sound familiar from Kneser-Ney LM smoothing?).
Now the maximum a posteriori estimate is just the estimate you get from adding the prior counts (parameter minus one) to all the empirical counts. It doesn’t even exist if the counts are less than 1, which can happen with Dirichlet parameter components that are less than 1. But Settles says he’s looking at the posterior expectation (think conditional expectation of parameters given data — the mean of the posterior distribution). The posterior average always exist (it has to given the bounded support here), but it requires you to add another one to all the counts.
To summarize, the mean of the Dirichlet distribution is
whereas the maximum (or mode) is
where the is read componentwise, so . This only exists if all .
That’s why a parameter of 1 corresponds to the uniform distribution and why a parameter of 2 (aka Laplace’s “add-one” prior) is not uniform.
Settles says he’s using Laplace and taking the posterior mean (which, by the way, is the “Bayesian point estimate” (oxymoron warning) minimizing expected square loss). But that’s not right. If he truly adds 1 to the empirical frequency counts, then he’s taking the posterior average with a Laplace prior (which is not uniform). This is equivalent to the posterior mode with a prior parameter of 3. But it’s not equivalent to either the posterior mode or mean with a uniform prior (i.e., prior parameter of 1).
Both documents and words are sorted by entropy-based active learning measures which the paper presents very clearly.
Documents are sorted by the conditional entropy of the category given the words in the model.
Word features are sorted by information gain, which is the reduction in entropy from the prevalence category distribution to the expected category distribution entropy conditoned on knowing the feature’s value.
Rather than sorting docs by highest classification uncertainty, as conditional entropy does, we’ve found it useful to sort docs by the lowest classification uncertainty! That is, we ask humans to label the docs about which the classifier is least uncertain. The motivation for this is that we’re often building high-precision (and relatively low recall) classifiers for customers and thus have a relatively high probability threshold to return a guess. So the higher ranked items are closer to the boundary we’re trying to learn. Also, we find in real world corpora that if we go purely by uncertainty, we get a long stream of outliers.
Settles does bring up the issue of whether using what’s effectively a kind of active learning mechanism trained with one classifier will be useful for other classifiers. We need to get someone like John Langford or Tong Zhang in here to prove some useful bounds. Their other work on active learning with weights is very cool.
I love the big submit button. What with Fitts’s law, and all.
I see a big problem with this interface for situations with more than a handful of categories. What would the full 20 Newsgroups look like? There aren’t enough room for more columns or a big stack of buttons.
Also, buttons seems like the wrong choice for selecting categories. These should probably be radio buttons to express the exclusivity and the fact that they don’t take action themselves. Typically, buttons cause some action.
Given the concluding comments, Settles doesn’t seem to know that you can do pretty much exactly the same thing in a “discriminative” classifier setting. For instance, logistic regression can be cast as just another directed graphical model with parameters for each word/category pair. So we could do full Bayes with no problem.
There are also plenty of online estimation procedures for weighted examples; you’ll need the weighting to deal with EM (see, e.g., (Karampatziakis and Langford 2010) for an online weighted training method that adjusts neatly for curvature in the objective; it’s coincidentally the paper I’m covering for the next Columbia Machine Learning Reading Group meeting).
The priors can be carried over to this setting, too, only now they’re priors on regression coefficients. See (Genkin, Lewis and Madigan 2007) for guidance. One difference is that you get a mean and a variance to play with.
Building it in LingPipe
LingPipe’s traditional naive Bayes implementation contains all that you need to build a system like DUALIST. Semi-supervised learning with EM is covered in our EM tutorial with naive Bayes as an example.
To solve some of the speed issues Settles brings up in the discussion section, you can always thread the retraining in the background. That’s what I did in the chunk tagger. With a discriminative “online” method, you can just keep cycling through epochs in the background, which gives you the effect of a hot warmup for subsequent examples. Also, you don’t need to run to convergence — the background models are just being used to select instances for labeling.