Semi-supervised learning involves training a classifier, tagger or some other model using a combination of labeled and unlabeled data. The challenge is to get some mileage out of the unlabeled data over a fully supervised baseline using only the labeled data. This is particularly helpful in an application context where you need to build classifiers on the fly with only a few handfuls of training examples.
Literature Survey
I’d highly recommend this very nicely written paper on using expectation maximization (EM) for semi-supervised learning:
- Kamal Nigam, Andrew McCallum, and Tom M. Mitchell. 2006. Semi-Supervised Text Classification Using EM. In O. Chapelle, A. Zien, and B. Scholkopf (Eds.) Semi-Supervised Learning. MIT Press.
They evaluate a semi-supervised naive Bayes classifier over the 20 Newsgroups data. This one figure and caption pretty much sums up the results (click to enlarge):
As usual, they found EM-based training most useful when the labeled data is small.
The Algorithm
The basic EM algorithm for semi-supervised classification is quite simple:
classifier <- new classifier train classifier on labeled data repeat { lastClassifier <- classifier classifier <- new classifier train classifier on labeled data for (item in unlabeled data) { Expectation (E) step: compute E = p(cat|item) using lastClassifier Maximization (M) step: train classifier on (item,cat) with weight E } } until (converged)
First train on classified data, then look at each unlabeled item, compute the conditional probability of the various categories (these are expected category counts, hence the name “E step”), then train using these probabilities as weights (assumes training is a maximization or “M” step). Sometimes only sufficient statistics are collected inside the inner for-loop with the M step outside.
Convergence is usually measured by some error metric not improving, typically negative log likelihood. With only conditional models, this just has to be the log likelihood of the categories given the input. With a joint model, it’s typically joint log likelihoods. The error can be calculated over the supervised data, the supervised data and unsupervised data, or over some held out evaluation set.
API Design Issues
I believe the term of art for this kind of vexing design problem is “Ay, Carumba!”
The algorithm’s fully general and works on any classifier that computes conditional probabilities, which in LingPipe, means an instance of Classifier<E,? extends ConditionalClassification>
. It requires the classifier be trainable with weights, but this can be finessed and quantized at the same time by multiplying the E values by a number, say 100, rounding, and training as if there were that many instances (making sure to adjust smoothing parameters to deal with the multiplicative effect). In LingPipe, that means the Bernoulli, naive Bayes, character and token-based language-model classifiers, K-nearest-neighbors, and logistic regression.
But we don’t need a classifier, we need a classifier factory, because we create new classifiers within the algorithm, so now we’re dealing with type Factory<Classifier<E,? extends ConditionalClassification>>
.
But wait, that’s not all. We also need that classifier to be trainable, which we can approximate by requiring it to implement ClassifierHandler<E,Classification>
. So that means what we really need to pull the classifier out is:
Factory<? extends Classifier<E,? extends ConditionalClassification> & ClassificationHandler<E,Classification>>
To truly encapsulate the entire EM algorithm, we need to take the data as input. The supervised data is going to need to be an instance of Corpus<ClassifierHandler<E,Classification>>
if we’re going to use the quantization fudge for generality, or Corpus<ClassifierHandler<E,ConditionalClassification>>
if we aren’t. The unsupervised data is much simpler, being an instance of Corpus<ObjectHandler<E>>
.
Next, because we have the whole convergence loop inside, we need to set up a maximum number of epochs and a minimum amount of relative improvement in log likelihood to consider the result converged. As in our other algorithms, we’ll need to monitor feedback, so that leaves us with something like:
static Classifier<E,? extends CondtionalClassifiation> emTrain(Factory<? extends Classifier<E,? extends ConditionalClassification> & ClassificationHandler<E,Classification>> classifierFactory, Corpus<ClassifierHandler<E,ConditionalClassification>> labeledData, Corpus<ObjectHandler<E>> unlabeledData, int maxEpochs, double minRelativeImprovement, PrintWriter progressWriter) { ... }
Now let’s hope that everything here implements the same generic type E
or we’ll need to complicate to (? super E)
or (? extends E)
depending on the polarity of the context. And keep in mind we’re assuming some form of general convergence measurement; if that needs to be configurable we’re in for an even more complex method.
So far, we haven’t included any kind of annealing schedule, which Nigam et al. found to be useful in encouraging EM to converge reliably without getting stuck in local optima. Our logistic regression API already has a flexible annealing schedule abstract base class, so we could always include an implementation as a further argument.
I’m thinking it’d be easier, and it’ll certainly be more flexible, to just write the EM algorithm and monitor it on the outside rather than writing all the implementations required for the EM-training method. I’m already planning a simplified API for our existing implementations of iterative methods: regularized logistic regression, latent Dirichlet allocation (LDA), and singular value decomposition (SVD).
Leave a Reply