I promised to use “estimation” instead of “learning”, but I’m torn with standard terminology like semi-supervised learning, of which statistical estimation is only one approach.

My last post on semi-supervised learning considered the expectation maximization (EM) algorithm and considered how it could be integrated into LingPipe. In this post, I’m going to do the same for Gibbs sampling.

The algorithm is just as simple as EM, if not simpler:

classifier <- new classifier train classifier on labeled data for (item in unlabeled data) { increment data: item + random label } for (e in 1 to numEpochs) { for (item in unlabeled data) { decrement data: item + label[epoch-1,item] sample: label[epoch,item] from p(label|item) increment data: item + label[epoch,item] } }

The algorithm has the same basic structure as EM: start by training a model on the labeled data, then assigning and training on the unlabeled data points at random. Next, iterate over all of the training data several times. While looping over the training data, the Gibbs sampler visits each item once. It first decrements the count from the previous epoch (not necessary in the first epoch, of course), then samples a label from the current model, then increments the current model’s data counts with the sampled value.

Unlike EM, Gibbs sampling doesn’t create a new model in each epoch. If we can compute the expectation (E) step of the EM algorith, we can trivially sample by ordering the items, generating a random number in [0,1] and choosing the corresponding label based on cumulative probability [in fact, this is a clever way to do generation given an arithmetic coding]. The extra bit that Gibbs sampling needs is the ability to decrement counts in models.

Sometimes Gibbs sampling is run in batches where all of the labels are sampled given the previous model and re-incremented; in this setup, you can use the same sort of model generation as in EM to avoid having to decrement, and can have the same freedom to define complex maximization (M) steps. If the estimators have simple sufficient statistics, such as most of the exponential family, like multinomials with conjugate Dirichlet priors or Gaussians, the models can be fairly easily updated online for each item sampled.

The result of Gibbs sampling is a (Markov) chain of samples for every parameter in `θ`

, including the missing data labels (the missing data labels by themselves constitute a multiple imputation of the missing data). In practice, multiple Gibbs samplers are run, and their resulting sequences of estimates are checked for convergence. Typically the earlier epochs before multiple chains begin to mix well are discarded. EM is deterministic, but requires several starting points to be evaluated to ensure convergence to the global maximum (unless the loss is convex, in which case only one run is necessary).

In either case, we want to take training cases (both supervised and unsupervised) `X`

and draw inferences about new data `Y`

, which we write as `p(Y|X)`

. In both cases, we have the same parametric likelihood function `p(Y|θ)`

.

The EM algorithm produces a maximum likelihood estimate `θ`

of the model parameters (or maximum a posteriori estimate, if we have a non-uniform prior ^{*}`p(θ)`

). This estimate is then used to reason about future cases, approximating `p(Y|X)`

by `p(Y|θ`

. ^{*})

The full definition of `p(Y|X)`

requires the evaluation of an integral: `∫`

. This integral averages the predictions _{Θ} p(Y|θ) p(θ|X) *d*θ`p(Y|θ)`

weighted by the probability `p(θ|X)`

of the parameters.

In Gibbs sampling, we get a sequence of samples of parameter values `θ`

. We use these to approximate the integral definining ^{(1)},...,θ^{(N)}`p(Y|X)`

as `Σ p(Y|θ`

. ^{(n)})/N

The motivation for this more involved reasoning is that it incorporates all of our uncertainty about the parameter values `θ`

, whereas EM places all of the weight on the maximum likelihood value. In some cases, the maximum likielihood parameters give a good approximation of `p(Y|X)`

and in other cases they don’t. Gibbs sampling will approximate `p(Y|X)`

to arbitrary precision given enough samples. In practice, 100 to 1000 are often used.

The question is, will it matter in practice?

December 18, 2008 at 4:50 pm |

Jianfeng Gao and I looked at this for unsupervised POS tagging using an HMM model in our paper at this year’s EMNLP conference.

We compared four different kinds of Gibbs samplers to EM and Variational Bayes (which is an EM-like procedure, but estimates an approximation to the full posterior). Regarding your question, we found that Gibbs sampling produced markedly better solutions on small training data sets than either EM or Variational Bayes, but when we trained on 1,000,000 the difference largely disappeared (not surprising). Variational Bayes also scales much better to large data sets, which I guess is also not surprising. It would be interesting to know if this holds true for other domains as well.

December 19, 2008 at 11:04 am |

Nice. Gao and Johnson’s paper is very much on topic. It gives me even more motivation to try to understand the variational methods — the functional diff eqs are a bit frightening.

What I was thinking of doing is something very similar, only with varying amounts of labeled data.

I’m expecting that with more and more unlabeled data, the labeled data will become less significant. That is, the labeled data acts just like a prior (in fact, you can often code up priors using fake data points). On the other hand, training the labeled data first may drive the optimizers to a local maximum from which they can’t escape (which might be good for practical, if not theoretical, reasons in this example).