Large-scale and online classification problems require a classifier that allows online training. By large-scale, I mean problems so large they won’t fit in memory. For this blog entry, size doesn’t really matter. I’m worried about honest-to-goodness online requirements, which force a learner to get a training example, classify it, get feedback, do some learning, then forget it, all in a constant amount of time per example.

Client-side spam detection and many server-side solutions would require a truly online algorithm in this sense. So it’s perhaps not surprising that Goodman and Yih of Microsoft Research used an online approach in their paper Online Discriminative Spam Filter Training. They use standard binomial logistic regression (max likelihood, not regularized) with stochastic gradient descent. They don’t give many details about their implementation other than say it was a naive Perl implementation that was fast enough for doing evaluations.

Your typical “online” learner requires training examples to be processed one-at-a-time. This solves the large-scale problem, in that you only need to bring one example into memory at a time, assuming the parameters fit in memory. Even so, online learners like Bottou’s SGD and our own logistic regression still load all the data into memory, and then make multiple passes over it.

Langford, Li and Strehl’s Vowpal Wabbit doesn’t load all the data into memory. VW even bounds the size of the features so a constant stream of new features won’t blow out memory (see my previous blog entry on using hash codes as features). Unfortunately, VW runs from the command-line and requires the training examples to be in a data file before the program starts. This could be modified, because the good folks at Yahoo! released it with source under a BSD license.

Handling logistic regression in an online setting **without** regularization (priors) is straightforward. You just use stochastic gradient descent and take the loop over the data out of the algorithm and make it a function that updates the parameters. All you need is a set of parameter vectors for coefficients that’s resizable. This could either be done by making them map-like structures (presumably what Goodman and Yih did in Perl), or resizing them like Java’s `util.ArrayList`

. The latter is preferable from a size and speed point of view. The algorithm won’t converge in a single pass the way the multiple pass algorithms do, but it’ll be fast and quite straightforward.

This leaves two problems. First, how do we handle the learning rate? We can just leave it as a constant. Annealing doesn’t seem to make much sense, as the later examples will count less and less. This may be just the opposite of what you want in an online learning setting, where more recent information may be more representative of what’s to follow. In an API, we can just have a method to set the learning rate at any given point.

The second problem is what to do about regularization. In the standard Bayesian maximum a posteriori estimation setting, the prior is fixed ahead of time and contributes a single error term for all of the data. In a setting where you know the number of training instances, it’s easy to handle. You just regularize at a rate of 1/numberOfTrainingExamples per example. Regularization can be done lazily to preserve sparseness, as I point out in my tech report on lazy regularized SGD.

What I don’t know is how to do regularization online. What I’ve done in a preliminary implementation of truly online logistic regression is set a per-instance regularization discount. For instance, in a corpus with 100 training instances, that’d be 1/100.

But what should the regularization discount factor be for an unbounded number of instances when you have to apply it online? If we leave this discount at a constant, it has the effect of reducing the variance of the prior as the number of training examples increases. Normally, the data would outweigh any fixed prior. But with a constant regularization discount factor, the prior keeps up at its regular pace for all the data. This might even be desirable, as it will keep tamping down the coefficients. For instance, it’ll guarantee that a feature seen only once will eventually go to zero under a Laplace prior (and approach zero with a Gaussian or Cauchy prior).

It’d also be nice to be able to serialize the learner at any point, as well as compile it to a fixed representation, which catches up all of the lazy online regularization.

## Leave a Reply