No idea — you should ask the people who wrote the software, not us.

]]>datasets. I understand that pegasos implementation currently only

produces the model file. How can i predict the actual class for test

set? can i use svm_classify or svm_perf_classify? or what is the

formula that i should use to write a predictor of my own. Thanks a lot

in advance.

Thanks,

Venkatesh

1. “Oh, we have a huge amount of data, so online is effective”.

2. Online learning needs a random sample at each step.

3. In experiment, we loaded the whole data into memory, and so random access becomes easy.

But hey, if the data can fit into memory, then it’s not called “large”. If the data has to reside on the hard drive, then the CPU will have to wait for the bus to fetch the data, and we lose all the benefit of cheap update in online learning.

Even worse, the data can be distributed on different data centers. Do we want to move the data to some central node?

In such scenarios, a map reduce framework for batch learning is much more effective.

]]>The real key is that it doesn’t need the babysitting.

In practice, we’re always evaluating different regularization parameters and different feature sets along with everything else, so you’d be able to run Pegasos in parallel to do that more efficiently.

I think which method wins these contests almost always depends on the exact shape of the problem (sparsity, size, matrix conditioning, etc.).

]]>Empirical story #1: I found basically the same convergence issues you were talking about when I implemented SGD for L2-regularized linear regression. My data set was bag-of-words NLP but somewhat small (document titles: 40k examples, 20k features). Lots of tweaking changed the solution when using a convergence threshold, but held-out accuracy (proportion of variance explained) didn’t always change a ton; but it’s pretty computationally expensive to find out!

I would love a magical method that required less tweaking / babysitting / staring at an accuracy/time graph and intuiting whether it’s converged, while at the same time you’re balancing that against a qualitative gut feeling whether to wait longer. Quantifying the time cost is a first step of course; but more broadly, I just don’t have very good confidence in reproducibility with any of these algorithms. Pegasos sounds appealing for that reason, though if your annealing strategy plus parallelism can beat it, so be it.

Story #2: I was using random forests recently and tried using out-of-bag error to figure out how many bootstraps I had to do before convergence. (Like SGD with a slow enough learning rate, bagging & RF’s are guaranteed to converge if given enough time; so another computation vs accuracy tradeoff.) OOB estimation is a neat trick, but it seems to overestimate how far you have to go relative to held-out accuracy. I really don’t understand why this is. The original Breiman papers where RF’s come from all gloss over this.

]]>LingPipe has a Java implementation of sparse, regularized, truncated, stochastic gradient descent in:

`com.aliasi.stats.LogisticRegression`

It’s wrapped up for classification applications with programmatic feature extraction in:

`com.aliasi.classify.LogisticRegressionClassifier`

.

The code’s available in the LingPipe distribution or online. The estimation’s all done in the source file:

`com/aliasi/stats/LogisticRegression.java`

.

I also wrote up a SGD for logistic regression white paper which goes over my rediscovery of Tong Zhang et al.’s truncated gradient with regularization (Normal, Laplace, or Cauchy priors). I also work through all the derivatives, etc., from scratch (I write these things for myself as I’m learning a field).

]]>Thanks for linking to the papers.

You say you’ve been implementing these. C++? Java?

]]>