I found that Bottou’s suggestion of finding the best learning rate and then going with that to be unworkable because the line search for the best learning rate is highly non-convex, with good solutions at all sorts of distances that are good for the next 100K examples but bad for the next 100K examples after that.

I went to the extreme of developing a GUI-based control where **I** could drive the rate, but I really couldn’t do any better than the exponential (and who wants to spend a couple minutes doing this)? It is informative to see what’s going on with the GUI, though. If you leave the rate constant, improvements gradually dwindle to nothing, then pick up again as the rate is lowered.

I found classifier accuracy wasn’t very sensitive to how many decimal places of accuracy were in the coefficients, so in practice, this may not matter much.

]]>The problems we’re working on are pretty unstable in the sense of being nearly separable on some folds, so replication is really critical to get any predictive accuracy of held-out performance.

Luckily, with the Laplace prior being used the cross-validation accuracies have much less variance across random folds than with maximum likelihood or even Gaussian priors of the same variance.

The relation of parameters to training sample sizes is a big concern with cross-validation, especially since the usual advice is to fix hyperparameters with cross-validation and then train on all the data (not just the size of a fold).

]]>One thing we want to do is to tackle the prior inference in a proper Bayesian way instead of going through with CV. But there is some dependence on the parameters of CV: a 2-fold CV will favor a more informative prior than a leave-one-out.

Oh yes, one more thing. I stabilize CV by replicating it many times. For example, I do 5 runs of 5-fold CV, amounting to 25 test/train splits.

]]>I’m just using cross-validation the way you in the paper of yours I cited in the previous blog entry in this thread, How Fat is Your (Prior’s) Tail?.

In some sense, the only thing I care about is performance on some held out test set. I don’t really care about minimizing the error function per se, except insofar as it leads to better held out performance. Cross-validation’s the best way I know to estimate that, but I’d love to hear other ideas.

On those lines, I’m finding the Laplace prior clearly superior, as suggested by Goodman and Genkin et al. (same prior blog post has the refs). And I’m seeing about the same magnitude of effect as reported by Genkin et al. — 70% vs. 82% accuracy on the i2b2 Obesity challenge data using the same bag-of-word features selected by a count threshold)

What’s interesting is that Goodman did the posterior histogram of parameters and saw they looked like a Laplace distribution. So using the Laplace as a prior is a truly Bayesian maneuver in this setting in that it encodes prior knowledge about likely parameter values.

]]>