Alex Smola just introduced a blog, Adventures in Data, that’s focusing on many of the same issues as this one. His first few posts have been on coding tweaks and theorems for gradient descent for linear classifiers, a problem Yahoo!’s clearly put a lot of effort into.
Lazy Updates for SGD Regularization, Again
In the wide world of independently discovering algorithmic tricks, Smola has a post on Lazy Updates for Generic Regularization in SGD. It presents the same technique as I summarized in my pedantically named white paper Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. It turns out to be a commonly used, but not often discussed technique.
I wrote a long comment on that post, explaining that I originally implemented it that way in LingPipe, but eventually found it faster in real large-scale applications to block the prior updates.
Your Mileage May Vary
To finally get to the point of the post (blame my love of rambling New Yorker articles), your mileage may vary.
The reason blocking works for us is that we have feature sets like character n-grams where each text being classified produces on the order of 1/1000 of all features. So if you update by block every 1000 epochs, it’s a win. Not even counting removing the memory and time overhead of keeping the record of sparseness and increased memory locality.
Horses for Courses
The only possible conclusion is that it’s horses for courses. Each application needs its own implementation for optimal results.
As another example, I can estimate my annotation models using Gibbs sampling in BUGS in 20 or 30 minutes. I can estimate them in my custom Java program in a second or two. I only wrote the program because I needed to scale to a named-entity corpus and BUGS fell over. As yet another example, I’m finding coding up genomic data very similar to text data (edit distance, Bayesian models) and also very different (4-character languages, super-long strings).
Barry Richards, the department head of Cog Sci when I was a grad student, gave a career retrospective talk about all the work he did in optimizing planning systems. Turns out each application required a whole new kind of software. As a former logician, he found the lack of generality very dispiriting.
Not being able to resist one further idiom, this sounds like a probletunity to me.
Threading Blog Discussions
This post is a nice illustration of Fernando Pereira’s comment on Hal Daumé III’s blog about the problematic nature of threading discussions across blogs.
That Zeitgeist Again
I need to generalize my earlier post about the scientific zeitgeist to include coding tricks. The zeitgeist post didn’t even discuss my rediscoveries in the context of SGD!