He mentioned a way to speed up linear classifiers at runtime that’s intriguing. A binary linear classifier is based on the dot product of a weight vector β (almost always dense) with a feature vector x (almost always sparse). Examples of linear classifiers include perceptrons, logistic regression, naive Bayes/multinomial, SVMs, and it’s even the innermost loop in discriminitive sequence models like CRFs.
The bottleneck at runtime for linear classifiers is converting the objects being classified into sparse vectors. If you use a symbol table, the bottleneck is the hash lookup in the symbol table. The feature weight vector β is almost always an array, so once you have the symbol ID, you pay the array lookup, multiplication of feature weight (typically 1 in NLP problems) by value found, and add to the sum for the dot product. Only the array lookup is time consuming here.
Actually constructing the sparse vector itself would also be expensive, but this can be done implicitly, because all we need is the dot product of the vector with the parameter
So what happens if we replace the symbol generated by a symbol table with a hash code? Instant speedup. We eliminate the expensive hash lookup, which requires an array lookup almost certainly out of L2 cache, and then iterating over the collision set doing a string-match until we get a match or exhaust the bucket.
The price we pay is possible collisions. In effect, any two features that have the same hash code get conflated. If we’re doing 20 newsgroups and trying to distinguish hockey posts from baseball posts, it’s going to hurt accuracy if the hashcode of “goalie” and “pitcher” are the same, as they’re highly discriminitive in this domain.
Now we’re going to use a hash code that produces numbers in a small range, say 0 to 2**18, or 18 bits, so that an array of floats or doubles of that size fits in L2 cache on our CPU. Now we’re really flying. The symbol we’re looking up will fit in a register, so computing its hash code will be pretty fast. It’s the lookup out of cache and subsequent matching that’s the time-sink.
In practice, John reports that experiments they’ve done have shown that this isn’t a problem. He found this somewhat surprising, but I didn’t. Language is highly redundant, so a few features being conflated is unlikely to hurt performance much. It’d be interesting to see a plot of size of hash table vs. number of features vs. accuracy.
This approach extends to the more complex, structured features common in discriminitive classifiers. We never need to build an explicit feature representation if we can generate a hash code for it.
If we ever have to make a simple classifier that really flies, this is what I’ll be thinking about. I might also be thinking about perfect hashing, because I’m a neat freak.