Netflix finally awarded the grand prize (the linked forum page links to the technical descriptions of the top systems).
What Did We Learn?
Off the shelf regression works pretty darn well. Improving on it takes a huge amount of both general and domain-specific effort.
The most surprising aspect of the competition to me was how close the final race was. I’m sure that was in part due to the openness of the competition, such as the winners of the intermediate progress prizes having to publicly disclose their approach(es). As we tell our customers, it’s not necessarily the “secret sauce” or “magic pixie dust”, so much as a whole lot of elbow grease.
This result actually undercuts the surprisingly common naive argument for machine learning as opposed to heuristic approaches: with machine learning, you don’t have to have experts analyze and tune the data. You just feed the data into the hopper and get a system out the other end. Those of us who’ve been around the block a few times know this isn’t true given the results of other competitions, such as the various BioCreative bakeoffs and the i2b2 Obesity Challenge.
It took teams of experts thousands of hours to reduce the average error from .95 stars (on a 1-5 star scale) to .85 stars. That’s only 1/10th of a star. I don’t even think that’d be noticeable on the Netflix interface. In absolute terms, that’s a reduction from 23.75% error (.95/4) to 21.25% (.85/4) error.
What’s more is that the computational resources required to achieve this reduction in error were vast. Hundreds of relatively complex classifiers of various kinds were combined into an ensemble used for the final predictions.
The contest was over a relatively small static data collection (100M ratings, which fit in under 1GB memory), whereas Netflix is faced with a vast amount of dynamic data (new users, new films, new ratings).
The majority of the improvement came right away, with a simple partial singular value decomposition (SVD) implemented with a stochastic gradient alternating least squares fitter. When I say simple, I mean that the algorithm can be implemented efficiently in a matter of hours. I ported this approach for LingPipe’s SVD.
Finer-grained improvements came only after detailed study and analysis of the data, and involved things like modeling kinks in ratings due to unknown sources at particular dates and other non-linear effects such as user-specific rating scales.
More or Less Data?
While we could run experiments to see how well we could’ve done with only 1M or 10M ratings, what would’ve happend with 1G or 10G ratings? I can’t help being reminded of the Banko and Brill paper on large data sets, the moral of which was that more data trumps fancy learning techniques. I’d think we’d find a particularly profound effect on the k-nearest-neighbor approaches.