Why Didn’t You Win the Netflix Prize?


Netflix awarded its first progress prize in the Netflix Prize competition. The winner was a group from AT&T Research consisting of Robert Bell, Yehuda Koren and Chris Volinsky. According to Netflix’s press release on the award, the AT&T team spent 2000 hours working on this project. That’s a full-time, don’t go to any meetings, don’t take any vacation, person-year. And it pretty much sums up why we weren’t even in the running.

How’d AT&T Do It?

Here’s all the juicy details, in AT&T’s Disclosure Tech Report. (Props to AT&T for letting them release this stuff.)

I’m not surprised by how they did it. I doubt anyone who’s studied machine learning in the past ten years will be surprised. The goal was to reduce prediction variance So they built an ensemble (aka committee). Of course they built an ensemble. It’s the most widely practiced method for reducing the variance of a predictor.

AT&T used a really big ensemble. One with 107 reasonably tuned learners, drawn from a half dozen classes or so. They implemented a rich K nearest neighbors approach, Toronto’s Boltzman machine approach, singular value decompositions (standard L2 and Lasso-style L1 norms with regularization), some specialized factor models borrowed from Arek Paterek (another Netflix prize participant), and several flavors of regression model.

How Good is a 10% Reduction in Error?

The US$1M prize is for a 10% reduction in Netflix’s current system’s error. The current Netflix system has an (average) error of around 0.951, and the US$1M prize would go to a system with error of around 0.856. That’s on a 1-5 scale. That’s an improvement of only 1/10th of a star. The subsequent error, +/- 0.86 stars, is relatively large. Because there are only 4 stars of range (5-1=4), the progress prize is for a system with 21.4% estimation error overall (.85/4), down from the 23.8% or so of Netflix’s (stripped down) Cinematch algorithm. That’s pretty much the difference between 3 stars (liked it) and 2 stars (didn’t like it). Put another way, if Netflix used a 0-100 scale, these algorithms would be more than 20 points off in their predictions on average. (It’s actually a little more subtle than that, because they use root mean square error, not mean absolute error, which reduces the effect of small errors (less than 1 star) but increases the effect of larger errors, but the point still applies.)

The naive system that predicts the average rating for a user and average rating for the movie has something like a 1.1 error. It’s very difficult to squeak out performance on such noisy data. It may simply not be possible to win the grand prize due to inherent variability of people’s ratings.

Will Netflix Use It?

The AT&T approach is unlikely to be practical for Netflix. Their approach involved combining 100s of predictors, each of which is very slow to either run or train, and most of which can only be run as-is on static data collections.

Why’d AT&T (or Anyone) Participate?

In the machine learning community, it’s common to participate in bakeoffs, even if there are no prizes at all. For instance, the 2007 KDD Cup used the Netflix data to run a bakeoff to predict which movies a user would rent and how many times a given title would be rented. These are perhaps even more pressing problems for Netflix. The research community’s happy to play along because we’re starved for real data on which to test our models. The researchers even published their results. The AT&T system was only possible, as they themselves acknowledge, because several other teams published their results, which were then incorporated as predictors into AT&T’s system.

Would LingPipe Help?

We’re about to roll out the next release of LingPipe, which contains an implementation of one of the factor models: (ridge) regularized singular value decomposition. This model alone gets you to 0.91 error on Netflix, but like the others, is both static and expensive to compute (though runtime predictions are fast). I finally sorted out the regularized least squares stochastic gradient descent algorithm by following an implementation published through the Netflix prize forums (full citations forthcoming in the Javadoc and tutorial) and a stack of books with chapters on regression (e.g. Hastie et al., Bishop, Gelman et al., etc.)

You’ll also see that AT&T used some text features in comparing films in their KNN classifiers. There’s a wide range of other features (like actors, directors, rating, length, country of origin, etc.) that could be extracted, aligned with the film, and used.

I also implemented an EM-estimated variant of probabilistic latent semantic analysis (pLSA), which only achieved a 0.95 error. That may go into LingPipe at some time, as we’re working on general factor models (such as latent Dirichelt and EM clustering) for text and document clustering. I think it’d work better with Gibbs sampling; the EM really gets stuck.