A Regression Problem
The contest involves predicting ratings for movies. The actual ratings given the movies by users are known to Netflix to be , where the ratings are integers for . A submission consists of floating point approximations of real numbers where for .
(Root Mean) Square Error
The error function used by Netflix was root mean square error (RMSE), which is defined as the square root of the mean of the square error:
The mean scales square error to a per-example error, and the square root puts it back on the original point scale. But note that the best RMSE is achieved at the point of best SE, so the winning entry is the one with the best SE, so we won’t need to worry about means or square roots, just square error. (Note that square error is just Euclidean distance between the entry vector and true answer vector.)
Graphing Upward Curving Error
Suppose we have a true rating of for movie . We plot square error for :
Note that the error curves upward. Note that the error is much higher for outlier guesses; a guess of 1 has an error of 9 in this case, whereas a guess of 2 has an error of 4 and a guess of 3 an error of only 1. This tends to drive estimates toward the center of the range to guard against huge error. By way of comparison, absolute (or taxicab distance) error would be and would not penalize extreme guesses nearly as severely as squared error.
We have plotted as red circles the error for two guesses and , with values
Now consider the guess that results from averaging the guesses 3.0 and 4.5, namely 3.75. It has lower error than either 3.0 or 4.5,
The error is plotted on the curve. Note that the error for the average of the guesses 0.0625 is not only less than the average of the errors, 1.25/2 = 0.625, it is less than the error of the individual guesses, 1.0 and 0.25.
The reason committees fare so well with root mean square error is because the error function is convex. Convexity for functions means the same thing as the dictionary definition — curved upwards. When functions curve upwards, as square error does, the error for averaging two guesses is always lower than the average error from the two guesses.
Mathematically, a function is convex if for all and and ,
This says that the value of the function for a weigted average of inputs is less than or equal to the weighted average of the value of the function at those inputs. In the example above, is the square error function and the examples are weighted equally with .
We know square error is convex because its differentiable everywhere and its second derivative is positive:
The result can also be easily established algebraically by expanding all of the terms, cancelling and rearranging.
Everything generalizes to multiple dimensions by additivity. Because the full square error is the sum of the dimensionwise square errors, all the partial second derivatives are positive and the entire error function is convex.
Committees Average Their Members’ Predictions
All of the top-scoring Netflix Prize entries averaged the predictions of multiple base systems. Given the convexity of error, the error for the average of two equal scoring systems is less than or equal to their average error. So in general, two predictions with the same error can be averaged to produce a prediction with lower error. And as we saw in the example above, this can even happen when the predictions are not the same.
It can help to tune the weighting , typically by weighting the better scoring systems more heavily.
Discrete Predictions and Posterior Variance
As we’ve been talking about in other blog posts, point estimates sweep posterior variance under the rug. The square error further pushes numerical scores toward non-comittal average ratings.
Suppose we build a discrete 5-way classifier and estimate a posterior for . The Bayesian estimator here would minimize expected square error with respect to the posterior, namely the expected rating, which weights each prediction by its estimated probability:
For instance, here are plots of two posterior estimates with the same mean prediction, 2.95, the first of which is for a film the system thinks the user will love or hate,
and one for which the system is pretty sure the user will be indifferent,
I find these posterior distributions much more useful than a single plot. This is how Amazon and NewEgg display their customer ratings, though neither site predicts how much you’ll like something numerically like Netflix does.