## Convexity of (Root) Mean Square Error, or Why Committees Won the Netflix Prize

As John Langord mentioned in his blog, the reason committees won the Netflix prize is because root mean square error is convex. Let’s see what that means with an example and with some math.

### A Regression Problem

The contest involves predicting ratings for $N$ movies. The actual ratings given the movies by users are known to Netflix to be $y = y_1,\ldots,y_N$, where the ratings are integers $y_n \in \{ 1, 2, 3, 4, 5 \}$ for $1 \leq n \leq N$. A submission consists of floating point approximations of real numbers $\hat{y} = \{ \hat{y}_1,\ldots,\hat{y}_N \}$ where $\hat{y}_n \in \mathbb{R}$ for $1 \leq n \leq N$.

### (Root Mean) Square Error

The error function used by Netflix was root mean square error (RMSE), which is defined as the square root of the mean of the square error:

$\mbox{RMSE}(\hat{y},y) = \sqrt{\mbox{MSE}(\hat{y},y)}$

$\mbox{MSE}(\hat{y},y) = \mbox{SE}(\hat{y},y)/N$

$\mbox{SE}(\hat{y},y) = \sum_{n=1}^N \mbox{SE}(\hat{y}_n,y_n)$

$\mbox{SE}(\hat{y}_n,y_n) = (y_n - \hat{y_n})^2$

The mean scales square error to a per-example error, and the square root puts it back on the original point scale. But note that the best RMSE is achieved at the point of best SE, so the winning entry is the one with the best SE, so we won’t need to worry about means or square roots, just square error. (Note that square error is just Euclidean distance between the entry vector and true answer vector.)

### Graphing Upward Curving Error

Suppose we have a true rating of $y_n = 4$ for movie $n$. We plot square error $\mbox{SE}(\hat{y}_n,y_n) = (y_n-\hat{y}_n)^2$ for $\hat{y}_n \in [1,5]$:

Note that the error curves upward. Note that the error is much higher for outlier guesses; a guess of 1 has an error of 9 in this case, whereas a guess of 2 has an error of 4 and a guess of 3 an error of only 1. This tends to drive estimates toward the center of the range to guard against huge error. By way of comparison, absolute (or taxicab distance) error would be $|y_n - \hat{y}_n|$ and would not penalize extreme guesses nearly as severely as squared error.

We have plotted as red circles the error for two guesses $3.0$ and $4.5$, with values

$\mbox{SE}(3.0,4)=(4-3.0)^2=1.0$

$\mbox{SE}(4.5,4) = (4-4.5)^2 = 0.25$.

Now consider the guess that results from averaging the guesses 3.0 and 4.5, namely 3.75. It has lower error than either 3.0 or 4.5,

$\mbox{SE}(3.75,4) = (4-3.75)^2 = 0.0625$

The error is plotted on the curve. Note that the error for the average of the guesses 0.0625 is not only less than the average of the errors, 1.25/2 = 0.625, it is less than the error of the individual guesses, 1.0 and 0.25.

$(\mbox{SE}(4.5,4) + \mbox{SE}(3.0,4))/2 \geq \mbox{SE}((4.5 + 3.0)/2,4)$

### Convexity

The reason committees fare so well with root mean square error is because the error function is convex. Convexity for functions means the same thing as the dictionary definition — curved upwards. When functions curve upwards, as square error does, the error for averaging two guesses is always lower than the average error from the two guesses.

Mathematically, a function $f$ is convex if for all $x$ and $x'$ and $\lambda \in [0,1]$,

$f(\lambda x + (1-\lambda) x') \leq \lambda f(x) + (1-\lambda) f(x')$

This says that the value of the function for a weigted average of inputs is less than or equal to the weighted average of the value of the function at those inputs. In the example above, $f$ is the square error function $f(x) = (y-x)^2$ and the examples are weighted equally with $\lambda = 1/2$.

We know square error is convex because its differentiable everywhere and its second derivative is positive:

$\frac{d^2}{d\hat{y}_n^2} \mbox{SE}(\hat{y}_n,y_n) = \frac{d^2}{d\hat{y}_n^2}(y_n - \hat{y}_n)^2 = 2 \hat{y}_n$

The result can also be easily established algebraically by expanding all of the terms, cancelling and rearranging.

Everything generalizes to multiple dimensions by additivity. Because the full square error is the sum of the dimensionwise square errors, all the partial second derivatives are positive and the entire error function is convex.

### Committees Average Their Members’ Predictions

All of the top-scoring Netflix Prize entries averaged the predictions of multiple base systems. Given the convexity of error, the error for the average of two equal scoring systems is less than or equal to their average error. So in general, two predictions with the same error can be averaged to produce a prediction with lower error. And as we saw in the example above, this can even happen when the predictions are not the same.

It can help to tune the weighting $\lambda$, typically by weighting the better scoring systems more heavily.

### Discrete Predictions and Posterior Variance

As we’ve been talking about in other blog posts, point estimates sweep posterior variance under the rug. The square error further pushes numerical scores toward non-comittal average ratings.

Suppose we build a discrete 5-way classifier and estimate a posterior $\mbox{Pr}(y_n=m)$ for $m \in \{ 1, 2, 3, 4, 5 \}$. The Bayesian estimator here would minimize expected square error with respect to the posterior, namely the expected rating, which weights each prediction by its estimated probability:

$\hat{y}_n = \sum_{m =1}^{5} m \cdot \mbox{Pr}(y_n=m)$

For instance, here are plots of two posterior estimates with the same mean prediction, 2.95, the first of which is for a film the system thinks the user will love or hate,

and one for which the system is pretty sure the user will be indifferent,

I find these posterior distributions much more useful than a single plot. This is how Amazon and NewEgg display their customer ratings, though neither site predicts how much you’ll like something numerically like Netflix does.

### 4 Responses to “Convexity of (Root) Mean Square Error, or Why Committees Won the Netflix Prize”

1. Max Gubin Says:

Actually, the situation with a real recommendation system is more complex because it can provide only a limited number of recommendations for a user, in other words, it provides only N top items with the best scores for a user. Any loss function that uses such a top window is extremely non-convex with many local minima (like loss functions in IR: P@N, NDCG). Popular Netflix methods like SVD and stochastic gradient descent won’t work well there. I won’t be surprised if people run into this problem when they try applying Netflix competition results in real systems.

2. lingpipe Says:

@Max Good point about convexity and ranking. Non-convex error functions make optimization-based estimation (training) difficult, too.

I don’t think the kind of non-convexity you’d get from top N or F-measure like scores would necessarily mean committee voting wouldn’t work for these kinds of tasks.

I think precision-at-N, where “relevant” means a customer rents it, is a better measure for Netflix’s task.

Furthermore, I think we need more results diversity — I don’t want it to recommend seven seasons of the Simpsons even if I will eventually rent them all and rank them highly. And I don’t want to keep being shown the same movie again and again as a recommendation, so there needs to be history.

I would also like Netflix and Amazon and PubMed to go deeper into more-like-this, because I find it useful for exploration.

Netflix’s own Cinematch system, which provided the baseline, was also based on regression. From what Netflix has said, they’re already incorporating some of the techniques from the competition. Certainly I’d think the features would port.

• Viviane Says:

Thank you Felipe. I have read your work. It is ineedd the first to my knowledge that uses the extended (also called Fisher’s non-central) hypergeometric distribution for IR. I found that sampling from extended hypergeometric becomes exponentially expensive as the sample size increases, and that there is ongoing research in statistics about .While researching on the use of the hypergeometric distribution for IR, I found a paper from Wilbur dating back to 1993! Wilbur models the vocabulary intersection between the query and a set of relevant documents using the central hypergeometric distribution. Little has been done since then, probably because the multinomial distribution is a good approximation to the hypergeometric for most IR scenarios, i.e., when the sample size (query) is cosiderably smaller than the population size (document). However, as we show in the paper, in the case of document-long queries, the multinomial approximation does not hold anymore, and the use of the “vanilla” hypergeometric distribution is required.

3. Search terms and the flu: prefering complex models | Ready-to-hand Says:

[…] One way of doing this is just taking a weighted average of the predictions of several simpler models. This works quite well when your measure of the value of your model is root mean squared error (RMSE)… […]