Expected Reciprocal Rank Evaluation Metric
In this post, I want to discuss the evaluation metric, expected reciprocal rank (ERR), which is the basis of Yahoo!’s Learning to Rank Challenge. The ERR metric was introduced in:
- Chapelle, Olivier, Donald Metzler, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In CIKM.
The metric was designed to model a user’s search behavior better than previous metrics, many of which are discussed in the paper and related to expected reciprocal rank.
The Cascade Model
Expected reciprocal rank is based on the cascade model of search (there are citations in the paper). The cascade model assumes a user scans through ranked search results in order, and for each document, evaluates whether the document satisfies the query, and if it does, stops the search.
This model presupposes that a single document can satisfy a search. This is a reasonable model for question answering (e.g. [belgium capital], [raveonettes concert dates]) or for navigation queries (e.g. [united airlines], [new york times]).
In a cascade model, a highly ranked document that is likely to satisfy the search will dominate the metric.
But the cascade model is a poor model for the kinds of search I often find myself doing, which is more research survey oriented (e.g. [“stochastic gradient” “logistic regression” (regularization OR prior)], [splice variant expression RNA-seq]). In these situations, I want enough documents to feel like I’ve covered a field up to whatever need I have.
To make a cascade model complete, we need to model how likely it is that a given document will satisfy a given user query. For the learning to rank challenge, this is done by assigning each query-document pair an “editorial grade” from 0 to 4, with 0 meaning irrelevant and 4 meaning highly relevant. These are translated into probabilities of the document satsifying the search by mapping a grade to , resulting in the following probabilities:
That is, when reviewing a search result for a query with an editorial grade of 3, there is a 7/16 chance the user will be satisfied with that document and hence terminate the search, and a 9/16 chance they will continue with the next item on the ranked list.
Expected Reciprocal Rank
Expected reciprocal rank is just the expectation of the reciprocal of the position of a result at which a user stops. Suppose for a query , a system returns a ranked list of documents , where the probability that document satisfies the user query is given by the transform of the editorial grade assigned to the query-document pair, which we write . If we let be a random variable denoting the rank at which we stop, the metric is the expectation of ,
Stopping at rank involves being satisfied with document , the probability of which is . It also involves not having been satisified with any of the previous documents ranked , the probability of which is . This is all multiplied by , because it’s the inverse stopping rank whose expectation is being computed.
For instance, suppose my system returns documents D1, D2, and D3 for a query Q, where the editorial grades for the document-query pairs are 3, 2, and 4 respectively. The expected reciprocal rank is computed by
|Rank k||1/Rank||Grade||p(satisfied by doc k)||p(stop at doc k)|
|2||1/2||2||3/16||3/16 * (1 – 7/16)|
|3||1/3||4||15/16||15/16 * (1 – 3/16) * (1 – 7/16)|
For instance, to stop at rank 2, I have to not be satisfied by the document at rank 1 and be satisfied by the document at rank 2.
We then just multiply the reciprocal ranks by the stop probabilities to get:
ERR = 1/1 * 7/16 + 1/2 * 3/16 * (1 - 7/16) + 1/3 * 15/16 * (1 - 3/16) * (1 - 7/16) = 0.63
Document Relevance Independence
One problem remaining for this metric (and others) is correlated documents. I often find myself doing a web search and getting lots and lots of hits that are essentially wrappers around the same PDF paper (e.g. CiteSeer, Google Scholar, ACM, several authors’ publication pages, departmental publications pages, etc). Assigning each of these an independent editorial grade does not make sense, because if one satisfies my need, they all will, and if one doesn’t satisfy my need, none of them will. They’re not exact duplicate pages, though, so it’s not quite just a deduplication problem. And of course, this is just the extreme end of correlation among results.
Lots of queries are ambiguous. For instance, consider the query [john langford], which is highly ambiguous. There’s an Austin photographer, a realtor, the machine learning blogger, etc. I probably want one of these, not all of them (unless I’m doing a research type search), so my chance of stopping at a given document is highly dependent on which John Langford is being mentioned. This is another example of non-independence of stopping criteria.
Reciprocal Rank vs. Rank
One nice feature of the expected reciprocal rank metric is that it always falls between 0 and 1, with 1 being best. This means that the loss due to messing up any single example is bounded. This is like 0/1 loss for classifiers, where the maximum loss for misclassifying a given example is 1.
Log loss for classifiers is the negative log probability of the correct response. This number isn’t bounded. Similarly, were we to measure expected rank rather than expected reciprocal rank, the loss for a single example would not be bounded other than by the number of possible matching docs. It actually seems to make more sense to me to use ranks because of their natural scale and ease of combination; expected rank measures the actual number of pages a user can be expected to consider.
With reciprocal rank, as the results lists get longer, the tail is inverse weighted. That is, finding a document at rank 20 adds at most 1/20 to the result. This makes a list of 20 documents with all zero grade documents (completely irrelevant) have ERR of 0, and a list of 20 docs with the first 19 zero grade and the last grade 4 (perfect) has ERR of 1/20 * 15/16. If the perfect result is at position k and all previous results are completely irrelevant, the ERR is 1/k. This means there’s a huge difference between rank 1 and 2 (ERR = 1, vs. ERR = 0.5), with a much smaller absolute difference between rank 10 and rank 20 (ERR = 0.10 vs. ERR=0.05). This matters when we’re averaging the ERR values for a whole bunch of examples.
With ranks, there’s the huge problem of no results being found, which provides a degenerate expected rank calculation. Somehow we need the expected rank calculation to return a large value if there are no relevant results. Maybe if everyone’s evaluating the same documents it won’t matter.