Yahoo!’s Learning to Rank Challenge

by

[This is a refactoring of the material that should have been broken out from yesterday's discussion of the expected reciprocal rank evaluation metric.]

Yahoo! is hosting an online Learning to Rank Challenge. Sort of like a poor man’s Netflix, given that the top prize is US$8K. [Update: I clearly can't read. As Olivier Chapelle, one of the organizers points out, the rules clearly state that "If no member of a winning Team is able to attend, a representative of the Sponsor will give the presentation on the winning Team's behalf." Haifa's an interesting city, especially if you can hang out with the natives at the Technion, but it's an expensive trip in both time and money from the US.]

Like Netflix, this is an incredibly rich data set tied to a real large-scale application. I can imagine how much editorial time went into creating it. There are over 500K real query relevance judgments!

Training Data

The supervised training data consists of 500K or so pairs of queries and documents represented as 750-dimensional vectors and provided “editorial grades” of relevance from 0 (completely irrelevant) to 4 (near perfect match). There’s also a smaller set of 20K or so pairs with some overlapping and some new features to evaluate adaptation.

Unfortunately, there’s no raw data, just the vectors.

Test Data

The contest involves ranking a set of documents with respect to a query, where the input is the set of document-query vectors. An evaluation set consists of a number of query-document pairs represented as vectors, where there may be any number of potential documents for each query. The systems must return rankings of the documents for each query.

Binary Logistic Regression for Ranking

In case you were curious, I’m going to tackle the learning to rank problem by generalizing LingPipe’s logistic regression models to allow probabilistic training. Then I can train editorial grade stopping probabilities directly. If I can predict those well, I can score well at ranking. I don’t see any advantage to training something like DCA compared to a binary logistic regression on a per-example basis here.

I’m not sure which regression packages that scale to this size data allow probabilistic training. It could always be faked by making an even bigger data set with positive and negative items.

No IP Transfer, But a Non-Compete

I’d like to be able to play with these data sets and publish about them and release demos based on them without transferring any intellectual property to a third party other than a publication and rights to publicity. IANAL, but I don’t think I can do that with this data.

We still haven’t gone through the legal fine print, nor is it clear it will be worth our while to do so in the end. It’s expensive to run this stuff by lawyers to look for gotchas. And my own read makes it seem like a no-go.

It doesn’t appear from a quick read that they require the winners to give them the algorithm. They do require the winners have the rights to their submission. And they will take possession of copyright on submissions (presumably not including the algorithm itself). This is great if I’m reading it right.

On the other hand, the non-compete clause (4a) immediately jumped out at me as a deal breaker:

THE TEAM WILL NOT ENTER THE ALGORITHM NOR ANY OUTPUT OR RESULTS THEREOF IN ANY OTHER COMPETITION OR PROMOTION OFFERED BY ANYONE OTHER THAN THE SPONSOR FOR ONE YEAR AFTER THE CONCLUSION OF THE CONTEST PERIOD;

Not only are they shouting, I’m not sure what constitutes “the algorithm”. It’d be absurd to tie up LingPipe’s logistic regression by entering a contest. Yet it’s not clear what more there is to “THE ALGORITHM”.

And if we release something as part of LingPipe, our royalty-free license lets people do whatever they want with it. So I don’t see how we could comply.

The other issue that’s problematic for a company is (4g):

agrees to indemnify and hold the Contest Entities and their respective subsidiaries, affiliates, officers, directors, agents, co-branders or other partners, and any of their employees (collectively, the “Contest Indemnitees”), harmless from any and all claims, damages, expenses, costs (including reasonable attorneys’ fees) and liabilities (including settlements), brought or asserted by any third party against any of the Contest Indemnitees due to or arising out of the Team’s Submissions or Additional Materials, or any Team Member’s conduct during or in connection with this Contest, including but not limited to trademark, copyright, or other intellectual property rights, right of publicity, right of privacy and defamation.

One doesn’t like to put one’s company on the line for a contest that’ll at most net US$8K (less travel and conference registration) and minor publicity.

4 Responses to “Yahoo!’s Learning to Rank Challenge”

  1. Olivier Chapelle Says:

    Regarding the prize requirement: in fact, one of the rules state that “each winning Team will be required to create and submit to Sponsor a presentation”. There is no need to go to Haifa if you can’t make it.

    And about the clause 4a: I’m not a lawyer, but my understanding is that this clause is meant to prevent an entanglement resulting from simultaneous participation in two challenges with conflicting rules.

    • lingpipe Says:

      Thanks for the feedback; I updated the post accordingly. The rules were pretty clear when I actually read them! I should’ve sent you (and the other organizers) a draft first.

      It’s really hard for us poor (in the cash sense, not the “woe is me” sense) startups to interpret all the legalese. My own eyes sort of glaze over at the language, which always seems vaguely menacing.

  2. Mathieu Says:

    Just a minor detail but the vectors are 700-dimensional.

    The problem with treating the learning to rank problem as a mere classification or regression problem is that you don’t use the *relative* position of documents with each other, and they are more general problems to solve.

    Something worth mentioning is that queries in the test set don’t exist in the training set. This means that the feature vectors necessarily contain information about both the document d and the query q, not d alone.

    I agree that the absence of the raw data is a bit disappointing. I guess two possible explanations are Yahoo not willing to disclose this data and data size. This means that the learning and possibly feature selection/dimensionality reduction are the main things that will distinguish the teams.

    Interestingly, the current 20 best teams are all in a 0.01 range, with regards to the ERR.

    • lingpipe Says:

      The vectors also contain slightly different non-zero dimensions in the main and adaptation versions of the bakeoff.

      Because ERR only depends on editorial grade, if you can predict the editorial grade of a document/query vector pair, you can optimize the ranking.

      I’d guess Yahoo! was just playing it safe on privacy after the AOL and Netflix debacles.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 797 other followers