[This is a refactoring of the material that should have been broken out from yesterday's discussion of the expected reciprocal rank evaluation metric.]
Yahoo! is hosting an online Learning to Rank Challenge. Sort of like a poor man’s Netflix, given that the top prize is US$8K. [Update: I clearly can't read. As Olivier Chapelle, one of the organizers points out, the rules clearly state that "If no member of a winning Team is able to attend, a representative of the Sponsor will give the presentation on the winning Team's behalf." Haifa's an interesting city, especially if you can hang out with the natives at the Technion, but it's an expensive trip in both time and money from the US.]
Like Netflix, this is an incredibly rich data set tied to a real large-scale application. I can imagine how much editorial time went into creating it. There are over 500K real query relevance judgments!
The supervised training data consists of 500K or so pairs of queries and documents represented as 750-dimensional vectors and provided “editorial grades” of relevance from 0 (completely irrelevant) to 4 (near perfect match). There’s also a smaller set of 20K or so pairs with some overlapping and some new features to evaluate adaptation.
Unfortunately, there’s no raw data, just the vectors.
The contest involves ranking a set of documents with respect to a query, where the input is the set of document-query vectors. An evaluation set consists of a number of query-document pairs represented as vectors, where there may be any number of potential documents for each query. The systems must return rankings of the documents for each query.
Binary Logistic Regression for Ranking
In case you were curious, I’m going to tackle the learning to rank problem by generalizing LingPipe’s logistic regression models to allow probabilistic training. Then I can train editorial grade stopping probabilities directly. If I can predict those well, I can score well at ranking. I don’t see any advantage to training something like DCA compared to a binary logistic regression on a per-example basis here.
I’m not sure which regression packages that scale to this size data allow probabilistic training. It could always be faked by making an even bigger data set with positive and negative items.
No IP Transfer, But a Non-Compete
I’d like to be able to play with these data sets and publish about them and release demos based on them without transferring any intellectual property to a third party other than a publication and rights to publicity. IANAL, but I don’t think I can do that with this data.
We still haven’t gone through the legal fine print, nor is it clear it will be worth our while to do so in the end. It’s expensive to run this stuff by lawyers to look for gotchas. And my own read makes it seem like a no-go.
It doesn’t appear from a quick read that they require the winners to give them the algorithm. They do require the winners have the rights to their submission. And they will take possession of copyright on submissions (presumably not including the algorithm itself). This is great if I’m reading it right.
On the other hand, the non-compete clause (4a) immediately jumped out at me as a deal breaker:
THE TEAM WILL NOT ENTER THE ALGORITHM NOR ANY OUTPUT OR RESULTS THEREOF IN ANY OTHER COMPETITION OR PROMOTION OFFERED BY ANYONE OTHER THAN THE SPONSOR FOR ONE YEAR AFTER THE CONCLUSION OF THE CONTEST PERIOD;
Not only are they shouting, I’m not sure what constitutes “the algorithm”. It’d be absurd to tie up LingPipe’s logistic regression by entering a contest. Yet it’s not clear what more there is to “THE ALGORITHM”.
And if we release something as part of LingPipe, our royalty-free license lets people do whatever they want with it. So I don’t see how we could comply.
The other issue that’s problematic for a company is (4g):
agrees to indemnify and hold the Contest Entities and their respective subsidiaries, affiliates, officers, directors, agents, co-branders or other partners, and any of their employees (collectively, the “Contest Indemnitees”), harmless from any and all claims, damages, expenses, costs (including reasonable attorneys’ fees) and liabilities (including settlements), brought or asserted by any third party against any of the Contest Indemnitees due to or arising out of the Team’s Submissions or Additional Materials, or any Team Member’s conduct during or in connection with this Contest, including but not limited to trademark, copyright, or other intellectual property rights, right of publicity, right of privacy and defamation.
One doesn’t like to put one’s company on the line for a contest that’ll at most net US$8K (less travel and conference registration) and minor publicity.