[Update: 13 October 2009: David Madigan says they got the wrong legalese the first go round and are going to replace it.]
The Coarse Print
There’s no way we’d enter, given the ridiculously restrictive OMOP Cup rules.
The official rules only mention source in passing and only suggest you send them a tech report. But the home page says you have to “share your method”.
In addition, you have to transfer the rights to the contest organizers, which means you wouldn’t even own the rights to use your own technique after submitting it! If that’s not enough, the rules also require you to indemnify the organizers against damages (e.g. when a patent troll sues them because they think your entry infringed their patent).
If you work for a company or for a university, you may not even have the right to reassign your intellectual property this way.
The Two Challenges
There are two predictive challenges, based on the same underlying training data. For full details, see the site above and the challenge overviews:
Data’s out now, progress prizes ($5K total) will be awarded end of November 2009, and grand prizes ($15K total) at the end of March 2010.
Simulated Training Data
Unfortunately, they’re using simulated data. For what it’s worth, here’s OMOP’s call for simulations, so you can figure out some of the basics of how they were planning to simulate.
The basic data sizes are bigger than for Netflix, consisting of roughly:
- 5K drugs
- 5K conditions
- 10M persons
- 300M condition occurrences over 10 years
- 90M drug exposures over 10 years
- 4K positive, 4K negative associations (labeled training data)
The basic observational training data is organized into four DB table dumps:
- Conditions: start date, person, condition
- Drug Exposure: start date, end date, person, drug
- Person Observations: start date, end date, person id, person status (alive/dead), prescription data (yes/no)
- Person Data: id, birth year, gender (M/F), race (white/non-white)
The labeled training data contains 4000 examples of positive drug-condition associations and 4000 examples of negative associations.
I have no idea if the drug and condition data link to real-world drugs and conditions, though the challenge indicates they want you to use outside data, so they probably do (I’ll post any links that people send me about ways to use this data). OMOP’s common data model (CDM) specification is huge, and all you get in the data files are numerical codes.
For Challenge 1 ($10K grand prize), you just provide a score for each drug/condition pair, and the scores are only used for ranking.
For Challenge 2 ($5K grand prize), there’s a time component I didn’t understand, and they’re only using the first 500 of the 5000 drugs.
For the first bakeoff, it’s just average precision (see LingPipe’s
ScoredPrecisionRecall class documentation or the task descriptions linked above for an explanation of (mean) average precision).
For the second bakeoff, it’s mean average precision, where means are over years.
They’re supposed to have a leaderboard and a way of evaluating responses online as Netflix did. So far, I don’t see it on their site.
OMOP is the Observational Medical Outcomes Partnership, a “public-private partnership”, the stated goal of which is to “improve the monitoring of drugs for safety”.