Archive for the ‘Uncategorized’ Category

OMOP Cup: Drug Safety Surveillance Bakeoff

October 12, 2009

[Update: 13 October 2009: David Madigan says they got the wrong legalese the first go round and are going to replace it.]

David Madigan (co-lead of BMR) and crew, just announced the 2009/2010 OMOP Cup, a short bakeoff aimed at predicting which drugs cause which conditions in patients.

The Coarse Print

There’s no way we’d enter, given the ridiculously restrictive OMOP Cup rules.

The official rules only mention source in passing and only suggest you send them a tech report. But the home page says you have to “share your method”.

In addition, you have to transfer the rights to the contest organizers, which means you wouldn’t even own the rights to use your own technique after submitting it! If that’s not enough, the rules also require you to indemnify the organizers against damages (e.g. when a patent troll sues them because they think your entry infringed their patent).

If you work for a company or for a university, you may not even have the right to reassign your intellectual property this way.

The Two Challenges

There are two predictive challenges, based on the same underlying training data. For full details, see the site above and the challenge overviews:


Data’s out now, progress prizes ($5K total) will be awarded end of November 2009, and grand prizes ($15K total) at the end of March 2010.

Simulated Training Data

Unfortunately, they’re using simulated data. For what it’s worth, here’s OMOP’s call for simulations, so you can figure out some of the basics of how they were planning to simulate.

The basic data sizes are bigger than for Netflix, consisting of roughly:

  • 5K drugs
  • 5K conditions
  • 10M persons
  • 300M condition occurrences over 10 years
  • 90M drug exposures over 10 years
  • 4K positive, 4K negative associations (labeled training data)

The basic observational training data is organized into four DB table dumps:

  • Conditions: start date, person, condition

  • Drug Exposure: start date, end date, person, drug

  • Person Observations: start date, end date, person id, person status (alive/dead), prescription data (yes/no)

  • Person Data: id, birth year, gender (M/F), race (white/non-white)

The labeled training data contains 4000 examples of positive drug-condition associations and 4000 examples of negative associations.

I have no idea if the drug and condition data link to real-world drugs and conditions, though the challenge indicates they want you to use outside data, so they probably do (I’ll post any links that people send me about ways to use this data). OMOP’s common data model (CDM) specification is huge, and all you get in the data files are numerical codes.

System Output

For Challenge 1 ($10K grand prize), you just provide a score for each drug/condition pair, and the scores are only used for ranking.

For Challenge 2 ($5K grand prize), there’s a time component I didn’t understand, and they’re only using the first 500 of the 5000 drugs.


For the first bakeoff, it’s just average precision (see LingPipe’s ScoredPrecisionRecall class documentation or the task descriptions linked above for an explanation of (mean) average precision).

For the second bakeoff, it’s mean average precision, where means are over years.

Leader Board

They’re supposed to have a leaderboard and a way of evaluating responses online as Netflix did. So far, I don’t see it on their site.

What’s OMOP?

OMOP is the Observational Medical Outcomes Partnership, a “public-private partnership”, the stated goal of which is to “improve the monitoring of drugs for safety”.

Joint Referential (Un)Certainty: The “Wallace and Gromit” Dilemma

March 26, 2009
Wallace in bed with Cheese

Mitzi and I were talking and she said she loved … “corn chips”. Hmm. I was expecting “cheese”, which is what she’s usually eating in the other room at this time. So I was primed to think “Wallace and Gromit”. But I couldn’t remember which of the pair, Wallace or Gromit was which. I remember the characters. One’s a head-in-the-clouds human cheese lover, the other a clever pooch.

Back when I worked on semantics, this is just the kind of data that’d get me very excited. Why? Because it’s inconsistent with theories of reference, like the one I put forward in my semantics book.

My theory of plurals would have it that to understand a conjunction like “Wallace and Gromit”, you first identified a referent for each of the conjuncts, “Wallace” and “Gromit”, which could then be used to pick out a group.

In this case, I know the two members of the group, I just don’t know which had which name.

But maybe “Wallace and Gromit” as a whole is a name. That is, maybe it’s a frozen expression, at least for me. Lots of names are like that for me, like “Johnson and Johnson”. Speaking of Johnson and Johnson, could they be just one person? That is, could both conjuncts refer to the same person? It probably mostly refers to a company as a fixed expression these days.

At one point, “Johnson and Johnson” would’ve caused confusion for named entity detectors (conjunction of two person names, or a combined company name; annotation standards like Genia’s Technical Term Annotation let you keep both). This is a problem for us now in our high recall entity extraction with terms like “insulin receptor” — is that a reference to insulin (one thing), or the receptor (another thing)?

Mitzi’s a virtual font of referential uncertainty data tonight. She said she knew that “Abbott and Costello” (the comedy team) had first names “Lou” and “Bud”, but she didn’t know which went with which last name (hint: that’s Lou Costello on the left and Bud Abbott on the right).