i2b2 Obesity Challenge: No Machine Learning Necessary


I attended the i2b2 Obesity Challenge Workshop over the weekend, where the top-performing systems by all metrics were primarily hand-built rule-based systems. The papers gave me a sense of déjà vu; they were not only built just like the expert systems of the 1970s (such as Mycin), they were motivated by a desire for explainable conclusions. That is, a clinician is going to need to review the machine’s findings, and rules are easy to understand.

The task was to classify (anonymized) patient discharge summaries from Massachussetts General Hospital’s Weight Center for patients at risk of obesity or diabetes as to whether they actually were obese and whether they had 15 other co-morbidities such as diabetes, coronary artery disease, congestive heart failure, gout, and sleep apnea. These discharge summaries are hundreds of sentences long and discuss everything from family history and patient medical history to lab test reports and prescription lists.

The best-performing machine learning systems that treated the docs as simple bags of words were rule learners like Ripper and decision trees. Linear classifiers performed best using the top few features (usually extracted by measuring information gain, which is classification entropy minus conditional entropy given the feature).

In terms of feature extraction and document parsing, zoning really helped. The family history section (fairly easily extracted in this data) was a common source of false-positives for diseases for naive systems. The second important step was to import synonym and abbreviation dictionaries for drugs and diseases. We saw a lot of use of resources like UMLS and RxNorm for that. Given the task had yes/no/unknown categories, everyone expected approaches such as Chapman’s NegEx to have more of an impact than they did (though one team got more mileage by customizing NegEx with a specialized dictionary for the obesity task).

These all point to the difference between this task and other classification tasks such as overall sentiment, topic, language identification — it’s more of an information extraction problem than a full-text classification problem. In this, it’s like aspect-oriented sentiment extraction.

This bucks the prevailing trend in the field where recent bakeoff winners have been built following a three-step program:

1. collect and annotate data,

2. extract features with a rule-based system to create a vectorized representation of a document, then

3. fit one or more discriminative linear classifiers (e.g. SVMs, logistic regression, or perceptrons).

This is a hybrid method, which really undercuts all the claims of automation from the machine learning crowd. Perhaps that’s why everyone’s so obsessed with adaptation and semi-supervised learning these days. At the same time, all the rule-based systems leaned heavily on the data collection step to tune their rules.

Clearly, none of the machine learning-based entries (including ours) spent nearly enough time on feature extraction. MITRE and Mayo Clinic leveraged Mayo’s existing entity extraction and normalization systems, and the results were pretty good, though they didn’t have time to customize the resources much for the challenge (the knowledge needed was fairly deep and broad, though as one team pointed out, fully accessible on the web through keyword lookups).

I also suggested to Özlem Uzuner (the challenge organizer) that we might run the same task again next year with another pass over the data by the annotators (my current hobby horse!). One of the huge pains for this kind of eval is scrubbing for anonymity, which makes large semi-supervised tasks problematic. It’s also hard to get good gold-standard agreement and arrive at a consistent coding standard with only a pair of annotators and a tie-breaker in a single pass. I’d love to have a chance to take the features of the winning systems and carry out step (2). I can’t do it now, because we had to destroy all the data after the workshop due to privacy and liability concerns.

Cincinnati Children’s Hospital managed to get their ICD-9-CM coding data released to the public, which I’m told is quite remarkable. Their Medical NLP Challenge to perform ICD-9 coding of radiology reports showed a similar pattern of results to the i2b2 Obesity Challenge, except for UPenn’s entry, which placed second following the above methdology.

If you’re interested in how we did, we were in the middle of the pack of 28 systems. A few quick and dirty feature extraction tricks for associating drug terms and diseases and to distribute negation helped a bit, as did using information gain to select features before training with L1-regularized logistic regression.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s