Time, Negation, and Clinical Events
Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.
LingPipe Chunk Annotation GUI
Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in
- LingPipe Sandbox: citationEntities project
She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.
Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct. This approach was pioneered in MITRE’s Alembic Workbench, which was used to create the original MUC-6 named-entity corpus.
The chunker underlying LingPipe’s annotation toolkit is based on LingPipe’s character language-model rescoring chunker, which can be trained online (that is, as the data streams in) and has quite reasonable out-of-the-box performance. It’s LingPipe’s best out-of-the-box chunker. In contrast, CRFs can be engineered to outperform the rescoring chunker with good feature engineering.
A very nice project would be to build a semi-supervised version of the rescoring chunker. The underlying difficulty is that our LM-based and HMM-based models take count-based sufficient statistics.
Mitzi’s getting reasonable system accuracy under cross validation, with over 80% precision and recall (and hence over 80% balanced F-measure).
That’s not Cricket!
According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement.
If we can train a system to performa at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation?
We have something better — we can train a system to match Mitzi’s annotation!
In fact, training such a system is really all that we often care about. It’s much better to be able to train a system than another human to do the annotation.
The other thing we might want a corpus for is to evaluate a range of systems. There, if the systems are highly comparable, the fringes of the corpus matter. But perhaps the small, but still p < 0.05, differences in such systems don't matter so much. What the MT people have found is that even a measure that's roughly correlated with performance can be used to guide system development.
Error Analysis and Fixing Inconsistencies
Mitzi’s been doing the sensible thing of actually looking at the errors the system’s making under cross validation. In some of these cases, she’d clearly made a braino and annotated the data wrong. So she fixes it. And system performance goes up.
What Mitzi’s reporting is what I’ve always found in these tasks. For instance, she inconsistently annotated time plus date sequences, sometimes including the times and sometimes not. So she’s going back to correct to do it all consistently to include all of the time information in a phrase (makes sense to me).
After a couple of days of annotation, you get a much stronger feeling for how the annotations should have gone all along. The annotations drifted so much over time in this fashion in the clinical notes annotated for the i2b2 Obesity Challenge that the winning team exploited time of labeling as an informative feature to predict co-morbidities of obesity!
That’s also not Cricket!
The danger with re-annotating is that the system’s response will bias the human annotations. System-label bias is also a danger with single annotation under the tag-a-little, learn-a-little setup. If you gradually change the annotation to match the system’s responses, you’ll eventually get to very good, if not perfect, performance under cross validation.
So some judgment is required in massaging the annotations into a coherent system, but one that you care about, not one driven by the learned system’s behavior.
On the other hand, you do want to choose features and chunkings the system can learn. So if you find you’re trying to make distinctions that are impossible for the system to learn, then change the coding standard to make it more learnable, that seems OK to me.
Go Forth and Multiply
Mitzi has only spent a few days annotating the data and the system’s already working well end to end. This is just the kind of use case that Breck and I had in mind when we built LingPipe in the first place. It’s so much fun seeing other people use your tools
When Breck and Linnea and I were annotating named entities with the citationEntities tool, we could crank along at 5K tokens/hour without cracking a sweat. Two eight-hour days will net you 80K tokens of annotated data and a much deeper insight into the problem. In less than a person-week of effort, you’ll have a corpus the size of the MUC 6 entity corpus.
Of course, it’d be nice to roll in some active learning here. But that’s another story. As is measuring whether it’s better to have a bigger or a better corpus. This is the label-another-instance vs. label-a-fresh-instance decision problem that (Sheng et al. 2008) addressed directly.