Cross Validation vs. Inter-Annotator Agreement


Time, Negation, and Clinical Events

Mitzi’s been annotating clinical notes for time expressions, negations, and a couple other classes of clinically relevant phrases like diagnoses and treatments (I just can’t remember exactly which!). This is part of the project she’s working on with Noemie Elhadad, a professor in the Department of Biomedical Informatics at Columbia.

LingPipe Chunk Annotation GUI

Mitzi’s doing the phrase annotation with a LingPipe tool which can be found in

She even brought it up to date with the current release of LingPipe and generalized the layout for documents with subsections.

Our annotation tool follows the tag-a-little, train-a-little paradigm, in which an automatic system based on the already-annotated data is trained as you go to pre-annotate the data for a user to correct. This approach was pioneered in MITRE’s Alembic Workbench, which was used to create the original MUC-6 named-entity corpus.

The chunker underlying LingPipe’s annotation toolkit is based on LingPipe’s character language-model rescoring chunker, which can be trained online (that is, as the data streams in) and has quite reasonable out-of-the-box performance. It’s LingPipe’s best out-of-the-box chunker. In contrast, CRFs can be engineered to outperform the rescoring chunker with good feature engineering.

A very nice project would be to build a semi-supervised version of the rescoring chunker. The underlying difficulty is that our LM-based and HMM-based models take count-based sufficient statistics.

It Works!

Mitzi’s getting reasonable system accuracy under cross validation, with over 80% precision and recall (and hence over 80% balanced F-measure).

That’s not Cricket!

According to received wisdom in natural language processing, she’s left out a very important step of the standard operating procedure. She’s supposed to get another annotator to independently label the data and then measure inter-annotator agreement.

So What?

If we can train a system to performa at 80%+ F-measure under cross-validation, who cares if we can’t get another human to match Mitzi’s annotation?

We have something better — we can train a system to match Mitzi’s annotation!

In fact, training such a system is really all that we often care about. It’s much better to be able to train a system than another human to do the annotation.

The other thing we might want a corpus for is to evaluate a range of systems. There, if the systems are highly comparable, the fringes of the corpus matter. But perhaps the small, but still p < 0.05, differences in such systems don't matter so much. What the MT people have found is that even a measure that's roughly correlated with performance can be used to guide system development.

Error Analysis and Fixing Inconsistencies

Mitzi’s been doing the sensible thing of actually looking at the errors the system’s making under cross validation. In some of these cases, she’d clearly made a braino and annotated the data wrong. So she fixes it. And system performance goes up.

What Mitzi’s reporting is what I’ve always found in these tasks. For instance, she inconsistently annotated time plus date sequences, sometimes including the times and sometimes not. So she’s going back to correct to do it all consistently to include all of the time information in a phrase (makes sense to me).

After a couple of days of annotation, you get a much stronger feeling for how the annotations should have gone all along. The annotations drifted so much over time in this fashion in the clinical notes annotated for the i2b2 Obesity Challenge that the winning team exploited time of labeling as an informative feature to predict co-morbidities of obesity!

That’s also not Cricket!

The danger with re-annotating is that the system’s response will bias the human annotations. System-label bias is also a danger with single annotation under the tag-a-little, learn-a-little setup. If you gradually change the annotation to match the system’s responses, you’ll eventually get to very good, if not perfect, performance under cross validation.

So some judgment is required in massaging the annotations into a coherent system, but one that you care about, not one driven by the learned system’s behavior.

On the other hand, you do want to choose features and chunkings the system can learn. So if you find you’re trying to make distinctions that are impossible for the system to learn, then change the coding standard to make it more learnable, that seems OK to me.

Go Forth and Multiply

Mitzi has only spent a few days annotating the data and the system’s already working well end to end. This is just the kind of use case that Breck and I had in mind when we built LingPipe in the first place. It’s so much fun seeing other people use your tools

When Breck and Linnea and I were annotating named entities with the citationEntities tool, we could crank along at 5K tokens/hour without cracking a sweat. Two eight-hour days will net you 80K tokens of annotated data and a much deeper insight into the problem. In less than a person-week of effort, you’ll have a corpus the size of the MUC 6 entity corpus.

Of course, it’d be nice to roll in some active learning here. But that’s another story. As is measuring whether it’s better to have a bigger or a better corpus. This is the label-another-instance vs. label-a-fresh-instance decision problem that (Sheng et al. 2008) addressed directly.

8 Responses to “Cross Validation vs. Inter-Annotator Agreement”

  1. Chris Brew Says:

    The SOP was developed to handle corner cases where there was serious doubt about either (a) whether the annotations being asked for make sense at all or (b) whether anyone except the original designer of the annotation scheme was capable of understanding and applying it. Since Bob is pretty sure that Mitzi’s annotations make sense, and that having more of them is a good thing, it’s great to skip that extra step. Doubly so since she’s going to check them anyway.

    Were you to submit a paper to a conference, the absence of a second rater might engage the lizard brain of some reviewer, and lead to rejection. Actually, it could be that some reviewers can do this as a spinal reflex, with no use of the brain whatsoever.

  2. Bob Carpenter Says:

    A computer system that can be trained to make predictions that agree with one person’s annotation scheme seems to demonstrate replicability (b).

    I’m more concerned about (a) — whether they make sense. I just don’t see any way to independently evaluate that. I know people like to plug the results in some other end-to-end system they care about (e.g., use them as features to improve a classifier or tagger).

    This is one of the issues Becky Passoneau and I are wrestling with for word-sense annotation. Different people may understand word senses differently and some instances of words may fall between the cracks of any discrete sense inventory. So the task of finding “the” sense of an instance of a word may not make sense. There’ll always be some agreement and some disagreement until users do what’s commonly known as “semantics” and adjudicate the meaning of the words they’re using among themselves.

    But we can get good kappa scores and build systems that agree with any annotation scheme pretty well. The statistical classifiers are pretty robust to noise and there’s some core cases on which everyone seems to agree.

  3. Chris Brew Says:

    Adam Kilgarriff’s “I don’t believe in word senses”, ( is essential reading for anyone who wants to annotate word senses. It is essentially a description of why there are cracks in any discrete word sense inventory.

  4. scottedwards2000 Says:

    Sorry, a bit off topic, Bob, but couldn’t find another way to contact you. I’ve been reading a number of articles on this blog and am excited and blown away by the progress in ML reflected here.

    I would absolutely love to fully be able to understand topics like SVM and SGD, but I come from a traditional stat/research methods background (psych) where ANCOVA and multiple linear regression is still state of the art.

    I hate to bother you, but would you be willing to recommend the best places to get started to get to a place where I can understand these topics? I’d like to REALLY understand them, not just know how to use them (I’ve taken this approach with traditional stat, and it has served me well – allowing me to avoid many mistakes “practicioners” make.)

    It seems to require more than just an understanding of linear algebra… :-)

    Thanks, for your time,
    Scott Edwards

    • Bob Carpenter Says:

      I’m Google [bob carpenter] if you need to find me.

      Other than linear algebra and calc and a bit of algorithms, the ante’s pretty low if you already know basic stats. Then it depends what you want to do. “Machine learning” is understood even more broadly than statistics. Do you want to scale something simple to the web? Build something smaller scale but more involved for sequence data?

      I’ve always come at these things through applied problems that I’ve been interested in. The general advice to do it like I do it would be to read through NIPS proceedings until you see something you’d like to understand and then work backwards through all the bits required. Ideally with people around to help you get through the confusing parts.

      Or if you’re a more bottom-up person, follow one of the Stanford or MIT online courses. Or read Bishop’s book if you know linear algebra and calc or Nocedal and Wright if you care more about optimization.

    • scottedwards2000 Says:

      Thanks so much for the help, Bob! I’ve gotten a hold of Bishop’s book and it looks good, but despite some experience in linear algebra, I’ve already hit a few math items I have questions on. Besides the great online courses you mentioned, are there any good places to go for interactive help on such issues? I’ve seen a couple ( and and, but wondering if you had any other suggestions where I could ask questions related to the heavy math in these books you recommended.

      • Bob Carpenter Says:

        Unfortunately, what counts as “heavy math” is very relative. That’s why I never know what to recommend.

        stats.stackexchange and metaoptimize are both reasonable, but the math in Bishop’s book isn’t “heavy” by the standards of these sites.

        A much easier book to start with is the O’Reilly NLTK book (it’s about the Natural Language Toolkit package in Python and is written assuming the reader is a linguist who’s never programmed or done any math after high school).

        Something in between the NLTK book and Bishop’s book is the Witten and Frank Data Mining book. It covers many of the same algorithms, but is much more practical.

        Next up from there would be Manning and Schuetze’s NLP book, but it’s now very dated and still doesn’t go into much of the actual math required.

        After that, I’m afraid you need to bite the bullet and learn the linear algebra. Strang’s book is great and I hear there’s also an MIT class online for it. And of course calc if you don’t know that — I don’t have a good reco for that.

  5. RONALD LOUI Says:

    Bob — email me — I have a q 4 u

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: