Ratinov and Roth (2009) Design Challenges and Misconceptions in Named Entity Recognition


If you’re interested in named entity (mention) recognition, you should check out this paper:

There are two things to read this for: (1) the compendium of features to consider, and (2) the analysis of greedy left-to-right decoding. You might also like their state-of-the-art results. (They only cover person, location, organization and miscellaneous entity mentions in English, so it’d be nice to see what other domains, entity types, and languages might require.)


Presumably you already knew from careful study of LingPipe (or Andrew Borthwick’s thesis) that begin-in-out (BIO) coding is inferior to begin-mid-end-whole-out (BMEWO, which the authors call BILOU, for “begin-in-last-out-unit”) in some cases. This is particularly important if you’re doing some kind of greedy decoding, because it gives you context info that couldn’t otherwise be resolved.

Local Features

Although the features they considered were hardly exhaustive, they covered a fair range in a fair amount of detail. If anyone knows a better compendium, or just wants to add features, please comment.

Their baseline system has the usual suspects: previous two predictions (making it a second-order model), word, word type/shape (no details given, but some systems use multiple type or shape features), prefixes and suffixes of the word (no arbitrary substrings and no indication of length they used), and windows of these features around the current word, as well as some interactions (specifically token n-grams and previous label). The problem with this kind of evaluation is that there are so many parameters to play with there’s no way to evaluate all combinations of parameters.

Most unconventionally, they didn’t use any part-of-speech features (though see below on clustering for a hint as to why).

Contextual Features

They considered the words around a feature in 200 and 1000 token windows (I presume that’s +/- 100 and +/- 500 tokens). They ignored doc boundaries and stated consecutive docs were often related in their corpus, so your mileage may vary here. I’d stick to doc boundaries, myself.

Context Aggregation

This feature I’d never seen before. When they look at a word w[n] in context (w[n-2],w[n-1],w[n],w[n+1],w[n+2]), they create a vector by looking at all the instances within 200 tokens (+/- 200?). Sort of like Ando and Zhang (see paper refs), only much simpler and more context sensitive.

Two-Stage Predictions

They took the output of a baseline NE detector on a corpus and fed its voted results back into the system, restricting themselves to a window around the instance in question. Sort of like Krishnan and Manning (see paper refs), only context sensitive.

Prediction History

Working left-to-right through the corpus, they kept track of earlier predictions in the corpus, taking percentage of assignment as feature value (again, there’s all sorts of issues in scaling these features — logs might be good, and then there’s smoothing for low counts). Again, they restrict this to context, this time the previous 1000 tokens.

Unlabeled Text

Again, reminiscent of Rie Ando and Tong Zhang’s work, with an approach borrowed from Peter Brown’s earlier work at IBM (again, see refs), they derived features from Percy Liang’s hierarchical clustering of Reuters (it’s nice to have a pile of in-domain data). What’ neat is that they used sequences of branching decisions (e.g. 0110 for left,right,left,right) in the dendrogram as features, taking length 4 prefixes (which they claim look like POS tags, hence the explanation for why no POS tagger), as well as length 6, 10, and 20 (no explanation why these).

Dictionaries and Gazetteers

They threw in 30 dictionaries, derived from the web and Wikipedia, covering a range of classes. They’re included in the download, and the paper has some lists of Wikipedia categories used to define types of entities.


I ran some text from the front-page of the NYTimes through their online demo, which is a softball test. The main problem was messing up the character encodings (and whitespace). Here’s the output:

[LOC ALBANY ] � [PER Pedro Espada Jr. ] returned to the [MISC Democrats ] and was named [ORG Senate ] majority leader on Thursday as part of a deal worked out by [ORG Senate ] [MISC Democratic ] leaders , ending a monthlong stalemate that has hobbled state government . Skip to next paragraph Related City Room: [MISC Ravitch Signed Oath ] at [LOC Brooklyn ] Steakhouse [PER Paterson Picks M.T.A. Figure ] as His [ORG No. ] 2 ( July 9 , 2009 ) [MISC Mr. Espada�s ] return gave the [MISC Democrats ] 32 votes in the [ORG Senate ] , a clear two-vote margin that re-established their control of the chamber . Under the deal , Senator [PER Malcolm A. Smith ] of [LOC Queens ] will be president for an undetermined period of time , and Senator [PER John L. Sampson ] of [LOC Brooklyn ] will be the leader of the [MISC Democratic ] caucus . Details of the arrangement were explained by [PER Mr. Smith ] at a news conference. �At the end of the day , [MISC Democrats ] always come together , � he said . The defection of [PER Mr. Espada ] , of the [LOC Bronx ] , plunged [LOC Albany ] into chaos on June 8 . When it was his turn to speak at [ORG Thursday�s ] news conference , [PER Mr. Espada ] said , �This has obviously taken a toll on the institution.� He described the last several weeks as a time of �chaos and dysfunction� and added , �I profoundly apologize.� Talk of a deal started to bubble up Thursday afternoon . As it became clear that a deal was at hand , [PER Steve Pigeon ] , a top aide to the [LOC Rochester ] billionaire [PER Tom Golisano ] , a supporter of the [MISC Republican ] takeover , left [MISC Mr. Espada�s ] office , saying little . [PER Mr. Pigeon ] , who helped orchestrate the coup along with [PER Mr. Golisano ] , then huddled with a top [ORG Senate ] [MISC Republican ] , [PER George D. Maziarz ] , on a stairway near [MISC Mr. Espada�s ] office . After the conversation , [PER Mr. Maziarz ] was speaking as if a deal was a fait accompli . Asked if he was disappointed , he said he was not , and said he believed that the rules and reforms [MISC Republicans ] had pushed through last month would still stand .

Treating “Mr. Espada” as a miscellaneous entry is almost certainly a mistake they make because they messed up the char encodings and hence the tokenization missed the possesive apostrophe. They also mangled the whitespaces in the output. These detail issues can make or break a real system in practice, but they’re not all that hard to clean up.

More topical text from justjared.buzznet.com is harder:

Scope out these new exclusive pics of [PER Michael Jackson ] with two of his three kids � [LOC Paris ] and [PER Michael Joseph Jr ] . ( also known as Prince [PER Michael ] ) � from the new issue of [LOC OK ] ! . In this picture , Prince , 4 , and [PER Paris Jackson ] , 3 , play dress-up in 2001 at the [LOC Neverland Ranch ] in [LOC Santa Barbara County ] , [LOC Calif ] . Also pictured is [PER Michael ] at [LOC Neverland ] , celebrating [MISC Prince�s ] sixth birthday in 2003 with a [MISC Spider-Man-themed ] party . Cute !

Disclaimer: The LingPipe online tagger demo does much worse on this data. NE mention detection is very hard once you stray even a little bit outside of newswire.

Official Evaluation

They claim to have the highest F measure for CoNLL ’03 (F=0.908; Ando and Zhang had F=0.892). F measure, using an evaluation that ignores punctuation, but we all know these systems are overtrained at this point, and they also had the advantage of having clustering data over the same data set. Our readers probably know by now how much I distrust post-hoc results like these.

They did label some of their own web page data for which they acknowledge the borderline status of some of their labeling decisions even after adjudication; this directly supports the point I’ve been trying to make about labeling difficulty and noisy corpora; the MUC data’s also noisy, and I’m guessing so is the CoNLL data.

(Missing) Conditional Model Details

They report that there was very little search error in doing greedy decoding, and even less in doing a ten-item beam search, when evaluated on a non-contextualized baseline system.

I’m not clear on exactly how they used regularized average perceptrons; they never spell out the model in detail. For instance, they don’t discuss how they do regularization or feature selection (e.g. is there a min count?)! More confusingly, they don’t discuss how they extend perceptrons to structured predictions for sequence analysis, though from the rest of the paper, it looks like they’re deciding on a per-token basis (though if that’s the case, why do they say they need to convert to probabilities to run Viterbi?).

I love these feature survey papers. Unfortunately, drawing general conclusions from this kind of evaluation is nearly impossible, even when they’re done as carefully as this one.

We dont’ know if a CRF have worked better. Given where they go with the system, perhaps they should’ve used MEMMs and a per-tag error function.

8 Responses to “Ratinov and Roth (2009) Design Challenges and Misconceptions in Named Entity Recognition”

  1. fredrik Says:

    as regards to features, the paper by nadeau and sekine; “A survey of named entity recognition and classification” (2007) lists a fair amount of different features used in the literature. the paper is available here: http://nlp.cs.nyu.edu/sekine/papers/li07.pdf

  2. Lev Ratinov Says:

    Hi Bob.

    Thanks for evaluating our system. I have two comments.
    1) The Link you provided to our system is a bit incorrect.
    This is the link to use: http://l2r.cs.uiuc.edu/~cogcomp/LbjNer.php
    2) I hope you’ll agree with me that as a non-commercial demo we have hard time adjusting to different encodings without our web demo. It’s really easy to fix the encoding prbolem, and I’m a bit disappointed you didn’t try our system AFTER fixing the encoding, which is a much more fair comparison. I’ve done it, and the output is below. As you can see, our system does much better on the blog entry you mention, which, I agree with you is hard, since it’s not a newswire text.

    Scope out these new exclusive pics of [PER Michael Jackson ] with two of his three kids – [LOC Paris ] and [PER Michael Joseph Jr ] . ( also known as Prince [PER Michael ] ) – from the new issue of [LOC OK ] ! . In this picture , [PER Prince ] , 4 , and [PER Paris Jackson ] , 3 , play dress-up in 2001 at the [LOC Neverland Ranch ] in [LOC Santa Barbara County ] , [LOC Calif ] . Also pictured is [LOC Michaelat Neverland ] , celebrating [PER Prince ] ‘s sixth birthday in 2003 with a [MISC Spider-Man-themed ] party . Cute !

  3. lingpipe Says:

    @fredrik Thanks for the reference. I’ve been trying to find survey papers like this to do a meta-survey! Satoshi also has a fairly extensive NE ontology that’s been in play for various evaluations.

    @Lev Thanks for the link correction; I fixed it in the post.

    And if I didn’t make it clear enough in the post, I really like both your approach and how well it seems to work on the examples I tried (many more than I put in the post; those were just the first two I tried). I’d urge everyone else to try it. The demo’s very speedy, too, which is great. How big is the demo beam?

    I’m assuming something’s going wrong on your program’s decoding side, because when I cut and paste from the NY Times, in Latin1, to your web form, also Latin1 (both with inferred encodings from firefox), the characters get mangled on output. Is there a way I could’ve entered the curly apostrophe with the form the way it is now?

    I have no idea how to properly deal with different encodings in PHP. It’s a pain in Java servlets, because (a) you have to transcode through Latin1 to recover bytes, because the servlet interface is specified in unicode characters [or you have to do everything in bytes, losing the advantage of letting the implementation parse forms for you], and (b) you then have to either know the underlying charset (how I’ve built our demos) or determine it using something like a unicode library (how we deal with untrusted char data).

    Then there’s the issue of what to do with these things in the models. You wind up with names and things like that with very low frequency characters in them that aren’t in any of the character-based training data. We’ve had to deal with issues like this in training models from the French Treebank, because live French uses all sorts of different punctuation and the training corpus was fairly uniform.

  4. Yoav Says:

    fwiw, the length 4 and 6 prefixes of the Brown clustering algorithm were also shown to be useful for dependency parsing by Koo et. al. 2008 (see paper for ref).

  5. Lev Ratinov Says:

    Hi, Lingpipe.

    Interesting comments. The beam size of the demo is 1, which is the size of our system we reported in the paper. We didn’t get much improvement when increasing the beam size to 100, but it was painfully slow. So everything we did was plain greedy.

    Unfortunately, I don’t know anything about PHP, I’m using a standard stub in our group for creating demos, and is has many problems.

    But giving it a second thought, I do agree that character encoding, dealing with punctuation etc are important. One thing that I didn’t talk in the paper, due to lack of space, but which had impact on the system performance for real-world text was tokenization and text normalization. We’re using two tokenization schemes in our system, which play with how we parse hyphens and punctuation marks… All this thing is not trivial and under-researched.

  6. Dr. Jochen L. Leidner Says:

    Tokenization is one of the key factors for high quality in NLP, yet Mikheev’s article in Computational Linguistics is about the only serious reference that I can think of that studies it. Tokenization is language-dependent, interacts with the textual format of the input text, which makes it a complex problem.

    I have seen people running around asking “Um… do you have a tokenizer?” at various institutions in more than one country, and many end up using some idiosyncratic Perl script that “somebody hacked up one night in the lab”.

    Even for English, to date there is no standard component that people can share, there is not even interest in the problem. Yes, people will tell you their experiences, but they would never consider publishing on the topic, because they consider it “solved”.

    • lingpipe Says:

      There’s definitely more written on tokenization in IR than in NLP, and even then the emphasis is on non-whitespace-separated languages like Chinese.

      There’s been a fair bit written on tokenization in the bio-NLP world, because of the complex token structures of biochemistry nomenclature (e.g. “p53wt” might become three tokens and “mRNA” two tokens).

      Another place you see it is in learning to segment unsegmented input (e.g. phonemes into words).

      There’s also a fair bit of attention in IR to issues of compound splitting, because of the Spider-man/Batman problem (“Spider-man” is two words, “Batman” is one, but many searchers get this wrong).

      One problem is that tokenization isn’t “natural”. Speakers of a language don’t have intuitions about tokenization, so no one ever runs bakeoffs on tokenization. So the “right” tokenization is often system dependent. Having said that, you could say the same about POS tagging.

      A big issue is that much of the test data is pre-tokenized. Consider CoNLL, the Penn Treebank, the Google N-gram corpus, etc. etc. Bakeoff organizers often consider this a service (e.g. Senseval and many KDD-like efforts), and only provide stemmed, tokenized and stoplisted input.

      What’s really frustrating is getting a resource without a tokenizer (e.g. CoNLL, Google N-grams). Then you can’t convert your system to run on wild text. At least the Penn Treebank distributes its crazy script (which first does sentence detection so it can treat end-of-sentence periods differently than sentence-internal ones).

      Then there are the probabilistic tokenizers, like in BioIE. I have too much carryover from my programming language days, but I like my tokenizers to be standalone and deterministic preprocessors, not part of some big joint probabilistic inference system. That’s why we tend to run finer-grained tokenizers than most.

      For our spelling correction and for our rescoring named entity detector, we rely heavily on tokenization. For spelling, we can determine the set of tokens which may be suggested as tokens and make correction sensitive to that. For rescoring named entities, we use a simple HMM, which determines hard, but fine-grained token boundaries, then rescore it in a non-tokenized longer-distance model.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: