If you’re interested in named entity (mention) recognition, you should check out this paper:
- Ratinov, Lev and Dan Roth. 2009. Design Challenges and Misconceptions in Named Entity Recognition. In CoNLL.
- Try: Online Demo
- Download: Code and Resources
There are two things to read this for: (1) the compendium of features to consider, and (2) the analysis of greedy left-to-right decoding. You might also like their state-of-the-art results. (They only cover person, location, organization and miscellaneous entity mentions in English, so it’d be nice to see what other domains, entity types, and languages might require.)
BIO vs. BMEWO
Presumably you already knew from careful study of LingPipe (or Andrew Borthwick’s thesis) that begin-in-out (BIO) coding is inferior to begin-mid-end-whole-out (BMEWO, which the authors call BILOU, for “begin-in-last-out-unit”) in some cases. This is particularly important if you’re doing some kind of greedy decoding, because it gives you context info that couldn’t otherwise be resolved.
Although the features they considered were hardly exhaustive, they covered a fair range in a fair amount of detail. If anyone knows a better compendium, or just wants to add features, please comment.
Their baseline system has the usual suspects: previous two predictions (making it a second-order model), word, word type/shape (no details given, but some systems use multiple type or shape features), prefixes and suffixes of the word (no arbitrary substrings and no indication of length they used), and windows of these features around the current word, as well as some interactions (specifically token n-grams and previous label). The problem with this kind of evaluation is that there are so many parameters to play with there’s no way to evaluate all combinations of parameters.
Most unconventionally, they didn’t use any part-of-speech features (though see below on clustering for a hint as to why).
They considered the words around a feature in 200 and 1000 token windows (I presume that’s +/- 100 and +/- 500 tokens). They ignored doc boundaries and stated consecutive docs were often related in their corpus, so your mileage may vary here. I’d stick to doc boundaries, myself.
This feature I’d never seen before. When they look at a word w[n] in context (w[n-2],w[n-1],w[n],w[n+1],w[n+2]), they create a vector by looking at all the instances within 200 tokens (+/- 200?). Sort of like Ando and Zhang (see paper refs), only much simpler and more context sensitive.
They took the output of a baseline NE detector on a corpus and fed its voted results back into the system, restricting themselves to a window around the instance in question. Sort of like Krishnan and Manning (see paper refs), only context sensitive.
Working left-to-right through the corpus, they kept track of earlier predictions in the corpus, taking percentage of assignment as feature value (again, there’s all sorts of issues in scaling these features — logs might be good, and then there’s smoothing for low counts). Again, they restrict this to context, this time the previous 1000 tokens.
Again, reminiscent of Rie Ando and Tong Zhang’s work, with an approach borrowed from Peter Brown’s earlier work at IBM (again, see refs), they derived features from Percy Liang’s hierarchical clustering of Reuters (it’s nice to have a pile of in-domain data). What’ neat is that they used sequences of branching decisions (e.g. 0110 for left,right,left,right) in the dendrogram as features, taking length 4 prefixes (which they claim look like POS tags, hence the explanation for why no POS tagger), as well as length 6, 10, and 20 (no explanation why these).
Dictionaries and Gazetteers
They threw in 30 dictionaries, derived from the web and Wikipedia, covering a range of classes. They’re included in the download, and the paper has some lists of Wikipedia categories used to define types of entities.
I ran some text from the front-page of the NYTimes through their online demo, which is a softball test. The main problem was messing up the character encodings (and whitespace). Here’s the output:
[LOC ALBANY ] ï¿½ [PER Pedro Espada Jr. ] returned to the [MISC Democrats ] and was named [ORG Senate ] majority leader on Thursday as part of a deal worked out by [ORG Senate ] [MISC Democratic ] leaders , ending a monthlong stalemate that has hobbled state government . Skip to next paragraph Related City Room: [MISC Ravitch Signed Oath ] at [LOC Brooklyn ] Steakhouse [PER Paterson Picks M.T.A. Figure ] as His [ORG No. ] 2 ( July 9 , 2009 ) [MISC Mr. Espadaï¿½s ] return gave the [MISC Democrats ] 32 votes in the [ORG Senate ] , a clear two-vote margin that re-established their control of the chamber . Under the deal , Senator [PER Malcolm A. Smith ] of [LOC Queens ] will be president for an undetermined period of time , and Senator [PER John L. Sampson ] of [LOC Brooklyn ] will be the leader of the [MISC Democratic ] caucus . Details of the arrangement were explained by [PER Mr. Smith ] at a news conference. ï¿½At the end of the day , [MISC Democrats ] always come together , ï¿½ he said . The defection of [PER Mr. Espada ] , of the [LOC Bronx ] , plunged [LOC Albany ] into chaos on June 8 . When it was his turn to speak at [ORG Thursdayï¿½s ] news conference , [PER Mr. Espada ] said , ï¿½This has obviously taken a toll on the institution.ï¿½ He described the last several weeks as a time of ï¿½chaos and dysfunctionï¿½ and added , ï¿½I profoundly apologize.ï¿½ Talk of a deal started to bubble up Thursday afternoon . As it became clear that a deal was at hand , [PER Steve Pigeon ] , a top aide to the [LOC Rochester ] billionaire [PER Tom Golisano ] , a supporter of the [MISC Republican ] takeover , left [MISC Mr. Espadaï¿½s ] office , saying little . [PER Mr. Pigeon ] , who helped orchestrate the coup along with [PER Mr. Golisano ] , then huddled with a top [ORG Senate ] [MISC Republican ] , [PER George D. Maziarz ] , on a stairway near [MISC Mr. Espadaï¿½s ] office . After the conversation , [PER Mr. Maziarz ] was speaking as if a deal was a fait accompli . Asked if he was disappointed , he said he was not , and said he believed that the rules and reforms [MISC Republicans ] had pushed through last month would still stand .
Treating “Mr. Espada” as a miscellaneous entry is almost certainly a mistake they make because they messed up the char encodings and hence the tokenization missed the possesive apostrophe. They also mangled the whitespaces in the output. These detail issues can make or break a real system in practice, but they’re not all that hard to clean up.
More topical text from justjared.buzznet.com is harder:
Scope out these new exclusive pics of [PER Michael Jackson ] with two of his three kids ï¿½ [LOC Paris ] and [PER Michael Joseph Jr ] . ( also known as Prince [PER Michael ] ) ï¿½ from the new issue of [LOC OK ] ! . In this picture , Prince , 4 , and [PER Paris Jackson ] , 3 , play dress-up in 2001 at the [LOC Neverland Ranch ] in [LOC Santa Barbara County ] , [LOC Calif ] . Also pictured is [PER Michael ] at [LOC Neverland ] , celebrating [MISC Princeï¿½s ] sixth birthday in 2003 with a [MISC Spider-Man-themed ] party . Cute !
Disclaimer: The LingPipe online tagger demo does much worse on this data. NE mention detection is very hard once you stray even a little bit outside of newswire.
They claim to have the highest F measure for CoNLL ’03 (F=0.908; Ando and Zhang had F=0.892). F measure, using an evaluation that ignores punctuation, but we all know these systems are overtrained at this point, and they also had the advantage of having clustering data over the same data set. Our readers probably know by now how much I distrust post-hoc results like these.
They did label some of their own web page data for which they acknowledge the borderline status of some of their labeling decisions even after adjudication; this directly supports the point I’ve been trying to make about labeling difficulty and noisy corpora; the MUC data’s also noisy, and I’m guessing so is the CoNLL data.
(Missing) Conditional Model Details
They report that there was very little search error in doing greedy decoding, and even less in doing a ten-item beam search, when evaluated on a non-contextualized baseline system.
I’m not clear on exactly how they used regularized average perceptrons; they never spell out the model in detail. For instance, they don’t discuss how they do regularization or feature selection (e.g. is there a min count?)! More confusingly, they don’t discuss how they extend perceptrons to structured predictions for sequence analysis, though from the rest of the paper, it looks like they’re deciding on a per-token basis (though if that’s the case, why do they say they need to convert to probabilities to run Viterbi?).
I love these feature survey papers. Unfortunately, drawing general conclusions from this kind of evaluation is nearly impossible, even when they’re done as carefully as this one.
We dont’ know if a CRF have worked better. Given where they go with the system, perhaps they should’ve used MEMMs and a per-tag error function.