Following a tip from the last NE survey, today we look at 15 years of named entity research in:
- Nadeau, David and Satoshi Sekine (2007) A survey of named entity recognition and classification. Linguisticae Investigationes 30(1):3–26.
This paper is extracted from the literature survey for Nadeau’s thesis on semi-supervised NE. Nice work!
It surveys work in many languages and language families, many genres of text or spoken transcription, and many entity types ranging from highly general to very specific.
It surveys the shift from hand-written patterns (not dead yet, especially in conjunction with learned feature weights) to supervised learning to semi-supervised learning, with a detour through approximate dictionary-matching methods such as soundex, edit distance and Jaro-Winkler. They call bootstrapping from sources like Wikipedia unsupervised because the exact task isn’t labeled, but they feel semi-supervised to me.
Mainly, I wanted to read this for the overview of features, which they cover far better than any other survey I’ve seen, breaking features down into (1) word-level features, such as case, punctuation, word shape, part-of-speech, morphology, and word function, (2) list lookup features such as common words, known entities, and entity cues (e.g. “Mr.” or “Corp.”), and (3) document and corpus features, such as multiple occurrence voting, local syntactic context, document meta-information and corpus frequency information.
They also cover eval, but I think this has been rather weak in the field, so I wouldn’t get too carried away with MUC or ACE-type evals.
In my mind, this survey needs a discussion of the following techniques to be brought up to date:
- Dan Roth’s joint relation/entity model and other joint inference systems such as Finkel and Manning’s parsing/entity model (the survey does talk about Grishman and Heng’s coref/entity model)
- Finkel et al.’s long-distance model with Gibbs sampling for inference
- Ando and Zhang’s feature induction method and similar clustering-type unsupervised methods
- Committee-based methods, which I’m not even sure have been used in this domain other than for mixing forward and backward Markovian models
- Domain adaptation methods, such as Finkel’s hierarchal models or Daume’s frustratingly simple adaptation
- Nested named entities, as found in the Genia corpus
- Parsing-based approaches such as Finkel and Manning’s approach to nested entities
Another thing that’d be good to include in a broader survey would be the encoding of chunking problems as tagging problems, which the Ratinov and Roth paper touched on, but didn’t explore fully.
What I liked about this paper is that it avoided categorial conclusions about what works and what doesn’t — I think the jury’s still out on that one.