One thing I forgot to mention on the release notes for LingPipe 3.8.2 is that I’ve added a section on building an Arabic named entity recognizer to the
- LingPipe Named Entity Tutorial
Benajiba’s ANER Corpus
It’s based on Yassine Benajiba‘s freely distributed (thanks!) corpus:
It’s 150K tokens in CoNLL format (easy to parse, but lossy for whitespace) using person, location, organization and miscellaneous types (like CoNLL’s English corpus). Here’s a sample (rendered with style value
direction:rtl to get the ordering right; the actual file’s in the usual character order):
Benajiba also supplies small dictionaries for locations, organizations and persons, which we explain how to use in the demo.
LingPipe Model and Evaluation
I applied a simple sentence-chunking heuristic and then built a character 8-gram-based rescoring chunker using otherwise default parameters and training on all the data (but not the dictionaries).
There’s absolutely nothing Arabic-specific about the model.
Overall performance is in line with Benajiba’s, while differing substantially on the four entity types. Here are the LingPipe 6-fold (125K token train, 25K token test) cross-validated results:
Dictionaries improved recall a bit, but hurt precision even more. Bigger dictionaries and more training data would certainly help here.
For a description of the corpus, and a description and evaluation of Benajiba’s own Arabic NER system, see:
- Benajiba, Y. and P. Rosso 2007. ANERsys 2.0 : Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. In IICAI-2007.
- Benajiba, Y., P. Rosso, and Benedí J. M. 2007. ANERsys: An Arabic Named Entity Recognition system based on Maximum Entropy. In CICLing-2007.
And of course, there’s more information in LingPipe’s Named Entity Tutorial.