Arabic Named Entity Recognition with the ANER Corpus

by

One thing I forgot to mention on the release notes for LingPipe 3.8.2 is that I’ve added a section on building an Arabic named entity recognizer to the

Benajiba’s ANER Corpus

It’s based on Yassine Benajiba‘s freely distributed (thanks!) corpus:

It’s 150K tokens in CoNLL format (easy to parse, but lossy for whitespace) using person, location, organization and miscellaneous types (like CoNLL’s English corpus). Here’s a sample (rendered with style value direction:rtl to get the ordering right; the actual file’s in the usual character order):


فرانكفورت B-LOC
(د O
ب O
أ) O
أعلن O
اتحاد B-ORG
صناعة I-ORG
السيارات I-ORG
في O
ألمانيا B-LOC
...

Benajiba also supplies small dictionaries for locations, organizations and persons, which we explain how to use in the demo.

LingPipe Model and Evaluation

I applied a simple sentence-chunking heuristic and then built a character 8-gram-based rescoring chunker using otherwise default parameters and training on all the data (but not the dictionaries).
There’s absolutely nothing Arabic-specific about the model.

Overall performance is in line with Benajiba’s, while differing substantially on the four entity types. Here are the LingPipe 6-fold (125K token train, 25K token test) cross-validated results:

Type Precision Recall F1
LOC 0.782 0.788 0.785
PERS 0.634 0.657 0.645
ORG 0.609 0.527 0.565
MISC 0.553 0.421 0.478
COMBINED 0.685 0.661 0.673

Dictionaries improved recall a bit, but hurt precision even more. Bigger dictionaries and more training data would certainly help here.

References

For a description of the corpus, and a description and evaluation of Benajiba’s own Arabic NER system, see:

And of course, there’s more information in LingPipe’s Named Entity Tutorial.

12 Responses to “Arabic Named Entity Recognition with the ANER Corpus”

  1. Laxmikant Agrawal Says:

    I am using Lingpipe for named entity recognition in my research. I want to create my own model. For that I am using same code as shown in named entity tutorial and used for CONLL2002 dataset. But I am getting erroors saying illegal line =Codexis B_ORG. First line in my training file is Codexis B_ORG. Same code is working for training files of CONLL2002. I created training file manually. Is there anything special I need to consider. This error is coming from LineTaggingParser.java class and parseString() in that file. Here it checks whether line matches the pattern or not. I dont see any mismatch with the pattern. Please help me out.

    • Bob Carpenter Says:

      You may have just gotten bitten by my updating the tutorials to reflect the actual tags used in CoNLL, B-ORG, not B_ORG. The last release of LingPipe (3.9.1), wouldn’t parse CoNLL, but would’ve parsed data with tags like B_ORG>

      All you need to do is convert your tags to CoNLL format, B-ORG, not B_ORG, and you should be good to go.

  2. Laxmikant Agrawal Says:

    I tried it already but I am getting following stacktrace.

    F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demos\tutorial\ne>ant -Ddata.contracts=contracts/ train-contract
    Buildfile: F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demos\tutorial\ne\b
    uild.xml

    compile:
    [mkdir] Created dir: F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demos
    \tutorial\ne\build\classes
    [javac] Compiling 14 source files to F:\TextMining_Dr.Singh\LingPipeDir\ling
    pipe-3.9.1\demos\tutorial\ne\build\classes

    jar:
    [jar] Building jar: F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demo
    s\tutorial\ne\neDemo.jar

    train-contract:
    [java] Setting up Chunker Estimator
    [java] Setting up Data Parser
    [java] Training with Data from File=contracts\contract1.train
    [java] java.util.regex.Matcher[pattern=(\S+)\s(\S+\s)?(O|[B|I]-\S+) region=
    0,14 lastmatch=]
    [java] Exception in thread “main” java.lang.IllegalArgumentException: Illeg
    al =Codexis B-ORG
    [java] at com.aliasi.tag.LineTaggingParser.parseString(LineTaggingParse
    r.java:158)
    [java] at Conll2002ChunkTagParser.parseString(Conll2002ChunkTagParser.j
    ava:60)
    [java] at com.aliasi.corpus.StringParser.parse(StringParser.java:68)
    [java] at com.aliasi.corpus.Parser.parse(Parser.java:99)
    [java] at com.aliasi.corpus.Parser.parse(Parser.java:115)
    [java] at TrainConll2002.main(TrainConll2002.java:43)
    [java] Java Result: 1

    • lingpipe Says:

      This really isn’t the best place for a discussion of bugs in LingPipe because I don’t check it as often as mail and it’s not where others go to find bug reports and/or patches.

      We have a mailing list linked from the web page and a direct mail address, bugs@alias-i.com.

      To answer the question, you need to upgrade to LingPipe 3.9.2. I fixed the bug I introduced in 3.9.1 with respect to NE parsing in CoNLL format. I tested the new parser configurations with all the data we talk about, and it now works with B- and I- tags.

  3. Laxmikant Agrawal Says:

    Sorry for posting bug report here. Thanks for the answer to my query.

  4. noha Says:

    hi, i have a question i can’t imagine how can we compute the effeciency of the NER system and what about
    thanks in advance

    • Bob Carpenter Says:

      I’m not sure what you mean — there are a bunch of things people often call “efficiency”. The big ones are memory and time. Our three built-in statistical NER systems vary on both of these. If you go to the last section of our NER tutorial, you’ll see a small comparison. The absolute amounts will depend on the number of categories, amount of pruning in the parameters, and the amount of training data and number of features.

      The best thing to do is find the NER you want and then time it on your hardware. For absolute throughput, keep in mind all our NER systems can be multithreaded with a single model in memory.

  5. ziadov Says:

    Please could any one send me this class
    src/ANERXVal.java.
    cause i couldn’t find it on the server :(

    Thanks in advance

  6. Cyber Says:

    Hey LingPipe, you have made my life easier with your NLP tutorials and resources. I have successfully built NER model for Tigrigna Language (one of under resourced language like Arabic) using the resources in the LingPipe(ANERXval, an HMM model). But I need to build a model using CRF(Experiment Set-up — Partition: Dividing the entire corpus in to training(i.e. 2/3 or 80%) and testing set(i.e. 1/3 or 20%), and also Cross Validation: experiment with fold cross validation same as ANERXval using CRF to build the NER system (especially for scarce corpora) reading UNICODE format in text file.
    What do you recommend me to do so? Please indicate me resources and other links to perform such NLP tasks? Thank you a lot for your excellent tutorials and supports.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s