One thing I forgot to mention on the release notes for LingPipe 3.8.2 is that I’ve added a section on building an Arabic named entity recognizer to the
- LingPipe Named Entity Tutorial
Benajiba’s ANER Corpus
It’s based on Yassine Benajiba‘s freely distributed (thanks!) corpus:
It’s 150K tokens in CoNLL format (easy to parse, but lossy for whitespace) using person, location, organization and miscellaneous types (like CoNLL’s English corpus). Here’s a sample (rendered with style value direction:rtl
to get the ordering right; the actual file’s in the usual character order):
فرانكفورت B-LOC
(د O
ب O
أ) O
أعلن O
اتحاد B-ORG
صناعة I-ORG
السيارات I-ORG
في O
ألمانيا B-LOC
...
Benajiba also supplies small dictionaries for locations, organizations and persons, which we explain how to use in the demo.
LingPipe Model and Evaluation
I applied a simple sentence-chunking heuristic and then built a character 8-gram-based rescoring chunker using otherwise default parameters and training on all the data (but not the dictionaries).
There’s absolutely nothing Arabic-specific about the model.
Overall performance is in line with Benajiba’s, while differing substantially on the four entity types. Here are the LingPipe 6-fold (125K token train, 25K token test) cross-validated results:
Type | Precision | Recall | F1 |
---|---|---|---|
LOC | 0.782 | 0.788 | 0.785 |
PERS | 0.634 | 0.657 | 0.645 |
ORG | 0.609 | 0.527 | 0.565 |
MISC | 0.553 | 0.421 | 0.478 |
COMBINED | 0.685 | 0.661 | 0.673 |
Dictionaries improved recall a bit, but hurt precision even more. Bigger dictionaries and more training data would certainly help here.
References
For a description of the corpus, and a description and evaluation of Benajiba’s own Arabic NER system, see:
- Benajiba, Y. and P. Rosso 2007. ANERsys 2.0 : Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. In IICAI-2007.
- Benajiba, Y., P. Rosso, and Benedí J. M. 2007. ANERsys: An Arabic Named Entity Recognition system based on Maximum Entropy. In CICLing-2007.
And of course, there’s more information in LingPipe’s Named Entity Tutorial.
April 27, 2010 at 7:12 pm |
I am using Lingpipe for named entity recognition in my research. I want to create my own model. For that I am using same code as shown in named entity tutorial and used for CONLL2002 dataset. But I am getting erroors saying illegal line =Codexis B_ORG. First line in my training file is Codexis B_ORG. Same code is working for training files of CONLL2002. I created training file manually. Is there anything special I need to consider. This error is coming from LineTaggingParser.java class and parseString() in that file. Here it checks whether line matches the pattern or not. I dont see any mismatch with the pattern. Please help me out.
April 28, 2010 at 12:53 am |
You may have just gotten bitten by my updating the tutorials to reflect the actual tags used in CoNLL, B-ORG, not B_ORG. The last release of LingPipe (3.9.1), wouldn’t parse CoNLL, but would’ve parsed data with tags like B_ORG>
All you need to do is convert your tags to CoNLL format, B-ORG, not B_ORG, and you should be good to go.
April 28, 2010 at 11:42 am |
I tried it already but I am getting following stacktrace.
F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demos\tutorial\ne>ant -Ddata.contracts=contracts/ train-contract
Buildfile: F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demos\tutorial\ne\b
uild.xml
compile:
[mkdir] Created dir: F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demos
\tutorial\ne\build\classes
[javac] Compiling 14 source files to F:\TextMining_Dr.Singh\LingPipeDir\ling
pipe-3.9.1\demos\tutorial\ne\build\classes
jar:
[jar] Building jar: F:\TextMining_Dr.Singh\LingPipeDir\lingpipe-3.9.1\demo
s\tutorial\ne\neDemo.jar
train-contract:
[java] Setting up Chunker Estimator
[java] Setting up Data Parser
[java] Training with Data from File=contracts\contract1.train
[java] java.util.regex.Matcher[pattern=(\S+)\s(\S+\s)?(O|[B|I]-\S+) region=
0,14 lastmatch=]
[java] Exception in thread “main” java.lang.IllegalArgumentException: Illeg
al =Codexis B-ORG
[java] at com.aliasi.tag.LineTaggingParser.parseString(LineTaggingParse
r.java:158)
[java] at Conll2002ChunkTagParser.parseString(Conll2002ChunkTagParser.j
ava:60)
[java] at com.aliasi.corpus.StringParser.parse(StringParser.java:68)
[java] at com.aliasi.corpus.Parser.parse(Parser.java:99)
[java] at com.aliasi.corpus.Parser.parse(Parser.java:115)
[java] at TrainConll2002.main(TrainConll2002.java:43)
[java] Java Result: 1
April 29, 2010 at 11:25 am |
This really isn’t the best place for a discussion of bugs in LingPipe because I don’t check it as often as mail and it’s not where others go to find bug reports and/or patches.
We have a mailing list linked from the web page and a direct mail address, bugs@alias-i.com.
To answer the question, you need to upgrade to LingPipe 3.9.2. I fixed the bug I introduced in 3.9.1 with respect to NE parsing in CoNLL format. I tested the new parser configurations with all the data we talk about, and it now works with B- and I- tags.
April 29, 2010 at 11:52 am |
Sorry for posting bug report here. Thanks for the answer to my query.
March 21, 2011 at 7:37 am |
hi, i have a question i can’t imagine how can we compute the effeciency of the NER system and what about
thanks in advance
March 21, 2011 at 12:42 pm |
I’m not sure what you mean — there are a bunch of things people often call “efficiency”. The big ones are memory and time. Our three built-in statistical NER systems vary on both of these. If you go to the last section of our NER tutorial, you’ll see a small comparison. The absolute amounts will depend on the number of categories, amount of pruning in the parameters, and the amount of training data and number of features.
The best thing to do is find the NER you want and then time it on your hardware. For absolute throughput, keep in mind all our NER systems can be multithreaded with a single model in memory.
May 14, 2012 at 5:21 pm |
Please could any one send me this class
src/ANERXVal.java.
cause i couldn’t find it on the server :(
Thanks in advance
May 15, 2012 at 1:41 pm |
That’s a lower-case “v” for some reason:
http://alias-i.com/lingpipe/demos/tutorial/ne/src/ANERXval.java
The web site mirrors the source tar ball, so it’s in
/demos/tutorial/ne/src/ANERXval.java
in the release.
May 21, 2012 at 12:31 pm
Thanks a lot for help :)
August 24, 2016 at 6:02 am |
Hey LingPipe, you have made my life easier with your NLP tutorials and resources. I have successfully built NER model for Tigrigna Language (one of under resourced language like Arabic) using the resources in the LingPipe(ANERXval, an HMM model). But I need to build a model using CRF(Experiment Set-up — Partition: Dividing the entire corpus in to training(i.e. 2/3 or 80%) and testing set(i.e. 1/3 or 20%), and also Cross Validation: experiment with fold cross validation same as ANERXval using CRF to build the NER system (especially for scarce corpora) reading UNICODE format in text file.
What do you recommend me to do so? Please indicate me resources and other links to perform such NLP tasks? Thank you a lot for your excellent tutorials and supports.
August 24, 2016 at 9:38 am |
Best source for learning how to use LingPipe at this point is our book. See
LingPipe Cookbook. Available on Amazon too.