LingPipe Talk at Columbia Uni, 26 Oct 2006

by

I’ll be giving the following talk at Columbia next Thursday.

4:15-5:15 PM
CS Conference Room
Mudd Building

Character Language Modeling for
Word Segmentation and Entity Detection

Bob Carpenter
Alias-i, Inc.

I’ll discuss the application of LingPipe’s character
language models to the two problems in Chinese language
processing: word segmentation and named entity extraction.

For word segmentation, we use the same noisy channel
model as we use for spelling correction. The source model
is a character language model trained on word segmented
Chinese data. The channel model is weighted edit distance;
for word segmentation, this is merely deterministic space
deletion. There are no Chinese-specific features at all
in the models. The bakeoff F1 measure for our segmenter
was .961; the winning F1 was .972.

For named entity extraction, we use a two stage process.
The first stage is an HMM with character language model
emissions. For Chinese, where we consider each character
a token, this reduces to the more usual multinomial emission HMM.
We code entity-extraction as a tagging problem using fine-grained
states to effectively encode a higher-order HMM.
For rescoring, we use a pure character language model approach
that allows longer distance dependencies, encoding chunking
information as characters within the models. As with word
segmentation, there are no Chinese-specific features. The bakeoff
F1 for our entity extractor was .855; the winning F1 was .890.

Time permitting, I’ll discuss our confidence ranking entity
and part-of-speech taggers and show some output from MEDLINE
POS tagging and gene mention extraction.

The LingPipe web site provides tutorials on both word segmentation and entity extraction. There are also web demos for both applications. The sandbox contains the complete code used to generate entries for the SIGHAN bakeoff; the data is available from SIGHAN. Two papers covering roughly the same material as the talk are available at:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s