Postponed: Character LMs with a Dozen Apps – Columbia Uni Talk

by

I’m giving a talk some time this semester Thursday 23 March 2006 as part of Columbia University’s series http://www1.cs.columbia.edu/nlp/otslac.html (OTSLAC). It’ll be in the CS Seminar Room in the MUDD Building on Columbia’s main campus.

Here’s the abstract:

LingPipe's Character Language Models with a Dozen Applications

Bob Carpenter        http://www.colloquial.com/carp
Alias-i, Inc.        http://www.alias-i.com/lingpipe

LingPipe 2 applies character-level language models to a range of
common natural language applications:

  Compression: LMs provide general text compression.  We consider (1)
  compression of NY Times data at 1.4 bits/character.

  Noisy Channel Decoding: LMs model the source and weighted edit
  distance the channel.  This supports (2) corpus-sensitive
  language-independent spelling correction, (3) case renormalization
  of monocased data, and (4) Chinese tokenization.  Supports N-best
  output with joint probabilities.

  Classification: LMs model probabilities of texts given categories.
  This supports: (5) sentence-level subjectivity and review-level
  polarity classification, (6) spam detection, (7) topic classification,
  (8) language identification, and (9) word-sense disambiguation of gene
  names to upport rejection of non-genes and the linkage of ambiguous
  genes to EntrezGene.  Suppprts one-vs-all or many-way classification
  with joint and conditional probabilities.

  Tagging: LMs model emission probabilities for Hidden Markov models.
  This supports: (10) part-of-speech tagging, (11) phrase chunking,
  and (12) named-entity extraction.  Supports tag-level or
  analysis-level n-best and confidence ranking.  Chunking supports
  chunk-level confidence.

The talk will focus on three representative applications in detail: Chinese
tokenization, gene mention linkage with EntrezGene, and named-entity
extraction.

Alias-i develops search- and data-mining-style exploratory
applications, both of which are enhanced by n-best analyses and
confidence-ranked analyses.  Decoding with n-best and confidence also
supports hierarchical modeling, rescoring by downstream processes
(such as parsing or corefernece), and rescoring by higher-order and/or
more richly featured models which would be too expensive to decode
directly.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s