Postponed: Character LMs with a Dozen Apps – Columbia Uni Talk


I’m giving a talk some time this semester Thursday 23 March 2006 as part of Columbia University’s series (OTSLAC). It’ll be in the CS Seminar Room in the MUDD Building on Columbia’s main campus.

Here’s the abstract:

LingPipe's Character Language Models with a Dozen Applications

Bob Carpenter
Alias-i, Inc.

LingPipe 2 applies character-level language models to a range of
common natural language applications:

  Compression: LMs provide general text compression.  We consider (1)
  compression of NY Times data at 1.4 bits/character.

  Noisy Channel Decoding: LMs model the source and weighted edit
  distance the channel.  This supports (2) corpus-sensitive
  language-independent spelling correction, (3) case renormalization
  of monocased data, and (4) Chinese tokenization.  Supports N-best
  output with joint probabilities.

  Classification: LMs model probabilities of texts given categories.
  This supports: (5) sentence-level subjectivity and review-level
  polarity classification, (6) spam detection, (7) topic classification,
  (8) language identification, and (9) word-sense disambiguation of gene
  names to upport rejection of non-genes and the linkage of ambiguous
  genes to EntrezGene.  Suppprts one-vs-all or many-way classification
  with joint and conditional probabilities.

  Tagging: LMs model emission probabilities for Hidden Markov models.
  This supports: (10) part-of-speech tagging, (11) phrase chunking,
  and (12) named-entity extraction.  Supports tag-level or
  analysis-level n-best and confidence ranking.  Chunking supports
  chunk-level confidence.

The talk will focus on three representative applications in detail: Chinese
tokenization, gene mention linkage with EntrezGene, and named-entity

Alias-i develops search- and data-mining-style exploratory
applications, both of which are enhanced by n-best analyses and
confidence-ranked analyses.  Decoding with n-best and confidence also
supports hierarchical modeling, rescoring by downstream processes
(such as parsing or corefernece), and rescoring by higher-order and/or
more richly featured models which would be too expensive to decode

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s