I’m giving a talk some time this semester Thursday 23 March 2006 as part of Columbia University’s series http://www1.cs.columbia.edu/nlp/otslac.html (OTSLAC). It’ll be in the CS Seminar Room in the MUDD Building on Columbia’s main campus.
Here’s the abstract:
LingPipe's Character Language Models with a Dozen Applications Bob Carpenter http://www.colloquial.com/carp Alias-i, Inc. http://www.alias-i.com/lingpipe LingPipe 2 applies character-level language models to a range of common natural language applications: Compression: LMs provide general text compression. We consider (1) compression of NY Times data at 1.4 bits/character. Noisy Channel Decoding: LMs model the source and weighted edit distance the channel. This supports (2) corpus-sensitive language-independent spelling correction, (3) case renormalization of monocased data, and (4) Chinese tokenization. Supports N-best output with joint probabilities. Classification: LMs model probabilities of texts given categories. This supports: (5) sentence-level subjectivity and review-level polarity classification, (6) spam detection, (7) topic classification, (8) language identification, and (9) word-sense disambiguation of gene names to upport rejection of non-genes and the linkage of ambiguous genes to EntrezGene. Suppprts one-vs-all or many-way classification with joint and conditional probabilities. Tagging: LMs model emission probabilities for Hidden Markov models. This supports: (10) part-of-speech tagging, (11) phrase chunking, and (12) named-entity extraction. Supports tag-level or analysis-level n-best and confidence ranking. Chunking supports chunk-level confidence. The talk will focus on three representative applications in detail: Chinese tokenization, gene mention linkage with EntrezGene, and named-entity extraction. Alias-i develops search- and data-mining-style exploratory applications, both of which are enhanced by n-best analyses and confidence-ranked analyses. Decoding with n-best and confidence also supports hierarchical modeling, rescoring by downstream processes (such as parsing or corefernece), and rescoring by higher-order and/or more richly featured models which would be too expensive to decode directly.
Leave a Reply