How Can I Build a Classifier with no Negative Data?

July 17, 2008 by lingpipe

As part of our NIH grant, we’re working on the database linkage problem from gene/protein mentions in MEDLINE to database entries in EntrezGene. Basically, it’s what the biologists call "gene normalization", and was the basis of Biocreative Task 2.

I can summarize the problem we’re having with a simple example. We’d like to classify all 17M or so entries in MEDLINE as to whether they’re about genomics or not. EntrezGene provides links to 200K citations that are about particular genes, so we have a pile of articles about genomics (making up about 300 million characters). What we don’t have is any negative training data.

So my question is: how do I build a classifier for articles about genomics versus those that are not about genomics?

The job running in the background giving me time to write this post is generating cross-validation on cross-entropy rates for all of these 200K citations. That is, I train a character-level language model on 180K citations and use it to evaluate the other 20K, for all possible choices. This gives me a received set of expected scores for positive examples (assuming there’s no bias in that 200K set, which there is in terms of recency and particular gene focus, not to mention focus on human genomics). I’m going to plot this and see what the curve looks like. Empirically, we can then set a threshold that would accept 99% of the articles we have. Unfortunately, I have no idea how well this’ll work in practice at rejecting the articles that aren’t about genomics.

For the genomics/non-genomics problem, we can just annotate a few thousand examples; it’ll only take a day or two.

The real problem is that we want to build models to classify contexts for the 30K or so human gene entries in Entrez. Some of them have a handful of example docs, some have hundreds. We’re going to pull out articles with potential mentions, then filter with the classifier. It’s related to what we did in Phase I of our grant and reported in our gene linkage white paper. In that setting, we can generate candidate docs using approximate matching of aliases, then use the scores to rank the possible docs according to their language model scores against the known articles for the gene in question. This is great in a search context, but doesn’t give us a go/no-go decision point, which we need for some of our downstream applications.

If anyone knows how to tackle this problem, I’d love to hear about it. I might even implement it as part of LingPipe if the idea’s simple and general enough.

Hyphenation as a Noisy Channel

July 11, 2008 by lingpipe

Noisy channel models with very simple deterministic channels can be surprisingly effective at simple linguistic tasks like word splitting. We’ve used them for Chinese word segmentation (aka word boundary detection, aka tokenization), not to mention spelling correction.

In this blog entry, I’ll provide a first look at our forthcoming tutorial on hyphenation. The hyphenation problem is that of determining if a hyphen can be inserted between two positions in a word. Hyphenation is an orthographic process, which means it operates over spellings. In contrast, syllabification is a phonological (or phonetic) process, which means it operates over sounds. Hyphenation roughly follows syllabification, but is also prone to follow morphological split points.

The hyphenation problem isn’t even well-defined on a per-word level. There are pairs such as num-ber (something you count with) and numb-er (when you get more numb) that have the same spelling, but different pronunciations and corresponding hyphenations depending on how they are used. I’ll just ignore this problem here; in our evaluation, ambiguities always produce at least one error.

The Noisy Channel Model

The noisy channel model consists of a source that generates messages and a noisy channel along which they are passed. The receiver’s job is to recover (i.e. decode) the underlying message. For hyphenation, the source is a model of what English hyphenation looks like. The channel model deterministically removes spaces. The receiver thus needs to figure out where the hyphens should be reinserted to recover the original message at the source.

The training corpus for the source model consists of words with hyphenations represented by hyphens. The model is just a character language model (I used 8-grams, but it’s not very length sensitive). This gives us estimates of probabilities like p("che-ru-bic") and p("cher-u-bic"). Our channel model just deterministically removes spaces, so that p("cherubic"|"che-ru-bic") = 1.0.

To use the noisy channel model to find a hyphenation given an unhyphenated word, we just find the hyphenation h that is most likely given the word w, using Bayes’s rule in a maximization etting: ARGMAXh p(h|w) = ARGMAXh p(w|h)*p(h). Because the channel probabilities p(w|h) are always 1.0 if the characters in w match those in h, this reduces to finding the hyphenation h yielding character sequence w which maximizes p(h). For example, the model will estimate "che-ru-bic" to be a more likely hyphenation than "cher-u-bic" if p("che-ru-bic") > p("cher-u-bic").

Decoding is fairly efficient for this task, despite using a high-order n-gram language model, because the channel bounds the combinatorics by only allowing a single hyphen to be inserted at each point; dynamic programming into language model states can reduce them even further.

English Evaluation

So how do we test how well the model works? We just became members of the Linguistic Data Consortium, who distribute Baayen, Piepenbrock and Gulikers’ CELEX 2 corpus. CELEX is a great corpus. It contains pronunciations, syllabifications, and hyphenations for a modest sized dictionary of Dutch, English and German. For instance, there are 66,332 distinct English words in the corpus (there are many more entries, but these constitute duplicates and compounds whose constituents have their normal hyphenations). These 66,332 word have 66,418 unique hyphenations, meaning 86 of the words lead to ambiguities. This is about 1/10th of a percent, so I won’t worry about it here.

I evaluated with randomly partitioned 10-fold cross-validation. With the noisy channel model above, LingPipe had a 95.4% whole word accuracy, with a standard deviation of 0.2%. That means we got 95.4% of the words completely correctly hyphenated. We can also look at the 111,521 hyphens in the corpus, over which we had a 97.3% precision and a 97.4% recall. That is, we missed 2.6% true hyphenation points (false negatives), and 2.7% of the hyphenations we returned were spurious (false positives). Finally, we can look at per-decision accuracy, for which there were 482,045 positions between characters, over which we were 98.8% accurate.

Forward and Backward Models

But that’s not all. Like HMMs, there’s a degree of label bias in a left-to-right language model. So I reversed all the data and built a right-to-left (or back-to-front) model. Using n-best extraction, I ran this two ways. First, I just added the score to the forward model to get an interpolated score. Somewhat surprisingly, it behaved almost identically to the forward-only model, with slightly lower per-hyphen and per-decision scores. More interestingly, I then ran them in intersect mode, which means only returning a hyphen if it was in the first-best analysis of both the left-to-right and right-to-left models. This lowered per-word accuracy to 94.6% (from 95.4%), but raised precision to 98.0% (from 97.3%) while only lowering recall to 96.7% (from 97.4%). Overall, hyphenation is considered to be a precision business in application, as it’s usually used to split words across lines in documents, and many split points might work.

Is it State of the Art?

The best results I’ve seen for this task were reported in Bartlett, Kondrak and Cherry’s 2008 ACL paper Automatic Syllabification with Structured SVMs for Letter-To-Phoneme Conversion, which also received an outstanding paper award. They treated this problem as a tagging problem and applied structured support vector machines (SVMs). On a fixed 5K testing set, they report a 95.65% word accuracy, which is slightly higher than our 95.4%. The 95% binomial confidence intervals for 5000 test cases are +/- 0.58% and our measured standard deviation was .2%, with results ranging from 95.1 to 95.7% on different folds. Their paper also tackled pronuncation, for which their hyphenation was only one feature.

German and Dutch are easier to hyphenate than English. Syllabification in all of these languages is also easier than hyphenation.

But is it better than 1990?

Before someone in the audience gets mad at me, I want to point out that we could follow Coker, Church and Liberman (1990)’s lead in reporting results:

The Bell Laboratories Text-to-Speech system, TTS, takes a radical dictionary-based approach; dictionary methods (with morphological and analogical extensions) are used for the vast majority of words. Only a fraction of a percent (0.5% of words overall; 0.1% of lowercase words) are left for letter-to-sound rules.

Although this insight won’t get us into NIPS, it’s how we’d field an application.

Lucene’s Missing Token Stream Factory

July 6, 2008 by lingpipe

While we’re on the subject of tokenization, every time I (Bob) use Lucene, I’m struck by its lack of a tokenizer factory abstraction.

Lucene’s abstract class TokenStream defines an iterator-style method next() that returns the next token or null if there are no more tokens.

Lucene uses the abstract class Token for tokens. A Lucene token contains a string representing its text, a start and end position, and a lexical type which I’ve never seen used. Because LingPipe has to handle many tokens quickly without taxing garbage collection, it doesn’t create objects for them beyond their string texts. But that’s the subject of another blog entry.

A Lucene Document is essentially a mapping from field names, represented as strings, to values, also represented as strings. Each field in a document may be treated differently with respect to tokenization. For instance, some might be dates, others ordinary text, and others keyword identifiers).

To handle this fielded document structure, the Lucene class analysis.Analyzer maps field names, represented as strings, and values, represented as instances of java.io.Reader, to token streams. The choice of Reader for values is itself puzzling because it introduces I/O exceptions and the question of who’s responsible for closing the reader.

Lucene overloads the analyzer class itself to provide the functionality of LingPipe’s tokenizer factory. Lucene classes such as SimpleAnalyzer and CJKAnalyzer return the same token stream no matter which field is specified. In other words, the field name is ignored.

What would be useful would be a Lucene interface analysis.TokenStreamFactory with a simple method TokenStream tokenizer(CharSequence input) (note how we’ve replaced the analyzer’s reader input with a generic string). Then analyzers could be built by associating token stream factories with fields. This would be the natural place to implement Lucene’s simple analyzer, CJK analyzer, and so on. The current analyzer behavior is then easily derived with an analyzer which sets a default token stream factory for fields.

The Curse of “Intelligent” Tokenization

June 26, 2008 by lingpipe

We’re now running into a problem we’ve run into before: so-called “intelligent” tokenization. The earliest version of this of which I’m aware is the Penn Treebank tokenization, which assumes sentence splitting has already been done. That way, the end-of-sentence punctuation can be treated differently than other punctuation. Specifically, “Mr. Smith ran. Then he jumped.” gets split into two sentences, “Mr. Smith ran.” and “Then he jumped.”. Now the fun starts. The periods at the end of a sentence are split off into their own token. The period after “Mr” remains, so the tokens are “Mr.”, “Smith”, “ran” and “.”. Note that the Treebank tokenizer also replaces double quotes with either left or right LaTex-style quotes, so there’s no way to reconstruct the underlying text from the tokens. Like many other projects, they also throw away the whitespace information in the data, so there’s no way to train something to do tokenization that’s whitespace sensitive because we just don’t have the whitespace. That’s what you get for releasing data like “John/PN ran/V ./PUNCT” — you just don’t know if there was space between that final verb “ran” and the full stop “.”. You’ll also see their script builds in all sorts of knowledge about English, such as a handful of contractions, so that “cannot” is split out into two tokens, “can” and “not”.

The most recent form of “intelligent” tokenization I’ve seen is perpetrated by UPenn, this time as part of their BioIE project. There, the data’s not even consistently tokenized, because they left it to annotators to decide on token boundaries. They then use statistical chunkers to uncover the tokenizations probabilistically.

Google’s n-gram data is also distributed without a tokenizer. It looks very simple, but there are lots of heuristic boundary conditions that make it very hard to run on new text. Practically speaking, the data’s out of bounds anyway because of its research-only license. Unlike the Penn Treebank or French Treebank, there’s no option to buy commercial licenses.

I’ve just been struggling with the French Treebank, which follows the Penn Treebank’s lead in using “intelligent” tokenization. The problem for us is that we don’t know French, so we can’t quite read the technical docs (in French), nor induce from the corpus how tokenization was done.

This is all terribly problematic for the “traditional” parsing model of first running tokenization, then running analysis. The problem is that the tokenization depends on the analysis and vice-versa. At least with the Penn approach, there’s code to do their ad hoc sentence splitting and then their ad hoc heuristic tokenization. It may not be coherent from an English point of view (a handful of contractions will be split; others won’t), but at least it’s reproducible.

Our own approach (in practice — in theory we can plug and play any tokenizer that can be coded) has been to take very fine-grained tokenizations so that the tokenization would be compatible with any old kind of tagger. Our HMM chunker pays attention to tokenization, but the rescoring chunker uses whitespace and longer-distance token information.

At the BioNLP workshop at ACL 2008, Peter Corbett presented a paper (with Anne Copestake) on Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition. It’s a really neat paper that addresses issues of confidence estimation, and particularly trading precision for recall (or vice-versa). But they weren’t able to reproduce our 99.99% recall gene/protein name extraction. During the question period, we got to the bottom of what was going on, which turned out to be intelligent tokenization making mistakes so that entities weren’t extractable because they were only parts of tokens. I’m hoping Peter does the analysis to see how many entities are permanently lost due to tokenization errors.

So why do people do “intelligent” tokenization? The hope is that by making the token decisions more intelligently, downstream processing like part-of-speech tagging is easier. For instance, it’s difficult to even make sense of assigning part-of-speech tags to three tokens derived from “p-53″, namely “p”, “-” and “53″. Especially if you throw away whitespace information.

Tokenization is particularly vexing in the bio-medical text domain, where there are tons of words (or at least phrasal lexical entries) that contain parentheses, hyphens, and so on. This turned out to be a problem for WordNet).

In some sense, tokenization is even more vexing in Chinese, which isn’t written with spaces. To get around that problem, our named-entity detector just treats each character as a token; that worked pretty well for our entry in the SIGHAN 3 bakeoff. There were even two papers on jointly modeling segmentation and tagging for Chinese at this year’s ACL (Jiang et al. and Zhang et al.). Joint modeling of this kind seems like a promising approach to allowing “intelligent” tokenization; by extending the tokenization model far enough, we could even maintain high end-to-end recall, which is not possible with a state-of-the-art first-best probabilistic tokenizer.

Shame and Joy of Open Source: Perf Bug in FastCache

June 18, 2008 by lingpipe

Hats off to Brian Wilson of Endeca for profiling and then debugging a huge problem in my (Bob’s) implementation of com.aliasi.util.FastCache. The joy is that Brian could find and diagnose the bug; the shame is mine from having made this colossal blunder in the first place and then not spotting it with size-sensitive tests.

The immediate upshot is that I’m going to roll out a 3.5.1 version with the bug patched this weekend.

The fast cache is designed to be a thread-safe frequency-of-use based cache with pruning triggered by a new entry exceeding the load factor. The pruning is supposed to scale counts by dividing by 2 and then discard records with zero counts.

You might be able to spot the problem yourself by looking at the implementation, which is online at: 3.5.0 version of FastCache.java.

I’ll give you a hint. It’s in the prune() method.

Specifically, in this line:

while (record != null
         && (record.mCount >>> 2) == 0)

Here’s what it should be:

while (record != null
         && (record.mCount = (record.mCount >>> 1)) == 0)

Here’s the 3.5.1 version of FastCache.java.

The problem was that I didn’t decrement the counts. So why didn’t we notice this before? Because the implementation in place now is not only sound (that is, it produces the right answers), but it’s also relatively well-behaved in the face of natural power-law distributed natural language data. Every time the size threshold gets exceeded, all records with counts of 3 or less get discarded. Since language is dominated by 1 and 2 counts in the tail, this isn’t such a problem in processes that don’t run forever on a really diverse set of data with lots of local duplicate terms.

The other problem is that I only meant to divide by 2, not by 4, which a right-shift (zero filled) by 2 achieves. I’ll blame the psycholinguistic lexical priming effect for the 2 bug (I was thinking divide by 2 so inserted a 2).

This is a very serious bug because not only does it allow the cache to grow without bound, every time a new entry is added after it exceeds capacity with 4 counts and above, the entire cache gets visited. In the original implementation, pruning recreated a whole bunch of new map entry records, and also a whole bunch of soft reference wrappers. I also tweaked the code so as not to recreate records, but just modify theones that exist. This is the clue that Brian found in profiling the code. And even then I didn’t spot the bug.

Ironically, I’m in Columbus for the 2008 Assoc for Comp Ling Conference, where I’m co-organizing (with Kevin Bretonnel Cohen, who did most of the work) a workshop on Software Engineering, Testing, and Quality Assurance for Natural Language Processing.

Book: Building Search Applications: Lucene, LingPipe and Gate

June 12, 2008 by lingpipe

There’s a book about LingPipe!

The title is linked to the Amazon page; it’s also available as an inexpensive download from Lulu.

The Bottom Line

The subtitle “A practical guide to building search applications using open source software” pretty much sums it up (comment added June 22, 2008: please see Seth Grimes’s comment below about LingPipe’s royalty-free license not being compatible with other open-source licenses). It takes a reader that knows Java, but nothing at all about search or associated text processing algorithms, and provides a hands-on, step-by-step guide for building a state-of-the-art search engine.

I (Bob) gave Manu feedback on a draft, but there wasn’t much to correct on the LingPipe side, so I can vouch for the book’s technical accuracy. (Disclaimer: I didn’t actually try to run the code.)

Chapter by Chapter Overview

After (1) a brief discussion of application issues, the chapters include (2) tokenization in all three frameworks, (3) indexing with Lucene, (4) searching with Lucene, (5) sentence extraction, part-of-speech tagging, interesting/significant phrase extraction, and entity extraction with LingPipe and Gate (6) clustering with LingPipe, (7) topic and language classification with LingPipe, (8 ) enterprise and web search, page rank/authority calculation, and crawling with Nutch, (9) tracking news, sentiment analysis with LingPipe, detecting offensive content and plagiarism, and finally, (10) future directions including vertical search, tag-based search and question-answering.

For those wanting introductions to the LingPipe APIs mentioned above, Konchady’s book is a gentler starting point than our own tutorials.

That may sound like a whole lot of ground to cover in 400 pages, but Konchady pulls the reader along by illustrating everything with working code and not getting bogged down in theoretical boundary conditions. There are pointers to theory, and a bit of math where necessary, but the book never loses sight of its goal of providing a practical introduction. In that way, it’s like the Manning in Action series.

The book’s hot of the presses, so it’s up to date with Lucene 2.3 and LingPipe 3.3.

About the Author

Manu Konchady’s an old hand at search and text processing. You may remember him from such books as Text Mining Application Programming and High Performance Parallel Algorithms for Scientific Computing with Application to a Coupled Ocean Model.

Per-Tag Error Function for CRFs and MEMMs for Removing Label Bias

June 9, 2008 by lingpipe

As you may have noticed from the posts, I (Bob) have been swotting up on discriminative methods, such as logistic regression and conditional random fields (CRFs).

A couple posts ago, I mentioned Martin Jansche’s training of a logistic regression-like classifier based on (an approximation of) F-measure error. The goal was to build a classifier with a better F-measure than one trained on traditional log loss.

Along these same lines, I came across a paper minimizing yet another error function, this time for sequence taggers:

Kakade et al. look at discriminative sequence tagging models including max-entropy Markov models (MEMM) and conditional random fields (CRF). MEMMs and CRFs are both trained like logistic regression using log loss. The key point is that this is defined on a per-input basis, not a per-tag basis. Kakade et al. use a per-tag error function instead, with some surprising results.

Here’s the general setup. The training data consists of n pairs of tag/token sequences tags[i], tokens[i] for i < n. The models (either CRFs or MEMMs) supply probability estimates p(tags[i] | tokens[i], β ), and the error for parameters β is the sum of the log loss values per input in the data:

Err(β ) = Σi -log p(tags[i] | tokens[i], β )

The chain rule reduces this for bigram context models such as MEMMs (but not CRFs) to:

Err(β ) =  Σi Σj - log p(tags[i][j] | tags[i][j-1], tokens[i], β )

and as they point out, this decomposition is the source of the so-called label-bias problem in MEMMs as originally discussed in Léon Bottou’s Thesis.

What Kakade et al. do is replace this whole-sequence error with the following error function, which is decomposed per tag:

Err2(β ) = Σi Σj - log p(tags[i][j] | tokens[i], β )

As with Jansche’s paper, the goal is to devise an error function that matches the evaluation function more closely (e.g. per-tag part-of-speech tagging error). And also like Jansche’s paper, the work is all in the gradient calculations of the new error function. That, and the actual training algorithm, which as you can guess, makes use of the forward-backward algorithm to compute p(tag[i][j] | tokens[i]) (note that this use of forward-backward is implemented in LingPipe’s HMM tag-word lattice class).

What’s really neat about this technique is how well it works. They use a synthetic example, and one example with very high accuracy (McCallum’s FAQ dataset). It would sure be nice to see if this worked on a data set we care more about, like part-of-speech tagging.

The big problem is that per-tag optimized output is not very suitable for tagging in support of named entity extraction or other chunking tasks, because the first-best output isn’t guaranteed to be consistent.

An Online Logistic Regression API with Regularization

June 3, 2008 by lingpipe

Large-scale and online classification problems require a classifier that allows online training. By large-scale, I mean problems so large they won’t fit in memory. For this blog entry, size doesn’t really matter. I’m worried about honest-to-goodness online requirements, which force a learner to get a training example, classify it, get feedback, do some learning, then forget it, all in a constant amount of time per example.

Client-side spam detection and many server-side solutions would require a truly online algorithm in this sense. So it’s perhaps not surprising that Goodman and Yih of Microsoft Research used an online approach in their paper Online Discriminative Spam Filter Training. They use standard binomial logistic regression (max likelihood, not regularized) with stochastic gradient descent. They don’t give many details about their implementation other than say it was a naive Perl implementation that was fast enough for doing evaluations.

Your typical “online” learner requires training examples to be processed one-at-a-time. This solves the large-scale problem, in that you only need to bring one example into memory at a time, assuming the parameters fit in memory. Even so, online learners like Bottou’s SGD and our own logistic regression still load all the data into memory, and then make multiple passes over it.

Langford, Li and Strehl’s Vowpal Wabbit doesn’t load all the data into memory. VW even bounds the size of the features so a constant stream of new features won’t blow out memory (see my previous blog entry on using hash codes as features). Unfortunately, VW runs from the command-line and requires the training examples to be in a data file before the program starts. This could be modified, because the good folks at Yahoo! released it with source under a BSD license.

Handling logistic regression in an online setting without regularization (priors) is straightforward. You just use stochastic gradient descent and take the loop over the data out of the algorithm and make it a function that updates the parameters. All you need is a set of parameter vectors for coefficients that’s resizable. This could either be done by making them map-like structures (presumably what Goodman and Yih did in Perl), or resizing them like Java’s util.ArrayList. The latter is preferable from a size and speed point of view. The algorithm won’t converge in a single pass the way the multiple pass algorithms do, but it’ll be fast and quite straightforward.

This leaves two problems. First, how do we handle the learning rate? We can just leave it as a constant. Annealing doesn’t seem to make much sense, as the later examples will count less and less. This may be just the opposite of what you want in an online learning setting, where more recent information may be more representative of what’s to follow. In an API, we can just have a method to set the learning rate at any given point.

The second problem is what to do about regularization. In the standard Bayesian maximum a posteriori estimation setting, the prior is fixed ahead of time and contributes a single error term for all of the data. In a setting where you know the number of training instances, it’s easy to handle. You just regularize at a rate of 1/numberOfTrainingExamples per example. Regularization can be done lazily to preserve sparseness, as I point out in my tech report on lazy regularized SGD.

What I don’t know is how to do regularization online. What I’ve done in a preliminary implementation of truly online logistic regression is set a per-instance regularization discount. For instance, in a corpus with 100 training instances, that’d be 1/100.

But what should the regularization discount factor be for an unbounded number of instances when you have to apply it online? If we leave this discount at a constant, it has the effect of reducing the variance of the prior as the number of training examples increases. Normally, the data would outweigh any fixed prior. But with a constant regularization discount factor, the prior keeps up at its regular pace for all the data. This might even be desirable, as it will keep tamping down the coefficients. For instance, it’ll guarantee that a feature seen only once will eventually go to zero under a Laplace prior (and approach zero with a Gaussian or Cauchy prior).

It’d also be nice to be able to serialize the learner at any point, as well as compile it to a fixed representation, which catches up all of the lazy online regularization.

Collocations, Chi-Squared Independence, and N-gram Count Boundary Conditions

May 28, 2008 by lingpipe

Pearson’s chi-squared independence test may be used to compute collocations in a text corpus; that is, sequences of words that occur together more than they might by chance. This typically finds names, idioms, set-phrases and the like in text. There’s a whole chapter on collocations in Manning and Schütze’s Stat NLP book, and there’s a Lingpipe Interesting Phrases Tutorial.

In the case of bigrams, we have two words A and B, and want to know if they occur as a bigram (A,B) more often than chance. The simple binary chi-squared independence test is based on a confusion matrix which computes bigram counts for (+A,+B), (+A,-B), c(-A,-B), and (-A,-B), where (+A,+B) is the count of the bigram (A,B), (+A,-B) is the count of all bigrams (A,D) where D != B, and (-A,-B) is the count of all bigrams (C,D) where C != A and D != B. The formula for the test statistic itself is defined in the javadoc for method Statistics.chiSquaredIndependence(double,double,double,double).

The LingPipe class lm.TokenizedLM provides a convenient wrapper with which to count sequences of token n-grams. The tokenized LM class has a train(CharSequence) method which allows texts to be given. This method stores the n-gram counts after tokenizing. So let’s see what happens if we give it two sentences as inputs:

  • John ran
  • John ran home

This produces the following counts:

n-gram Count
John 2
ran 2
home 1
(John,ran) 2
(ran,home) 1
(John,ran,home) 1

This causes a subtle problem for chi-squared statistics: the count for a final word may be higher than its count as the first element of a bigram. The same problem happens for initial words and their counts as the second element of a bigram. So what are our exact bigram counts for (ran,home)? They’re (+ran,+home)=1, (+ran,-home)=0, (-ran,+home)=0, and (-ran,-home)=2. (Leave aside for now the fact that this isn’t high enough counts for the significance test to be accurate; it’s just an example of a point.)

The trie data structure used to represent counts (lm.TrieIntSeqCounter) can easily compute the number of words following a given word with a local lookup of a word’s daughters in the trie. It can’t easily compute the number of words preceding a given word; that would require iterating over all characters. So would computing all bigrams.

So the chi-squared implementation takes a shortcut. It approximates count(+ran,-home)=0 by count(+ran)-count(+ran,+home) = 2-1. We’re in effect overcounting the negative cases. Thus our chi-squared statistic is wrong.

But what if we implicitly assume there are boundary tokens, so we model the following sequences:

  • BOS John ran EOS
  • BOS John ran home EOS

giving us the following additional unigram and bigram counts:

n-gram Count
BOS 2
EOS 2
(BOS,John) 2
(ran,EOS) 1
(home,EOS) 1

Our approximate counts now become exact:

  • count(+ran,+home)=1
  • count(+ran,-home) = count(+ran)-count(+ran,+home) = 1
  • count(-ran,+home) = count(home)-count(+ran,+home) = 0
  • count(-ran,-home) = totalCount-count(+ran,+home)-count(-ran,+home)-count(+ran,-home) = 9-0-1-1 = 7

Et voilà. An ex post facto rationalization. What LingPipe is counting in chi-squared collocation tests includes the implicit boundaries.

F-measure Loss for “Logistic Regression”

May 23, 2008 by lingpipe

Given our recent inclusion of regularized logistic regression into LingPipe and our ongoing focus on high-recall classifiers, taggers and chunkers, I’ve been studying this paper:

First a recap. The only difference between logistic regression, perceptron and SVM models is the error function being optimized. They’re all simple linear models. Logistic regression uses negative log likelihood as error, whereas perceptrons use 0/1 loss and SVMs hinge loss. What Martin’s doing in the paper cited above is presenting yet another loss function - F-measure based loss. The reason he still calls it logistic regression is that he’s using the same generalized linear model link function — the logit (aka inverse logistic or inverse sigmoid) to convert linear basis predictors into probability estimates.

What’s different is that he’s using F-measure as error. So why is this so hard? Standard logistic regression models present a concave error minimization problem (equivalently, a convex maximum a posteriori (MAP) probability maximization problem). F-measure, as illustrated in the wonderfully elaborate 3-d diagrams in the paper, does not present a concave error function. This makes the optimization problem intractable in theory.

Aside from framing the problem, the insight reported in this paper was using expectations as defined by the logistic regression model’s probability estimates to approximate the delta functions needed for truly defining F-measure loss. After that, there’s a lot of heavy lifting on the implementation and evaluation side.

The really nice part about Martin’s formulation is that it works not only for the standard balanced F(0.5) measure (equally weighted harmonic mean of precision and recall), but also for arbitrary F(α ) measures, where α determines the weighting of precision and recall. Martin evaluated α=0.5 (balanced), and α=0.75 (recall weighted 0.75, precision weighted 0.25), and as expected, the α=0.75 setting indeed produced higher recall (though counter-intuitively not a better F(0.75) measure).

He also evaluated the result of training a regular old maximum likelihood logistic regression model and showed that it performed terribly (the task, by the way, was sentence extraction for summarization — I think it’d be easier for the rest of us to evaluate these techniques on something we can reproduce like the Reuters corpus or 20 newsgroups). He then showed that a posterior fitting of the binary classification threshold away from 0.0 improved performance immensely.

I’m left with two questions. (1) If you want high recall, can you just train a regular logistic regression method and then set the acceptance probability lower than 0.5? and (2) how does a reasonably regularized logistic regression fare? I saw the same kind of bad performance for maximum likelihood and ridge regression (Gaussian priors) in Genkin, Lewis and Madigan’s paper about large-scale logistic regression with priors (also well worth reading).

I have a long history with Martin’s F-measure error paper. Martin couldn’t get a visa to travel from the U.S. to Canada for the conferences. So he had me tack up the poster for this paper for him. Too bad I didn’t understand the technical details at the time. (During the same trip, I presented his treebank transfer paper, which as far as I know, was the first truly Bayesian NLP paper and is well worth reading.)