Given our recent inclusion of regularized logistic regression into LingPipe and our ongoing focus on high-recall classifiers, taggers and chunkers, I’ve been studying this paper:

- Martin Jansche. Maximum expected F-measure training of logistic regression models.
*Human Language Technology Conference / Conference on Empirical Methods in Natural Language Processing.*October 2005.

First a recap. The only difference between logistic regression, perceptron and SVM models is the error function being optimized. They’re all simple linear models. Logistic regression uses negative log likelihood as error, whereas perceptrons use 0/1 loss and SVMs hinge loss. What Martin’s doing in the paper cited above is presenting yet another loss function – F-measure based loss. The reason he still calls it logistic regression is that he’s using the same generalized linear model link function — the logit (aka inverse logistic or inverse sigmoid) to convert linear basis predictors into probability estimates.

What’s different is that he’s using F-measure as error. So why is this so hard? Standard logistic regression models present a concave error minimization problem (equivalently, a convex maximum a posteriori (MAP) probability maximization problem). F-measure, as illustrated in the wonderfully elaborate 3-d diagrams in the paper, does not present a concave error function. This makes the optimization problem intractable in theory.

Aside from framing the problem, the insight reported in this paper was using expectations as defined by the logistic regression model’s probability estimates to approximate the delta functions needed for truly defining F-measure loss. After that, there’s a lot of heavy lifting on the implementation and evaluation side.

The really nice part about Martin’s formulation is that it works not only for the standard balanced F(0.5) measure (equally weighted harmonic mean of precision and recall), but also for arbitrary F(α ) measures, where α determines the weighting of precision and recall. Martin evaluated α=0.5 (balanced), and α=0.75 (recall weighted 0.75, precision weighted 0.25), and as expected, the α=0.75 setting indeed produced higher recall (though counter-intuitively not a better F(0.75) measure).

He also evaluated the result of training a regular old maximum likelihood logistic regression model and showed that it performed terribly (the task, by the way, was sentence extraction for summarization — I think it’d be easier for the rest of us to evaluate these techniques on something we can reproduce like the Reuters corpus or 20 newsgroups). He then showed that a posterior fitting of the binary classification threshold away from 0.0 improved performance immensely.

I’m left with two questions. (1) If you want high recall, can you just train a regular logistic regression method and then set the acceptance probability lower than 0.5? and (2) how does a reasonably regularized logistic regression fare? I saw the same kind of bad performance for maximum likelihood and ridge regression (Gaussian priors) in Genkin, Lewis and Madigan’s paper about large-scale logistic regression with priors (also well worth reading).

I have a long history with Martin’s F-measure error paper. Martin couldn’t get a visa to travel from the U.S. to Canada for the conferences. So he had me tack up the poster for this paper for him. Too bad I didn’t understand the technical details at the time. (During the same trip, I presented his treebank transfer paper, which as far as I know, was the first truly Bayesian NLP paper and is well worth reading.)

May 27, 2008 at 9:59 am |

Hi Bob! Thanks for this very useful writeup. I should first point out that the technique in my 2005 paper is substantially equivalent to one proposed by Mozer et al. in NIPS 14. The common idea in both Mozer et al.’s formulation as well as my 2005 paper was to approximate the expected F-score in terms of approximate true positives etc. in order to get an objective function with a non-constant gradient. You can get at the same goal without having to approximate: see my 2007 ACL paper (http://aclweb.org/anthology/P07-1093). The general technique for computing the expected F-score exactly also plays a role in the corresponding decoder, which is discussed in the more recent paper as well. Based on my recent experience, it’s the decoder I would focus on first, in preference to the training objective, when building a classifier for a new problem.

The maximum expected F-score decoder gives you different classification quality than simply adjusting the threshold parameter in your classifier. Or rather, you don’t know what the optimal threshold value should be on test data without knowledge of the true labels. You do get higher recall by lowering your acceptance probability, but since you sacrifice precision along with gains in recall, you generally don’t know how low you have to go in order to maximize F-score. You could calibrate that threshold on held-out data for which you have labels, but it won’t carry over to unseen data whose class distribution is very different from your held-out data. The decoder on the other hand will consider all relevant ways of labeling unseen data and optimize the expected F-score (under a fixed model in my naive implementation) on those data. I’ll post some of the slides that go with the talk for my ACL paper (which I never gave), which show results on standard UCI datasets.

Finally, you’re absolutely right that “logistic regression” is a red herring (my words). The general optimization techniques work (at a minimum) for any probabilistic classification model that can be trained using gradient-based maximum likelihood. This, too, is much clearer in my 2007 paper, which drops any reference to “logistic regression” other than as a concrete example.

June 9, 2008 at 5:54 pm |

[…] Jansche’s training of a logistic regression-like classifier based on (an approximation of) F-measure error. The goal was to build a classifier with a better F-measure than one trained on traditional log […]