<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: F-measure Loss for &#8220;Logistic Regression&#8221;</title>
	<atom:link href="http://lingpipe-blog.com/2008/05/23/f-measure-loss-for-logistic-regression/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2008/05/23/f-measure-loss-for-logistic-regression/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Sat, 04 Feb 2012 20:56:48 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: Per-Tag Error Function for Conditional Random Fields for Removing Label Bias &#171; LingPipe Blog</title>
		<link>http://lingpipe-blog.com/2008/05/23/f-measure-loss-for-logistic-regression/#comment-2480</link>
		<dc:creator><![CDATA[Per-Tag Error Function for Conditional Random Fields for Removing Label Bias &#171; LingPipe Blog]]></dc:creator>
		<pubDate>Mon, 09 Jun 2008 22:54:29 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=97#comment-2480</guid>
		<description><![CDATA[[...] Jansche&#8217;s training of a logistic regression-like classifier based on (an approximation of) F-measure error. The goal was to build a classifier with a better F-measure than one trained on traditional log [...]]]></description>
		<content:encoded><![CDATA[<p>[...] Jansche&#8217;s training of a logistic regression-like classifier based on (an approximation of) F-measure error. The goal was to build a classifier with a better F-measure than one trained on traditional log [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Martin Jansche</title>
		<link>http://lingpipe-blog.com/2008/05/23/f-measure-loss-for-logistic-regression/#comment-2412</link>
		<dc:creator><![CDATA[Martin Jansche]]></dc:creator>
		<pubDate>Tue, 27 May 2008 14:59:53 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe.wordpress.com/?p=97#comment-2412</guid>
		<description><![CDATA[Hi Bob!  Thanks for this very useful writeup.  I should first point out that the technique in my 2005 paper is substantially equivalent to one proposed by Mozer et al. in NIPS 14.  The common idea in both Mozer et al.&#039;s formulation as well as my 2005 paper was to approximate the expected F-score in terms of approximate true positives etc.  in order to get an objective function with a non-constant gradient.  You can get at the same goal without having to approximate: see my 2007 ACL paper (http://aclweb.org/anthology/P07-1093).  The general technique for computing the expected F-score exactly also plays a role in the corresponding decoder, which is discussed in the more recent paper as well.  Based on my recent experience, it&#039;s the decoder I would focus on first, in preference to the training objective, when building a classifier for a new problem.

The maximum expected F-score decoder gives you different classification quality than simply adjusting the threshold parameter in your classifier.  Or rather, you don&#039;t know what the optimal threshold value should be on test data without knowledge of the true labels.  You do get higher recall by lowering your acceptance probability, but since you sacrifice precision along with gains in recall, you generally don&#039;t know how low you have to go in order to maximize F-score.  You could calibrate that threshold on held-out data for which you have labels, but it won&#039;t carry over to unseen data whose class distribution is very different from your held-out data.  The decoder on the other hand will consider all relevant ways of labeling unseen data and optimize the expected F-score (under a fixed model in my naive implementation) on those data.  I&#039;ll post some of the slides that go with the talk for my ACL paper (which I never gave), which show results on standard UCI datasets.

Finally, you&#039;re absolutely right that &quot;logistic regression&quot; is a red herring (my words).  The general optimization techniques work (at a minimum) for any probabilistic classification model that can be trained using gradient-based maximum likelihood.  This, too, is much clearer in my 2007 paper, which drops any reference to &quot;logistic regression&quot; other than as a concrete example.]]></description>
		<content:encoded><![CDATA[<p>Hi Bob!  Thanks for this very useful writeup.  I should first point out that the technique in my 2005 paper is substantially equivalent to one proposed by Mozer et al. in NIPS 14.  The common idea in both Mozer et al.&#8217;s formulation as well as my 2005 paper was to approximate the expected F-score in terms of approximate true positives etc.  in order to get an objective function with a non-constant gradient.  You can get at the same goal without having to approximate: see my 2007 ACL paper (<a href="http://aclweb.org/anthology/P07-1093" rel="nofollow">http://aclweb.org/anthology/P07-1093</a>).  The general technique for computing the expected F-score exactly also plays a role in the corresponding decoder, which is discussed in the more recent paper as well.  Based on my recent experience, it&#8217;s the decoder I would focus on first, in preference to the training objective, when building a classifier for a new problem.</p>
<p>The maximum expected F-score decoder gives you different classification quality than simply adjusting the threshold parameter in your classifier.  Or rather, you don&#8217;t know what the optimal threshold value should be on test data without knowledge of the true labels.  You do get higher recall by lowering your acceptance probability, but since you sacrifice precision along with gains in recall, you generally don&#8217;t know how low you have to go in order to maximize F-score.  You could calibrate that threshold on held-out data for which you have labels, but it won&#8217;t carry over to unseen data whose class distribution is very different from your held-out data.  The decoder on the other hand will consider all relevant ways of labeling unseen data and optimize the expected F-score (under a fixed model in my naive implementation) on those data.  I&#8217;ll post some of the slides that go with the talk for my ACL paper (which I never gave), which show results on standard UCI datasets.</p>
<p>Finally, you&#8217;re absolutely right that &#8220;logistic regression&#8221; is a red herring (my words).  The general optimization techniques work (at a minimum) for any probabilistic classification model that can be trained using gradient-based maximum likelihood.  This, too, is much clearer in my 2007 paper, which drops any reference to &#8220;logistic regression&#8221; other than as a concrete example.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

