<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments on: Comparing Precision-Recall Curves the Bayesian Way?</title>
	<atom:link href="http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Wed, 08 Feb 2012 17:47:08 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>By: Brendan O'Connor</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6318</link>
		<dc:creator><![CDATA[Brendan O'Connor]]></dc:creator>
		<pubDate>Thu, 04 Feb 2010 17:51:15 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6318</guid>
		<description><![CDATA[Just go for the bootstrap... so much easier!]]></description>
		<content:encoded><![CDATA[<p>Just go for the bootstrap&#8230; so much easier!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Clegg</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6252</link>
		<dc:creator><![CDATA[Andrew Clegg]]></dc:creator>
		<pubDate>Tue, 02 Feb 2010 03:58:59 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6252</guid>
		<description><![CDATA[Had to go and read up on bootstrapping -- been years since doing about sampling etc. in a Masters course. But yes, I can see the appeal of that.

I&#039;d been using AUC of the P-R curve as the point statistic for comparison, as most of the algorithms predict conservatively and don&#039;t actually reach 99% recall...]]></description>
		<content:encoded><![CDATA[<p>Had to go and read up on bootstrapping &#8212; been years since doing about sampling etc. in a Masters course. But yes, I can see the appeal of that.</p>
<p>I&#8217;d been using AUC of the P-R curve as the point statistic for comparison, as most of the algorithms predict conservatively and don&#8217;t actually reach 99% recall&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aleks Jakulin</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6235</link>
		<dc:creator><![CDATA[Aleks Jakulin]]></dc:creator>
		<pubDate>Mon, 01 Feb 2010 20:51:49 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6235</guid>
		<description><![CDATA[Many people are fond of aROC (or area under the ROC curve) as a statistic. I still prefer to just throw in my utilities and go decision-theoretic.]]></description>
		<content:encoded><![CDATA[<p>Many people are fond of aROC (or area under the ROC curve) as a statistic. I still prefer to just throw in my utilities and go decision-theoretic.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6234</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 01 Feb 2010 20:42:03 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6234</guid>
		<description><![CDATA[As I said below, I&#039;m now thinking a bootstrap analysis would be simpler.  But you&#039;d still need a statistic to measure.  I really like precision at 99% recall, but that&#039;s just us.  Recall at high precision is also interesting.  In our experience, apps tend to be heavily recall or precision focused.]]></description>
		<content:encoded><![CDATA[<p>As I said below, I&#8217;m now thinking a bootstrap analysis would be simpler.  But you&#8217;d still need a statistic to measure.  I really like precision at 99% recall, but that&#8217;s just us.  Recall at high precision is also interesting.  In our experience, apps tend to be heavily recall or precision focused.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Clegg</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6230</link>
		<dc:creator><![CDATA[Andrew Clegg]]></dc:creator>
		<pubDate>Mon, 01 Feb 2010 18:46:49 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6230</guid>
		<description><![CDATA[Err, Gold Standard, sorry.

Okay, I&#039;ll punt the raw results to you when we have clearance to share it. Thanks -- always glad to have a new angle, and I don&#039;t think Bayesianly very much myself.]]></description>
		<content:encoded><![CDATA[<p>Err, Gold Standard, sorry.</p>
<p>Okay, I&#8217;ll punt the raw results to you when we have clearance to share it. Thanks &#8212; always glad to have a new angle, and I don&#8217;t think Bayesianly very much myself.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6229</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 01 Feb 2010 18:34:22 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6229</guid>
		<description><![CDATA[I&#039;d think an evaluation based on PR curves should be run just from the ranked results of each system and a notion of which results are &quot;relevant&quot;.  Of course, you can compute document ranks from document scores.  

What&#039;s a &quot;GS&quot;?  

I&#039;m in no hurry; I know the biologists are close with their data and wouldn&#039;t want to step on anyone&#039;s toes.]]></description>
		<content:encoded><![CDATA[<p>I&#8217;d think an evaluation based on PR curves should be run just from the ranked results of each system and a notion of which results are &#8220;relevant&#8221;.  Of course, you can compute document ranks from document scores.  </p>
<p>What&#8217;s a &#8220;GS&#8221;?  </p>
<p>I&#8217;m in no hurry; I know the biologists are close with their data and wouldn&#8217;t want to step on anyone&#8217;s toes.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: lingpipe</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6228</link>
		<dc:creator><![CDATA[lingpipe]]></dc:creator>
		<pubDate>Mon, 01 Feb 2010 18:30:22 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6228</guid>
		<description><![CDATA[Both great ideas with big &quot;BUT&quot;s for Andrew&#039;s proposed application of running an information retrieval-like evaluation.

1.  Not everyone uses a probabilistic system, 

2.  Many IR systems don&#039;t have an estimation/training phase, and

3.  When running a retrospective evaluation, you typically just have ranked responses to queries.  

Having said all that, it occurs to me from Aleks&#039;s suggestions you might be able to use a bootstrap analysis.]]></description>
		<content:encoded><![CDATA[<p>Both great ideas with big &#8220;BUT&#8221;s for Andrew&#8217;s proposed application of running an information retrieval-like evaluation.</p>
<p>1.  Not everyone uses a probabilistic system, </p>
<p>2.  Many IR systems don&#8217;t have an estimation/training phase, and</p>
<p>3.  When running a retrospective evaluation, you typically just have ranked responses to queries.  </p>
<p>Having said all that, it occurs to me from Aleks&#8217;s suggestions you might be able to use a bootstrap analysis.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Aleks Jakulin</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6204</link>
		<dc:creator><![CDATA[Aleks Jakulin]]></dc:creator>
		<pubDate>Sun, 31 Jan 2010 03:50:39 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6204</guid>
		<description><![CDATA[ROC is a model comparison that turns probability into a rank. Not sure I buy it. If you feed probabilities into a decision theoretic utility, it will give you the optimal threshold automatically.

Now, if you insist on doing significance on ROC curves, why not do a bunch of 2-fold cross validations over a number of permutations of the indices, and paint the ROC for each one of them? If you run 100 cross validations, you will have 100 ROC curves, and if you try two models on the same split, you will be even able to compare them. I&#039;ve done such things in my PhD.]]></description>
		<content:encoded><![CDATA[<p>ROC is a model comparison that turns probability into a rank. Not sure I buy it. If you feed probabilities into a decision theoretic utility, it will give you the optimal threshold automatically.</p>
<p>Now, if you insist on doing significance on ROC curves, why not do a bunch of 2-fold cross validations over a number of permutations of the indices, and paint the ROC for each one of them? If you run 100 cross validations, you will have 100 ROC curves, and if you try two models on the same split, you will be even able to compare them. I&#8217;ve done such things in my PhD.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Andrew Clegg</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6200</link>
		<dc:creator><![CDATA[Andrew Clegg]]></dc:creator>
		<pubDate>Sat, 30 Jan 2010 11:13:49 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6200</guid>
		<description><![CDATA[Hi Bob, I was hoping you would jump in at some point.

A bit of background on the system: It&#039;s not actually a document IR system, it&#039;s a set of algorithms (and ensembles) for selecting and ranking relevant proteins from a much larger set of candidates, using various evidence sources -- gene expression profiles, evolutionary relatedness via domain fusion, MEDLINE co-occurrence, semantic similarity of GO annotations, yadda yadda.

Dave&#039;s comment about the &#039;traditional sciences&#039; is bang on, as this is going to a bioinformatics journal (formerly a trad molecular biology one), and yes, there is a certain element of ass-covering. However the deadline looms so the significance testing will probably have to wait until a reviewer requests it.

In the meantime, it&#039;s just become an interesting intellectual challenge.

Ken, you&#039;re right about the difference between significantly different vs. better, I was thinking of the former really. I think any reviewer who insisted on the latter would be making too many assumptions about what &#039;better&#039; means, since as you point out, curves cross, and part of our motivation for presenting various ensembles is that different ones emphasize precision or recall.

Bob, can you do any informative hacking just based on the scores, or would you need the actual GS and predictions? If you need the raw data, I&#039;ll need to &#039;anonymize&#039; it, as our collaborators who assembled the GS by hand don&#039;t want it published until their own paper is out.]]></description>
		<content:encoded><![CDATA[<p>Hi Bob, I was hoping you would jump in at some point.</p>
<p>A bit of background on the system: It&#8217;s not actually a document IR system, it&#8217;s a set of algorithms (and ensembles) for selecting and ranking relevant proteins from a much larger set of candidates, using various evidence sources &#8212; gene expression profiles, evolutionary relatedness via domain fusion, MEDLINE co-occurrence, semantic similarity of GO annotations, yadda yadda.</p>
<p>Dave&#8217;s comment about the &#8216;traditional sciences&#8217; is bang on, as this is going to a bioinformatics journal (formerly a trad molecular biology one), and yes, there is a certain element of ass-covering. However the deadline looms so the significance testing will probably have to wait until a reviewer requests it.</p>
<p>In the meantime, it&#8217;s just become an interesting intellectual challenge.</p>
<p>Ken, you&#8217;re right about the difference between significantly different vs. better, I was thinking of the former really. I think any reviewer who insisted on the latter would be making too many assumptions about what &#8216;better&#8217; means, since as you point out, curves cross, and part of our motivation for presenting various ensembles is that different ones emphasize precision or recall.</p>
<p>Bob, can you do any informative hacking just based on the scores, or would you need the actual GS and predictions? If you need the raw data, I&#8217;ll need to &#8216;anonymize&#8217; it, as our collaborators who assembled the GS by hand don&#8217;t want it published until their own paper is out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ken Williams</title>
		<link>http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/#comment-6197</link>
		<dc:creator><![CDATA[Ken Williams]]></dc:creator>
		<pubDate>Fri, 29 Jan 2010 22:19:12 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=3502#comment-6197</guid>
		<description><![CDATA[There seem to be two different questions asked here.  The first is whether the two curves are significantly *different*.  The second is whether one curve is significantly *better* than another.

They&#039;re different questions because one curve can be higher on the left than another curve and lower on the right.  The two curves are then different, but it&#039;s a subjective question whether one is better than the other.

The first question also seems easier, because you &quot;just&quot; want to model whether the two curves might plausibly have been generated from the same system.

For the second question I&#039;d probably want to integrate the PR function times a utility function that tells you the desired tradeoff between P &amp; R.  As Dave says though, carrying a significance test through that process seems horribly messy.]]></description>
		<content:encoded><![CDATA[<p>There seem to be two different questions asked here.  The first is whether the two curves are significantly *different*.  The second is whether one curve is significantly *better* than another.</p>
<p>They&#8217;re different questions because one curve can be higher on the left than another curve and lower on the right.  The two curves are then different, but it&#8217;s a subjective question whether one is better than the other.</p>
<p>The first question also seems easier, because you &#8220;just&#8221; want to model whether the two curves might plausibly have been generated from the same system.</p>
<p>For the second question I&#8217;d probably want to integrate the PR function times a utility function that tells you the desired tradeoff between P &amp; R.  As Dave says though, carrying a significance test through that process seems horribly messy.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

