<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
		>
<channel>
	<title>Comments for LingPipe Blog</title>
	<atom:link href="http://lingpipe-blog.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://lingpipe-blog.com</link>
	<description>Natural Language Processing and Text Analytics</description>
	<lastBuildDate>Sun, 28 Apr 2013 22:30:55 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by Mark Johnson</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22345</link>
		<dc:creator><![CDATA[Mark Johnson]]></dc:creator>
		<pubDate>Sun, 28 Apr 2013 22:30:55 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22345</guid>
		<description><![CDATA[You&#039;re quite right that SVMs and Perceptrons aren&#039;t probabilistic models (i.e., they don&#039;t estimate the probabilities of different outputs).  I just meant that the SVM is designed to optimise a probabilistically-defined objective, namely the expected loss.  (Expectations require there to be an underlying probability distribution, of course).

Yes, that&#039;s an interesting point about hard SVMs -- I guess that follows from the fact that SVMs maximise the margin; it&#039;s pretty clear that &quot;duplicating&quot; a data point has no effect whatsoever on the margin.

Interestingly, a number of psycholinguists (e.g., Bybee) suggest that humans pay far more attention to type frequency rather than token frequency when learning morphology; i.e., multiple presentations of the same data item doesn&#039;t help human learning, rather we need to see different variants instantiating the same principle. 

But type-based learning isn&#039;t unique to SVMs -- learning the parameters in the &lt;a href=&quot;http://books.nips.cc/papers/files/nips18/NIPS2005_0333.pdf&quot; rel=&quot;nofollow&quot;&gt;upper levels of hierarchical Bayesian models is type-based in a similar way&lt;/a&gt;.]]></description>
		<content:encoded><![CDATA[<p>You&#8217;re quite right that SVMs and Perceptrons aren&#8217;t probabilistic models (i.e., they don&#8217;t estimate the probabilities of different outputs).  I just meant that the SVM is designed to optimise a probabilistically-defined objective, namely the expected loss.  (Expectations require there to be an underlying probability distribution, of course).</p>
<p>Yes, that&#8217;s an interesting point about hard SVMs &#8212; I guess that follows from the fact that SVMs maximise the margin; it&#8217;s pretty clear that &#8220;duplicating&#8221; a data point has no effect whatsoever on the margin.</p>
<p>Interestingly, a number of psycholinguists (e.g., Bybee) suggest that humans pay far more attention to type frequency rather than token frequency when learning morphology; i.e., multiple presentations of the same data item doesn&#8217;t help human learning, rather we need to see different variants instantiating the same principle. </p>
<p>But type-based learning isn&#8217;t unique to SVMs &#8212; learning the parameters in the <a href="http://books.nips.cc/papers/files/nips18/NIPS2005_0333.pdf" rel="nofollow">upper levels of hierarchical Bayesian models is type-based in a similar way</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by tdietterich</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22344</link>
		<dc:creator><![CDATA[tdietterich]]></dc:creator>
		<pubDate>Sun, 28 Apr 2013 19:06:31 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22344</guid>
		<description><![CDATA[Bob Duin pointed out (in what I believe is a still-unpublished note) that the hard SVM (without the slack penalty) is a non-statistical classifier in that it produces exactly the same decision boundary regardless of the number of times a given data point appears in the training set. This makes it rather unusual in the Stat/ML/Pattern Recognition universe.

I also find the interpretation of SVMs as a robust classifier (Xu, et al. JMLR 2009) to be intriguing. In any case, I think there is a lot more going on in SVMs than just a convex approximation to a probabilistic model. 

P.S. I use Platt scaling all the time. However, I don&#039;t see how that gives us a probabilistic interpretation of SVMs. I think it just tells us that the discriminant function computed by an SVM is a useful input to a logistic regression.]]></description>
		<content:encoded><![CDATA[<p>Bob Duin pointed out (in what I believe is a still-unpublished note) that the hard SVM (without the slack penalty) is a non-statistical classifier in that it produces exactly the same decision boundary regardless of the number of times a given data point appears in the training set. This makes it rather unusual in the Stat/ML/Pattern Recognition universe.</p>
<p>I also find the interpretation of SVMs as a robust classifier (Xu, et al. JMLR 2009) to be intriguing. In any case, I think there is a lot more going on in SVMs than just a convex approximation to a probabilistic model. </p>
<p>P.S. I use Platt scaling all the time. However, I don&#8217;t see how that gives us a probabilistic interpretation of SVMs. I think it just tells us that the discriminant function computed by an SVM is a useful input to a logistic regression.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Convergence is Relative: SGD vs. Pegasos, LibLinear, SVM^light, and SVM^perf by p</title>
		<link>http://lingpipe-blog.com/2009/04/08/convergence-relative-sgd-pegasos-liblinear-svmlight-svmper/#comment-22343</link>
		<dc:creator><![CDATA[p]]></dc:creator>
		<pubDate>Sun, 28 Apr 2013 11:04:58 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=1186#comment-22343</guid>
		<description><![CDATA[I am using pegasos svm but do not understand properly. how can i calculate the accuracy &amp; exucution time. Is any file or program available which calculate the accuracy or any other option??]]></description>
		<content:encoded><![CDATA[<p>I am using pegasos svm but do not understand properly. how can i calculate the accuracy &amp; exucution time. Is any file or program available which calculate the accuracy or any other option??</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by Mark Johnson</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22341</link>
		<dc:creator><![CDATA[Mark Johnson]]></dc:creator>
		<pubDate>Sun, 28 Apr 2013 08:23:02 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22341</guid>
		<description><![CDATA[Like you (I suspect), I&#039;ve tended to stay away from SVMs and Empirical Risk Minimisation in my own work, so I don&#039;t know as much about them as I should.

But SVMs do have a probabilistic interpretation, i.e., SVMs are designed to minimise the expected loss.  There&#039;s a certain attraction to this -- in many applications this is precisely what we want to do.

Also, there is a close relationship between the Perceptron and log-linear &quot;MaxEnt&quot; classifier models.  Specifically, if you estimate a log-linear classifier with stochastic gradient descent (i.e., mini-batch of size 1) and make a Viterbi approximation (i.e., assume that the expected feature values are the feature values of the mode label) then you obtain the Perceptron update rule!  Perhaps even more surprisingly, in SGD, L2 regularisation becomes multiplicative weight decay and L1 regularisation becomes subtractive weight decay! (This isn&#039;t too surprising after a bit of thought).]]></description>
		<content:encoded><![CDATA[<p>Like you (I suspect), I&#8217;ve tended to stay away from SVMs and Empirical Risk Minimisation in my own work, so I don&#8217;t know as much about them as I should.</p>
<p>But SVMs do have a probabilistic interpretation, i.e., SVMs are designed to minimise the expected loss.  There&#8217;s a certain attraction to this &#8212; in many applications this is precisely what we want to do.</p>
<p>Also, there is a close relationship between the Perceptron and log-linear &#8220;MaxEnt&#8221; classifier models.  Specifically, if you estimate a log-linear classifier with stochastic gradient descent (i.e., mini-batch of size 1) and make a Viterbi approximation (i.e., assume that the expected feature values are the feature values of the mode label) then you obtain the Perceptron update rule!  Perhaps even more surprisingly, in SGD, L2 regularisation becomes multiplicative weight decay and L1 regularisation becomes subtractive weight decay! (This isn&#8217;t too surprising after a bit of thought).</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by Mark Johnson</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22340</link>
		<dc:creator><![CDATA[Mark Johnson]]></dc:creator>
		<pubDate>Sun, 28 Apr 2013 08:11:31 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22340</guid>
		<description><![CDATA[I agree completely -- the graphical model that results from conditioning on a set of variables deletes those variables and the edges connected to them, which means that conditional estimation can be quite tractable in situations where joint estimation would be intractable.  I see this as the main insight behind the CRF.]]></description>
		<content:encoded><![CDATA[<p>I agree completely &#8212; the graphical model that results from conditioning on a set of variables deletes those variables and the edges connected to them, which means that conditional estimation can be quite tractable in situations where joint estimation would be intractable.  I see this as the main insight behind the CRF.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by Eric</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22339</link>
		<dc:creator><![CDATA[Eric]]></dc:creator>
		<pubDate>Sun, 28 Apr 2013 00:10:36 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22339</guid>
		<description><![CDATA[Great discussion.  Platt ( http://research.microsoft.com/apps/pubs/?id=69187 ) offers a post hoc way to estimate a posterior class probability for an SVM.  That&#039;s not a probabilistic interpretation per se, but it allows one to use an SVM in settings where a probability is required. In the paper he also compares that method to a &quot;kernel classifier with a logit link function and a regularized maximum likelihood score&quot;.  Perhaps the latter could be seen as the sought-for probabilistic interpretation of SVMs.]]></description>
		<content:encoded><![CDATA[<p>Great discussion.  Platt ( <a href="http://research.microsoft.com/apps/pubs/?id=69187" rel="nofollow">http://research.microsoft.com/apps/pubs/?id=69187</a> ) offers a post hoc way to estimate a posterior class probability for an SVM.  That&#8217;s not a probabilistic interpretation per se, but it allows one to use an SVM in settings where a probability is required. In the paper he also compares that method to a &#8220;kernel classifier with a logit link function and a regularized maximum likelihood score&#8221;.  Perhaps the latter could be seen as the sought-for probabilistic interpretation of SVMs.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by tdietterich</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22338</link>
		<dc:creator><![CDATA[tdietterich]]></dc:creator>
		<pubDate>Sat, 27 Apr 2013 23:47:22 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22338</guid>
		<description><![CDATA[Yes, you get a classification machine (or a regression machine or a ranking machine, depending on the loss function). So if the learning component is the last step in your pipeline, you can train it using the loss function of the actual task. 

Probabilistic methods make sense when the component being learned is buried inside a large system, and you seek a modular solution. Probabilities provide a good intermediate representation that can be converted (in a principled way) to optimize many other loss functions. Of course if you are willing to perform end-to-end training of the entire system (in the style of deep neural networks), then you don&#039;t need modularity and you can just use the task-specific loss.]]></description>
		<content:encoded><![CDATA[<p>Yes, you get a classification machine (or a regression machine or a ranking machine, depending on the loss function). So if the learning component is the last step in your pipeline, you can train it using the loss function of the actual task. </p>
<p>Probabilistic methods make sense when the component being learned is buried inside a large system, and you seek a modular solution. Probabilities provide a good intermediate representation that can be converted (in a principled way) to optimize many other loss functions. Of course if you are willing to perform end-to-end training of the entire system (in the style of deep neural networks), then you don&#8217;t need modularity and you can just use the task-specific loss.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by Bob Carpenter</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22337</link>
		<dc:creator><![CDATA[Bob Carpenter]]></dc:creator>
		<pubDate>Sat, 27 Apr 2013 22:01:14 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22337</guid>
		<description><![CDATA[Absolutely --- this is all a matter of perspective.  The body of the post only considered probabilistic models.  (I tend not to think of other kinds these days.)

I don&#039;t know of a probabilistic interpretation of SVMs, but I haven&#039;t looked.   As usually stated, SVMs (and perceptrons) give you a classification machine, but don&#039;t model either the probabilities of the predictors or the outcome categories.  They&#039;re more like the discriminative models than the generative ones if you work by analogy to logistic regression.  After all, MAP estimates for logistic regression and the usual estimates for SVMs look the same when you fit logistic regression with a point estimate by minimizing log loss and fit an SVM by minimizing hinge loss.  ]]></description>
		<content:encoded><![CDATA[<p>Absolutely &#8212; this is all a matter of perspective.  The body of the post only considered probabilistic models.  (I tend not to think of other kinds these days.)</p>
<p>I don&#8217;t know of a probabilistic interpretation of SVMs, but I haven&#8217;t looked.   As usually stated, SVMs (and perceptrons) give you a classification machine, but don&#8217;t model either the probabilities of the predictors or the outcome categories.  They&#8217;re more like the discriminative models than the generative ones if you work by analogy to logistic regression.  After all, MAP estimates for logistic regression and the usual estimates for SVMs look the same when you fit logistic regression with a point estimate by minimizing log loss and fit an SVM by minimizing hinge loss.  </p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by tdietterich</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22332</link>
		<dc:creator><![CDATA[tdietterich]]></dc:creator>
		<pubDate>Sat, 27 Apr 2013 03:07:46 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22332</guid>
		<description><![CDATA[As a machine learning person, I feel compelled to add a third column to your table in which we consider non-probabilistic &quot;models&quot; as well. Maybe the column heading should be &quot;Vapniker&quot;. The notation would be f(z; w, \beta) for the discriminative case. And this could be trained for 0/1 loss or hinge loss or whatever loss function you care about. I don&#039;t think there is non-probabilistic equivalent of the generative model (although nearest-neighbor methods come close).]]></description>
		<content:encoded><![CDATA[<p>As a machine learning person, I feel compelled to add a third column to your table in which we consider non-probabilistic &#8220;models&#8221; as well. Maybe the column heading should be &#8220;Vapniker&#8221;. The notation would be f(z; w, \beta) for the discriminative case. And this could be trained for 0/1 loss or hinge loss or whatever loss function you care about. I don&#8217;t think there is non-probabilistic equivalent of the generative model (although nearest-neighbor methods come close).</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Generative vs. Discriminative; Bayesian vs. Frequentist by tdietterich</title>
		<link>http://lingpipe-blog.com/2013/04/12/generative-vs-discriminative-bayesian-vs-frequentist/#comment-22331</link>
		<dc:creator><![CDATA[tdietterich]]></dc:creator>
		<pubDate>Sat, 27 Apr 2013 03:03:41 +0000</pubDate>
		<guid isPermaLink="false">http://lingpipe-blog.com/?p=6601#comment-22331</guid>
		<description><![CDATA[Mark, I think your last paragraph is key. Any model that conditions on some input variables can just as easily condition on arbitrary (fixed) functions of those inputs. This gives the modeler great freedom. However, if those functions depend on most or all of the input variables, then you will need global normalization, I believe. In principle, one could construct an equivalent generative model of the input, but it would likely be mis-specified, and that could damage the predictive power of the model.]]></description>
		<content:encoded><![CDATA[<p>Mark, I think your last paragraph is key. Any model that conditions on some input variables can just as easily condition on arbitrary (fixed) functions of those inputs. This gives the modeler great freedom. However, if those functions depend on most or all of the input variables, then you will need global normalization, I believe. In principle, one could construct an equivalent generative model of the input, but it would likely be mis-specified, and that could damage the predictive power of the model.</p>
]]></content:encoded>
	</item>
</channel>
</rss>