I thought the representational equivalence holds as long as one restricts the CRF to use very simple graph structures. Does it still hold if the CRF has non-local, (and potentially non-regular) edges between the state node and a feature node ? Inference on such a CRF should be pretty much as easy as on a standard “chain CRF”. .

]]>You’re quite right that SVMs and Perceptrons aren’t probabilistic models (i.e., they don’t estimate the probabilities of different outputs). I just meant that the SVM is designed to optimise a probabilistically-defined objective, namely the expected loss. (Expectations require there to be an underlying probability distribution, of course).

Yes, that’s an interesting point about hard SVMs — I guess that follows from the fact that SVMs maximise the margin; it’s pretty clear that “duplicating” a data point has no effect whatsoever on the margin.

Interestingly, a number of psycholinguists (e.g., Bybee) suggest that humans pay far more attention to type frequency rather than token frequency when learning morphology; i.e., multiple presentations of the same data item doesn’t help human learning, rather we need to see different variants instantiating the same principle.

But type-based learning isn’t unique to SVMs — learning the parameters in the upper levels of hierarchical Bayesian models is type-based in a similar way.

]]>Bob Duin pointed out (in what I believe is a still-unpublished note) that the hard SVM (without the slack penalty) is a non-statistical classifier in that it produces exactly the same decision boundary regardless of the number of times a given data point appears in the training set. This makes it rather unusual in the Stat/ML/Pattern Recognition universe.

I also find the interpretation of SVMs as a robust classifier (Xu, et al. JMLR 2009) to be intriguing. In any case, I think there is a lot more going on in SVMs than just a convex approximation to a probabilistic model.

P.S. I use Platt scaling all the time. However, I don’t see how that gives us a probabilistic interpretation of SVMs. I think it just tells us that the discriminant function computed by an SVM is a useful input to a logistic regression.

]]>Like you (I suspect), I’ve tended to stay away from SVMs and Empirical Risk Minimisation in my own work, so I don’t know as much about them as I should.

But SVMs do have a probabilistic interpretation, i.e., SVMs are designed to minimise the expected loss. There’s a certain attraction to this — in many applications this is precisely what we want to do.

Also, there is a close relationship between the Perceptron and log-linear “MaxEnt” classifier models. Specifically, if you estimate a log-linear classifier with stochastic gradient descent (i.e., mini-batch of size 1) and make a Viterbi approximation (i.e., assume that the expected feature values are the feature values of the mode label) then you obtain the Perceptron update rule! Perhaps even more surprisingly, in SGD, L2 regularisation becomes multiplicative weight decay and L1 regularisation becomes subtractive weight decay! (This isn’t too surprising after a bit of thought).

]]>I agree completely — the graphical model that results from conditioning on a set of variables deletes those variables and the edges connected to them, which means that conditional estimation can be quite tractable in situations where joint estimation would be intractable. I see this as the main insight behind the CRF.

]]>Great discussion. Platt ( http://research.microsoft.com/apps/pubs/?id=69187 ) offers a post hoc way to estimate a posterior class probability for an SVM. That’s not a probabilistic interpretation per se, but it allows one to use an SVM in settings where a probability is required. In the paper he also compares that method to a “kernel classifier with a logit link function and a regularized maximum likelihood score”. Perhaps the latter could be seen as the sought-for probabilistic interpretation of SVMs.

]]>Yes, you get a classification machine (or a regression machine or a ranking machine, depending on the loss function). So if the learning component is the last step in your pipeline, you can train it using the loss function of the actual task.

Probabilistic methods make sense when the component being learned is buried inside a large system, and you seek a modular solution. Probabilities provide a good intermediate representation that can be converted (in a principled way) to optimize many other loss functions. Of course if you are willing to perform end-to-end training of the entire system (in the style of deep neural networks), then you don’t need modularity and you can just use the task-specific loss.

]]>Absolutely — this is all a matter of perspective. The body of the post only considered probabilistic models. (I tend not to think of other kinds these days.)

I don’t know of a probabilistic interpretation of SVMs, but I haven’t looked. As usually stated, SVMs (and perceptrons) give you a classification machine, but don’t model either the probabilities of the predictors or the outcome categories. They’re more like the discriminative models than the generative ones if you work by analogy to logistic regression. After all, MAP estimates for logistic regression and the usual estimates for SVMs look the same when you fit logistic regression with a point estimate by minimizing log loss and fit an SVM by minimizing hinge loss.

]]>