Thanks for the empirical feedback and paper pointer.

My perceptron example was purely theoretical. As I mentioned, I haven’t tried it out empirically.

An explanation (again theoretical) of why perceptrons or logistic regression can get into trouble is with unbounded coefficients. If you have a feature that only occurs in one category of example, it can run off to infinity in maximum likelihood logistic regression or CRFs, and even in perceptrons if the problem’s not separable are other features running interference and lots of epochs.

What do you mean by max margin? Perceptrons aren’t max margin. With SVM’s hinge loss, once you’re a bit beyond the correct side of the decision boundary, there’s zero error in SVMs, which isn’t the case for CRFs. I’m not up on all the non-probabilistic linear models.

How different is an averaged perceptron from a perceptron empirically? I’d expect the averaged one would have more stable estimates.

A big problem with all of these approaches is that we’re fitting a model to the training data, not unseen new data. That’s one of the reasons regularization or priors (or early stopping in perceptrons/SVM) help in practice. Without regularization, coefficients can shoot off toward infinity, which can push estimates for held out data toward 0 or 1.

As to the S B Cohen and N Smith paper, I’m a big believer in joint inference. It seems far more sensible than pipelining because you can account for interactions. I would think you would get the same kind of label bias problem in pipelined log-linear models as you’d get for HMMs.

It also looks like they used maximum likelihood estimates (or finite epoch approximations thereof) and that the blending was of unnormalized models with an oracle estimate, so I’m not sure what conclusions we can draw from it, but I may be misunderstanding this in a quick read.

As you point out, it may also be the model(s). It’s always possible to build a model with bad predictive power that thinks it has good predictive power (partly, it’s the training versus test split). It’s tricky to combine a very rich model like parsing with a simpler model like morphology. The same thing is problematic when stacking PCFGs and HMM POS taggers — the PCFGs just aren’t very good models. For instance, PCFGs by themselves aren’t very good POS taggers, nor are they very good language models. Also, as you point out for the Cohen and Smith paper, you can wind up predicting the POS tags and/or lexical items twice.

It’s pretty well known that naive Bayes needs to be calibrated because of their faulty independence assumptions. Ditto for combining acoustic models and language models in speech rec, because the independence assumptions in the acoustic model are so much worse than in the language model.

I think the best way to characterize how well the model’s doing in terms of prediction is to model log loss on held out data. First-best evaluation isn’t even what’s being optimized with a CRF or logistic regression! A proxy would be some characterization of the relation between predicted probabilities and error rates if you’re using first-best analyses.

]]>– about the (averaged) perceptron: empirically, it doesn’t behave as you describe. In my experience (all involving tons of features), the exponentiated and normalized perceptron scores are actually much closer to 0 or 1 (and more accurate..) than the corresponding logistic regression models. This does not mean they good at assigning probabilities, just that they are not centered around 1/2 as you predicted. And if going for a maximum-margin kind of algorithm, than you are practically guaranteed to be far from 1/2.

Behavior may be different with fewer features.

– a (sort of) empirical demonstration that CRF probabilities do not neccesarilly combine well with other probabilities can be seen in this paper bu Cohen and Smith (http://www.cs.cmu.edu/~scohen/joint+emnlp07.pdf). While I don’t quite like the modeling in the paper (some things are modeled twice), it does show that a combination of CRF and another probabilistic model benefits from some scaling. Their joint models are a “product-of-experts”, of the form log(PCFG_prob) + alpha*log(CRF_prob), and they get substential gains in accuracy by tuning the alpha value (though as I said, I am not quite certain if this is true for CRFs in general, or just a byproduct of the particular model assumptions they make).

]]>I’ve been thinking about this issue a lot lately, and would love to hear other’s experiences or analyses.

In practice, we’ve found logistic regression classification over text with character n-gram features to be pretty well calibrated. Specifically, we’ve estimated some large-scale multinomial logit models and their predicted probabilities correlate well with probabilities of errors on held out data. This contrasts with naive Bayes, which is notoriously uncalibrated.

For concreteness, consider binary logistic regression (aka “max ent” with no outcome-category-specific features) rather than CRFs, because they’re simpler and more easily related to the traditional use of perceptrons. The same argument applies to CRFs (which are like multinomial logistic regression) versus structured SVMs or perceptrons.

The reason to trust logistic regression more than perceptrons when evaluating probabilistically is that they’re trained with log loss, which pushes the coefficient vectors toward values that mirror the training set probabilities. That is, there’s a penalty for assigning probabilities less than 1 to true cases and probabilities greater than 0 to false cases. This assumes we have a binary training set (i.e. we don’t have probabilistic training, where there’s some uncertainty of an item’s category).

Consider a really simple linearly separable training set with a handful of examples that perceptrons classify perfectly after one pass. All we know is that the value of true cases will be greater than 1/2 and false cases less than 1/2. The probabilities for true cases will almost certainly be much less than 1 (though still greater than 1/2), and for false cases will be much greater than 0 (though still less than 1/2). That’s because perceptrons minimize 0/1 loss, and don’t care what the scores are as long as they’re on the right side of 1/2.

For logistic regression, as we increase the number of epochs in an iterative estimator (e.g. gradient descent optimizers such as SGD or quasi-Newton methods like L-BFGS), we’ll approach values of 0.0 for false cases and 1.0 for true cases. Of course, we’ll never actually converge (except to within epsilon), because we never quite get to 0 or 1 due to the shape of the logistic sigmoid.

The problem is that with only 0 or 1 target values in the training set, there’s really no reason to assume that probabilities for cases near the decision boundary have reasonably estimated probabilities. That’s because there’s no reason to believe the logistic sigmoid interpolates well. And given how flat the long tails are, there’s some reason to believe it won’t be well calibrated.

Adding informative priors will help with convergence, though keep in mind that with Laplace priors (aka L1 regularization) and perfectly correlated features, the model’s still not identified. (It is identified with Gaussian or elastic net priors.) We’ve also found it helps a bit with calibration.

As I said earlier, I’m curious what others have found. I’m afraid we haven’t run on any standard data sets we can share.

]]>My comment might be really naive, but if it bothers me a lot lately, and I would really like to get an answer for it..

What bothers me about those conditionally trained CRFs (and also MaxEnt, for that matter) is that I don’t understand why I should trust the resulting models as probability distributions more than, say, linear models trained using perceptron and then having the the resulting scores exponentiated and normalized only at prediction time?

Is there any good reason to believe that we can really use probabilities from CRFs as part of a larger inference system (say by combining them with noisy input probabilities, as mentioned in the previous post, or any other kind of prior knowledge in the form of probability distribution)? is there any reason to believe this is really the correct approach to take?

]]>