## McCallum, Bellare and Pereira (2005) A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

Following up on my previous post on alignment with CRFs, I’ve found a related paper that I’d like to review here. I’ve spent the last 20 years following Fernando around the intellectual landscape, from logic programming and grammatical formalisms to statistics and biology. (So it’s perhaps not surprising that we’re both speaking next week at the Edinburgh HCRC and Cog Sci Reunion.)

I have to admit I’d read this paper in the dim and distant past (OK, it’s from 2005), so something might’ve stuck in my head. But once you grok CRFs, building new models is easy, because as I said last time, CRFs are really good at normalizing sequential labeling or classification problems.

### Applications

McCallum et al. applied their model to matching restaurant names and addresses and also to matching bibliographic citations. In contrast, I was interested in matching short reads against reference genomes (for DNA sequencing) or gene models (for mRNA sequencing).

### What’s Being Modeled?

McCallum et al.’s goal was to model the probability that two strings match. This reduces the problem to a binary classification problem.

In contrast, what I was modeling in the last post is the probability of a query/read sequence being generated from a reference sequence. The reads are more like the tags generated in a tagging CRF. The alignment is a local alignment of the read to the reference.

We both approached the problem by first defining a notion of an alignment sequence and its relation to the strings. They modeled the probability of a global alignment given both strings, whereas I modeled the probability of a read locally aligned to a reference sequence.

The difference is in normalization terms, both of which are easy to calculate via dynamic programming.

As they point out in their discussion, what they did is to pair HMMs (i.e. as used by [Ristad and Yianilos 1998]), as CRFs are to regular HMMs. I turned a similar crank to estimate a related probability.

### Other Innovations

What I really like about McCallum et al.’s approach is that it allows chunked matching operations. Although they don’t consider affine gap models (where longer gaps aren’t that much more unlikely than shorter gaps), they do consider token-sensitive operations like deleting a token that appears in both strings or in a dictionary, and consider deleting to the end of the current token (useful for matching acronyms). The possibilities are endless.

With their CRF FST implementation in Mallet, I’m guessing it was as easy to code as the paper makes it seem. The FST coding is straightforward, with the only problem being the usual one of explosion of dependencies. Neatly (for both McCallum et al. and me), the sum-product (forward/back) and max-product (Viterbi) dynamic programming algorithms fall out of the encoding.

They don’t allow non-local transpositions (as in, say a soft TF/IDF-like approach to matching), so everything remains reasonably tractable.

They note that the estimation problem is not concave, because they use latent variables (alignments and states aren’t directly observed here). I’m pretty sure I’ll have the same problem.

They treat matches and mismatches as a mixture model, with two distinct subgraphs of the FSTs. This allows them to have separate features as in a max entropy or discrete choice (DCA) model and unlike in a logistic regression-type approach where there are no outcome-specific features.

Section 5.1 provides a hint at the kinds of features they use to set the values for the various edit operations, like is a numerical character substituted for another numerical character. Curiously, there are no features for the actual characters being matched, just classes. Maybe they didn’t have enough data.

What I added that they didn’t have was a different normalization and features for the alignment position of local alignments (useful for modeling positional or polymer biases in reads).

### Direct Regression

(Tsuruoka, McNaught, Tsujii and Ananiadou 2007) introduced an approach to string similarity that was a straight-up logistic regression based on features extracted from the two strings. These features were things like n-grams.

This also sounds like a good idea. I’m guessing the error patterns would be dissimilar enough that the two classifies could be fruitfully combined.

The Tsuruoka model would be easy to implement in LingPipe. I had to do a custom implementation for my own model decoder and haven’t built the estimator yet.

### Further Applications and Generalizations

I’d love to see something like this applied to the BioCreative III gene linkage task (and thank you, organizers for switching to camel case from seemingly random case). You’d have to generalize somewhat, because you’re matching against a whole set of known aliases for a gene. The discriminative training would be a huge help to sort out common phrases, the importance of numbers, the correspondence of Greek and Latin characters, etc.

McCallum et al. only had a chance to scratch the surface of the feature space that might be useful for this and related problems. I’m surprised now that I haven’t seen more work building on this.

It’d also be easy to take any of the pair HMM applications to things like phonology or morphology that are out there.

### 4 Responses to “McCallum, Bellare and Pereira (2005) A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance”

1. Yoav Says:

Nice post (also the previous ones).

My comment might be really naive, but if it bothers me a lot lately, and I would really like to get an answer for it..

What bothers me about those conditionally trained CRFs (and also MaxEnt, for that matter) is that I don’t understand why I should trust the resulting models as probability distributions more than, say, linear models trained using perceptron and then having the the resulting scores exponentiated and normalized only at prediction time?

Is there any good reason to believe that we can really use probabilities from CRFs as part of a larger inference system (say by combining them with noisy input probabilities, as mentioned in the previous post, or any other kind of prior knowledge in the form of probability distribution)? is there any reason to believe this is really the correct approach to take?

• lingpipe Says:

I’ve been thinking about this issue a lot lately, and would love to hear other’s experiences or analyses.

In practice, we’ve found logistic regression classification over text with character n-gram features to be pretty well calibrated. Specifically, we’ve estimated some large-scale multinomial logit models and their predicted probabilities correlate well with probabilities of errors on held out data. This contrasts with naive Bayes, which is notoriously uncalibrated.

For concreteness, consider binary logistic regression (aka “max ent” with no outcome-category-specific features) rather than CRFs, because they’re simpler and more easily related to the traditional use of perceptrons. The same argument applies to CRFs (which are like multinomial logistic regression) versus structured SVMs or perceptrons.

The reason to trust logistic regression more than perceptrons when evaluating probabilistically is that they’re trained with log loss, which pushes the coefficient vectors toward values that mirror the training set probabilities. That is, there’s a penalty for assigning probabilities less than 1 to true cases and probabilities greater than 0 to false cases. This assumes we have a binary training set (i.e. we don’t have probabilistic training, where there’s some uncertainty of an item’s category).

Consider a really simple linearly separable training set with a handful of examples that perceptrons classify perfectly after one pass. All we know is that the value of true cases will be greater than 1/2 and false cases less than 1/2. The probabilities for true cases will almost certainly be much less than 1 (though still greater than 1/2), and for false cases will be much greater than 0 (though still less than 1/2). That’s because perceptrons minimize 0/1 loss, and don’t care what the scores are as long as they’re on the right side of 1/2.

For logistic regression, as we increase the number of epochs in an iterative estimator (e.g. gradient descent optimizers such as SGD or quasi-Newton methods like L-BFGS), we’ll approach values of 0.0 for false cases and 1.0 for true cases. Of course, we’ll never actually converge (except to within epsilon), because we never quite get to 0 or 1 due to the shape of the logistic sigmoid.

The problem is that with only 0 or 1 target values in the training set, there’s really no reason to assume that probabilities for cases near the decision boundary have reasonably estimated probabilities. That’s because there’s no reason to believe the logistic sigmoid interpolates well. And given how flat the long tails are, there’s some reason to believe it won’t be well calibrated.

Adding informative priors will help with convergence, though keep in mind that with Laplace priors (aka L1 regularization) and perfectly correlated features, the model’s still not identified. (It is identified with Gaussian or elastic net priors.) We’ve also found it helps a bit with calibration.

As I said earlier, I’m curious what others have found. I’m afraid we haven’t run on any standard data sets we can share.

2. Yoav Says:

Two quick empirical notes:

– about the (averaged) perceptron: empirically, it doesn’t behave as you describe. In my experience (all involving tons of features), the exponentiated and normalized perceptron scores are actually much closer to 0 or 1 (and more accurate..) than the corresponding logistic regression models. This does not mean they good at assigning probabilities, just that they are not centered around 1/2 as you predicted. And if going for a maximum-margin kind of algorithm, than you are practically guaranteed to be far from 1/2.
Behavior may be different with fewer features.

– a (sort of) empirical demonstration that CRF probabilities do not neccesarilly combine well with other probabilities can be seen in this paper bu Cohen and Smith (http://www.cs.cmu.edu/~scohen/joint+emnlp07.pdf). While I don’t quite like the modeling in the paper (some things are modeled twice), it does show that a combination of CRF and another probabilistic model benefits from some scaling. Their joint models are a “product-of-experts”, of the form log(PCFG_prob) + alpha*log(CRF_prob), and they get substential gains in accuracy by tuning the alpha value (though as I said, I am not quite certain if this is true for CRFs in general, or just a byproduct of the particular model assumptions they make).

• lingpipe Says:

Thanks for the empirical feedback and paper pointer.

My perceptron example was purely theoretical. As I mentioned, I haven’t tried it out empirically.

An explanation (again theoretical) of why perceptrons or logistic regression can get into trouble is with unbounded coefficients. If you have a feature that only occurs in one category of example, it can run off to infinity in maximum likelihood logistic regression or CRFs, and even in perceptrons if the problem’s not separable are other features running interference and lots of epochs.

What do you mean by max margin? Perceptrons aren’t max margin. With SVM’s hinge loss, once you’re a bit beyond the correct side of the decision boundary, there’s zero error in SVMs, which isn’t the case for CRFs. I’m not up on all the non-probabilistic linear models.

How different is an averaged perceptron from a perceptron empirically? I’d expect the averaged one would have more stable estimates.

A big problem with all of these approaches is that we’re fitting a model to the training data, not unseen new data. That’s one of the reasons regularization or priors (or early stopping in perceptrons/SVM) help in practice. Without regularization, coefficients can shoot off toward infinity, which can push estimates for held out data toward 0 or 1.

As to the S B Cohen and N Smith paper, I’m a big believer in joint inference. It seems far more sensible than pipelining because you can account for interactions. I would think you would get the same kind of label bias problem in pipelined log-linear models as you’d get for HMMs.

It also looks like they used maximum likelihood estimates (or finite epoch approximations thereof) and that the blending was of unnormalized models with an oracle estimate, so I’m not sure what conclusions we can draw from it, but I may be misunderstanding this in a quick read.

As you point out, it may also be the model(s). It’s always possible to build a model with bad predictive power that thinks it has good predictive power (partly, it’s the training versus test split). It’s tricky to combine a very rich model like parsing with a simpler model like morphology. The same thing is problematic when stacking PCFGs and HMM POS taggers — the PCFGs just aren’t very good models. For instance, PCFGs by themselves aren’t very good POS taggers, nor are they very good language models. Also, as you point out for the Cohen and Smith paper, you can wind up predicting the POS tags and/or lexical items twice.

It’s pretty well known that naive Bayes needs to be calibrated because of their faulty independence assumptions. Ditto for combining acoustic models and language models in speech rec, because the independence assumptions in the acoustic model are so much worse than in the language model.

I think the best way to characterize how well the model’s doing in terms of prediction is to model log loss on held out data. First-best evaluation isn’t even what’s being optimized with a CRF or logistic regression! A proxy would be some characterization of the relation between predicted probabilities and error rates if you’re using first-best analyses.