Following up on my previous post on alignment with CRFs, I’ve found a related paper that I’d like to review here. I’ve spent the last 20 years following Fernando around the intellectual landscape, from logic programming and grammatical formalisms to statistics and biology. (So it’s perhaps not surprising that we’re both speaking next week at the Edinburgh HCRC and Cog Sci Reunion.)
- McCallum, Andrew, Kedar Bellare and Fernando Pereira. 2005. A conditional random field for discriminatively-trained finite-state string edit distance. In UAI.
I have to admit I’d read this paper in the dim and distant past (OK, it’s from 2005), so something might’ve stuck in my head. But once you grok CRFs, building new models is easy, because as I said last time, CRFs are really good at normalizing sequential labeling or classification problems.
McCallum et al. applied their model to matching restaurant names and addresses and also to matching bibliographic citations. In contrast, I was interested in matching short reads against reference genomes (for DNA sequencing) or gene models (for mRNA sequencing).
What’s Being Modeled?
McCallum et al.’s goal was to model the probability that two strings match. This reduces the problem to a binary classification problem.
In contrast, what I was modeling in the last post is the probability of a query/read sequence being generated from a reference sequence. The reads are more like the tags generated in a tagging CRF. The alignment is a local alignment of the read to the reference.
We both approached the problem by first defining a notion of an alignment sequence and its relation to the strings. They modeled the probability of a global alignment given both strings, whereas I modeled the probability of a read locally aligned to a reference sequence.
The difference is in normalization terms, both of which are easy to calculate via dynamic programming.
As they point out in their discussion, what they did is to pair HMMs (i.e. as used by [Ristad and Yianilos 1998]), as CRFs are to regular HMMs. I turned a similar crank to estimate a related probability.
What I really like about McCallum et al.’s approach is that it allows chunked matching operations. Although they don’t consider affine gap models (where longer gaps aren’t that much more unlikely than shorter gaps), they do consider token-sensitive operations like deleting a token that appears in both strings or in a dictionary, and consider deleting to the end of the current token (useful for matching acronyms). The possibilities are endless.
With their CRF FST implementation in Mallet, I’m guessing it was as easy to code as the paper makes it seem. The FST coding is straightforward, with the only problem being the usual one of explosion of dependencies. Neatly (for both McCallum et al. and me), the sum-product (forward/back) and max-product (Viterbi) dynamic programming algorithms fall out of the encoding.
They don’t allow non-local transpositions (as in, say a soft TF/IDF-like approach to matching), so everything remains reasonably tractable.
They note that the estimation problem is not concave, because they use latent variables (alignments and states aren’t directly observed here). I’m pretty sure I’ll have the same problem.
They treat matches and mismatches as a mixture model, with two distinct subgraphs of the FSTs. This allows them to have separate features as in a max entropy or discrete choice (DCA) model and unlike in a logistic regression-type approach where there are no outcome-specific features.
Section 5.1 provides a hint at the kinds of features they use to set the values for the various edit operations, like is a numerical character substituted for another numerical character. Curiously, there are no features for the actual characters being matched, just classes. Maybe they didn’t have enough data.
What I added that they didn’t have was a different normalization and features for the alignment position of local alignments (useful for modeling positional or polymer biases in reads).
(Tsuruoka, McNaught, Tsujii and Ananiadou 2007) introduced an approach to string similarity that was a straight-up logistic regression based on features extracted from the two strings. These features were things like n-grams.
This also sounds like a good idea. I’m guessing the error patterns would be dissimilar enough that the two classifies could be fruitfully combined.
The Tsuruoka model would be easy to implement in LingPipe. I had to do a custom implementation for my own model decoder and haven’t built the estimator yet.
Further Applications and Generalizations
I’d love to see something like this applied to the BioCreative III gene linkage task (and thank you, organizers for switching to camel case from seemingly random case). You’d have to generalize somewhat, because you’re matching against a whole set of known aliases for a gene. The discriminative training would be a huge help to sort out common phrases, the importance of numbers, the correspondence of Greek and Latin characters, etc.
McCallum et al. only had a chance to scratch the surface of the feature space that might be useful for this and related problems. I’m surprised now that I haven’t seen more work building on this.
It’d also be easy to take any of the pair HMM applications to things like phonology or morphology that are out there.