This post is about probabilistic edit distances used as noisy channel models with applications to biological short-read sequencing and natural language spelling correction. (Spoiler: Normalization is over channel outputs and the forward (sum) algorithm is required instead of the usual Viterbi (max) algorithm.)

### Spelling Correction and Chinese Tokenization

LingPipe models both spelling correction and Chinese tokenization (word splitting) using noisy channel models. We have two tutorials:

The two components of a noisy channel model are the source from which signals consisting of strings are generated probabilistically, and the channel which models the probability of observing the signal when the message is . For spell checking, the source model of what people are likely to say may be estimated using document data and/or query log, and the channel model of possible brainos and typos may be estimated using matched query/correction data (which itself may be inferred via EM from query logs and/or document data). For Chinese, the channel deterministically deletes spaces, which requires probabilistic reconstruction for the observer who doesn’t get word boundary information.

LingPipe’s spelling package implements (confusion-matrix weighted) edit distance. And other models, such as n-gram comparison. This is all described in the spelling tutorial as it relates to spell checking, and also in:

LingPipe doesn’t enforce a probabilistic interpretation of edit distance. Furthermore, decoding of the intended signal given the observed message is implemented using the Viterbi approximation, which estimates the channel probability based on the single best alignment between the message and the observed signal .

### Mature mRNA Sequencing

Mitzi’s lab’s working on sequencing the untranslated end regions of mature mRNA (aka the 3′ UTR) in C. elegans (aka the nematode worm), a widely studied model organism. The 3′ UTR is the sequence of bases after the stop codon and before the poly-A tail found on mature mRNA.

### Short Read Sequencers

I’ve been talking with Mitzi lately about the next generation short-read DNA sequencers, like Illumina and SOLiD. Perhaps it’s not surprising, given that it’s locally coherent string data, that the same techniques are used in bioinformatics as in computational linguistics.

### Isoform Expression Models

I’ve been formulating Bayesian models for transcriptome analysis. In particular, isoform (aka alternative splicing) expression. As usual, there’s an underlying generative model of the process. One of the components of this model is how the read that’s observed from the device (the observed data) is generated given the biological sequence from which it was generated (the estimand of interest).

The noisy channel model is appropriate for this setting; I just need to make it probabilistic.

### Levenshtein Distance

The simplest models of spelling correction or genetic sequencing involve simple Levenshtein distance calculations. The Levenshtein distance is the minimum number of insert, delete or substitution operations required to turn one sequence into the other. It’s symmetric. But it’s not probabilistic.

### Probabilistic Channels

It’s actually quite simple to model . Just make sure all the moves from a given state sum to 1.0. A state corresponds to a prefix of the reference sequence . Legal moves are to skip (delete) the next symbol from the message , to add (insert) a spurious symbol into the observation , match the next symbol from and , or to substitute the next symbols.

### Viterbi versus Forward Algorithm

The usual approach to calculation is to make the so-called Viterbi assumption, and look at the probability of the best sequence of moves, to consume the message sequence and observed sequence. Given a sequence of edit operations, we can generate an alignment between sequences. For instance, consider matching the message ACGTACG to the observed signal AGTTAGG. The following alignment consists of a match of A, a delete of C, two matches GT, an insertion of a T, a match on A, a substitution of a G for a C, and a final match on G:

ACGT-ACG |D||I|S| A-GTTAGG

To make the probabilities add up to 1.0 for the moves from a particular state, we need to subdivide the inserts and substitutions. With the genetic alphabet of symbols {A, C, G, T}, the 9 possible moves are subdivided by output, I(A), I(C), I(G), I(T), D, S(A), S(C), S(G), S(T). We have probabilities for these outputs for each symbol. In the genetic case, that’s four multinomials of order 9.

Note that we’ve included the match among the substitution operations for simplicity. If we know the next output symbol from a state, only one of the four substitutions will be legal — the one that outputs the next output symbol.

But there are many more alignments for the same pair of strings. Some even have the same Levenshtein distance. For instance, we can align the two sequences using three substitutions:

ACGTACG |SS||S| AGTTAGG

Edit distance assigns a probability to a path of operations, which corresponds to an alignment , say . Given the underlying message (or sequence) , the observed sequence is determined by the edit operations .

What we need to do is integrate (or sum) out the alignments in the usual way in order to calculate the probability of the observed signal (query/read) given the message (meaning/reference sequence) :

The Viterbi approximation is:

The approximation is only reasonable from a probabilistic point of view if the best alignment accounts for most of the probability mass. Or, if we are doing calculations proportionally and the best alignment probability is a constant fraction of the total generation probability.

Simply using the forward algorithm instead of the Viterbi algorithm computes this in time (and space) where is the length of and the length of .

We can actually tighten up the space requirements to ; this is how I did things for LingPipe’s implementation and what I contributed for Lucene’s fuzzy matcher. We can tighten even further if we have an upper bound on the maximum number of edits to consider, say . Then the time complexity reduces to . (As for all things string algorithm related, see Gusfield’s string algorithms book and Durbin, Eddy, Krogh and Mithcison’s probabilistic string matching book).

Of course, we’ll compute it all on a log scale so the products become sums. This will require our old friend the log sum of exponentials operation.

### Filtering

For the biological sequence matching problem, we’re facing on the order of 100M reads of 50 bases each, matching against a set of 30K isoforms of 1.5K bases each. This is prohibitive for the forward algorithm. There are various ways to share the prefix operations (again, see Gusfield’s book).

Typically, though, practitioners use n-gram sequence indexes to spot potential matches, only considering sequences with good n-gram overlap, thus eliminating most of the work up front. This can be done either heuristically, as in BLAST (described in both books), or exactly, as in SHRiMP (the latter link is to a paper with a good description of the SOLiD color space reads and uses automaton composition to compose noisy channels, one for color read errors, one for sequencing errors, and one for mutations).

This is like the notion of blocking in database record linkage (including database deduplication), which we also tend to carry out using character n-grams for natural language sequences.

The transcendental math’s still going to dominate the calculations, so typically a Viterbi pass using Levenshtein distance is done after blocking with n-grams to further filter out unlikely alignments.

Biological sequence analysis sure looks a lot like natural language sequence analysis!

## Leave a Reply