In my last post on Probabilistic Edit Channel Models, I tried to motivate a properly calculated marginal probability of an observation (e.g., a read) given a source (e.g., a reference sequence).

In this post, I’ll show you how to build a properly normalized probabilistic generalization of the well-known Needleman-Wunsch algorithm for global alignment that computes full marginal probabilities rather than best path probabilities. Global alignment matches all of its two input sequences.

This approach is easily generalized to the Smith-Waterman algorithm for local alignments or to fully matching a read sequence against a subsequence of a reference sequence. See the last section below.

### Edit Operations

We’ll assume a simple alphabet of bases, {A,C,G,T}. We’ll also assume four possible actions given a reference sequence and a position in that reference sequence:

- delete the next symbol from the reference (advance the ref position)
- insert a symbol into the read (don’t advance the ref position)
- generate a matched symbol for the read (advance ref)
- generate a mismatched symbol for the read (advance ref)

There’s only one way to delete, which we’ll write D. There are four possiblities for insertion depending on the base inserted, which we’ll write IA, IC, IG, IT. There are a total of four possibilities for substitution and matching, depending on the base generated in the read, SA, SC, SG, ST; the one with the symbol corresponding to the next reference symbol is the match case.

### Probabilistic Edits

Given a reference sequence, we generate a read sequence by iterating over the reference sequence base by base, and for each base, performing one of the above operations. If the operation is not a delete, the position in the reference sequence advances.

For instance, consider the alignment we introduced in the previous post,

```
A C G T - A C G
| D | | I | S |
A - G T T A G G
```

which is generated by the sequence of operations:

Reference Position | Read Generated | Operation | Prob |
---|---|---|---|

^ACGTACG | |||

A^CGTACG | A | SA | p(SA | A) |

AC^GTACG | A | D | p(D | C) |

ACG^TACG | AG | SG | p(SG | G) |

ACGT^ACG | AGT | SG | p(ST | T) |

ACGT^ACG | AGTT | IT | p(IT | T) |

ACGTA^CG | AGTTA | SA | p(SA | A) |

ACGTAC^G | AGTTAG | SA | p(SG | C) |

ACGTACG^ | AGTTAGG | SG | p(SG | G) |

What we need to estimate is the conditional probability of one of the nine edit operations given the next base in the input. We’ll assume the edit probabilities are known. Typically, these edits are inferred using little or no training data through the EM algorithm.

The sequence of edit operations is: σ = [SA, D, SG, SG, IT, SA, SA, SG]. Given the reference sequence ACGTACG and the edit operations σ, the read AGTTAGG is uniquely determined. With this clean generative model, given an input reference sequence, the probability of all read sequences sums to 1.0. Of course, we will know the read in most of what we do and only need to compute its probability.

The probability of that sequence of edit operations given the reference sequence ACGTACG is given by the product of all the probabilities in the rightmost column above:

= p(SA|A) p(D|C) p(SG|G) p(ST|T) p(IT|T) p(SA|A) p(SG|C) p(SG|G)

What we’re interested in calculating here is p(AGTTAGG | ACGTACG), namely the probability of the read given the reference sequence. This is done by summing over all possible alignments:

Note that this probability will be zero if the edit sequence and reference don’t generate the read .

The usual alignment programs compute the best alignment, namely:

We could use a regular in place of to compute the probability of the best alignment instead of the alignment itself.

Note that we take the maximum probability; usually the algorithm is formulated with edit costs and it takes the minimum cost. We could convert to the usual setup with minimization by reformulating in terms of negative log probabilities.

### Needleman-Wunsch Algorithm

The Needlman-Wunsch algorithm is a classic instance of dynamic programming where the distances computed for shorter substrings are extended in the inner loop to distances for longer strings. The first two loops deal with the boundary conditions of all-insertions and all-deletions

double nw(byte[] reads, byte[] refs) { int[][] d = new int[refs.length+1][reads.length+1]; d[0][0] = 0.0; for (int i = 1; i < refs.length; ++i) d[0][i] = d[0][i-1] + log p(D|refs[i]); for (int j = 1; j < refs.length; ++j) d[j][0] = d[j-1][0] + log p(I_reads[j] | refs[0]) for (int i = 1; i < refs.length; ++i) for (int j = 1; j < reads.length; ++j) d[j][i] = max( d[i][j-1] + log p(D | refs[i]), d[j-1][i] + log p(I_reads[j] | refs[i]), d[j-1][i-1] + log p(S-reads[j] | refs[i]) ); return d[refs.length][reads.length]; }

Given its simple loop, this algorithm clearly takes time proportional to read time reference length.

### Marginal Needleman-Wunsch

I’ll call this the marginal Needleman-Wunsch, but if someone knows a better name, let me know. It bears the same relation to the standard N-W algorithm as the forward algorithm for HMMs compares to the Viterbi algorithm.

double nw(byte[] reads, byte[] refs) { int[][] d = new int[refs.length+1][reads.length+1]; d[0][0] = 0.0; for (int i = 1; i < refs.length; ++i) d[0][i] = d[0][i-1] + log p(D|refs[i]); for (int j = 1; j < refs.length; ++j) d[j][0] = d[j-1][0] + log p(I_reads[j] | refs[0]) for (int i = 1; i < refs.length; ++i) for (int j = 1; j < reads.length; ++j) d[j][i] =logSumExps( d[i][j-1] + log p(D | refs[i]), d[j-1][i] + log p(I_reads[j] | refs[i]), d[j-1][i-1] + log p(S-reads[j] | refs[i]) ); return d[refs.length][reads.length]; }

The only change is that we’re calling the familiar log sum of exponentials function here rather than max. Recall the definition:

We use this because in our case, x, y, and z are log probabilities, so that exp(x), exp(y), and exp(z) are linear probabilities, so that their sum is a linear probability, and hence the final result is the total probability of the paths converted back to the log scale. We don’t just maintain linear probabilities because we’re likely to hit underflow problems. (This may not actually be an issue for the limited number of errors allowed in these models; if not, then we just need a sum of the linear probabilities, which will be much faster to execute computationally.)

That’s it. We’ve computed the (log of) the sum of all path probabilities.

### When the Reference is Exhausted

One small detail: we need to be able to generate symbols in the read after the reference is exhausted, so we have to include that option in the algorithm. That is, we need five discrete distributions over edit operations, one for each next base {A,C,G,T} in the reference, and one to use when the reference is exhausted.

Although not conceptually difficult or expensive to compute, it complicates the boundary conditions for the algorithm and the conceptualization of the algorithm, so the algorithm isn’t technically correct as stated. To correc it, the probability terms need to be fiddled to account for the boundary.

### Generalizing to Local Alignments

The Smith-Waterman-like generalization to local alignments requires allowing zero-cost deletes on the first and last row of the edit matrix, and zero-cost inserts on the first and last column of the matrix.

This isn’t quite properly normalized, but can be if we assume a uniform model of read start and end positions.

We can also restrict local alignments in the usual way to the case where the read completely matches a substring of the reference sequence by only allowing zero-cost deletes on the first and last row of the distance matrix.

### Multiple Edits and Correlated Errors

The model above is Markovian; only the current reference symbol determines the output symbol.

This assumption may be relaxed to allow compound edit operations instead of the simple Markovian insertion, substitution, and deletion operations of the Needleman-Wunsch algorithm.

For example, University of Toronto’s SHRiMP alignment package considers pairwise operations generated due to the AB SOLiD platform’s colorspace reads (there’s a shortage of capital “I”s for names, hence “Alias-i”). With this approach, SHRiMP can model the the correlated errors from the SOLiD colorspace reads relative to a reference sequence due to to singular nucleotide polymorphisms (SNPs) in the samples relative to the reads (which induce two sequential, correlated, color space read errors).

SHRiMP’s generalization of Needleman-Wunsch is similar both algorithmically and conceptually to the phonetic extensions to he edit distance component of the noisy channel model for natural language spelling correction proposed by Kristina Toutanova and Bob Moore in their 2002 ACL paper, Pronunciation Modeling for Improved Spelling Correction.