Archive for the ‘Bioinformatics’ Category

Computing Marginal Channel Probabilities with Generalized Needleman-Wunsch

December 8, 2009

In my last post on Probabilistic Edit Channel Models, I tried to motivate a properly calculated marginal probability of an observation (e.g., a read) given a source (e.g., a reference sequence).

In this post, I’ll show you how to build a properly normalized probabilistic generalization of the well-known Needleman-Wunsch algorithm for global alignment that computes full marginal probabilities rather than best path probabilities. Global alignment matches all of its two input sequences.

This approach is easily generalized to the Smith-Waterman algorithm for local alignments or to fully matching a read sequence against a subsequence of a reference sequence. See the last section below.

Edit Operations

We’ll assume a simple alphabet of bases, {A,C,G,T}. We’ll also assume four possible actions given a reference sequence and a position in that reference sequence:

  • delete the next symbol from the reference (advance the ref position)
  • insert a symbol into the read (don’t advance the ref position)
  • generate a matched symbol for the read (advance ref)
  • generate a mismatched symbol for the read (advance ref)

There’s only one way to delete, which we’ll write D. There are four possiblities for insertion depending on the base inserted, which we’ll write IA, IC, IG, IT. There are a total of four possibilities for substitution and matching, depending on the base generated in the read, SA, SC, SG, ST; the one with the symbol corresponding to the next reference symbol is the match case.

Probabilistic Edits

Given a reference sequence, we generate a read sequence by iterating over the reference sequence base by base, and for each base, performing one of the above operations. If the operation is not a delete, the position in the reference sequence advances.

For instance, consider the alignment we introduced in the previous post,

A C G T - A C G
| D | | I | S |
A - G T T A G G

which is generated by the sequence of operations:

Reference Position Read Generated Operation Prob
^ACGTACG
A^CGTACG A SA p(SA | A)
AC^GTACG A D p(D | C)
ACG^TACG AG SG p(SG | G)
ACGT^ACG AGT SG p(ST | T)
ACGT^ACG AGTT IT p(IT | T)
ACGTA^CG AGTTA SA p(SA | A)
ACGTAC^G AGTTAG SA p(SG | C)
ACGTACG^ AGTTAGG SG p(SG | G)

What we need to estimate is the conditional probability of one of the nine edit operations given the next base in the input. We’ll assume the edit probabilities are known. Typically, these edits are inferred using little or no training data through the EM algorithm.

The sequence of edit operations is: σ = [SA, D, SG, SG, IT, SA, SA, SG]. Given the reference sequence ACGTACG and the edit operations σ, the read AGTTAGG is uniquely determined. With this clean generative model, given an input reference sequence, the probability of all read sequences sums to 1.0. Of course, we will know the read in most of what we do and only need to compute its probability.

The probability of that sequence of edit operations given the reference sequence ACGTACG is given by the product of all the probabilities in the rightmost column above:

p(σ|ACGTACG)
= p(SA|A) p(D|C) p(SG|G) p(ST|T) p(IT|T) p(SA|A) p(SG|C) p(SG|G)

What we’re interested in calculating here is p(AGTTAGG | ACGTACG), namely the probability of the read given the reference sequence. This is done by summing over all possible alignments:

p(\mbox{\tt AGTTAGG}|\mbox{\tt ACGTACG}) = \sum_\sigma p(\mbox{\tt AGTTAGG},\sigma | \mbox{\tt ACGTACG})

Note that this probability will be zero if the edit sequence \sigma and reference \mbox{\tt ACGTACG} don’t generate the read \mbox{\tt AGTTAGG}.

The usual alignment programs compute the best alignment, namely:

\arg\max_{\sigma} p(\mbox{\tt AGTTAGG},\sigma | \mbox{\tt ACGTACG})

We could use a regular \max_{\sigma} in place of \arg\max_{\sigma} to compute the probability of the best alignment instead of the alignment itself.

Note that we take the maximum probability; usually the algorithm is formulated with edit costs and it takes the minimum cost. We could convert to the usual setup with minimization by reformulating in terms of negative log probabilities.

Needleman-Wunsch Algorithm

The Needlman-Wunsch algorithm is a classic instance of dynamic programming where the distances computed for shorter substrings are extended in the inner loop to distances for longer strings. The first two loops deal with the boundary conditions of all-insertions and all-deletions

double nw(byte[] reads, byte[] refs) {

    int[][] d = new int[refs.length+1][reads.length+1];

    d[0][0] = 0.0;

    for (int i = 1; i < refs.length; ++i)
        d[0][i] = d[0][i-1] + log p(D|refs[i]);

    for (int j = 1; j < refs.length; ++j)
        d[j][0] = d[j-1][0] + log p(I_reads[j] | refs[0])

    for (int i = 1; i < refs.length; ++i)
        for (int j = 1; j < reads.length; ++j)
            d[j][i] = max( d[i][j-1] + log p(D | refs[i]),
                  d[j-1][i] + log p(I_reads[j] | refs[i]),
                  d[j-1][i-1] + log p(S-reads[j] | refs[i]) );

    return d[refs.length][reads.length];
}

Given its simple loop, this algorithm clearly takes time proportional to read time reference length.

Marginal Needleman-Wunsch

I’ll call this the marginal Needleman-Wunsch, but if someone knows a better name, let me know. It bears the same relation to the standard N-W algorithm as the forward algorithm for HMMs compares to the Viterbi algorithm.

double nw(byte[] reads, byte[] refs) {

    int[][] d = new int[refs.length+1][reads.length+1];

    d[0][0] = 0.0;

    for (int i = 1; i < refs.length; ++i)
        d[0][i] = d[0][i-1] + log p(D|refs[i]);

    for (int j = 1; j < refs.length; ++j)
        d[j][0] = d[j-1][0] + log p(I_reads[j] | refs[0])

    for (int i = 1; i < refs.length; ++i)
        for (int j = 1; j < reads.length; ++j)
            d[j][i] = logSumExps( d[i][j-1] + log p(D | refs[i]),
                  d[j-1][i] + log p(I_reads[j] | refs[i]),
                  d[j-1][i-1] + log p(S-reads[j] | refs[i]) );

    return d[refs.length][reads.length];
}

The only change is that we’re calling the familiar log sum of exponentials function here rather than max. Recall the definition:

\mbox{logSumExp}(x,y,z) = \log (e^x + e^y + e^z)

We use this because in our case, x, y, and z are log probabilities, so that exp(x), exp(y), and exp(z) are linear probabilities, so that their sum is a linear probability, and hence the final result is the total probability of the paths converted back to the log scale. We don’t just maintain linear probabilities because we’re likely to hit underflow problems. (This may not actually be an issue for the limited number of errors allowed in these models; if not, then we just need a sum of the linear probabilities, which will be much faster to execute computationally.)

That’s it. We’ve computed the (log of) the sum of all path probabilities.

When the Reference is Exhausted

One small detail: we need to be able to generate symbols in the read after the reference is exhausted, so we have to include that option in the algorithm. That is, we need five discrete distributions over edit operations, one for each next base {A,C,G,T} in the reference, and one to use when the reference is exhausted.

Although not conceptually difficult or expensive to compute, it complicates the boundary conditions for the algorithm and the conceptualization of the algorithm, so the algorithm isn’t technically correct as stated. To correc it, the probability terms need to be fiddled to account for the boundary.

Generalizing to Local Alignments

The Smith-Waterman-like generalization to local alignments requires allowing zero-cost deletes on the first and last row of the edit matrix, and zero-cost inserts on the first and last column of the matrix.

This isn’t quite properly normalized, but can be if we assume a uniform model of read start and end positions.

We can also restrict local alignments in the usual way to the case where the read completely matches a substring of the reference sequence by only allowing zero-cost deletes on the first and last row of the distance matrix.

Multiple Edits and Correlated Errors

The model above is Markovian; only the current reference symbol determines the output symbol.

This assumption may be relaxed to allow compound edit operations instead of the simple Markovian insertion, substitution, and deletion operations of the Needleman-Wunsch algorithm.

For example, University of Toronto’s SHRiMP alignment package considers pairwise operations generated due to the AB SOLiD platform’s colorspace reads (there’s a shortage of capital “I”s for names, hence “Alias-i”). With this approach, SHRiMP can model the the correlated errors from the SOLiD colorspace reads relative to a reference sequence due to to singular nucleotide polymorphisms (SNPs) in the samples relative to the reads (which induce two sequential, correlated, color space read errors).

SHRiMP’s generalization of Needleman-Wunsch is similar both algorithmically and conceptually to the phonetic extensions to he edit distance component of the noisy channel model for natural language spelling correction proposed by Kristina Toutanova and Bob Moore in their 2002 ACL paper, Pronunciation Modeling for Improved Spelling Correction.

Probabilistic Edit Channel Models

December 4, 2009

This post is about probabilistic edit distances used as noisy channel models with applications to biological short-read sequencing and natural language spelling correction. (Spoiler: Normalization is over channel outputs and the forward (sum) algorithm is required instead of the usual Viterbi (max) algorithm.)

Spelling Correction and Chinese Tokenization

LingPipe models both spelling correction and Chinese tokenization (word splitting) using noisy channel models. We have two tutorials:

The two components of a noisy channel model are the source p(\sigma) from which signals \sigma consisting of strings are generated probabilistically, and the channel p(\tau|\sigma) which models the probability of observing the signal \tau when the message is \sigma. For spell checking, the source model of what people are likely to say may be estimated using document data and/or query log, and the channel model of possible brainos and typos may be estimated using matched query/correction data (which itself may be inferred via EM from query logs and/or document data). For Chinese, the channel deterministically deletes spaces, which requires probabilistic reconstruction for the observer who doesn’t get word boundary information.

LingPipe’s spelling package implements (confusion-matrix weighted) edit distance. And other models, such as n-gram comparison. This is all described in the spelling tutorial as it relates to spell checking, and also in:

LingPipe doesn’t enforce a probabilistic interpretation of edit distance. Furthermore, decoding of the intended signal given the observed message is implemented using the Viterbi approximation, which estimates the channel probability based on the single best alignment between the message \sigma and the observed signal \tau.

Mature mRNA Sequencing

mRNA diagram

Mitzi’s lab’s working on sequencing the untranslated end regions of mature mRNA (aka the 3′ UTR) in C. elegans (aka the nematode worm), a widely studied model organism. The 3′ UTR is the sequence of bases after the stop codon and before the poly-A tail found on mature mRNA.

Short Read Sequencers

I’ve been talking with Mitzi lately about the next generation short-read DNA sequencers, like Illumina and SOLiD. Perhaps it’s not surprising, given that it’s locally coherent string data, that the same techniques are used in bioinformatics as in computational linguistics.

Isoform Expression Models

I’ve been formulating Bayesian models for transcriptome analysis. In particular, isoform (aka alternative splicing) expression. As usual, there’s an underlying generative model of the process. One of the components of this model is how the read \tau that’s observed from the device (the observed data) is generated given the biological sequence \sigma from which it was generated (the estimand of interest).

The noisy channel model is appropriate for this setting; I just need to make it probabilistic.

Levenshtein Distance

The simplest models of spelling correction or genetic sequencing involve simple Levenshtein distance calculations. The Levenshtein distance is the minimum number of insert, delete or substitution operations required to turn one sequence into the other. It’s symmetric. But it’s not probabilistic.

Probabilistic Channels

It’s actually quite simple to model p(\tau|\sigma). Just make sure all the moves from a given state sum to 1.0. A state corresponds to a prefix of the reference sequence \tau. Legal moves are to skip (delete) the next symbol from the message \sigma, to add (insert) a spurious symbol into the observation \tau, match the next symbol from \sigma and \tau, or to substitute the next symbols.

Viterbi versus Forward Algorithm

The usual approach to calculation is to make the so-called Viterbi assumption, and look at the probability of the best sequence of moves, to consume the message sequence and observed sequence. Given a sequence of edit operations, we can generate an alignment between sequences. For instance, consider matching the message ACGTACG to the observed signal AGTTAGG. The following alignment consists of a match of A, a delete of C, two matches GT, an insertion of a T, a match on A, a substitution of a G for a C, and a final match on G:

ACGT-ACG
|D||I|S|
A-GTTAGG

To make the probabilities add up to 1.0 for the moves from a particular state, we need to subdivide the inserts and substitutions. With the genetic alphabet of symbols {A, C, G, T}, the 9 possible moves are subdivided by output, I(A), I(C), I(G), I(T), D, S(A), S(C), S(G), S(T). We have probabilities for these outputs for each symbol. In the genetic case, that’s four multinomials of order 9.

Note that we’ve included the match among the substitution operations for simplicity. If we know the next output symbol from a state, only one of the four substitutions will be legal — the one that outputs the next output symbol.

But there are many more alignments for the same pair of strings. Some even have the same Levenshtein distance. For instance, we can align the two sequences using three substitutions:

ACGTACG
|SS||S|
AGTTAGG

Edit distance assigns a probability to a path of operations, which corresponds to an alignment a, say p(a,\tau|\sigma). Given the underlying message (or sequence) \sigma, the observed sequence \tau is determined by the edit operations a.

What we need to do is integrate (or sum) out the alignments in the usual way in order to calculate the probability of the observed signal (query/read) \tau given the message (meaning/reference sequence) \sigma:

p(\tau|\sigma) = \sum_a p(\tau,a|\sigma)

The Viterbi approximation is:

p(\tau|\sigma) \approx \max_a p(\tau,a|\sigma)

The approximation is only reasonable from a probabilistic point of view if the best alignment accounts for most of the probability mass. Or, if we are doing calculations proportionally and the best alignment probability is a constant fraction of the total generation probability.

Simply using the forward algorithm instead of the Viterbi algorithm computes this in {\mathcal O}(m \times n) time (and space) where n = |\tau| is the length of \tau and m = |\sigma| the length of \sigma.

We can actually tighten up the space requirements to \mbox{max}(m,n); this is how I did things for LingPipe’s implementation and what I contributed for Lucene’s fuzzy matcher. We can tighten even further if we have an upper bound on the maximum number of edits to consider, say d. Then the time complexity reduces to {\mathcal O}(d \times \max(m,n)). (As for all things string algorithm related, see Gusfield’s string algorithms book and Durbin, Eddy, Krogh and Mithcison’s probabilistic string matching book).

Of course, we’ll compute it all on a log scale so the products become sums. This will require our old friend the log sum of exponentials operation.

Filtering

For the biological sequence matching problem, we’re facing on the order of 100M reads of 50 bases each, matching against a set of 30K isoforms of 1.5K bases each. This is prohibitive for the forward algorithm. There are various ways to share the prefix operations (again, see Gusfield’s book).

Typically, though, practitioners use n-gram sequence indexes to spot potential matches, only considering sequences with good n-gram overlap, thus eliminating most of the work up front. This can be done either heuristically, as in BLAST (described in both books), or exactly, as in SHRiMP (the latter link is to a paper with a good description of the SOLiD color space reads and uses automaton composition to compose noisy channels, one for color read errors, one for sequencing errors, and one for mutations).

This is like the notion of blocking in database record linkage (including database deduplication), which we also tend to carry out using character n-grams for natural language sequences.

The transcendental math’s still going to dominate the calculations, so typically a Viterbi pass using Levenshtein distance is done after blocking with n-grams to further filter out unlikely alignments.

Biological sequence analysis sure looks a lot like natural language sequence analysis!


Follow

Get every new post delivered to your Inbox.

Join 811 other followers