Having spent hours figuring out how to properly normalize edit probabilities of an edit channel model so they sum to 1 over all reads of a given length given a reference sequence, the answer seems obvious in retrospect. Use conditional random fields (CRFs), because they can normalize anything and are easy to program. An added benefit is that biologists are used to thinking of edit costs on a log scale.
Once we have a conditional model with an arbitrary linear basis, it is straightforward to account for differential expression given location or initial hexamer (more on that below).
The Sequence Alignment Problem
The first step in most DNA or RNA analysis pipelines is sequence alignment, wherein the bases read from a biological sample by a sequencer are aligned with the bases of one or more reference sequences (genome sequences or gene models of RNA transcripts). The present generation of sequencers (e.g. Illumina or SOLiD) produce tens or even hundreds of millions of reads of 50 bases or so per run.
Typically, simpler models like pure edit distance are computed, using hashing heuristics to narrow search or suffix arrays to share prefix calculation. The model described here is intended to be run over a narrower set of alignment hypotheses.
Generative Expression Models Require Probabilistic Alignments
In a previous post, I described a Bayesian hierarchical model of mRNA expression. That model contains a random variable for each read that is dependent on the sequence from which it is read. Inference with this model requires us to calcluate the probability of a read given a reference sequence. The model described here directly plugs into the expression model. Probabilistic alignments may also be used in SNP calling, splice-variant discovery, or any other applications of sequence alignment.
Our model is based directly on the simple notion of alignment, which consists of a sequence of matches/substitutions, inserts (gaps in reference), and deletes (gaps in query/read). Repeating the example from my earlier post on probabilistic channel models, the following is an alignment of the read/query sequence
AGTTAGG (below) to the reference sequence
ACGT-ACG |D||I|S| A-GTTAGG
C in the reference is deleted, the second
T in the read is inserted, and the second
G in the read is substituted for the second
C in the reference. We can represent alignments as sequences of edit operations. For the alignment above, we have the sequence of eight edits
m(A), d(C), m(G), m(T), i(T), m(A), s(G,C), m(G).
The alignment indicates the bases involved so that the reference and query/read sequences may be reconstructed from the alignment.
As we noted before, there are typically very many ways to align a read to a reference. Here’s an alternative alignment,
ACGTACG |SS||S| AGTTAGG
which corresponds to the following sequence of seven edits
m(A), s(G,C), s(T,G), m(T), m(A), s(G,C), m(G).
Read Probabilities from Alignment Probabilities
Suppose we have a reference sequence of bases . The CRF we propose will model the probability of alignment aligning at position of reference sequence and producing a read of length . We’ll let be the read generated by the alignment .
We compute the probability of a read of length starting at position of the reference sequence as
We can then marginalize out the starting position by summing to calculate the probability of a read,
In general, CRFs are based on arbitrary potential functions for cliques of dependent variables which combine linearly by summation over all cliques. We break down the basis into a contribution from edits and starting position. For edits, we take
where is the potential of the edit operation . Recall that the edits include their edited bases, so we have four match edits
m(Z), twelve substitution edits
Z1 != Z2, four insert edits
i(Z) and four delete edits
d(Z), for a total of 24 edit operations. So provides values in for each of the 24 operations, and the weight of a sequence of operations is the sum of the operation weights.
For starting positions, we take a potential function for starting an alignment at position of reference sequence .
The probability of an alignment at a starting position is defined to be proportional to the exponentiated sum of potentials,
where we take to be an indicator variable with value if and otherwise. This makes sure our distrbution is normalized for the length of reads we care about, and will also make the dynamic programming required for inference tractable.
Normalization is straightforward to calculate in theory. To compute normalized probabilities, we take
As they say in the old country, Bob’s your uncle.
Sum-Product Algorithm for Read Probabilities
It’s just the usual sum-product algorithm for CRFs. Its just like the usual Smith-Waterman style edit algorithm except that (1) we must consume all the read, (2) start costs are given by , and (3) we use log sums of exponentials rather than maxes to compute cell values without underflow.
Sum-Product for Normalization
To compute the normalizing term , we run a similar sum-product algorithm, only with an extra sum for all the possible reads.
Viterbi Algorithm for Best Alignments
With the same caveats (1) and (2) above, we can use the usual maximum operation to produce the best alignment. We can also follow LingPipe’s approach and use Viterbi in the forward direction and A* in the reverse direction to enumerate the n-best alignments.
Affine Gap Basis and Edit Context Sensitivity
It’s easy to take affine gaps into account. Just condition the edit operations on the previous edit operation. You have to add the relevant context to dynamic programming as in the usual affine extension to Smith-Waterman alignment.
It’s also easy to model things like the uncertainty caused in Illumina reads by homopolymers (sequences repeating the same base) by allowing context for the previous base. This is also easy to dynamic program, though adding context adds to space and time requirements.
Modeling Uneven Reads
Depending on the sample preparation process in RNA sequencing experiments (RNA-Seq) (e.g. RNA fragmentation by hydrolysis versus oligo-primed cDNA fragmentation by sonication), there may be effects in read coverage based on position (e.g. RNA fragmentation depleted at 3′ and 5′ end or cDNA fragmentation favoring reads toward the 3′ end) or on the sequence of bases near the start of the read (e.g. amplification by “random” hexamer priming).
These effects are not small. [Hansen, Brenner and Dudoit 2010] showed that hexamer binding in the illumina process can cause over a 100-fold difference in expression depending on the first two hexamers in a read. Similarly, see figure 3 in [Wang, Gerstein and Snyder 2009], which demonstrates over a tenfold effect on expression based on position. (I sure wish there was variance in the Wang et al. graphs.)
Modeling Input Quality
We can also add input quality to our calculations if it’s on a log scale. Then we have to select which base was actually read, making it more like the use of CRFs for tagging. For an example of how to do this for standard edit costs, see the description of the GNUMAP system in [Clement et al. 2010]. As the authors point out, this is a serious issue with reads as noisy as those coming off the Illumina or SOLiD platforms.