Having spent hours figuring out how to properly normalize edit probabilities of an edit channel model so they sum to 1 over all reads of a given length given a reference sequence, the answer seems obvious in retrospect. Use conditional random fields (CRFs), because they can normalize anything and are easy to program. An added benefit is that biologists are used to thinking of edit costs on a log scale.
Once we have a conditional model with an arbitrary linear basis, it is straightforward to account for differential expression given location or initial hexamer (more on that below).
The Sequence Alignment Problem
The first step in most DNA or RNA analysis pipelines is sequence alignment, wherein the bases read from a biological sample by a sequencer are aligned with the bases of one or more reference sequences (genome sequences or gene models of RNA transcripts). The present generation of sequencers (e.g. Illumina or SOLiD) produce tens or even hundreds of millions of reads of 50 bases or so per run.
Typically, simpler models like pure edit distance are computed, using hashing heuristics to narrow search or suffix arrays to share prefix calculation. The model described here is intended to be run over a narrower set of alignment hypotheses.
Generative Expression Models Require Probabilistic Alignments
In a previous post, I described a Bayesian hierarchical model of mRNA expression. That model contains a random variable for each read that is dependent on the sequence from which it is read. Inference with this model requires us to calcluate the probability of a read given a reference sequence. The model described here directly plugs into the expression model. Probabilistic alignments may also be used in SNP calling, splice-variant discovery, or any other applications of sequence alignment.
Alignments
Our model is based directly on the simple notion of alignment, which consists of a sequence of matches/substitutions, inserts (gaps in reference), and deletes (gaps in query/read). Repeating the example from my earlier post on probabilistic channel models, the following is an alignment of the read/query sequence AGTTAGG
(below) to the reference sequence ACGTACG
(above):
ACGT-ACG |D||I|S| A-GTTAGG
The first C
in the reference is deleted, the second T
in the read is inserted, and the second G
in the read is substituted for the second C
in the reference. We can represent alignments as sequences of edit operations. For the alignment above, we have the sequence of eight edits
m(A), d(C), m(G), m(T), i(T), m(A), s(G,C), m(G)
.
The alignment indicates the bases involved so that the reference and query/read sequences may be reconstructed from the alignment.
As we noted before, there are typically very many ways to align a read to a reference. Here’s an alternative alignment,
ACGTACG |SS||S| AGTTAGG
which corresponds to the following sequence of seven edits
m(A), s(G,C), s(T,G), m(T), m(A), s(G,C), m(G)
.
Read Probabilities from Alignment Probabilities
Suppose we have a reference sequence of bases
. The CRF we propose will model the probability
of alignment
aligning at position
of reference sequence
and producing a read of length
. We’ll let
be the read generated by the alignment
.
We compute the probability of a read of length
starting at position
of the reference sequence as
.
We can then marginalize out the starting position by summing to calculate the probability of a read,
.
Linear Basis
In general, CRFs are based on arbitrary potential functions for cliques of dependent variables which combine linearly by summation over all cliques. We break down the basis into a contribution from edits and starting position. For edits, we take
where is the potential of the edit operation
. Recall that the edits include their edited bases, so we have four match edits
m(Z)
, twelve substitution edits s(Z1,Z2)
with Z1 != Z2
, four insert edits i(Z)
and four delete edits d(Z)
, for a total of 24 edit operations. So provides values in
for each of the 24 operations, and the weight of a sequence of operations is the sum of the operation weights.
For starting positions, we take a potential function for starting an alignment at position
of reference sequence
.
The probability of an alignment at a starting position is defined to be proportional to the exponentiated sum of potentials,
,
where we take to be an indicator variable with value
if
and
otherwise. This makes sure our distrbution is normalized for the length of reads we care about, and will also make the dynamic programming required for inference tractable.
Normalization is straightforward to calculate in theory. To compute normalized probabilities, we take
, where
.
As they say in the old country, Bob’s your uncle.
Sum-Product Algorithm for Read Probabilities
It’s just the usual sum-product algorithm for CRFs. Its just like the usual Smith-Waterman style edit algorithm except that (1) we must consume all the read, (2) start costs are given by , and (3) we use log sums of exponentials rather than maxes to compute cell values without underflow.
Sum-Product for Normalization
To compute the normalizing term , we run a similar sum-product algorithm, only with an extra sum for all the possible reads.
Viterbi Algorithm for Best Alignments
With the same caveats (1) and (2) above, we can use the usual maximum operation to produce the best alignment. We can also follow LingPipe’s approach and use Viterbi in the forward direction and A* in the reverse direction to enumerate the n-best alignments.
Affine Gap Basis and Edit Context Sensitivity
It’s easy to take affine gaps into account. Just condition the edit operations on the previous edit operation. You have to add the relevant context to dynamic programming as in the usual affine extension to Smith-Waterman alignment.
It’s also easy to model things like the uncertainty caused in Illumina reads by homopolymers (sequences repeating the same base) by allowing context for the previous base. This is also easy to dynamic program, though adding context adds to space and time requirements.
Modeling Uneven Reads
Depending on the sample preparation process in RNA sequencing experiments (RNA-Seq) (e.g. RNA fragmentation by hydrolysis versus oligo-primed cDNA fragmentation by sonication), there may be effects in read coverage based on position (e.g. RNA fragmentation depleted at 3′ and 5′ end or cDNA fragmentation favoring reads toward the 3′ end) or on the sequence of bases near the start of the read (e.g. amplification by “random” hexamer priming).
These effects are not small. [Hansen, Brenner and Dudoit 2010] showed that hexamer binding in the illumina process can cause over a 100-fold difference in expression depending on the first two hexamers in a read. Similarly, see figure 3 in [Wang, Gerstein and Snyder 2009], which demonstrates over a tenfold effect on expression based on position. (I sure wish there was variance in the Wang et al. graphs.)
Modeling Input Quality
We can also add input quality to our calculations if it’s on a log scale. Then we have to select which base was actually read, making it more like the use of CRFs for tagging. For an example of how to do this for standard edit costs, see the description of the GNUMAP system in [Clement et al. 2010]. As the authors point out, this is a serious issue with reads as noisy as those coming off the Illumina or SOLiD platforms.
April 26, 2010 at 1:09 pm |
It seems like this might be better understood as a Markov random field (MRF) over two sequences. It doesn’t seem to be a CRF, since it’s not conditionally trained. But, you are using dynamic programming inference techniques (forward backward, Viterbi) to compute things about marginal distributions that are the same inference techniques you would use with a CRF with this structure.
April 26, 2010 at 2:22 pm |
Excellent point; for reference, here’s the Wikipedia page on Markov random fields, which has a section comparing them to conditional random fields.
I think I really do have a CRF, because I’m modeling
, the conditional probability of generating an alignment
yielding a read of length
at position
in reference sequence
.
An obvious way to discriminatively train would be to take a known set of alignments and infer appropriate weights for the clique potentials. As presented above, we just guess at these using domain knowledge. This is exactly what I’d like to do in a hierarchical model. It’d be easy to extend the Gibbs sampler to sample the edit weights given alignments, and given the edit weights, everything else would be as above.
A natural Markov random field would be the joint probability of a read and reference sequence, though you’d probably only want to generate the pieces of the reference aligned. Here, I’m assuming the reference is known and generating the conditional probability of an alignment.
May 14, 2012 at 7:58 pm |
Shameless plug: Here is a related paper (http://aclweb.org/anthology-new/D/D08/D08-1113.pdf), in which we presented a CRF-like model for sequence alignment.
May 15, 2012 at 1:54 pm |
Thanks. Your model looks very similar (if not identical) to what I was proposing. I’ll probably blog more about it after reading your paper. I’m trying to get back into this line of work.
I’m also very interested in Wicentowski’s models for morphology; I used them before (with some generalizations to language model emissions instead of simple multinomials). They work pretty darn well out of the box for languages that are well modeled by the affixing schemes. What I’d really like to do is combine Wicentowski-style morphology with Goldwater/Johnson-style morphology induction.
Of course, the features are going to need to be different for sequence alignment than you were suggesting for words, as there isn’t a notion of word (though there’s some higher-level structure in the form of motifs, and perhaps other repeated patterns that haven’t been well characterized for regulation).