In my last post on this topic, Sequence Alignment with Conditional Random Fields, I introduced a properly normalized Smith-Waterman-like global aligner (which required the entire read to be aligned to a subsequence of the reference sequence). The motivation was modeling the probability of a read being observed given a reference sequence from which it originated.
Context Plus Read Quality
Today I want to generalize the model in two ways. First, I’ll add contextual effects like affine gap scores and poly-base errors on Illumina.
Second, I’ll take into consideration per-base read quality, as produced by Illumina’s prb
files (with simple fastq format, where you only get the probability of the best base, we can either treat the others as uniformly likely or use knowledge of the confusion error profile of the sequencer).
We’ll still take start position into account, for issues like modeling hexamer priming or fractionation effects in sample prep.
The dynamic programming algorithms work as before with finer-grained cells. Further, you can infer what the actual read was, though I believe a hierarchical model will be better for this, because the sequence polymorphism can be separated from the read errors generatively.
Modeling Alignments
What we’re interested in for our work on RNA-Seq based mRNA expression is modeling is , the probability of a read
given a reference sequence
. Rather than approaching this directly, we’ll take the usual route of modeling a (latent) alignment
at a (latent) start position
,
, and then marginalizing,
.
The best alignment will be computable via a standard sequence max algorithm, and the marginalization through the standard product-sum algorithm. Both run in time proportional to the read length times reference sequence length (of course, the latter will be trimmed down to potential matches in a region). Both can be sped up with upper bounds on distance.
The Model
We assume there is a score for each start position
in reference sequence
. For instance, this may be a log probability of starting at a position given the lab fractionation and priming/amplification protocol.
We will also assume there is a quality score on a (natural) log probability scale for the probability of base
at position
in read
. For instance, these may be the Phred quality scores reported in Illumina’s
prb
files converted to (natural) log probability of match.
Our definition only normalizes for reads of fixed length, which we’ll write .
We will recursively define a score on an additive scale and then exponentiate to normalize. In symbols,
where we define the alignment score function , for alignment sequence
, read
and read position
, reference sequence
and reference position
and previous alignment operation
, recursively by
Reminds me of my Prolog programming days.
The Edit Operations
The notation is meant to be the edit operation
followed by edit sequence
. The notation
is for the empty edit sequence.
The notation is for the substitution of an observed base
from the read for base
of the reference sequence; if
it’s a particular mismatch, and if
, it’s a match score. The notation
is for the insertion of observed base
and
for the deletion of base
in the reference sequence. Insertion and deletion are separated because gaps in the reference sequence and read may not be equally likely given the sequencer and sample being sequenced.
Note that the base case requires the entire read to be consumed by requring .
Scoring Functions
Note that the score functions, ,
, and
, have access to the read
, the reference sequence
, the present position in the read
and reference sequence
, as well as the previous edit operation
.
Enough Context
This context is enough to provide affine gap penalties, because the second delete will have a context indicating the previous operation was a delete, so the score can be higher (remember, bigger is better here). It’s also enough for dealing with poly-deletion, because you have the identity of the previous base on which the operation took place.
Further note that by defining the start position score by a function , which includes a reference to not only the position
but the entire reference sequence
, it is possible to take not only position but also information in the bases in
at positions around
into account.
Dynamic Programming Same as Before
The context is also restricted enough so that the obvious dynamic programming algorithms for first-best and marginalization apply. The only tricky part is computing the normalization for , which has only been defined up to a constant. As it turns out, the same dynamic programming algorithm works to compute the normalizing constant as for a single probabilistic read.
In practice, you have the usual dynamic programming table, only it keeps track of the last base matched in the read, the last base matched in the reference sequence, and whether the last operation was an insertion, deletion, or substitution. To compute the normalizer, you treat each base log prob from a pseudoread of the given read length as having score 0 and accumulate all the matches in the usual way with the sum-product algorithm. (You’ll need to work on the log scale and use the log-sum-of-exponentials operation at each stage to prevent underflow.)
In general, these dynamic programming algorithms will work anywhere you have this kind of tail-recursive definition over a finite set of operations.
Inferring the Actual Read Sequence for SNP, Etc. Calling
Because of variation among individuals, the sequence of bases in a read won’t necessarily match the reference sequence. We can marginalize out the probability that the read sequence is given that it was generated from reference sequence
with read
(remember the read’s non-deterministic here),
,
where is the read produced from alignment
, which itself is defined recursively by
,
,
, and
,
where is the empty string and
string concatenation.
I believe this is roughly what the GNUMAP mapper does (here’s the pay-per-view paper).
Reference Sequences with Uncertainty
We could extend the algorithm even further if we have a non-deterministic reference sequence. For instance, we can use data from the HapMap project to infer single nucleotide polymorphisms (SNPs). We just do the same thing for reference side non-determinsm as on the read side, assuming there’s some score associated with each choice and adding the choice into the context mix.
Leave a Reply