## Contextual Effects and Read Quality in a Probabilistic Aligner

In my last post on this topic, Sequence Alignment with Conditional Random Fields, I introduced a properly normalized Smith-Waterman-like global aligner (which required the entire read to be aligned to a subsequence of the reference sequence). The motivation was modeling the probability of a read being observed given a reference sequence from which it originated.

Today I want to generalize the model in two ways. First, I’ll add contextual effects like affine gap scores and poly-base errors on Illumina.

Second, I’ll take into consideration per-base read quality, as produced by Illumina’s prb files (with simple fastq format, where you only get the probability of the best base, we can either treat the others as uniformly likely or use knowledge of the confusion error profile of the sequencer).

We’ll still take start position into account, for issues like modeling hexamer priming or fractionation effects in sample prep.

The dynamic programming algorithms work as before with finer-grained cells. Further, you can infer what the actual read was, though I believe a hierarchical model will be better for this, because the sequence polymorphism can be separated from the read errors generatively.

### Modeling Alignments

What we’re interested in for our work on RNA-Seq based mRNA expression is modeling is $p(y|x)$, the probability of a read $y$ given a reference sequence $x$. Rather than approaching this directly, we’ll take the usual route of modeling a (latent) alignment $A$ at a (latent) start position $i$, $p(y,A,i|x)$, and then marginalizing, $p(y|x) = \sum_{A,i} p(y,A,i|x)$.

The best alignment will be computable via a standard sequence max algorithm, and the marginalization through the standard product-sum algorithm. Both run in time proportional to the read length times reference sequence length (of course, the latter will be trimmed down to potential matches in a region). Both can be sped up with upper bounds on distance.

### The Model

We assume there is a score $\mbox{\sc start}(i,x) \in \mathbb{R}$ for each start position $i$ in reference sequence $x$. For instance, this may be a log probability of starting at a position given the lab fractionation and priming/amplification protocol.

We will also assume there is a quality score on a (natural) log probability scale $\mbox{\sc logprb}(y,b,j) \in \mathbb{R}$ for the probability of base $b$ at position $j$ in read $y$. For instance, these may be the Phred quality scores reported in Illumina’s prb files converted to (natural) log probability of match.

Our definition only normalizes for reads of fixed length, which we’ll write $\mbox{\sc readlen}$.

We will recursively define a score on an additive scale and then exponentiate to normalize. In symbols, $p(y,A,i|x) \propto \exp\big(\mbox{\sc start}(i,x) + f(a,i,0,x,y,\mbox{\rm start})\big)$

where we define the alignment score function $f(A,i,j,x,y,a)$, for alignment sequence $A$, read $x$ and read position $i$, reference sequence $y$ and reference position $j$ and previous alignment operation $a$, recursively by $\begin{array}{rcl}f([],i,\mbox{\sc readlen},x,y,a) & = & 0.0\\[18pt]f(\mbox{\rm sub}(x_i,b)+A,i,j,x,y,a) & = & \mbox{\sc subst}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i+1,j+1,x,y,\mbox{\rm sub}(x_i,b))\\[12pt]f(\mbox{\rm del}(x_i)+A,i,j,x,y,a) & = & \mbox{\sc del}(x,y,i,j,a)\\ & + & f(A,i+1,j,x,y,\mbox{\rm del}(x_i))\\[12pt]f(\mbox{\rm ins}(b)+A,i,j,x,y,a) & = & \mbox{\sc ins}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i,j+1,x,y,\mbox{\rm ins}(b))\end{array}$

Reminds me of my Prolog programming days.

#### The Edit Operations

The notation $a + A$ is meant to be the edit operation $a$ followed by edit sequence $A$. The notation $[]$ is for the empty edit sequence.

The notation $\mbox{\rm sub}(x_i,b)$ is for the substitution of an observed base $b$ from the read for base $x_i$ of the reference sequence; if $x_i \neq b$ it’s a particular mismatch, and if $x_i = b$, it’s a match score. The notation $\mbox{\rm ins}(b)$ is for the insertion of observed base $b$ and $\mbox{\rm del}(x_i)$ for the deletion of base $x_i$ in the reference sequence. Insertion and deletion are separated because gaps in the reference sequence and read may not be equally likely given the sequencer and sample being sequenced.

Note that the base case requires the entire read to be consumed by requring $j = \mbox{\sc readlen}$.

#### Scoring Functions

Note that the score functions, $\mbox{\sc subst}(x,y,i,j,a) \in \mathbb{R}$, $\mbox{\sc del}(x,y,i,j,a) \in \mathbb{R}$, and $\mbox{\sc ins}(x,y,i,j,a) \in \mathbb{R}$, have access to the read $y$, the reference sequence $x$, the present position in the read $j$ and reference sequence $i$, as well as the previous edit operation $a$.

### Enough Context

This context is enough to provide affine gap penalties, because the second delete will have a context indicating the previous operation was a delete, so the score can be higher (remember, bigger is better here). It’s also enough for dealing with poly-deletion, because you have the identity of the previous base on which the operation took place.

Further note that by defining the start position score by a function $\mbox{\sc start}(x,i)$, which includes a reference to not only the position $i$ but the entire reference sequence $i$, it is possible to take not only position but also information in the bases in $x$ at positions around $i$ into account.

### Dynamic Programming Same as Before

The context is also restricted enough so that the obvious dynamic programming algorithms for first-best and marginalization apply. The only tricky part is computing the normalization for $p(y,A,i|x)$, which has only been defined up to a constant. As it turns out, the same dynamic programming algorithm works to compute the normalizing constant as for a single probabilistic read.

In practice, you have the usual dynamic programming table, only it keeps track of the last base matched in the read, the last base matched in the reference sequence, and whether the last operation was an insertion, deletion, or substitution. To compute the normalizer, you treat each base log prob from a pseudoread of the given read length $\mbox{\sc readlen}$ as having score 0 and accumulate all the matches in the usual way with the sum-product algorithm. (You’ll need to work on the log scale and use the log-sum-of-exponentials operation at each stage to prevent underflow.)

In general, these dynamic programming algorithms will work anywhere you have this kind of tail-recursive definition over a finite set of operations.

### Inferring the Actual Read Sequence for SNP, Etc. Calling

Because of variation among individuals, the sequence of bases in a read won’t necessarily match the reference sequence. We can marginalize out the probability that the read sequence is $z$ given that it was generated from reference sequence $x$ with read $y$ (remember the read’s non-deterministic here), $p(z|x,y) = \sum_{i,\mbox{yield}(A) = z} p(A,i,x|y)$,

where $\mbox{yield}(A)$ is the read produced from alignment $A$, which itself is defined recursively by $\mbox{yield}([]) = \epsilon$, $\mbox{\rm yield}(\mbox{\rm del}(x) + A) = \mbox{\rm yield}(A)$, $\mbox{\rm yield}(\mbox{subst}(b,x_i) + A) = b \cdot \mbox{\rm yield}(A)$, and $\mbox{\rm yield}(\mbox{ins}(b)+A) = b \cdot \mbox{\rm yield}(A)$,

where $\epsilon$ is the empty string and $\cdot$ string concatenation.

I believe this is roughly what the GNUMAP mapper does (here’s the pay-per-view paper).

### Reference Sequences with Uncertainty

We could extend the algorithm even further if we have a non-deterministic reference sequence. For instance, we can use data from the HapMap project to infer single nucleotide polymorphisms (SNPs). We just do the same thing for reference side non-determinsm as on the read side, assuming there’s some score associated with each choice and adding the choice into the context mix.