## Inferring Splice-Variant mRNA Expression from RNA-Seq

I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants.

### The Model

The data from which we infer expression is

• $K \in \mathbb{N}^{+}$, the number of splice variants,
• $N \in \mathbb{N}^{+}$, the number of reads, and
• $y_1,\ldots,y_N$, the reads.

For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores.

We assume two model hyperparameters,

• $\varphi$, genetic variation, sample preparation, and sequencing parameters, and
• $\alpha_1,\ldots,\alpha_K \in \mathbb{R}^{+}$, the prior read counts (plus one).

The general-purpose parameter $\varphi$ reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base).

We will infer two of the model parameters,

• $t_1,\ldots,t_N \in 1:K$, the mapping of read to splice variant, and
• $\theta_1,\ldots,\theta_K \in [0,1]$ such that $\sum_{k=1}^K \theta_k = 1$, the read expression probabilities.

The model structure in sampling notation is

• $\theta \sim \mbox{\sf Dirichlet}(\alpha)$
• $t_n \sim \mbox{\sf Discrete}(\theta)$ for $n \in 1:N$
• $y_n \sim \mbox{\sf Channel}(t_n,\varphi)$ for $n \in 1:N$.

We select the expression levels $\theta$ based on prior counts.

The heart of the mapping model is the discrete sampling of the latent mappings $t_n$ based on the expression level parameter $\theta$. This is effectively a beta-binomial model of expression at the corpus level.

The mysterious part of this whole thing is the channel model, which assigns the probability of a given read $y_n$ being observed given it arose from the splice variant $t_n$ under the sequencing model parameterized by $\varphi$.

### Simple Channel Examples

For purposes of illustration, we’re going to assume a really simple instance of this model.

First, we’ll assume that $\alpha_k = 1$ for $k \in 1:K$, which renders $\mbox{\sf Dirichlet}(\alpha)$ a uniform distribution over possible values of $\theta$. With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate $\alpha$.

Second, we’ll assume a simple uniform read model. Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure.

Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant.

Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here.

With all of these assumptions in place, if read $y_n$ is mapped to splice variant $t_n \in 1:K$, we set

$\mbox{\sf Channel}(y_n|t_n,\varphi) = 1 / (\mbox{len}(t_n) - \mbox{len}(y_n))$.

The denominator $\mbox{len}(t_n) - \mbox{len}(y_n)$ reflects the number of positions a sequence of the given read length $\mbox{len}(y_n)$ may occur in an mRNA of length $\mbox{len}(t_n)$.

### A Simple BUGS Implementation

Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS.

model {
theta[1:K] ~ ddirch(alpha[1:K])
for (n in 1:N) {
t[n] ~ dcat(theta[1:K])
y[n] ~ dcat(rho[t[n],1:K])
}
}


BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution.

I call it from R in the usual way. Here’s just the part where R tells BUGS how to run the model. After the BUGS window is closed, R plots and attaches the data.

library("R2WinBUGS")

data <- list("alpha", "rho", "y", "N", "K")
parameters <- c("theta","t")
inits <- function() { list(theta=c(1/3,1/3,1/3)) }
rnaexp <- bugs(data, inits, parameters,
"c:/carp/devmitz/carp/isoexp/src/Bugs/isoexp-multi.bug",
n.chains=3, n.iter=50000,
debug=TRUE, clearWD=TRUE,
bugs.directory="c:\\WinBUGS\\WinBUGS14",
DIC=FALSE)
print(rnaexp)
plot(rnaexp)
attach.bugs(rnaexp)


The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC.

Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples.

### An Example

I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. The gene model and expression levels for the simulation are defined in a comment in the R code. We’ll consider a case with three exons and three isoforms:

# exons  A BB C  (B twice as long as others)
# var1 = A BB
# var2 =   BB C
# var3 = A    C


There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). There are three splice variants, labeled var1, var2, and var3. Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C.

Without cross-junction reads, it’s not possible to get a uniquely mapped read. Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3.

Next, we specify the number of samples drawn from each exon in each splice variant.

#       exon
# var  1  2  3  Total theta*
# 1    1  2    | 3     3/19
#
# 2       4  2 | 6     6/19
#
# 3    5     5 | 10    10/19
#      --------
# tot  6  6  7   18


There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. These are distributed uniformly across the exons. Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3.

If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. But our expression parameter theta ($\theta$) is defined by reads, not by transcripts.

The observed data is much simpler — it’s just the total number of reads for samples mapping to each exon. We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3.

### Coding the Example in R

We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K:

alpha <- c(1,1,1)
K <- length(alpha)


We actually assume there are 10 times that many samples as in our figure, setting the read data as:

y <- c(rep(1,60),rep(2,60),rep(3,70))
N <- length(y)


That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190.

Finally, we use R’s matrix constructor to set up the read model:

rho <- matrix(c(1/3,2/3,0,  0,2/3,1/3,  1/2,0,1/2),
nrow=3,ncol=3,byrow=TRUE)


I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length).

### The Output

I ran the R program, which called BUGS, which terminated neatly and returned to R. The $\hat{R}$ values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler.

### Bayeisan Point Estimate

We can compute any statistics we want now with the posterior samples. For instance, theta[,1] contains all the Gibbs samples for the variable $\theta_1$.

We can then generate the Bayesian (L2) estimates for expression levels:

> thetaHat = c(mean(theta[,1]), mean(theta[,2]), mean(theta[,3]))
> print(thetaHat,digits=2)
[1] 0.17 0.31 0.52


The “right” answer for our simulation is also easy to compute with R:

> print(c(3/19, 6/19, 10/19), digits=2)
[1] 0.16 0.32 0.53


Our estimates are pretty darn close. (Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.)

How certain are we of these estimates? Not very, because there were only 190 samples and lots of mapping uncertainty. It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples:

> print(quantile(theta[,1], c(0.025,0.25,0.5,0.75,0.975)),
digits=2)
2.5%   25%   50%   75% 97.5%
0.025 0.104 0.167 0.231 0.340
> print(quantile(theta[,2], c(0.025,0.25,0.5,0.75,0.975)),
digits=2)
2.5%   25%   50%   75% 97.5%
0.13  0.25  0.31  0.37  0.47
> print(quantile(theta[,3], c(0.025,0.25,0.5,0.75,0.975)),
digits=2)
2.5%   25%   50%   75% 97.5%
0.42  0.49  0.52  0.56  0.61


With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced.

### A Triangle Plot and Correlated Expressions

Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. Here’s the plot:

I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3.

It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. Here’s the two-way interaction plots:

These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. This makes sense given variants 1 and 2 share the large exon BB.

Finally, here’s the source that generated the PNG file for the triangle plot:

library("ade4")
png(file="splice-variant-exp-tri.png")
draw.line=TRUE,show.position=FALSE,scale=FALSE)
dev.off()


and for the ordinary plots:

png(file="splice-variant-exp-12.png")
plot(theta[,c(1,2)],xlab="variant 1",ylab="variant 2",
xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 2")
dev.off()

png(file="splice-variant-exp-13.png")
plot(theta[,c(1,3)],xlab="variant 1",ylab="variant 3",
xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 3")
dev.off()

png(file="splice-variant-exp-23.png")
plot(theta[,c(2,3)],xlab="variant 2",ylab="variant 3",
xlim=c(0,1),ylim=c(0,1),pch=20,main="2 vs. 3")
dev.off()