## Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

### The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, $\theta$, used for read expression. The model in this paper assumes $N$ reads, implicitly assumes an uniform prior for $\theta \sim \mbox{\sf Dirichlet}({\bf 1})$, then for each read $1 \leq n \leq N$, chooses splice variants $z_n \sim \mbox{\sf Discrete}(\theta)$. So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

### Noise Isoform

Li et al. include a so-called “noise isoform”, which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don’t quite see given how it’s defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Things look different when Li et al. start to generate reads $y_n$. What I did was assume an arbitrary distribution $\mbox{\sf Read}(y_n|z_n,\varphi)$, with parameters $\varphi$ characterizing the likelihood of observing read $y_n$ from reference source $z_n$. Li et al. decompose part of $\mbox{\sf Read}$ in their model. First, they assume $s_n$ is the start position of read $y_n$ on reference sequence $z_n$ and assume $o_n$ is the strand, both of which are generated from the ref sequence $z_n$, by distributions $p(o_n|z_n)$ and $p(s_n|z_n)$.

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, $s_n, o_n, z_n$, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

### Non-Probabilistic Alignment

[Updated, June 10, 2010: I don’t know what I was thinking — Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix $w_t$ for substitutions (no indels). I’m not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

### EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior $p(\theta|y,\alpha,\varphi)$ of expression given reads $y$ and channel model $\varphi$ and Dirichlet prior $\alpha$. Li et al. use EM to find the maximum likelihood estimate, $\theta^* = \arg\max p(y|\theta,\varphi)$. As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is $\theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi)$.

### Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It’s really pretty much like Gibbs sampling in terms of the application, if not philosophy.]

### Simulations

They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there’s also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there’s no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

### 20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.