Archive for the ‘Bioinformatics’ Category

Sampling, Modeling and Measurement Error in Inference from Clinical Text

August 18, 2011

Here’s a link to the slides of the talk I presented recently at ICML:

It’s basically a list of the kinds of things that can go wrong and introduce error (bias and noise) into inferences. Although the examples are mostly clinical (with one on baseball and one on cancer clusters), the point is generally applicable.

Small, Focused Workshops

I really like small, focused workshops, and this one was very good, with lots of presentations on people’s practical experiences launching systems in hospitals and working on fascinating text mining problems from clinical notes.

Thanks to the Organizers

Thanks again to the organizers, especially Faisal Farooq, who handled all the paperwork. It’s a pretty thankless job in my experience, but having done it myself, I can really appreciate how much work it is to run something that comes off smoothly.

I don’t know how long the page will last, but here’s a link to the workshop itself:

Unintended (Beneficial) Consequences

When Noémie Elhadad invited me to give a talk, I met with her to see if there was a topic I could talk about. During that meeting, she mentioned how hard it had been to hire an NLP programmer in biomedical informatics (it’s just as hard if not harder at a small company). The upshot is that Mitzi got a new job at Columbia into the bargain. In a way, it’s too bad, because I miss talking to Mitzi about her work in genomics, about which I know relatively little compared to NLP.

Contextual Effects and Read Quality in a Probabilistic Aligner

June 14, 2010

In my last post on this topic, Sequence Alignment with Conditional Random Fields, I introduced a properly normalized Smith-Waterman-like global aligner (which required the entire read to be aligned to a subsequence of the reference sequence). The motivation was modeling the probability of a read being observed given a reference sequence from which it originated.

Context Plus Read Quality

Today I want to generalize the model in two ways. First, I’ll add contextual effects like affine gap scores and poly-base errors on Illumina.

Second, I’ll take into consideration per-base read quality, as produced by Illumina’s prb files (with simple fastq format, where you only get the probability of the best base, we can either treat the others as uniformly likely or use knowledge of the confusion error profile of the sequencer).

We’ll still take start position into account, for issues like modeling hexamer priming or fractionation effects in sample prep.

The dynamic programming algorithms work as before with finer-grained cells. Further, you can infer what the actual read was, though I believe a hierarchical model will be better for this, because the sequence polymorphism can be separated from the read errors generatively.

Modeling Alignments

What we’re interested in for our work on RNA-Seq based mRNA expression is modeling is p(y|x), the probability of a read y given a reference sequence x. Rather than approaching this directly, we’ll take the usual route of modeling a (latent) alignment A at a (latent) start position i, p(y,A,i|x), and then marginalizing,

p(y|x) = \sum_{A,i} p(y,A,i|x).

The best alignment will be computable via a standard sequence max algorithm, and the marginalization through the standard product-sum algorithm. Both run in time proportional to the read length times reference sequence length (of course, the latter will be trimmed down to potential matches in a region). Both can be sped up with upper bounds on distance.

The Model

We assume there is a score \mbox{\sc start}(i,x) \in \mathbb{R} for each start position i in reference sequence x. For instance, this may be a log probability of starting at a position given the lab fractionation and priming/amplification protocol.

We will also assume there is a quality score on a (natural) log probability scale \mbox{\sc logprb}(y,b,j) \in \mathbb{R} for the probability of base b at position j in read y. For instance, these may be the Phred quality scores reported in Illumina’s prb files converted to (natural) log probability of match.

Our definition only normalizes for reads of fixed length, which we’ll write \mbox{\sc readlen}.

We will recursively define a score on an additive scale and then exponentiate to normalize. In symbols,

p(y,A,i|x) \propto \exp\big(\mbox{\sc start}(i,x) + f(a,i,0,x,y,\mbox{\rm start})\big)

where we define the alignment score function f(A,i,j,x,y,a), for alignment sequence A, read x and read position i, reference sequence y and reference position j and previous alignment operation a, recursively by

\begin{array}{rcl}f([],i,\mbox{\sc readlen},x,y,a) & = & 0.0\\[18pt]f(\mbox{\rm sub}(x_i,b)+A,i,j,x,y,a) & = &  \mbox{\sc subst}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i+1,j+1,x,y,\mbox{\rm sub}(x_i,b))\\[12pt]f(\mbox{\rm del}(x_i)+A,i,j,x,y,a) & = &  \mbox{\sc del}(x,y,i,j,a)\\ & + & f(A,i+1,j,x,y,\mbox{\rm del}(x_i))\\[12pt]f(\mbox{\rm ins}(b)+A,i,j,x,y,a) & = &  \mbox{\sc ins}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i,j+1,x,y,\mbox{\rm ins}(b))\end{array}

Reminds me of my Prolog programming days.

The Edit Operations

The notation a + A is meant to be the edit operation a followed by edit sequence A. The notation [] is for the empty edit sequence.

The notation \mbox{\rm sub}(x_i,b) is for the substitution of an observed base b from the read for base x_i of the reference sequence; if x_i \neq b it’s a particular mismatch, and if x_i = b, it’s a match score. The notation \mbox{\rm ins}(b) is for the insertion of observed base b and \mbox{\rm del}(x_i) for the deletion of base x_i in the reference sequence. Insertion and deletion are separated because gaps in the reference sequence and read may not be equally likely given the sequencer and sample being sequenced.

Note that the base case requires the entire read to be consumed by requring j = \mbox{\sc readlen}.

Scoring Functions

Note that the score functions, \mbox{\sc subst}(x,y,i,j,a) \in \mathbb{R}, \mbox{\sc del}(x,y,i,j,a) \in \mathbb{R}, and \mbox{\sc ins}(x,y,i,j,a) \in \mathbb{R}, have access to the read y, the reference sequence x, the present position in the read j and reference sequence i, as well as the previous edit operation a.

Enough Context

This context is enough to provide affine gap penalties, because the second delete will have a context indicating the previous operation was a delete, so the score can be higher (remember, bigger is better here). It’s also enough for dealing with poly-deletion, because you have the identity of the previous base on which the operation took place.

Further note that by defining the start position score by a function \mbox{\sc start}(x,i), which includes a reference to not only the position i but the entire reference sequence i, it is possible to take not only position but also information in the bases in x at positions around i into account.

Dynamic Programming Same as Before

The context is also restricted enough so that the obvious dynamic programming algorithms for first-best and marginalization apply. The only tricky part is computing the normalization for p(y,A,i|x), which has only been defined up to a constant. As it turns out, the same dynamic programming algorithm works to compute the normalizing constant as for a single probabilistic read.

In practice, you have the usual dynamic programming table, only it keeps track of the last base matched in the read, the last base matched in the reference sequence, and whether the last operation was an insertion, deletion, or substitution. To compute the normalizer, you treat each base log prob from a pseudoread of the given read length \mbox{\sc readlen} as having score 0 and accumulate all the matches in the usual way with the sum-product algorithm. (You’ll need to work on the log scale and use the log-sum-of-exponentials operation at each stage to prevent underflow.)

In general, these dynamic programming algorithms will work anywhere you have this kind of tail-recursive definition over a finite set of operations.

Inferring the Actual Read Sequence for SNP, Etc. Calling

Because of variation among individuals, the sequence of bases in a read won’t necessarily match the reference sequence. We can marginalize out the probability that the read sequence is z given that it was generated from reference sequence x with read y (remember the read’s non-deterministic here),

p(z|x,y) = \sum_{i,\mbox{yield}(A) = z} p(A,i,x|y),

where \mbox{yield}(A) is the read produced from alignment A, which itself is defined recursively by

\mbox{yield}([]) = \epsilon,

\mbox{\rm yield}(\mbox{\rm del}(x) + A) = \mbox{\rm yield}(A),

\mbox{\rm yield}(\mbox{subst}(b,x_i) + A) = b \cdot \mbox{\rm yield}(A), and

\mbox{\rm yield}(\mbox{ins}(b)+A) = b \cdot \mbox{\rm yield}(A),

where \epsilon is the empty string and \cdot string concatenation.

I believe this is roughly what the GNUMAP mapper does (here’s the pay-per-view paper).

Reference Sequences with Uncertainty

We could extend the algorithm even further if we have a non-deterministic reference sequence. For instance, we can use data from the HapMap project to infer single nucleotide polymorphisms (SNPs). We just do the same thing for reference side non-determinsm as on the read side, assuming there’s some score associated with each choice and adding the choice into the context mix.

Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

June 9, 2010

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, \theta, used for read expression. The model in this paper assumes N reads, implicitly assumes an uniform prior for \theta \sim \mbox{\sf Dirichlet}({\bf 1}), then for each read 1 \leq n \leq N, chooses splice variants z_n \sim \mbox{\sf Discrete}(\theta). So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

Noise Isoform

Li et al. include a so-called “noise isoform”, which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don’t quite see given how it’s defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Position and Strand-Based Reads

Things look different when Li et al. start to generate reads y_n. What I did was assume an arbitrary distribution \mbox{\sf Read}(y_n|z_n,\varphi), with parameters \varphi characterizing the likelihood of observing read y_n from reference source z_n. Li et al. decompose part of \mbox{\sf Read} in their model. First, they assume s_n is the start position of read y_n on reference sequence z_n and assume o_n is the strand, both of which are generated from the ref sequence z_n, by distributions p(o_n|z_n) and p(s_n|z_n).

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, s_n, o_n, z_n, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

Non-Probabilistic Alignment

[Updated, June 10, 2010: I don’t know what I was thinking — Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix w_t for substitutions (no indels). I’m not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior p(\theta|y,\alpha,\varphi) of expression given reads y and channel model \varphi and Dirichlet prior \alpha. Li et al. use EM to find the maximum likelihood estimate, \theta^* = \arg\max p(y|\theta,\varphi). As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is \theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi).

Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It’s really pretty much like Gibbs sampling in terms of the application, if not philosophy.]


They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there’s also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there’s no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.

Expected Error Histograms from Illumina 36bp Reads

May 10, 2010

Mitzi and I have been working on reproducing the results reported in the following (hugely influential) recent paper.

The paper analyzes human liver and kidney tissue using both an Affymetrix micro-array and the Illumina short-read sequencer. More about Affy vs. Illumina later; for now, I just want to describe the quality of the Illumina reads.

Error Profile of Illumina

What’s nice about modern bioinformatics is that many authors submit their data to public repositories. So Mitzi was able to download and crunch Marioni et al.’s Illumina data.

For each 36bp read, Illumina provides a probability estimate for each base. For instance, we might have the 17th position assign a probability of 0.97 to A, 0.02 to C, 0.001 to G, and 0.009 to T. Marioni et al., like most researchers, simply take the first-best calls (see below for some exceptions), thus assigning position 17 in the above example to A. But according to Illumina, there’s only a 97% chance that call is correct!

Calculating Expected Errors per Read

So how many errors do we expect in a 36bp read? If q_i is the probability that the most likely base at position i is correct, the expected number of errors in the first I reads is \sum_{i = 1}^{I} (1 - q_i).

With a little Java, Python and R (more about the package we’re working on later), Mitzi plotted a histogram for all 66.4M reads from Marioni’s 3pM kidney data:

Illumina expected errors, 36bp

But it turns out that like most researchers, Marioni et al. truncated the reads because they’re noisier toward the tail (3′ end) of the reads. Specifically, they only used the first 32bp. Here’s the same plot with the last 4bps truncated:

Illumina expected errors, 32bp

As you can see, this dramatically reduces the expected number of errors.

Calculating Expected Number of Correct Reads

Mitzi also plotted the probability that the first best calls up to length I are all correct for a read, which is \prod_{i=1}^{I} q_i. The probabilities are so low it requires a log (base 10) scale when using all 36bps:

Illumina expected correctly called reads

The startling conclusion is that there’s almost no chance at all that the first-best base calls leads to the correct sequence. This has unsettling ramifications for procedures like SNP calling.

The Done Thing

What Marioni et al. did is typical. They used a heuristic aligner (Bowtie) which ignores the read quality scores and allows only a small number of edits. They then censor (aka discard) any read from their data set that doesn’t align “uniquely”. A unique alignment is taken to be one that has a unique maximum quality score with fewer than the maximum number of allowed edits. That is, if there’s an alignment with one edit, and seven with two edits, the alignment is considered unique; but if there are two alignments with one edit, the read’s discarded.

Why throw away non-unique reads? There’s good reason to believe that the unique alignments are not unique. Not eveyone discards non-unique reads; there are so-called “recovery” strategies, which I discussed in a previous post describing my joint mapping recovery and mRNA expression model.

Why throw away quality scores from the aligner? I only know of two aligners that use quality scores, Gnumap and BWA, but these systems are far less popular than simple edit counting systems like Bowtie.

Why only count edits? A proper channel model should be better. For instance, the SHRiMP system employs a more refined channel model (which my probabilistic alignment model refines with proper normalization and integrating out of the actual sequence of edits).

The answer’s simple really: it’s easier to code and easier to explain.

Can We Do Better?

Like every newbie, I feel we can. Not surprisingly, I think we should use all the data we have, integrating a proper probabilistic channel model for alignment with a Bayesian mRNA expression model.

Mitzi’s busy doing much more liberal alignments with SHRiMP (which requires a Viterbi approximation to the channel model) and then we’ll run them through the scalable collapsed Gibbs sampler I wrote the for mRNA expression model. After that, we’ll work in the more refined channel edit model that can account for biases induced by the wet lab process and the sequencer, and the quality scores coming from the sequencer.


No, Alias-i has no plans to work on Bioinformatics; this is still just a hobby for me. Maybe I should choose a hobby that’s a little more dissimilar from my day job.

McCallum, Bellare and Pereira (2005) A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

April 27, 2010

Following up on my previous post on alignment with CRFs, I’ve found a related paper that I’d like to review here. I’ve spent the last 20 years following Fernando around the intellectual landscape, from logic programming and grammatical formalisms to statistics and biology. (So it’s perhaps not surprising that we’re both speaking next week at the Edinburgh HCRC and Cog Sci Reunion.)

I have to admit I’d read this paper in the dim and distant past (OK, it’s from 2005), so something might’ve stuck in my head. But once you grok CRFs, building new models is easy, because as I said last time, CRFs are really good at normalizing sequential labeling or classification problems.


McCallum et al. applied their model to matching restaurant names and addresses and also to matching bibliographic citations. In contrast, I was interested in matching short reads against reference genomes (for DNA sequencing) or gene models (for mRNA sequencing).

What’s Being Modeled?

McCallum et al.’s goal was to model the probability that two strings match. This reduces the problem to a binary classification problem.

In contrast, what I was modeling in the last post is the probability of a query/read sequence being generated from a reference sequence. The reads are more like the tags generated in a tagging CRF. The alignment is a local alignment of the read to the reference.

We both approached the problem by first defining a notion of an alignment sequence and its relation to the strings. They modeled the probability of a global alignment given both strings, whereas I modeled the probability of a read locally aligned to a reference sequence.

The difference is in normalization terms, both of which are easy to calculate via dynamic programming.

As they point out in their discussion, what they did is to pair HMMs (i.e. as used by [Ristad and Yianilos 1998]), as CRFs are to regular HMMs. I turned a similar crank to estimate a related probability.

Other Innovations

What I really like about McCallum et al.’s approach is that it allows chunked matching operations. Although they don’t consider affine gap models (where longer gaps aren’t that much more unlikely than shorter gaps), they do consider token-sensitive operations like deleting a token that appears in both strings or in a dictionary, and consider deleting to the end of the current token (useful for matching acronyms). The possibilities are endless.

With their CRF FST implementation in Mallet, I’m guessing it was as easy to code as the paper makes it seem. The FST coding is straightforward, with the only problem being the usual one of explosion of dependencies. Neatly (for both McCallum et al. and me), the sum-product (forward/back) and max-product (Viterbi) dynamic programming algorithms fall out of the encoding.

They don’t allow non-local transpositions (as in, say a soft TF/IDF-like approach to matching), so everything remains reasonably tractable.

They note that the estimation problem is not concave, because they use latent variables (alignments and states aren’t directly observed here). I’m pretty sure I’ll have the same problem.

They treat matches and mismatches as a mixture model, with two distinct subgraphs of the FSTs. This allows them to have separate features as in a max entropy or discrete choice (DCA) model and unlike in a logistic regression-type approach where there are no outcome-specific features.

Section 5.1 provides a hint at the kinds of features they use to set the values for the various edit operations, like is a numerical character substituted for another numerical character. Curiously, there are no features for the actual characters being matched, just classes. Maybe they didn’t have enough data.

What I added that they didn’t have was a different normalization and features for the alignment position of local alignments (useful for modeling positional or polymer biases in reads).

Direct Regression

(Tsuruoka, McNaught, Tsujii and Ananiadou 2007) introduced an approach to string similarity that was a straight-up logistic regression based on features extracted from the two strings. These features were things like n-grams.

This also sounds like a good idea. I’m guessing the error patterns would be dissimilar enough that the two classifies could be fruitfully combined.

The Tsuruoka model would be easy to implement in LingPipe. I had to do a custom implementation for my own model decoder and haven’t built the estimator yet.

Further Applications and Generalizations

I’d love to see something like this applied to the BioCreative III gene linkage task (and thank you, organizers for switching to camel case from seemingly random case). You’d have to generalize somewhat, because you’re matching against a whole set of known aliases for a gene. The discriminative training would be a huge help to sort out common phrases, the importance of numbers, the correspondence of Greek and Latin characters, etc.

McCallum et al. only had a chance to scratch the surface of the feature space that might be useful for this and related problems. I’m surprised now that I haven’t seen more work building on this.

It’d also be easy to take any of the pair HMM applications to things like phonology or morphology that are out there.

Sequence Alignment with Conditional Random Fields

April 25, 2010

Having spent hours figuring out how to properly normalize edit probabilities of an edit channel model so they sum to 1 over all reads of a given length given a reference sequence, the answer seems obvious in retrospect. Use conditional random fields (CRFs), because they can normalize anything and are easy to program. An added benefit is that biologists are used to thinking of edit costs on a log scale.

Once we have a conditional model with an arbitrary linear basis, it is straightforward to account for differential expression given location or initial hexamer (more on that below).

The Sequence Alignment Problem

The first step in most DNA or RNA analysis pipelines is sequence alignment, wherein the bases read from a biological sample by a sequencer are aligned with the bases of one or more reference sequences (genome sequences or gene models of RNA transcripts).  The present generation of sequencers (e.g. Illumina or SOLiD) produce tens or even hundreds of millions of reads of 50 bases or so per run.

Typically, simpler models like pure edit distance are computed, using hashing heuristics to narrow search or suffix arrays to share prefix calculation. The model described here is intended to be run over a narrower set of alignment hypotheses.

Generative Expression Models Require Probabilistic Alignments

In a previous post, I described a Bayesian hierarchical model of mRNA expression. That model contains a random variable for each read that is dependent on the sequence from which it is read. Inference with this model requires us to calcluate the probability of a read given a reference sequence. The model described here directly plugs into the expression model. Probabilistic alignments may also be used in SNP calling, splice-variant discovery, or any other applications of sequence alignment.


Our model is based directly on the simple notion of alignment, which consists of a sequence of matches/substitutions, inserts (gaps in reference), and deletes (gaps in query/read). Repeating the example from my earlier post on probabilistic channel models, the following is an alignment of the read/query sequence AGTTAGG (below) to the reference sequence ACGTACG (above):


The first C in the reference is deleted, the second T in the read is inserted, and the second G in the read is substituted for the second C in the reference. We can represent alignments as sequences of edit operations. For the alignment above, we have the sequence of eight edits

m(A), d(C), m(G), m(T), i(T), m(A), s(G,C), m(G).

The alignment indicates the bases involved so that the reference and query/read sequences may be reconstructed from the alignment.

As we noted before, there are typically very many ways to align a read to a reference. Here’s an alternative alignment,


which corresponds to the following sequence of seven edits

m(A), s(G,C), s(T,G), m(T), m(A), s(G,C), m(G).

Read Probabilities from Alignment Probabilities

Suppose we have a reference sequence of M bases x = x_1,\ldots,x_M. The CRF we propose will model the probability p(a,m|x,N) of alignment a aligning at position m of reference sequence x and producing a read of length N. We’ll let \mbox{read}(a) be the read generated by the alignment a.

We compute the probability of a read y of length N starting at position m of the reference sequence as

p(y,m|x,N) = \sum_{\{a \ : \ \mbox{read}(a) = y\}} p(a,m|x,N).

We can then marginalize out the starting position m by summing to calculate the probability of a read,

p(y|x,N) = \sum_{m =1}^M p(y,m|x,N).

Linear Basis

In general, CRFs are based on arbitrary potential functions for cliques of dependent variables which combine linearly by summation over all cliques. We break down the basis into a contribution from edits and starting position. For edits, we take

\varphi(a,N) = \sum_{m = 1}^M \varphi(a_m)

where \varphi(a_m) is the potential of the edit operation a_m. Recall that the edits include their edited bases, so we have four match edits m(Z), twelve substitution edits s(Z1,Z2) with Z1 != Z2, four insert edits i(Z) and four delete edits d(Z), for a total of 24 edit operations. So \varphi(\cdot) provides values in (-\infty,\infty) for each of the 24 operations, and the weight of a sequence of operations is the sum of the operation weights.

For starting positions, we take a potential function \psi(m,x) for starting an alignment at position m of reference sequence x = x_1,\ldots,x_M.

The probability of an alignment at a starting position is defined to be proportional to the exponentiated sum of potentials,

p(a,m|x,N) \propto \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)),

where we take \xi(a,N) to be an indicator variable with value 0 if \mbox{length}(\mbox{read}(a)) = N and -\infty otherwise. This makes sure our distrbution is normalized for the length of reads we care about, and will also make the dynamic programming required for inference tractable.

Normalization is straightforward to calculate in theory. To compute normalized probabilities, we take

p(a,m|x,N) = \frac{1}{Z(x,N)} \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)), where

Z(x,N) = \sum_{a,m}  \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)).

As they say in the old country, Bob’s your uncle.

Sum-Product Algorithm for Read Probabilities

It’s just the usual sum-product algorithm for CRFs. Its just like the usual Smith-Waterman style edit algorithm except that (1) we must consume all the read, (2) start costs are given by \psi, and (3) we use log sums of exponentials rather than maxes to compute cell values without underflow.

Sum-Product for Normalization

To compute the normalizing term Z(x,N), we run a similar sum-product algorithm, only with an extra sum for all the possible reads.

Viterbi Algorithm for Best Alignments

With the same caveats (1) and (2) above, we can use the usual maximum operation to produce the best alignment. We can also follow LingPipe’s approach and use Viterbi in the forward direction and A* in the reverse direction to enumerate the n-best alignments.

Affine Gap Basis and Edit Context Sensitivity

It’s easy to take affine gaps into account. Just condition the edit operations on the previous edit operation. You have to add the relevant context to dynamic programming as in the usual affine extension to Smith-Waterman alignment.

It’s also easy to model things like the uncertainty caused in Illumina reads by homopolymers (sequences repeating the same base) by allowing context for the previous base. This is also easy to dynamic program, though adding context adds to space and time requirements.

Modeling Uneven Reads

Depending on the sample preparation process in RNA sequencing experiments (RNA-Seq) (e.g. RNA fragmentation by hydrolysis versus oligo-primed cDNA fragmentation by sonication), there may be effects in read coverage based on position (e.g. RNA fragmentation depleted at 3′ and 5′ end or cDNA fragmentation favoring reads toward the 3′ end) or on the sequence of bases near the start of the read (e.g. amplification by “random” hexamer priming).

These effects are not small. [Hansen, Brenner and Dudoit 2010] showed that hexamer binding in the illumina process can cause over a 100-fold difference in expression depending on the first two hexamers in a read. Similarly, see figure 3 in [Wang, Gerstein and Snyder 2009], which demonstrates over a tenfold effect on expression based on position. (I sure wish there was variance in the Wang et al. graphs.)

Modeling Input Quality

We can also add input quality to our calculations if it’s on a log scale. Then we have to select which base was actually read, making it more like the use of CRFs for tagging. For an example of how to do this for standard edit costs, see the description of the GNUMAP system in [Clement et al. 2010]. As the authors point out, this is a serious issue with reads as noisy as those coming off the Illumina or SOLiD platforms.

LDA Clustering for Gene Pathway Inference

February 25, 2010

Genetic activation is hierarchically structured at many different levels. Tight clusters of associated genes are co-activated to construct molecular machines or participate in genetic pathways. Biologists spend considerable time trying to work out these pathways down to the rate equation level. More abstract interaction graphs are available from the KEGG Pathway Database, the web page for which describes it as “Wiring diagrams of molecular interactions, reactions, and relations.”

I’ve blogged previously about inferring gene or mRNA transcription levels. Now suppose you have several tissue samples from the same organism (e.g. from different cells or at different developmental stages or just replicates of the same experiment). These might be mapped to genes, to splice variants, or to individual exons. What we’d like to be able to do is start making inferences about genetic pathways and their activation.

The Simplest Model: LDA

I wrote the simplest possible model down, stepped back, and realized it was just latent Dirichlet allocation (LDA), where “documents” are samples and “words” are mapped reads.

I really like BUGS notation for models because it combines the ease of a graphical model presentation with the precision of a programming language.

S: number of samples
T: number of pathways
K: number of genes (or variants or exons)
R[s]: number of reads in sample s
y[s,r]: mapping of read r in sample s to gene in 1:K
alpha: T-dimensional prior on pathway expression in sample
beta: K-dimensional prior on gene expression in pathway

phi[t]: gene expression in pathway t
theta[s]: pathway expression in sample s
t[s,r]: pathway for read r in sample s

for (t in 1:T) {
    phi[t] ~ Dirichlet(beta)
for (s in 1:S) {
    theta[s] ~ Dirichlet(alpha)
    for (r in 1:R[s]) {
        t[s,r] ~ Discrete(theta[s])
        y[s,r] ~ Discrete(phi[t[s,r]])

Note that the expression parameters theta and phi are scaled as ratios over reads. These values may be normalized by gene (or variant or exon) length (taking into account the vagaries of the mapping program) to molar units (by analogy with the number of reads being like weight).

These models are really easy to fit and really easy to scale. As output, we get an estimate of the joint posterior distribution p(theta,phi|y,alpha,beta) over pathway expression and pathway composition (we just marginalize out the nuisance parameter t in the usual way by summation).

LingPipe’s implementation is explained with sample code in:

The inputs are the hyperparameters alpha (the dimension of which determine the number of pathways T) and beta (determining the number of genes K), and the (ragged) input matrix y (determining the number of samples S and reads per sample R[s]). The output is a sequence of Gibbs samples. Each sample contains a value for all of the parameters: the source t of reads, the gene in pathway expressions phi and the pathway in sample values theta. We can just discard t to get a simulation of the posterior p(theta,phi|y,alpha,beta).

Reasoning with Samples

As I discussed in our tutorial (and Griffiths and Steyvers discuss in the paper cited), the cluster assignments themselves in LDA are not identifiable. What we do get is a kind of random effects mixture model of correlation among genes.

On the plus side, we can easily carry out multiple comparisons or other plug-in inferences from any derived sample statistics (such as pairwise or groupwise gene co-activation).

Given that we’re usually using clustering like LDA as part of exploratory data analysis, we can just look at samples and topics and see what pops up. And if we have ground truth, as supplied at least partially by KEGG, we can look at whether or not known pathways emerge as “topics”.

Joint Inference with Mapping

Perhaps the most pleasant feature of Bayesian inference is the ease with which it lets you combine models to perform joint inference at multiple levels or over multiple domains. Because all variable quantities are treated as first-class random variables, and reasoning is carried out by integrating uncertainty, models may be plugged together hierarchically whenever the components fit. Here, the y variables inferred by the RNA-Seq expression model would be the y variables assumed as data here. Very convenient

Inference is the same, at least in theory — there are just more variables to sample. The big difference is that now y has multiple constraints, so its conditional sampling distribution has to be worked out and simulated. BUGS does this kind of thing automatically. I’m still struggling with Casella and Robert’s Monte Carlo Statistical Methods.

Hierarchical Model of Priors

Just as I did for the hierarchical Bayesian model of batting ability and the hierarchical model of categorical data coding, with enough samples S that are in some sense exchangeable, it’s possible to estimate the priors alpha or beta in models like these.

All the usual benefits accrue, like smoothed multiple comparisons.

We just put a diffuse “hyperprior” on alpha and/or beta and include sampling for those variables. (Again, a bit easier said than done for those still working out all these simulations.) But again, it’s easy in BUGS.

Better Priors than Dirichlet

The Dirichlet prior only has a number of parameters equal to the number of dimensions and thus cannot fully characterize covariance of expected expression. A more reasonable model might be a multivariate normal on the logit scale. This is what Andrew Gelman suggested when I asked a while ago, and it’s only just sinking in.

Another problem with the Dirichlet (and multivariate normal) priors are that they fix the number of clusters ahead of time. It’d be possible to plug in something like a Dirichlet process prior here. I’m not sure if the model it entails is reasonable. I’ll have a better idea after fitting some data.

Not Just Theory

Well, OK, this post is just theory. But Mitzi’s already run the expression models I coded based on the previous post, and next up, I’ll try to convince her to run LDA on the data. When we have something ready to release to he public (warning: biologists tend to keep their data close), I’ll announce it on the blog.

Prior Art?

I’m guessing this (or at least the non-Bayesian version of it) has been done, but I don’t know how to look it up (there’s also confusion with “linear discriminant analysis” and “low density array”). And there’s lots of relevant stuff related to other factor models out there, which were popular for analyzing micro-array data. Feel free to post relevant references in the comments. The only things I could find (albeit after only half an hour or so of searching) are:

Also, the Stephens Lab at U.Chicago (which brought you the awesome Marioni et al. paper on RNA-Seq expression (using Poisson GLMs) mention LDA, but in the context of population genetics.

Inferring Splice-Variant mRNA Expression from RNA-Seq

February 5, 2010

I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants.

The Model

The data from which we infer expression is

  • K \in \mathbb{N}^{+}, the number of splice variants,
  • N \in \mathbb{N}^{+}, the number of reads, and
  • y_1,\ldots,y_N, the reads.

For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores.

We assume two model hyperparameters,

  • \varphi, genetic variation, sample preparation, and sequencing parameters, and
  • \alpha_1,\ldots,\alpha_K \in \mathbb{R}^{+}, the prior read counts (plus one).

The general-purpose parameter \varphi reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base).

We will infer two of the model parameters,

  • t_1,\ldots,t_N \in 1:K, the mapping of read to splice variant, and
  • \theta_1,\ldots,\theta_K \in [0,1] such that \sum_{k=1}^K \theta_k = 1, the read expression probabilities.

The model structure in sampling notation is

  • \theta \sim \mbox{\sf Dirichlet}(\alpha)
  • t_n \sim \mbox{\sf Discrete}(\theta) for n \in 1:N
  • y_n \sim \mbox{\sf Channel}(t_n,\varphi) for n \in 1:N.

We select the expression levels \theta based on prior counts.

The heart of the mapping model is the discrete sampling of the latent mappings t_n based on the expression level parameter \theta. This is effectively a beta-binomial model of expression at the corpus level.

The mysterious part of this whole thing is the channel model, which assigns the probability of a given read y_n being observed given it arose from the splice variant t_n under the sequencing model parameterized by \varphi.

Simple Channel Examples

For purposes of illustration, we’re going to assume a really simple instance of this model.

First, we’ll assume that \alpha_k = 1 for k \in 1:K, which renders \mbox{\sf Dirichlet}(\alpha) a uniform distribution over possible values of \theta. With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate \alpha.

Second, we’ll assume a simple uniform read model. Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure.

Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant.

Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here.

With all of these assumptions in place, if read y_n is mapped to splice variant t_n \in 1:K, we set

\mbox{\sf Channel}(y_n|t_n,\varphi) = 1 / (\mbox{len}(t_n) - \mbox{len}(y_n)).

The denominator \mbox{len}(t_n) - \mbox{len}(y_n) reflects the number of positions a sequence of the given read length \mbox{len}(y_n) may occur in an mRNA of length \mbox{len}(t_n).

A Simple BUGS Implementation

Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS.

model {
    theta[1:K] ~ ddirch(alpha[1:K])
    for (n in 1:N) {
        t[n] ~ dcat(theta[1:K])
        y[n] ~ dcat(rho[t[n],1:K])

BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution.

I call it from R in the usual way. Here’s just the part where R tells BUGS how to run the model. After the BUGS window is closed, R plots and attaches the data.


data <- list("alpha", "rho", "y", "N", "K")
parameters <- c("theta","t")
inits <- function() { list(theta=c(1/3,1/3,1/3)) }
rnaexp <- bugs(data, inits, parameters,
        n.chains=3, n.iter=50000,
        debug=TRUE, clearWD=TRUE,"c:\\WinBUGS\\WinBUGS14",

The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC.

Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples.

An Example

I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. The gene model and expression levels for the simulation are defined in a comment in the R code. We’ll consider a case with three exons and three isoforms:

# exons  A BB C  (B twice as long as others)
# var1 = A BB
# var2 =   BB C
# var3 = A    C

There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). There are three splice variants, labeled var1, var2, and var3. Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C.

Without cross-junction reads, it’s not possible to get a uniquely mapped read. Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3.

Next, we specify the number of samples drawn from each exon in each splice variant.

#       exon
# var  1  2  3  Total theta*
# 1    1  2    | 3     3/19
# 2       4  2 | 6     6/19
# 3    5     5 | 10    10/19
#      --------
# tot  6  6  7   18

There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. These are distributed uniformly across the exons. Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3.

If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. But our expression parameter theta (\theta) is defined by reads, not by transcripts.

The observed data is much simpler — it’s just the total number of reads for samples mapping to each exon. We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3.

Coding the Example in R

We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K:

alpha <- c(1,1,1)
K <- length(alpha)

We actually assume there are 10 times that many samples as in our figure, setting the read data as:

y <- c(rep(1,60),rep(2,60),rep(3,70))
N <- length(y)

That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190.

Finally, we use R’s matrix constructor to set up the read model:

rho <- matrix(c(1/3,2/3,0,  0,2/3,1/3,  1/2,0,1/2),

I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length).

The Output

I ran the R program, which called BUGS, which terminated neatly and returned to R. The \hat{R} values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler.

Bayeisan Point Estimate

We can compute any statistics we want now with the posterior samples. For instance, theta[,1] contains all the Gibbs samples for the variable \theta_1.

We can then generate the Bayesian (L2) estimates for expression levels:

> thetaHat = c(mean(theta[,1]), mean(theta[,2]), mean(theta[,3]))
> print(thetaHat,digits=2)
[1] 0.17 0.31 0.52

The “right” answer for our simulation is also easy to compute with R:

> print(c(3/19, 6/19, 10/19), digits=2)
[1] 0.16 0.32 0.53

Our estimates are pretty darn close. (Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.)

How certain are we of these estimates? Not very, because there were only 190 samples and lots of mapping uncertainty. It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples:

> print(quantile(theta[,1], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
0.025 0.104 0.167 0.231 0.340
> print(quantile(theta[,2], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
 0.13  0.25  0.31  0.37  0.47
> print(quantile(theta[,3], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
 0.42  0.49  0.52  0.56  0.61

With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced.

A Triangle Plot and Correlated Expressions

Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. Here’s the plot:

I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3.

It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. Here’s the two-way interaction plots:

two-way plots of RNA-seq expression 1 vs 2

two-way plots of RNA-seq expression 1 vs 3

two-way plots of RNA-seq expression 2 vs 3

These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. This makes sense given variants 1 and 2 share the large exon BB.

Finally, here’s the source that generated the PNG file for the triangle plot:


and for the ordinary plots:

plot(theta[,c(1,2)],xlab="variant 1",ylab="variant 2",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 2")

plot(theta[,c(1,3)],xlab="variant 1",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 3")

plot(theta[,c(2,3)],xlab="variant 2",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="2 vs. 3")

Marginalizing Latent Variables in EM

January 19, 2010

After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I “turned the crank” (an idiom in mathematics for solving an equation by brute force without thinking too much about it).

Expression Mixture Model

Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads r, gene or splice variant sources g, and mRNA expression \theta, setting

p(r,g|\theta) = \prod_{n=1}^N p(r_n,g_n|\theta) = \prod_{n=1}^N p(r_n|g_n) \ p(g_n|\theta).

Maximum Likelihood

Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.

What’s a Parameter?

So that’s what I did, calculating

g^*, \theta^* = \arg \max_{g,\theta} p(r|g,\theta).

The problem is that there’s an implicit “except the latent parameters” inside the usual understanding the MLE.

What I should have done is marginalized out the latent gene sources g, taking

\theta^* = \arg \max_{\theta} p(r|\theta).

That requires marginalizing out the latent mapping of reads r_n to gene or splice variant sources g_n,

p(r|\theta) = \prod_{n=1}^N p(r_n|\theta_n) = \prod_{n=1}^N \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

On a log scale, that’s

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

The Example

I’ll repeat the example here for convenience:

Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we’ll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

Exon 1 Exon 2 Exon 3 Total
Iso 1 N 20 0 20+N
Iso 2 70-N 0 50 120-N
Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.

Turning the Crank In the Right Direction

By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.

Now when we turn the crank, we see that

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k=1}^K p(r_n|k) p(k|\theta)

{} \ \ \ \ =  70 \log \sum_{k=1}^K p(1|k) p(k|\theta) + 20 \log \sum_{k=1}^K p(2|k) p(k|\theta)  + 50 \log \sum_{k=1}^K p(3|k) p(k\theta)

= 20 \log \frac{\theta_1}{2} + 50 \log \frac{\theta_2}{2} + 70 \log (\frac{\theta_1}{2} + \frac{\theta_2}{2})

= 20 \log \theta_1 + 50 \log \theta_2 + (70 \log 1 -20\log 2 - 50 \log 2 - 70 \log 2).

The reduction on the third line is because the probability of a read in the second exon, p(2|k), is only non-zero for isoform k=1, and similarly for a read in the third exon, where p(3|k) is nly non-zero for the second gene or splice variant, k=2. The final reduction follows because \theta_1 + \theta_2 = 1.

Not so Biased After All

After marginalizing out the read mappings g, the MLE for expression is right where we’d expect it, at the solution of

\frac{d}{d\theta} \log p(r|\theta) = 0,

which is


It turns out EM isn’t so biased for these discrete mixtures, at least when you turn the crank the right way.

Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased

January 14, 2010

Update: I turned the crank a little too hard on this post. While it’s technically correct according to one way of interpreting maximum likelihood, the more standard approach does not suffer the same bias. I describe the “right” way to do EM in the followup post,

While this is good news for EM (and RNA-Seq), it undercuts one of the motivations for the Bayesian estimator, which I’ll present soon.

I spent much of the winter break looking over Mitzi’s shoulder at papers on RNA-Seq. For those who don’t know what that it is, I added an appendix Background on RNA-Seq going over the problems.

Statistical Problem Synopsis

As usual, the problem is that the MLE falls on the parameter space boundary. Because of this, the MLE has what is known as statistical bias. In the case of MLEs for RNA-Seq expression, the bias is toward a winner-take-all solution in which one isoform is assigned all the uncertain mappings.

You’ve Seen Biased MLEs Before

You’ve seen it in the context of estimating the variance of a distribution from samples y = y_1,\ldots,y_n. The maximum likelihood estimate \mu^* of the mean is the same as the unbiased estimate \bar{\mu},

\mu^* = \bar{\mu} = \frac{1}{n} \sum_{n = 1}^N y_n.

But the maximum likelihood estimate {\sigma^2}^* of variance is biased:

{\sigma^2}^* = \frac{1}{n} \sum_{n=1}^N (y_n - \mu^*)^2.

It’s biased on the low side (because it’s based on the estimate of the mean, not the actual mean). The unbiased estimate arises by subtracting one from the denominator,

\bar{\sigma^2} = \frac{1}{n-1} \sum_{n=1}^N (y_n - \mu^*)^2.

But unlike the case for expression I’ll discuss next, as the sample size approaches infinity, the unbiased and MLE estimates of variance converge to the same value.

Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we’ll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

Exon 1 Exon 2 Exon 3 Total
Iso 1 N 20 0 20+N
Iso 2 70-N 0 50 120-N
Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2. So we’ve let N represent the total (out of 70) number of multi-mapped reads assigned to isoform 1.

Calculating Expression and Likelihood

For a given value of N, we can directly calculate the maximum likelihood estimate for isoform expression for isoform 1 (\theta) and isoform 2 (1 - \theta), by

\theta = (20+N)/140 and 1-\theta = (120-N)/140.

The (unnormalized) likelihood function for N, which determines \theta, is

p(N|\theta) \propto \theta^{20+N} (1-\theta)^{120-N}

and the (unnormalized) log likelihood is

\log p(N|\theta) = (20+N) \log \theta + (120-N) \log (1 - \theta).

Here’s a plot of the (unnormalized) log likelihood versus N that illustrates the problem with the MLE estimator:

log likelihood plot

The maximum likelihood (which is at the same point as the maximum of the unnormalized likelihood) is achieved at the boundary of the possible parameter values, so the MLE is N^* = 0. The MLE thus assigns all the multi-mapped reads to the second isoform, which is why I called it a “winner take all bias”.

The unbiased estimator is the expected value of N,

\bar{N} = \mathbb{E}[N],

which will obviously fall somewhere between 0 and 70, but closer to 0 than 70 given the skew of the likelihood function.

Just to be clear, it’s not just the MLE that’s a problem. Maximum a posteriori (MAP) estimates, which use Bayesian priors but not Bayesian estimators, have the same problem.

Who’s using MLE and MAP?

It’s actually even more popular in biology than in computational linguistics. And while there’s not much that’s been published on RNA-Seq expression profiling, I know these three papers (and would be glad to hear about more):

Ironically, MLE is being sold as an improvement to the one-step remapping approach outlined in the following two papers, the first of which overlaps significantly in authorship with the de Hoon et al. paper cited above:

While I like the fact that the former papers write down a full generative model, I say “ironically”, because the one-step remapping approach gets the “right” answer for the example above. The remapping of multi-maps by Mortazavi et al. is done proportionally to the expression computed over the uniquely mapped reads. In our example, that’s 20 to 50, so the 70 multi-mapped reads are split in the same proportion, with 20 going to isoform 1 and 50 to isoform 2, giving isoform 1 an expression of 40 and isoform 2 an expression of 100.

Stay Tuned for the Next Installment

In the next post in this series, I’ll show how to estimate the full Bayesian posterior using Gibbs sampling, incorporating probabilistic information from alignment. From the Bayesian posterior, unbiased point estimate is easily derived as is estimation variance (i.e. “confidence”). I’ll even show how to derive answers when there are no multi-mapped reads, and when exons and isoforms are of different sizes.

Background on RNA-Seq

For mammalian (and other complex) genomes, genes are made up of exons (of a few dozen to a few hundred bases) and introns (from very small to tens of thousands of bases in humans). RNA is transcribed from a gene’s DNA.

After transcription, the spliceosome chops out the introns and splices the exons together, adds a cap on the 5′ end (start) and a sequence of A bases (the poly(A) tail) on the 3′ end (end). The result is so-called mature mRNA (messenger RNA), because it is ready to be translated into a protein.

The spliceosome, for various not fully understood reasons (including regulation, cryptic [ambiguous] splice junctions, and perhaps chromatin structure), does not always splice all of the exons together to form mRNA. In different contexts, alternative splicing variants, the variant proteins produced from which are called isoforms, are generated at different levels of expression (concentration).

The ribosome complex translates mature mRNA (which has four bases) into the 20 different amino acids making up proteins Three bases, together called a codon, are used per amino acid, with the so-called genetic code mapping codons to amino acids; there is also a start codon and stop codon so the ribosome knows where to start and end translation. The (short) portions of the mRNA before the start codon is called the 5′ UTR, and the portion of mRNA between the stop codon and the poly(A) tail is called the 3′ UTR.

In RNA-Seq experiments, mature mRNA is isolated in the lab (typically by grabbing its poly(A) tail), then broken up into short sequences using specialized enzymes called RNases. The resulting short sequences are then copied to complementary DNA (cDNA), tens of millions short sequences which are then randomly attached to a slide (how this is done depends on the particular sequencer). Then the sequencer reads the bases, which is a very noisy process; if you’re lucky, you get confidence scores along with the reads that can be interpreted as probabilities of errors. The different machines have different error profiles.

We now have tens of millions of noisily read sequences. These are next mapped to known genome positions using a combination of indexing and edit distance (there are some really groovy algorithms here; more on those later). Reads that map to one position with a higher score than all other positions are said to map uniquely (though they may map to other positions with lower likelihood). Reads that map to multiple positions are called multi-maps.

The genome (drafts) have been marked up for exons (though this is also noisy) in what are called gene models (no Wikipedia ref I could find!). Multi-maps are very common if the target is an isoform, because isoforms share exons. But they also appear elsewhere because of the large number of paralogs (similar sequences) in copies and copy variants of genes in large genomes like those of mammals.

The problem we’re looking at here is calculating isoform expression taking into account multi-maps.