Update: I turned the crank a little too hard on this post. While it’s technically correct according to one way of interpreting maximum likelihood, the more standard approach does not suffer the same bias. I describe the “right” way to do EM in the followup post,
While this is good news for EM (and RNA-Seq), it undercuts one of the motivations for the Bayesian estimator, which I’ll present soon.
Statistical Problem Synopsis
As usual, the problem is that the MLE falls on the parameter space boundary. Because of this, the MLE has what is known as statistical bias. In the case of MLEs for RNA-Seq expression, the bias is toward a winner-take-all solution in which one isoform is assigned all the uncertain mappings.
You’ve Seen Biased MLEs Before
You’ve seen it in the context of estimating the variance of a distribution from samples . The maximum likelihood estimate of the mean is the same as the unbiased estimate ,
But the maximum likelihood estimate of variance is biased:
It’s biased on the low side (because it’s based on the estimate of the mean, not the actual mean). The unbiased estimate arises by subtracting one from the denominator,
But unlike the case for expression I’ll discuss next, as the sample size approaches infinity, the unbiased and MLE estimates of variance converge to the same value.
Isoform Expression Example
Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we’ll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:
|Exon 1||Exon 2||Exon 3||Total|
The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2. So we’ve let N represent the total (out of 70) number of multi-mapped reads assigned to isoform 1.
Calculating Expression and Likelihood
For a given value of , we can directly calculate the maximum likelihood estimate for isoform expression for isoform 1 () and isoform 2 (), by
The (unnormalized) likelihood function for , which determines , is
and the (unnormalized) log likelihood is
Here’s a plot of the (unnormalized) log likelihood versus N that illustrates the problem with the MLE estimator:
The maximum likelihood (which is at the same point as the maximum of the unnormalized likelihood) is achieved at the boundary of the possible parameter values, so the MLE is . The MLE thus assigns all the multi-mapped reads to the second isoform, which is why I called it a “winner take all bias”.
The unbiased estimator is the expected value of N,
which will obviously fall somewhere between 0 and 70, but closer to 0 than 70 given the skew of the likelihood function.
Just to be clear, it’s not just the MLE that’s a problem. Maximum a posteriori (MAP) estimates, which use Bayesian priors but not Bayesian estimators, have the same problem.
Who’s using MLE and MAP?
It’s actually even more popular in biology than in computational linguistics. And while there’s not much that’s been published on RNA-Seq expression profiling, I know these three papers (and would be glad to hear about more):
- de Hoon, Michiel J. L. et al. (2010) Cross-mapping and the identification of editing sites in mature microRNAs in high-throughput sequencing libraries. Genome Research.
- Marioni, John C. et al. (2009) RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Research.
- Jiang, Hui et al. (2009) Statistical inferences for isoform expression in RNA-Seq. Bioinformatics.
Ironically, MLE is being sold as an improvement to the one-step remapping approach outlined in the following two papers, the first of which overlaps significantly in authorship with the de Hoon et al. paper cited above:
- Hashimoto, T. et al. (2009) Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics
- Mortazavi, A. et al. (2009) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods.
While I like the fact that the former papers write down a full generative model, I say “ironically”, because the one-step remapping approach gets the “right” answer for the example above. The remapping of multi-maps by Mortazavi et al. is done proportionally to the expression computed over the uniquely mapped reads. In our example, that’s 20 to 50, so the 70 multi-mapped reads are split in the same proportion, with 20 going to isoform 1 and 50 to isoform 2, giving isoform 1 an expression of 40 and isoform 2 an expression of 100.
Stay Tuned for the Next Installment
In the next post in this series, I’ll show how to estimate the full Bayesian posterior using Gibbs sampling, incorporating probabilistic information from alignment. From the Bayesian posterior, unbiased point estimate is easily derived as is estimation variance (i.e. “confidence”). I’ll even show how to derive answers when there are no multi-mapped reads, and when exons and isoforms are of different sizes.
Background on RNA-Seq
For mammalian (and other complex) genomes, genes are made up of exons (of a few dozen to a few hundred bases) and introns (from very small to tens of thousands of bases in humans). RNA is transcribed from a gene’s DNA.
After transcription, the spliceosome chops out the introns and splices the exons together, adds a cap on the 5′ end (start) and a sequence of A bases (the poly(A) tail) on the 3′ end (end). The result is so-called mature mRNA (messenger RNA), because it is ready to be translated into a protein.
The spliceosome, for various not fully understood reasons (including regulation, cryptic [ambiguous] splice junctions, and perhaps chromatin structure), does not always splice all of the exons together to form mRNA. In different contexts, alternative splicing variants, the variant proteins produced from which are called isoforms, are generated at different levels of expression (concentration).
The ribosome complex translates mature mRNA (which has four bases) into the 20 different amino acids making up proteins Three bases, together called a codon, are used per amino acid, with the so-called genetic code mapping codons to amino acids; there is also a start codon and stop codon so the ribosome knows where to start and end translation. The (short) portions of the mRNA before the start codon is called the 5′ UTR, and the portion of mRNA between the stop codon and the poly(A) tail is called the 3′ UTR.
In RNA-Seq experiments, mature mRNA is isolated in the lab (typically by grabbing its poly(A) tail), then broken up into short sequences using specialized enzymes called RNases. The resulting short sequences are then copied to complementary DNA (cDNA), tens of millions short sequences which are then randomly attached to a slide (how this is done depends on the particular sequencer). Then the sequencer reads the bases, which is a very noisy process; if you’re lucky, you get confidence scores along with the reads that can be interpreted as probabilities of errors. The different machines have different error profiles.
We now have tens of millions of noisily read sequences. These are next mapped to known genome positions using a combination of indexing and edit distance (there are some really groovy algorithms here; more on those later). Reads that map to one position with a higher score than all other positions are said to map uniquely (though they may map to other positions with lower likelihood). Reads that map to multiple positions are called multi-maps.
The genome (drafts) have been marked up for exons (though this is also noisy) in what are called gene models (no Wikipedia ref I could find!). Multi-maps are very common if the target is an isoform, because isoforms share exons. But they also appear elsewhere because of the large number of paralogs (similar sequences) in copies and copy variants of genes in large genomes like those of mammals.
The problem we’re looking at here is calculating isoform expression taking into account multi-maps.