Mitzi and I have been working on reproducing the results reported in the following (hugely influential) recent paper.
- Marioni JC, Mason CE, Mane SM, Stephens M, and Gilad Y. 2008. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res.
The paper analyzes human liver and kidney tissue using both an Affymetrix micro-array and the Illumina short-read sequencer. More about Affy vs. Illumina later; for now, I just want to describe the quality of the Illumina reads.
Error Profile of Illumina
What’s nice about modern bioinformatics is that many authors submit their data to public repositories. So Mitzi was able to download and crunch Marioni et al.’s Illumina data.
For each 36bp read, Illumina provides a probability estimate for each base. For instance, we might have the 17th position assign a probability of 0.97 to A, 0.02 to C, 0.001 to G, and 0.009 to T. Marioni et al., like most researchers, simply take the first-best calls (see below for some exceptions), thus assigning position 17 in the above example to A. But according to Illumina, there’s only a 97% chance that call is correct!
Calculating Expected Errors per Read
So how many errors do we expect in a 36bp read? If is the probability that the most likely base at position is correct, the expected number of errors in the first reads is .
With a little Java, Python and R (more about the package we’re working on later), Mitzi plotted a histogram for all 66.4M reads from Marioni’s 3pM kidney data:
But it turns out that like most researchers, Marioni et al. truncated the reads because they’re noisier toward the tail (3′ end) of the reads. Specifically, they only used the first 32bp. Here’s the same plot with the last 4bps truncated:
As you can see, this dramatically reduces the expected number of errors.
Calculating Expected Number of Correct Reads
Mitzi also plotted the probability that the first best calls up to length are all correct for a read, which is . The probabilities are so low it requires a log (base 10) scale when using all 36bps:
The startling conclusion is that there’s almost no chance at all that the first-best base calls leads to the correct sequence. This has unsettling ramifications for procedures like SNP calling.
The Done Thing
What Marioni et al. did is typical. They used a heuristic aligner (Bowtie) which ignores the read quality scores and allows only a small number of edits. They then censor (aka discard) any read from their data set that doesn’t align “uniquely”. A unique alignment is taken to be one that has a unique maximum quality score with fewer than the maximum number of allowed edits. That is, if there’s an alignment with one edit, and seven with two edits, the alignment is considered unique; but if there are two alignments with one edit, the read’s discarded.
Why throw away non-unique reads? There’s good reason to believe that the unique alignments are not unique. Not eveyone discards non-unique reads; there are so-called “recovery” strategies, which I discussed in a previous post describing my joint mapping recovery and mRNA expression model.
Why throw away quality scores from the aligner? I only know of two aligners that use quality scores, Gnumap and BWA, but these systems are far less popular than simple edit counting systems like Bowtie.
Why only count edits? A proper channel model should be better. For instance, the SHRiMP system employs a more refined channel model (which my probabilistic alignment model refines with proper normalization and integrating out of the actual sequence of edits).
The answer’s simple really: it’s easier to code and easier to explain.
Can We Do Better?
Mitzi’s busy doing much more liberal alignments with SHRiMP (which requires a Viterbi approximation to the channel model) and then we’ll run them through the scalable collapsed Gibbs sampler I wrote the for mRNA expression model. After that, we’ll work in the more refined channel edit model that can account for biases induced by the wet lab process and the sequencer, and the quality scores coming from the sequencer.
No, Alias-i has no plans to work on Bioinformatics; this is still just a hobby for me. Maybe I should choose a hobby that’s a little more dissimilar from my day job.