## Expected Error Histograms from Illumina 36bp Reads

Mitzi and I have been working on reproducing the results reported in the following (hugely influential) recent paper.

The paper analyzes human liver and kidney tissue using both an Affymetrix micro-array and the Illumina short-read sequencer. More about Affy vs. Illumina later; for now, I just want to describe the quality of the Illumina reads.

### Error Profile of Illumina

What’s nice about modern bioinformatics is that many authors submit their data to public repositories. So Mitzi was able to download and crunch Marioni et al.’s Illumina data.

For each 36bp read, Illumina provides a probability estimate for each base. For instance, we might have the 17th position assign a probability of 0.97 to A, 0.02 to C, 0.001 to G, and 0.009 to T. Marioni et al., like most researchers, simply take the first-best calls (see below for some exceptions), thus assigning position 17 in the above example to A. But according to Illumina, there’s only a 97% chance that call is correct!

### Calculating Expected Errors per Read

So how many errors do we expect in a 36bp read? If $q_i$ is the probability that the most likely base at position $i$ is correct, the expected number of errors in the first $I$ reads is $\sum_{i = 1}^{I} (1 - q_i)$.

With a little Java, Python and R (more about the package we’re working on later), Mitzi plotted a histogram for all 66.4M reads from Marioni’s 3pM kidney data:

But it turns out that like most researchers, Marioni et al. truncated the reads because they’re noisier toward the tail (3′ end) of the reads. Specifically, they only used the first 32bp. Here’s the same plot with the last 4bps truncated:

As you can see, this dramatically reduces the expected number of errors.

### Calculating Expected Number of Correct Reads

Mitzi also plotted the probability that the first best calls up to length $I$ are all correct for a read, which is $\prod_{i=1}^{I} q_i$. The probabilities are so low it requires a log (base 10) scale when using all 36bps:

The startling conclusion is that there’s almost no chance at all that the first-best base calls leads to the correct sequence. This has unsettling ramifications for procedures like SNP calling.

### The Done Thing

What Marioni et al. did is typical. They used a heuristic aligner (Bowtie) which ignores the read quality scores and allows only a small number of edits. They then censor (aka discard) any read from their data set that doesn’t align “uniquely”. A unique alignment is taken to be one that has a unique maximum quality score with fewer than the maximum number of allowed edits. That is, if there’s an alignment with one edit, and seven with two edits, the alignment is considered unique; but if there are two alignments with one edit, the read’s discarded.

Why throw away non-unique reads? There’s good reason to believe that the unique alignments are not unique. Not eveyone discards non-unique reads; there are so-called “recovery” strategies, which I discussed in a previous post describing my joint mapping recovery and mRNA expression model.

Why throw away quality scores from the aligner? I only know of two aligners that use quality scores, Gnumap and BWA, but these systems are far less popular than simple edit counting systems like Bowtie.

Why only count edits? A proper channel model should be better. For instance, the SHRiMP system employs a more refined channel model (which my probabilistic alignment model refines with proper normalization and integrating out of the actual sequence of edits).

The answer’s simple really: it’s easier to code and easier to explain.

### Can We Do Better?

Like every newbie, I feel we can. Not surprisingly, I think we should use all the data we have, integrating a proper probabilistic channel model for alignment with a Bayesian mRNA expression model.

Mitzi’s busy doing much more liberal alignments with SHRiMP (which requires a Viterbi approximation to the channel model) and then we’ll run them through the scalable collapsed Gibbs sampler I wrote the for mRNA expression model. After that, we’ll work in the more refined channel edit model that can account for biases induced by the wet lab process and the sequencer, and the quality scores coming from the sequencer.

### Bioinformatics?

No, Alias-i has no plans to work on Bioinformatics; this is still just a hobby for me. Maybe I should choose a hobby that’s a little more dissimilar from my day job.

### 2 Responses to “Expected Error Histograms from Illumina 36bp Reads”

1. Mikael Says:

Actually, Bowtie does use quality scores, although you can set the program options so that it doesn’t. The default setting for Bowtie is to allow a certain number of mismatches in a “seed” region and, given that the read passed that first criterion, allow mismatches as long as their combined quality score does not pass a certain threshold.

• lingpipe Says:

Thanks for the clarification. I’ll link to the comment from the erroneous part of the blog post itself.

Here’s a link to Bowtie’s doc for their --n alignment mode. It uses the sum of the quality scores at the mismatched base positions (quality here is probability of error on on a Phred scale), requiring it to be below a given threshold, and giving priority to better scores in cases of conflicts. They note that the backtracking size needs to be high in order to guarantee correctness (i.e. not make search mistakes).

I guess you could call this “quality aware”, but it’s not the kind of thing I was thinking about, which would actually use all the base probabilities in scoring the whole alignment. Like, say, GnuMap does.