Author Archive

Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

June 9, 2010

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, \theta, used for read expression. The model in this paper assumes N reads, implicitly assumes an uniform prior for \theta \sim \mbox{\sf Dirichlet}({\bf 1}), then for each read 1 \leq n \leq N, chooses splice variants z_n \sim \mbox{\sf Discrete}(\theta). So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

Noise Isoform

Li et al. include a so-called “noise isoform”, which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don’t quite see given how it’s defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Position and Strand-Based Reads

Things look different when Li et al. start to generate reads y_n. What I did was assume an arbitrary distribution \mbox{\sf Read}(y_n|z_n,\varphi), with parameters \varphi characterizing the likelihood of observing read y_n from reference source z_n. Li et al. decompose part of \mbox{\sf Read} in their model. First, they assume s_n is the start position of read y_n on reference sequence z_n and assume o_n is the strand, both of which are generated from the ref sequence z_n, by distributions p(o_n|z_n) and p(s_n|z_n).

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, s_n, o_n, z_n, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

Non-Probabilistic Alignment

[Updated, June 10, 2010: I don’t know what I was thinking — Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix w_t for substitutions (no indels). I’m not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior p(\theta|y,\alpha,\varphi) of expression given reads y and channel model \varphi and Dirichlet prior \alpha. Li et al. use EM to find the maximum likelihood estimate, \theta^* = \arg\max p(y|\theta,\varphi). As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is \theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi).

Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It’s really pretty much like Gibbs sampling in terms of the application, if not philosophy.]


They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there’s also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there’s no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.

Upgrading Java Classes with Backward-Compatible Serialization

May 4, 2010

When using Java’s default serialization, any changes to the member variables or method signatures of the class breaks backward compatibility in the sense that anything serialized from the old class won’t deserialize in the new class. So do similar changes to superclasses or changing the superclass type.

After a bit of overview, I’ll show you the trick I’ve been using to maintain backward compatibility when upgrading classes with new meaningful parameters.

Serial Version ID

The problem stems from the fact that Java uses reflection to compute a serial version ID for each class. (You can calculate it yourself using the serialver command that ships with Sun/Oracle’s JDK.) When methods or member variables change, the serial version changes. To deserialize, the serialized class’s ID must match the current class’s ID.

As long as the (non-transient) member variables don’t change, this problem can be tackled by explicitly declaring the serial version ID. (See the Serializable interface javadoc for how to declare). With an explicit serial version ID declared, the serialization mechanism uses the declared value rather than computing it from the class signature via reflection.

If there are serializable objects out there you need to deserialize, use serialver to calculate what the version ID used to be, then declare it in the class. If you’re starting from scratch, the ID can be anything.

So always use an explicit serial version when you first create a class. It can’t hurt, and it’s likely to help with backward compatibility.

Taking Control with the Externalizer Interface

But what if (non-transient) member variables (non-static class variables) change? Version IDs won’t help, because the default read and write implemented through reflection will try to deserialize the new variables and find they’re not there and throw an exception.

You need to take control using the Externalizable interface, which extends Serializable. Now you control what gets read and written.

Serialization and deserialization will still work, as long as the new member variable is either not final or is defined in the nullary (no-arg) constructor.

The Simplest Case of The Problem

Suppose we have the class

public class Foo implements Externalizable {
    static final long serialVersionUID = 42L;
    public String mArg1;
    public Foo() { }
    public void setArg1(String arg1) { mArg1 = arg1; }
    public void readExternal(ObjectInput in) 
        throws IOException, ClassNotFoundException {
        mArg1 = (String) in.readObject();
    public void writeExternal(ObjectOutput out) 
        throws IOException {

So far, so good. If I add the following main(),

    public static void main(String[] args) 
        throws IOException, ClassNotFoundException {
        Foo foo = new Foo();
        foo.mArg1 = "abc";
        System.out.println("1: " + foo.mArg1);

        FileOutputStream out = new FileOutputStream("temp.Foo");
        ObjectOutput objOut = new ObjectOutputStream(out);

        FileInputStream in = new FileInputStream("temp.Foo");
        ObjectInput objIn = new ObjectInputStream(in);
        Foo foo2 = (Foo) objIn.readObject();
        System.out.println("2: " + foo2.mArg1);

and compile and run, I get

c:\carp>java Foo
1: abc
2: abc
c:\carp>ls -l temp.Foo
----------+ 1 Bob Carpenter None 31 2010-05-04 15:14 temp.Foo

So far, so good.

But now suppose I add a second variable to Foo,

public class Foo ...
    public String mArg2;
    public void setArg2(String arg2) { mArg2 = arg2; }

And try to run just the deserialization with the existing file,

    public static void main(String[] args) 
        throws IOException, ClassNotFoundException {

        FileInputStream in = new FileInputStream("temp.Foo");
        ObjectInput objIn = new ObjectInputStream(in);
        Foo foo2 = (Foo) objIn.readObject();
        System.out.println("2: " + foo2.mArg1);

Because I haven’t changed readExternal(), we still get the same answer.

But what if I want to serialize the second argument? And give it a default value of null for classes serialized before it was added? The obvious change to read and write don’t work,

    public void readExternal(ObjectInput in) 
        throws IOException, ClassNotFoundException {
	mArg1 = (String) in.readObject();
	mArg2 = (String) in.readObject();
    public void writeExternal(ObjectOutput out) 
        throws IOException {

It’ll throw an exception, as in

c:\carp>java Foo
Exception in thread "main"
        at Foo.readExternal(
        at Foo.main(

Now what?

The Trick: Marker Objects

Here’s what I’ve been doing (well, actually, I’m doing this through a serialization proxy, but the idea’s the same),

    public void readExternal(ObjectInput in) 
        throws IOException, ClassNotFoundException {
        Object obj = in.readObject();
        if (Boolean.TRUE.equals(obj)) {
            mArg1 = (String) in.readObject();
            mArg2 = (String) in.readObject();
        } else {
            mArg1 = (String) obj;
    public void writeExternal(ObjectOutput out) 
        throws IOException {

The new implementation always writes an instance of Boolean.TRUE (any small object that’s easily identifiable would work), followed by values for the first and second argumet. (In practice, these might be null in this class, so we’d have to be a bit more careful.) The trick here is the external read method reads an object, then checks if it’s Boolean.TRUE or not. If it’s not, we have an instance serialized by the previous version, so we cast that object to a string, assign it to the first argument and return. If it is, we read two args and assign them.

Now the result of deserializing our previous instance works again:

c:\carp>java Foo
2: abc

And there you have it, backward compatibility with new non-transient member variables.

The Fine Print

The trick only works if there’s an object that’s being serialized (or a primitive with a known restricted range of possible values). It doesn’t have to be the first variable, but there has to be something you can test to tell which version of the class you have.

Sequence Alignment with Conditional Random Fields

April 25, 2010

Having spent hours figuring out how to properly normalize edit probabilities of an edit channel model so they sum to 1 over all reads of a given length given a reference sequence, the answer seems obvious in retrospect. Use conditional random fields (CRFs), because they can normalize anything and are easy to program. An added benefit is that biologists are used to thinking of edit costs on a log scale.

Once we have a conditional model with an arbitrary linear basis, it is straightforward to account for differential expression given location or initial hexamer (more on that below).

The Sequence Alignment Problem

The first step in most DNA or RNA analysis pipelines is sequence alignment, wherein the bases read from a biological sample by a sequencer are aligned with the bases of one or more reference sequences (genome sequences or gene models of RNA transcripts).  The present generation of sequencers (e.g. Illumina or SOLiD) produce tens or even hundreds of millions of reads of 50 bases or so per run.

Typically, simpler models like pure edit distance are computed, using hashing heuristics to narrow search or suffix arrays to share prefix calculation. The model described here is intended to be run over a narrower set of alignment hypotheses.

Generative Expression Models Require Probabilistic Alignments

In a previous post, I described a Bayesian hierarchical model of mRNA expression. That model contains a random variable for each read that is dependent on the sequence from which it is read. Inference with this model requires us to calcluate the probability of a read given a reference sequence. The model described here directly plugs into the expression model. Probabilistic alignments may also be used in SNP calling, splice-variant discovery, or any other applications of sequence alignment.


Our model is based directly on the simple notion of alignment, which consists of a sequence of matches/substitutions, inserts (gaps in reference), and deletes (gaps in query/read). Repeating the example from my earlier post on probabilistic channel models, the following is an alignment of the read/query sequence AGTTAGG (below) to the reference sequence ACGTACG (above):


The first C in the reference is deleted, the second T in the read is inserted, and the second G in the read is substituted for the second C in the reference. We can represent alignments as sequences of edit operations. For the alignment above, we have the sequence of eight edits

m(A), d(C), m(G), m(T), i(T), m(A), s(G,C), m(G).

The alignment indicates the bases involved so that the reference and query/read sequences may be reconstructed from the alignment.

As we noted before, there are typically very many ways to align a read to a reference. Here’s an alternative alignment,


which corresponds to the following sequence of seven edits

m(A), s(G,C), s(T,G), m(T), m(A), s(G,C), m(G).

Read Probabilities from Alignment Probabilities

Suppose we have a reference sequence of M bases x = x_1,\ldots,x_M. The CRF we propose will model the probability p(a,m|x,N) of alignment a aligning at position m of reference sequence x and producing a read of length N. We’ll let \mbox{read}(a) be the read generated by the alignment a.

We compute the probability of a read y of length N starting at position m of the reference sequence as

p(y,m|x,N) = \sum_{\{a \ : \ \mbox{read}(a) = y\}} p(a,m|x,N).

We can then marginalize out the starting position m by summing to calculate the probability of a read,

p(y|x,N) = \sum_{m =1}^M p(y,m|x,N).

Linear Basis

In general, CRFs are based on arbitrary potential functions for cliques of dependent variables which combine linearly by summation over all cliques. We break down the basis into a contribution from edits and starting position. For edits, we take

\varphi(a,N) = \sum_{m = 1}^M \varphi(a_m)

where \varphi(a_m) is the potential of the edit operation a_m. Recall that the edits include their edited bases, so we have four match edits m(Z), twelve substitution edits s(Z1,Z2) with Z1 != Z2, four insert edits i(Z) and four delete edits d(Z), for a total of 24 edit operations. So \varphi(\cdot) provides values in (-\infty,\infty) for each of the 24 operations, and the weight of a sequence of operations is the sum of the operation weights.

For starting positions, we take a potential function \psi(m,x) for starting an alignment at position m of reference sequence x = x_1,\ldots,x_M.

The probability of an alignment at a starting position is defined to be proportional to the exponentiated sum of potentials,

p(a,m|x,N) \propto \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)),

where we take \xi(a,N) to be an indicator variable with value 0 if \mbox{length}(\mbox{read}(a)) = N and -\infty otherwise. This makes sure our distrbution is normalized for the length of reads we care about, and will also make the dynamic programming required for inference tractable.

Normalization is straightforward to calculate in theory. To compute normalized probabilities, we take

p(a,m|x,N) = \frac{1}{Z(x,N)} \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)), where

Z(x,N) = \sum_{a,m}  \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)).

As they say in the old country, Bob’s your uncle.

Sum-Product Algorithm for Read Probabilities

It’s just the usual sum-product algorithm for CRFs. Its just like the usual Smith-Waterman style edit algorithm except that (1) we must consume all the read, (2) start costs are given by \psi, and (3) we use log sums of exponentials rather than maxes to compute cell values without underflow.

Sum-Product for Normalization

To compute the normalizing term Z(x,N), we run a similar sum-product algorithm, only with an extra sum for all the possible reads.

Viterbi Algorithm for Best Alignments

With the same caveats (1) and (2) above, we can use the usual maximum operation to produce the best alignment. We can also follow LingPipe’s approach and use Viterbi in the forward direction and A* in the reverse direction to enumerate the n-best alignments.

Affine Gap Basis and Edit Context Sensitivity

It’s easy to take affine gaps into account. Just condition the edit operations on the previous edit operation. You have to add the relevant context to dynamic programming as in the usual affine extension to Smith-Waterman alignment.

It’s also easy to model things like the uncertainty caused in Illumina reads by homopolymers (sequences repeating the same base) by allowing context for the previous base. This is also easy to dynamic program, though adding context adds to space and time requirements.

Modeling Uneven Reads

Depending on the sample preparation process in RNA sequencing experiments (RNA-Seq) (e.g. RNA fragmentation by hydrolysis versus oligo-primed cDNA fragmentation by sonication), there may be effects in read coverage based on position (e.g. RNA fragmentation depleted at 3′ and 5′ end or cDNA fragmentation favoring reads toward the 3′ end) or on the sequence of bases near the start of the read (e.g. amplification by “random” hexamer priming).

These effects are not small. [Hansen, Brenner and Dudoit 2010] showed that hexamer binding in the illumina process can cause over a 100-fold difference in expression depending on the first two hexamers in a read. Similarly, see figure 3 in [Wang, Gerstein and Snyder 2009], which demonstrates over a tenfold effect on expression based on position. (I sure wish there was variance in the Wang et al. graphs.)

Modeling Input Quality

We can also add input quality to our calculations if it’s on a log scale. Then we have to select which base was actually read, making it more like the use of CRFs for tagging. For an example of how to do this for standard edit costs, see the description of the GNUMAP system in [Clement et al. 2010]. As the authors point out, this is a serious issue with reads as noisy as those coming off the Illumina or SOLiD platforms.

Inferring Splice-Variant mRNA Expression from RNA-Seq

February 5, 2010

I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants.

The Model

The data from which we infer expression is

  • K \in \mathbb{N}^{+}, the number of splice variants,
  • N \in \mathbb{N}^{+}, the number of reads, and
  • y_1,\ldots,y_N, the reads.

For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores.

We assume two model hyperparameters,

  • \varphi, genetic variation, sample preparation, and sequencing parameters, and
  • \alpha_1,\ldots,\alpha_K \in \mathbb{R}^{+}, the prior read counts (plus one).

The general-purpose parameter \varphi reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base).

We will infer two of the model parameters,

  • t_1,\ldots,t_N \in 1:K, the mapping of read to splice variant, and
  • \theta_1,\ldots,\theta_K \in [0,1] such that \sum_{k=1}^K \theta_k = 1, the read expression probabilities.

The model structure in sampling notation is

  • \theta \sim \mbox{\sf Dirichlet}(\alpha)
  • t_n \sim \mbox{\sf Discrete}(\theta) for n \in 1:N
  • y_n \sim \mbox{\sf Channel}(t_n,\varphi) for n \in 1:N.

We select the expression levels \theta based on prior counts.

The heart of the mapping model is the discrete sampling of the latent mappings t_n based on the expression level parameter \theta. This is effectively a beta-binomial model of expression at the corpus level.

The mysterious part of this whole thing is the channel model, which assigns the probability of a given read y_n being observed given it arose from the splice variant t_n under the sequencing model parameterized by \varphi.

Simple Channel Examples

For purposes of illustration, we’re going to assume a really simple instance of this model.

First, we’ll assume that \alpha_k = 1 for k \in 1:K, which renders \mbox{\sf Dirichlet}(\alpha) a uniform distribution over possible values of \theta. With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate \alpha.

Second, we’ll assume a simple uniform read model. Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure.

Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant.

Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here.

With all of these assumptions in place, if read y_n is mapped to splice variant t_n \in 1:K, we set

\mbox{\sf Channel}(y_n|t_n,\varphi) = 1 / (\mbox{len}(t_n) - \mbox{len}(y_n)).

The denominator \mbox{len}(t_n) - \mbox{len}(y_n) reflects the number of positions a sequence of the given read length \mbox{len}(y_n) may occur in an mRNA of length \mbox{len}(t_n).

A Simple BUGS Implementation

Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS.

model {
    theta[1:K] ~ ddirch(alpha[1:K])
    for (n in 1:N) {
        t[n] ~ dcat(theta[1:K])
        y[n] ~ dcat(rho[t[n],1:K])

BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution.

I call it from R in the usual way. Here’s just the part where R tells BUGS how to run the model. After the BUGS window is closed, R plots and attaches the data.


data <- list("alpha", "rho", "y", "N", "K")
parameters <- c("theta","t")
inits <- function() { list(theta=c(1/3,1/3,1/3)) }
rnaexp <- bugs(data, inits, parameters,
        n.chains=3, n.iter=50000,
        debug=TRUE, clearWD=TRUE,"c:\\WinBUGS\\WinBUGS14",

The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC.

Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples.

An Example

I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. The gene model and expression levels for the simulation are defined in a comment in the R code. We’ll consider a case with three exons and three isoforms:

# exons  A BB C  (B twice as long as others)
# var1 = A BB
# var2 =   BB C
# var3 = A    C

There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). There are three splice variants, labeled var1, var2, and var3. Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C.

Without cross-junction reads, it’s not possible to get a uniquely mapped read. Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3.

Next, we specify the number of samples drawn from each exon in each splice variant.

#       exon
# var  1  2  3  Total theta*
# 1    1  2    | 3     3/19
# 2       4  2 | 6     6/19
# 3    5     5 | 10    10/19
#      --------
# tot  6  6  7   18

There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. These are distributed uniformly across the exons. Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3.

If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. But our expression parameter theta (\theta) is defined by reads, not by transcripts.

The observed data is much simpler — it’s just the total number of reads for samples mapping to each exon. We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3.

Coding the Example in R

We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K:

alpha <- c(1,1,1)
K <- length(alpha)

We actually assume there are 10 times that many samples as in our figure, setting the read data as:

y <- c(rep(1,60),rep(2,60),rep(3,70))
N <- length(y)

That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190.

Finally, we use R’s matrix constructor to set up the read model:

rho <- matrix(c(1/3,2/3,0,  0,2/3,1/3,  1/2,0,1/2),

I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length).

The Output

I ran the R program, which called BUGS, which terminated neatly and returned to R. The \hat{R} values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler.

Bayeisan Point Estimate

We can compute any statistics we want now with the posterior samples. For instance, theta[,1] contains all the Gibbs samples for the variable \theta_1.

We can then generate the Bayesian (L2) estimates for expression levels:

> thetaHat = c(mean(theta[,1]), mean(theta[,2]), mean(theta[,3]))
> print(thetaHat,digits=2)
[1] 0.17 0.31 0.52

The “right” answer for our simulation is also easy to compute with R:

> print(c(3/19, 6/19, 10/19), digits=2)
[1] 0.16 0.32 0.53

Our estimates are pretty darn close. (Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.)

How certain are we of these estimates? Not very, because there were only 190 samples and lots of mapping uncertainty. It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples:

> print(quantile(theta[,1], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
0.025 0.104 0.167 0.231 0.340
> print(quantile(theta[,2], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
 0.13  0.25  0.31  0.37  0.47
> print(quantile(theta[,3], c(0.025,0.25,0.5,0.75,0.975)),
 2.5%   25%   50%   75% 97.5%
 0.42  0.49  0.52  0.56  0.61

With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced.

A Triangle Plot and Correlated Expressions

Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. Here’s the plot:

I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3.

It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. Here’s the two-way interaction plots:

two-way plots of RNA-seq expression 1 vs 2

two-way plots of RNA-seq expression 1 vs 3

two-way plots of RNA-seq expression 2 vs 3

These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. This makes sense given variants 1 and 2 share the large exon BB.

Finally, here’s the source that generated the PNG file for the triangle plot:


and for the ordinary plots:

plot(theta[,c(1,2)],xlab="variant 1",ylab="variant 2",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 2")

plot(theta[,c(1,3)],xlab="variant 1",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 3")

plot(theta[,c(2,3)],xlab="variant 2",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="2 vs. 3")

Hierarchical Bayesian Batting Ability, with Multiple Comparisons

November 4, 2009

Just in time for the last game(s) of this year’s World Series, the final installment of my example of Bayesian stats using baseball batting ability. I took us off-topic (text analytics) with a general intro to Bayesian stats (What is “Bayesian” Statistical Inference?). As a first example, I compared Bayesian calculations of binomial posteriors with point estimates (Batting Averages: Bayesian vs. MLE Estimates). In exploring the question of “where do priors come from?”, I started with simple non-Bayesian point estimates (Moment Matching for Empirical Bayes Beta Priors for Batting Averages). Finally, I realized I’d meant “ability” where I’d said “average” (former is a latent parameter, latter is a statistic calculated from at bats), when I considered Bayesian point estimates (Bayesian Estimators for the Beta-Binomial Model of Batting Ability).

Hierarchical Model Recap

For batter j, the number of hits n_j in N_j at bats is modeled as a binomial,

n_j \sim \mbox{\sf Binom}(\theta_j,N_j) for j \in 1:J,

where the ability, or chance of getting a hit, for batter j is \theta_j. Ability is modeled as having a beta distribution

\theta_j \sim \mbox{\sf Beta}(\alpha,\beta) for j \in 1:J

with prior number of hits \alpha-1 and prior number of outs \beta-1. These parameters, which act as priors for the for the binomial parameter, are themselves given priors. The mean of the beta, \alpha/(\alpha+\beta), is given a uniform prior,

\alpha/(\alpha+\beta) \sim \mbox{\sf Beta}(1,1).

The prior for the scale \alpha+\beta is modeled by a Pareto distribution, which has probability proportional to 1/(\alpha+\beta)^{2.5},

\alpha + \beta \sim \mbox{Pareto}(1.5).


I used the 2006 AL position player data (given in this previous blog post). That fixes the number of players J and the hits n_j and at-bats N_j for each player j \in 1:J.

I then used BUGS encoding of the model above (as shown in this previous post). BUGS automaticaly implements Gibbs sampling, a form of Markov Chain Monte Carlo (MCMC) inference. I ran 3 chains of 1000 samples each, retaining the last 500 samples, for 1500 posterior samples total. All parameters had \hat{R} values very near 1, indicating convergence of the Markov chains. As usual, we read off statistics from the samples and used the sampled values for inference, where they allow the integrals involved in Bayesian inference (as descirbed in this blog post) to be computed.

Beta Posterior

We can inspect the posterior for the beta mean \alpha/(\alpha+\beta) and scale \alpha+\beta parameters. We plot them as a 2D scatterplot with their means picked out as lines.

beta parameters posterior

As usual, the scale is much harder to estimate than the mean.

Ability Posteriors

We can also plot the ability for particular players. Here’s histograms of sampled posteriors for the players with the best average posterior ability estimates, with their hits and at-bats provided as labels above the plot:

Notice how the variance of the posterior is determined by the number of at bats. The player with 60 at bats has a much wider posterior than the player with 695 at bats.

Multiple Comparisons: Who’s the Best Batter?

We’re going to do what Andrew, Jennifer and Masanao suggest, which is to use our hierarchical model to make a posterior comparison about player’s ability. In particular, we’ll estimate the posterior probabability that a particular player is the best player. We simply look at all 1500 posterior samples, which include ability samples as plotted above, and count how many times a player has the highest ability in a sample. Then divide by 1500, and we’re done. It’s a snap in R, and here’s the results, for the same batters as the plot above:

Average At-Bats Pr(best ability)
.367 60 0.02
.342 482 0.08
.347 521 0.12
.322 695 0.02
.330 648 0.04
.343 623 0.11
.330 607 0.04

The .347 batter with 521 at bats not only has the highest estimated chance in our model of having the highest ability, he also won the batting crown (Joe Mauer of Minnesota). The beta-binomial hierarchical model estimate is only 12% that this batter has the highest ability. The estimate is very close for the .343 batter with 623 at bats (Derek Jeter of the Yankees). [It turns out the race to the batting crown came down to the wire.]

The larger number of at bats provides more evidence that the batter has a higher ability than average, thus pulling the posterior ability estimate further away from the prior mean. Finally, note that we’re assigning some chance that the .367 batter with only 60 at bats is best in the league. That’s because when the samples are on the high side of the distribution, this batter’s best.

Examples of “Futility of Commenting Code”

October 19, 2009

Continuing on from the previous blog entry, The Futility of Commenting Code, I’d like to address the dissenting comment of Sandman, who claims to have only seen the kind of caricature comments I made up in posts about commenting. Well, here goes reality.

A Real Example

Consider the following real example, which I chose because I knew the source for Apache projects was available online, and suspecting Apache Commons would be less heavily checked than more popular projects. I dove into the e-mail package, because it’s the first one I actually use. I found this example right away:

pieces of which I repeat here:

 * Create a datasource from a String.
 * @param data A String.
 * @param aType A String.
 * @throws IOException IOException
public ByteArrayDataSource(String data, String aType) throws IOException {

    this.type = aType;
    try {
        baos = new ByteArrayOutputStream();
        // Assumption that the string contains only ASCII
        // characters! Else just pass in a charset into this
        // constructor and use it in getBytes().
    } catch (UnsupportedEncodingException uex) {
        throw new IOException("The Character Encoding is not supported.");
    } finally {
        if (baos != null) {

I wasn’t talking about API comments, but these display the same problem. This public constructor is documented with “Create a datasource from a String”, but in fact, there are two string parameters, both documented as “a String”. That’s what the delcaration says, so this is exactly the kind of redundant comment I was talking about.

Next, consider the one code comment. It starts on the wrong foot, with “Assumption that the string contains only ASCII characters!”. If you look at the method call, data.getBytes("iso-8859-1"), you’ll see that it’s actually more general than documented, in that it’ll work for any ISO-8859-1 characters (aka Latin1). The second part of the comment, “Else just pass in a charset into this constructor and use it in getBytes()” makes no sense, because the bytes are hard coded, and there is no char encoding argument to the constructor.

Furthermore, the catch block (with its throw) should just be removed. It’s catching an UnsupportedEncodingException, which extends IOException, then throwing a fresh IOException. It should just be removed, at which point an unsupported encoding throws an unsupported encoding exception; you don’t even need to change the signature, because unsupported encoding exceptions are kinds of I/O exceptions. There are two problems with the code as is. First, the the IOException reports an unhelpful message; the unsupported encoding exception has the info on what went wrong in its message. Second, it’s returning a more general type, making it harder for catchers to do the right thing. You might also consider the fact that the original stack trace is lost a problem.

Another instance of (almost) commenting the language is later in the same file:

try {
    int length = 0;
    byte[] buffer = new byte[ByteArrayDataSource.BUFFER_SIZE];
    bis = new BufferedInputStream(aIs);
    baos = new ByteArrayOutputStream();
    osWriter = new BufferedOutputStream(baos);
    //Write the InputData to OutputStream
    while ((length = != -1) {
        osWriter.write(buffer, 0, length);
} ... 

I’d argue comments like “Write the InputData to OutputStream” are useless, because this is the idiom for buffered writes.

Aside on Bad Code

Maybe you think you should always buffer streams. In this case, that’s wrong. Both bufferings simply cause twice as many assignments as necessary.

Buffering the input stream is useless because you’re already reading into the byte array buffer.

Buffering the output stream is useless, because you’re writing to a byte array output stream baos.

Furthermore, you don’t need to close or flush a byte array output stream, because both are no-ops (see the ByteArrayOutputStream javadoc).

Finally, the naming is wonky, because osWriter is not a writer, it’s an output stream.

This isn’t an isolated case. Other files in this package have similar problems with doc.

Another Example

While we’re picking on Apache Commons, I checked out another package I’ve used, fileUpload.

public class DiskFileUpload extends FileUploadBase {
    // ------------------------------------ Data members

    * The factory to use to create new form items.
   private DefaultFileItemFactory fileItemFactory;
    // ------------------------------------ Constructors 

That’s the kind of pre-formatted useless stuff I was talking about. We know what a member variable and constructor look like in Java. There weren’t any other code comments in that file.

In that same package, the fileupload.ParameterParser class has some, though, and I’m guessing they’re of the kind that others mentioned as “good” comments, such as:

    // Trim leading white spaces
    while ((i1 < i2) && (Character.isWhitespace(chars[i1]))) {

I’d argue that perhaps a method called trimLeadingWhiteSpace() implemented the same way would be clearer. But if you’re not going to define new methods, I’d say this kind of comment helps. But always verify that the code does what it says it does; don’t take the comment’s word (or method name’s word) for it.

I couldn’t leave that file without commenting on their return-only-at-the-end-of-a-function style:

String result = null;
if (i2 > i1) {
    result = new String(chars, i1, i2 - i1);
return result; 

I have no idea why people do it, but as you can see in this case, it just makes the code hard to follow.

More Examples

I thought only a couple wouldn’t be convincing. So here’s some links and examples:

public boolean isFull() {
    // size() is synchronized
    return (size() == maxSize());

Documenting what’s clear from the code. (And nice section divider comments.)

 // Return the parser we already created (if any)
if (parser != null) {
    return (parser);
 } catch (RuntimeException e) {
    // rethrow, after logging
    log.error(e.getMessage(), e);
    throw e;

Yes, that’s what return, log, and throw do.

public LogFactoryImpl() {
    initDiagnostics(); // method on this object 

Yes, it really says “method on this object”.

There’s lots more, so I’ll leave finding them as an exercise.

It’s Not All Bad

There were useful code comments in those packages that explained how the code corresponded to an algorithm in Knuth, or how complex invariants were being preserved in complex loops. I’m really not saying don’t comment code at all.

The Futility of Commenting Code

October 15, 2009

I often read about novice programmers’ aspirations for a greater density of code comments, because they think it’s “professional”. I have news for you. Professional coders don’t comment their own code much and never trust the comments of others they find in code. Instead, we try to learn to read code and write more readable code.

API Documentation vs. Code Comments

I’m a big believer in clear API documentation. Java makes this easy with javadoc. It even produces relatively spiffy results.

But if you look at the LingPipe source code, you’ll find very few code comments.

Comments Lie

The reason to be very suspicious of code comments is that they can lie. The code is what’s executed, so it can’t lie. Sure, it can be obfuscated, but that’s not the same as lying.

I don’t mean little white lies, I mean big lies that’ll mess up your code if you believe them. I mean comments like “verifies the integrity of the object before returning”, when it really doesn’t.

A typical reason for slippage between comments and code is patches to the code that are not accompanied by patches to the comments. (This can happen for API doc, too, if you fiddle with your APIs.)

Another common reason is that the code author didn’t actually understand what the code was doing, so wrote comments that were wrong. If you really mess up along these lines, your code’s likely to be called out on blogs like Coding Horror or Daily WTF.

Most Comments Considered Useless

The worst offenses in the useless category are things that simply repeat what the code says.

// Set x to 3
x = 3;

It’s even better when embedded in a bloated Fortran-like comment block:

 ************* add1 ***********************
 *  A function to add 1 to integers       *
 *  arguments                             *
 *     + n, any old integer               *
public int add1(int n) { return n+1; }

Thanks. I didn’t know what int meant or what n+1 did. This is a classic rookie mistake, because rookies don’t know the language or its libraries, so often what they do seems mysterious to them. For instance, commenting n >>> 2 with “shift two bits to right and fill in on the left with zeroes” may seem less egregious, but your reader should know that >>> is the unsigned shift operator (or look it up).

There is a grey line in the sand (I love mixed metaphors) here. When you’re pulling out techniques like those in Bloch and Gafter’s Java Puzzlers, you might want to reconsider how you code something, or adding a wee comment.

Eliminate, don’t Comment Out, Dead Code

I remember being handed a 30-page Tcl/Tk script at Bell Labs back in the 1990s. It ran some kind of speech recognition pipeline, because that’s what I was up to at the time. I picked it up and found dozens of methods with the same name, and lots of commented-out code. This makes it really hard to follow what’s going on, especially if whole blocks get commented out with /* ... */.

Please don’t do this. You should lLearn any of thea version control systems instead, like SVN.

When do I Comment Code?

I add comments in code in two situations. The first is when I wrote something inefficiently, but I know a better way to do it. I’ll write something like “use dynamic programming to reduce quadratic to linear”. This is a bad practice, and I wish I could stop myself from doing it. I feel bad writing something inefficient when I know how to do it more efficiently, and I certainly don’t want people reading the code to think I’m clueless.

I know only one compelling reason to leave comments: when you write code that’s not idiomatic in the language it’s written in, or when you do something for efficiency that’s obscure. And even then, keep the comments telegraphic, like “C style strings for efficiency”, “unfolds the E step and M step of EM”, “first step unfolded for boundary checks” or something along those lines.

Update: 13 Dec 2012. I’ve thought about this issue a bit more and wanted to add another reason to comment code: to associate the code with an algorithm. If you’re implementing a relatively complex algorithm, then you’re going to have the algorithm design somewhere and it can be helpful to indicate which parts of the code correspond to which parts of the algorithm. Ideally, though, you’d just write functions with good names to do the stages of the algorithm if they’re clear. But often that’s not really an option because of the shape of the algorithm, mutability of arguments, etc.

Also, I want to be clear that I’m not talking about API comments in code, which I really like. For instance, we do that in LingPipe using Javadoc. I think these comments are really really important, but I think of them somehow more as specifications than as comments.

Write Readable Code Instead

What you should be doing is trying to write code that’s more readable.

I don’t actually mean in Knuth’s literate programming sense; Knuth wants programs to look more like natural language, which is a Very Bad Idea. For one, Knuth has a terrible track record, bringing us TeX, which is a great typesetting language, but impossible to read, and a three-volume set of great algorithms written in some of the most impenetrable, quirky pseudocode you’re ever likely to see.

Instead, I think we need to become literate in the idioms and standard libraries of the programming language we’re using and then write literately in that language.

The biggest shock for me in moving from academia to industry is just how much of other people’s code I have to read. It’s a skill, and I got much better at it with practice. Just like reading English.

Unfortunately, effective programming is as hard, if not harder, than effective essay writing. Writing understandable code that’s also efficient enough and general enough takes lots of revision, and ideally feedback from a good code reviewer.

Not everyone agrees about what’s most readable. Do I call out my member variables from local variables with a prefix? I do, but lots of perfectly reasonable programmers disagree, including the coders for Java itself. Do I use really verbose names for variables in loops? I don’t, but some programmers do. Do I use four-word-long class names, or abbreviations? The former, but it’s another judgment call.

However you code, try to do it consistently. Also, remember that coding standards and macro libraries are not good places to innovate. It takes a while to develop a consistent coding style, but it’s worth the investment.

Bayesian Counterpart to Fisher Exact Test on Contingency Tables

October 13, 2009

I want to expand a bit on Andrew’s post, in which he outlines a simple Bayesian analysis of 2×2 contingency tables to replace Fisher’s exact test (or a chi-square test) for contingency tables.

Contingency Table

I’ll use the data in the example in the Wikipedia article on contingency tables:

Left-Handed Right-Handed TOTAL
Male 9 (y1) 43 52 (n1)
Female 4 (y2) 44 48 (n2)
Total 13 87 100


The statistical hypothesis we wish to investigate is whether the proportion of left-handed men is greater than, equal to, or less than the proportion of left-handed women.

Beta-Binomial Model

The observed data has y_1 = 9 left-handed men of a total of n_1 = 52 men and y_2 = 4 left-handed women of a total of n_2 = 48 women.

Letting \theta_1 \in [0,1] be the proportion of left-handed men and \theta_2 \in [0,1] the proportion of left-handed women, Andrew suggested modeling the number of left-handers as a binomial,

y_1 \sim \mbox{\sf Binom}(n_1,\theta_1), and

y_2 \sim \mbox{\sf Binom}(n_2,\theta_2).

He then suggested simple uniform priors on the proportions, which we know are equivalent to \mbox{\sf Beta}(1,1) priors:

\theta_1 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1), and

\theta_2 \sim \mbox{\sf Unif}(0,1) = \mbox{\sf Beta}(1,1).

Given the conjugacy of the beta for the binomial, we can analytically compute the posteriors:

p(\theta_1|y_1,n_1) = \mbox{\sf Beta}(\theta_1|y_1+1, n_1-y_1+1), and

p(\theta_2|y_2,n_2) = \mbox{\sf Beta}(\theta_2|y_2+1, n_2-y_2+1)

We could try to compute the posterior density of \delta = \theta_1 - \theta_2 analytically by solving the integral

p(\delta|y,n) = \int_{-\infty}^{\infty} \mbox{\sf Beta}(\theta|y_1+1,n_1-y_1+1) \mbox{\sf Beta}(\theta-\delta|y_2+1,n_2-y_2+1) d\theta

and then evaluating \mbox{Pr}[\delta > 0].

Instead, we’ll just follow Andrew’s sage advice and estimate the integral by simulation. That is, we’ll take M samples from the joint posterior

p(\theta_1,\theta_2|y_1,n_1,y_2,n_2) = p(\theta_1|y_1,n_1) \times p(\theta_2|y_2,n_2),

which we write as

(\theta_1^{(1)},\theta_2^{(1)}), (\theta_1^{(2)},\theta_2^{(2)}), (\theta_1^{(3)},\theta_2^{(3)}), \ldots, (\theta_1^{(M)},\theta_2^{(M)})

and then use the Monte Carlo approximation,

\mbox{Pr}[\theta_1 > \theta_2]

\approx \frac{1}{M} \ \sum_{m=1}^{M} \mathbb{I}(\theta_1^{(m)} > \theta_2^{(m)}),

where \mathbb{I}() is the indicator function, taking on value 1 if its property is true and 0 otherwise.

In words, we estimate the probability that \theta_1 is greater than \theta_2 by the fraction of samples m \in M where \theta_1^{(m)} > \theta_2^{(m)}.

R Code

I love R. Here’s the R code to compute estimated posteriors by simulation and visualize them.

n1 = 52 # men
y1 = 9  # left-handed men
n2 = 48 # women
y2 = 4  # left-handed women

I = 10000 # simulations
theta1 = rbeta(I, y1+1, (n1-y1)+1)  
theta2 = rbeta(I, y2+1, (n2-y2)+1)
diff = theta1-theta2  # simulated diffs

quantiles = quantile(diff,c(0.005,0.025,0.5,0.975,0.995))

print("Probability higher % men than women lefties:")

     xlab="theta1 - theta2",
     ylab="p(theta1 - theta2 | y, n)",
     main="Posterior Simulation of Male - Female Lefties",
abline(v=quantiles[2], col="blue")
abline(v=quantiles[4], col="blue")

Running this code prints out:

> source("twoByTwoPosterior.R")
  0.5%   2.5%    50%  97.5%  99.5% 
-0.091 -0.046  0.084  0.218  0.262 

[1] "Probability higher % men than women lefties:"
[1] 0.901

[1] "Mean difference theta1-theta2"
[1] 0.08484979

The first numbers displayed are the posterior quantiles for the difference. The posterior median (50% quantile) difference is 0.084. The other numbers are quantiles which determine 95% and 99% posterior intervals, which span (0.025,0.975) and (0.005,0.995) respectively. The posterior central 95% quantile is thus (-0.046,0.218). Quantiles get the obvious interpretation in Bayesian stats — given the model and data, we assign a 95% probability to the true difference lying in the interval (-.046,0.218).

The second number printed is the posterior probability that men have a higher percentage of left-handness than women, which is 0.901, or about 90%. R makes this simple by letting us write mean(theta1 > theta2). This is also the value you’d get from integrating the posterior density from 0 to \infty.

The final number is the mean difference between theta1 and theta2, which is the unbiased Bayesian estimator of the difference. This value is 0.0848, which is a bit higher than the median difference of 0.084.

The code also generates the following graph of the posterior:

The vertical lines are drawn at the boundaries of the central 95% region of the posterior difference p(\theta_1 - \theta_2 | n, y).


Don’t ask. You’re a Bayesian now.

Instead, say that according to your model, based on the observed data, there’s a 90% chance there’s a higher prevalence of lefthandedness among men than women, or that there’s a 95% chance the difference between male prevalence of lefthandness and female prevalence of lefthandedness falls in the range (-.05,0.22). [Remember not to use too many digits. The outputs from R were so you could see the small differences.]

Batting Averages: Bayesian vs. MLE Estimates

September 11, 2009

It’s American baseball season, and I’ve been itching to do more baseball stats. So in the interest of explaining Bayesian posterior inference with an example…

Batting Averages

A player’s batting average is his number of “hits” divided by his number of “at bats“.

Binomial Model of Batting

For simplicitly, we’ll model a sequence of N at bats using a binomial distribution with a \theta probability of getting a hit in each at bat. This assumes that the sequence of at bats is independent and identically distributed, aka iid.

Of course this is wrong. All models are wrong. It’s just that this one’s more wrong than usual, despite the popularity of the batting average as a summary statistic. (See the second last section for a reference to a better model.)

Maximum Likelihood Estimate (MLE)

Given a sample (aka training data) consisting of a player with m hits in M at bats (where m > 0,  (M-m) > 0), the maximum-likelihood estimate (MLE) of the ability \theta is m/M, aka the player’s batting average.

Uniform Prior

We assume a uniform (aka noninformative) prior distribution over abilities.

This is also wrong. We know major league position players don’t bat under 0.200 or they’d be fired, and no one’s had a batting average of over 0.400 since Ted Williams.

But we’ll have to come back to issues with the prior later. This post is just to clarify what integration over the posterior does for inference, not what an informative prior does.

Bayesian Posterior Estimate

The beta prior is conjugate to the binomial. Because the uniform distribution may be expressed as a beta distribution, \mbox{\sf Beta}(1,1), the posterior is a beta distribution. Because the beta and binomial play nicely together, the posterior ability distribution p(\theta|m,M) may be expressed analytically as \mbox{\sf Beta}(1+m, 1+M-m). The maximum of \mbox{\sf Beta}(1+m,1+M-m) is m/M, so the maximum a posteriori (MAP) estimate of \theta is also m/M, the same as the maximum likelihood estimate.

Posterior Predictive Inference

Now let’s look at prediction. For concreteness, suppose we’ve observed M=10 at bats, of which m=3 were hits, for a batting average of 3/10 = 0.300. What we want to know is the likelihood of n hits in the next N=5 at bats.

There are six possible outcomes, 0, 1, 2, 3, 4, or 5 hits. Using the MLE or MAP point estimate, this distribution is \mbox{\sf Binomial}(n|5,0.300).

The Bayesian estimate requires integrating over the posterior, effectively averaging the binomial prediction for each ability by the ability’s probability. The posterior in this case is \mbox{\sf Beta}(3+1,10-3+1)=\mbox{\sf Beta}(4,8), so the integral results in the well known beta-binomial distribution:

\mbox{\sf BetaBinomial}(n|5, 4,8) = \int_0^1 \mbox{\sf Binomial}(n|5,\theta) \mbox{\sf Beta}(\theta|4,8) d\theta

I solved the integral numerically using Monte Carlo integration, which is a fancy way of saying I took a bunch of samples \theta_n from the beta posterior \mbox{\sf Beta}(4,8) and plugged them into \mbox{\sf Binomial}(5,\theta_n) and averaged the resulting predictions. It’s a few lines of R (see below). The beta-binomial may be expessed analytically, but that’s even more typing than the Monte Carlo solution. I’m sure there’s an R package that’d do it for me.

Comparing the Results

The important issue is to look at the outputs. Given that we’ve seen a batter get 3 hits in 10 at bats, here’s our prediction using a binomial model with uniform prior, shown with both the MLE/MAP point estimate and the average of the Bayesian predictions:

Point vs. Bayesian Estimates
Hits MLE/MAP Bayesian
0 0.168 0.180
1 0.360 0.302
2 0.309 0.275
3 0.132 0.165
4 0.028 0.064
5 0.002 0.012

The Bayesian estimate here is more dispersed (flatter, higher variance, higher entropy) than the MLE estimate. Of course, a true Bayesian estimate would provide a joint distribution over outcomes 0,1,2,3,4,5 (which reduce to a simplex, because any 5 values determine the 6th) for downstream reasoning. The nice thing about passing uncertainty along by integration is that it’s very plug-and-play.

Advanced Reading

Luckily for us baseball stats geeks, there’s some advanced reading hot off the press from Wharton Business School:

R Source

Here’s how I generated the table.

# observed
hits = 3
atBats = 10

# predicted
nextAtBats = 5

# y: MLE predictive distro
average = hits/atBats
y = rep(0,nextAtBats+1);
for (i in 0:nextAtBats) y[i+1] = dbinom(i,nextAtBats,average)

# z: Bayesian predictive distro
numSamps = 20000
for (k in 1:numSamps) {
    sampledAvg = rbeta(1,1+hits,1+atBats-hits)
    for (i in 0:nextAtBats)
        z[i+1] = z[i+1] + dbinom(i,nextAtBats,sampledAvg)
for (i in 0:nextAtBats)
    z[i+1] = z[i+1] / numSamps

print("Point"); print(y, digits=3)
print("Bayes"); print(z, digits=3)

What is “Bayesian” Statistical Inference?

September 9, 2009

Warning: Dangerous Curves This entry presupposes you already know some math stats, such as how to manipulate joint, marginal and conditional distributions using the chain rule (product rule) and marginalization (sum/integration rule). You should also be familiar with the difference between model parameters (e.g. a regression coefficient or Poisson rate parameter) and observable samples (e.g. reported annual income or the number of fumbles in a quarter of a given American football game).


I followed up this post with some concrete examples for binary and multinomial outcomes:

Bayesian Inference is Based on Probability Models

Bayesian models provide full joint probability distributions over both observable data and unobservable model parameters. Bayesian statistical inference is carried out using standard probability theory.

What’s a Prior?

The full Bayesian probability model includes the unobserved parameters. The marginal distribution over parameters is known as the “prior” parameter distribution, as it may be computed without reference to observable data. The conditional distribution over parameters given observed data is known as the “posterior” parameter distribution.

Non-Bayesian Statistics

Non-Bayesian statisticians eschew probability models of unobservable model parameters. Without such models, non-Bayesians cannot perform probabilistic inferences available to Bayesians, such as definining the probability that a model parameter (such as the mean height of an adult male American) is in a defined range say (say 5’6″ to 6’0″).

Instead of modeling the posterior probabilities of parameters, non-Bayesians perform hypothesis testing and compute confidence intervals, the subtleties of interpretation of which have confused introductory statistics students for decades.

Bayesian Technical Apparatus

The sampling distribution, with probability function p(y|\theta), models the probability of observable data y given unobserved (and typically unobservable) model parameters \theta. (The sampling distribution probability function p(y|\theta) is called the likelihood function when viewed as a function of \theta for fixed y.)

The prior distribution, with probability function p(\theta), models the probability of the parameters \theta.

The full joint distribution over parameters and data has a probability function given by the chain rule,

p(y,\theta) = p(y|\theta) \times p(\theta).

The posterior distribution gives the probability of parameters \theta given the observed data y. The posterior probability function p(\theta|y) is derived from the sampling and prior distributions via Bayes’s rule,


{} = \frac{\displaystyle p(y,\theta)}{\displaystyle p(y)}       by the definition of conditional probability,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\displaystyle p(y)}       by the chain rule,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y,\theta') \, d\theta'}       by the rule of total probability,

{} = \frac{\displaystyle p(y|\theta) \times p(\theta)}{\int_{\Theta} \displaystyle p(y|\theta') \times p(\theta') \, d\theta'}       by the chain rule, and

{} \propto p(y|\theta) \times p(\theta)       because y is constant.

The posterior predictive distribution p(\tilde{y}|y) for new data \tilde{y} given observed data y is the average of the sampling distribution defined by weighting the parameters proportional to their posterior probability,

p(\tilde{y}|y) = \int_{\Theta} \, p(\tilde{y}|\theta) \times p(\theta|y) \, d\theta.

The key feature is the incorporation into predictive inference of the uncertainty in the posterior parameter estimate. In particular, the posterior is an overdispersed variant of the sampling distribution. The extra dispersion arises by integrating over the posterior.

Conjugate Priors

Conjugate priors, where the prior and posterior are drawn from the same family of distributions, are convenient but not necessary. For instance, if the sampling distribution is binomial, a beta-distributed prior leads to a beta-distributed posterior. With a beta posterior and binomial sampling distribuiton, the predictive posterior distribution is beta-binomial, the overdispersed form of the binomial. If the sampling distribution is Poisson, a gamma-distributed prior leads to a gamma-distributed posterior; the predictive posterior distribution is negative-binomial, the overdispersed form of the Poisson.

Point Estimate Approximations

An approximate alternative to full Bayesian inference uses p(\tilde{y}|\hat{\theta}) for prediction, where \hat{\theta} is a point estimate.

The maximum \theta^{*} of the posterior distribution defines the maximum a posteriori (MAP) estimate,

\theta^{*} = \arg\max_{\theta} p(y|\theta) p(\theta).

If the prior p(\theta) is uniform, the MAP estimate is called the maximum likelihood estimate (MLE), because it maximizes the likelihood function p(y|\theta). Because the MLE does not assume a proper prior; the posterior may be improper (i.e., not integrate to 1).

By definition, the unbiased estimator for the parameter under the Bayesian model is the posterior mean,

\hat{\theta} = \int_{\Theta} \theta \times p(\theta|y) \, d\theta.

This quantity is often used as a Bayesian point estimator because it minimizes expected squared loss between an estimate and the actual value. The posterior median may also be used as an estimate — it minimizes expected absolute error of the estimate.

Point estimates may be reasonably accurate if the posterior has low variance. If the posterior is diffuse, prediction with point estimates tends to be underdispersed, in the sense of underestimating the variance of the predictive distribution. This is a kind of overfitting which, unlike the usual situation of overfitting due to model complexity, arises from the oversimplification of the variance component of the predictive model.