Genetic activation is hierarchically structured at many different levels. Tight clusters of associated genes are co-activated to construct molecular machines or participate in genetic pathways. Biologists spend considerable time trying to work out these pathways down to the rate equation level. More abstract interaction graphs are available from the KEGG Pathway Database, the web page for which describes it as “Wiring diagrams of molecular interactions, reactions, and relations.”

I’ve blogged previously about inferring gene or mRNA transcription levels. Now suppose you have several tissue samples from the same organism (e.g. from different cells or at different developmental stages or just replicates of the same experiment). These might be mapped to genes, to splice variants, or to individual exons. What we’d like to be able to do is start making inferences about genetic pathways and their activation.

### The Simplest Model: LDA

I wrote the simplest possible model down, stepped back, and realized it was just latent Dirichlet allocation (LDA), where “documents” are samples and “words” are mapped reads.

I really like BUGS notation for models because it combines the ease of a graphical model presentation with the precision of a programming language.

DataS: number of samples T: number of pathways K: number of genes (or variants or exons) R[s]: number of reads in sample s y[s,r]: mapping of read r in sample s to gene in 1:K alpha: T-dimensional prior on pathway expression in sample beta: K-dimensional prior on gene expression in pathwayParametersphi[t]: gene expression in pathway t theta[s]: pathway expression in sample s t[s,r]: pathway for read r in sample sModelfor (t in 1:T) { phi[t] ~ Dirichlet(beta) } for (s in 1:S) { theta[s] ~ Dirichlet(alpha) for (r in 1:R[s]) { t[s,r] ~ Discrete(theta[s]) y[s,r] ~ Discrete(phi[t[s,r]]) } }

Note that the expression parameters theta and phi are scaled as ratios over reads. These values may be normalized by gene (or variant or exon) length (taking into account the vagaries of the mapping program) to molar units (by analogy with the number of reads being like weight).

These models are really easy to fit and really easy to scale. As output, we get an estimate of the joint posterior distribution p(theta,phi|y,alpha,beta) over pathway expression and pathway composition (we just marginalize out the nuisance parameter t in the usual way by summation).

LingPipe’s implementation is explained with sample code in:

- LingPipe: Clustering Tutorial

The inputs are the hyperparameters alpha (the dimension of which determine the number of pathways T) and beta (determining the number of genes K), and the (ragged) input matrix y (determining the number of samples S and reads per sample R[s]). The output is a sequence of Gibbs samples. Each sample contains a value for all of the parameters: the source t of reads, the gene in pathway expressions phi and the pathway in sample values theta. We can just discard t to get a simulation of the posterior p(theta,phi|y,alpha,beta).

### Reasoning with Samples

As I discussed in our tutorial (and Griffiths and Steyvers discuss in the paper cited), the cluster assignments themselves in LDA are not identifiable. What we do get is a kind of random effects mixture model of correlation among genes.

On the plus side, we can easily carry out multiple comparisons or other plug-in inferences from any derived sample statistics (such as pairwise or groupwise gene co-activation).

Given that we’re usually using clustering like LDA as part of exploratory data analysis, we can just look at samples and topics and see what pops up. And if we have ground truth, as supplied at least partially by KEGG, we can look at whether or not known pathways emerge as “topics”.

### Joint Inference with Mapping

Perhaps the most pleasant feature of Bayesian inference is the ease with which it lets you combine models to perform joint inference at multiple levels or over multiple domains. Because all variable quantities are treated as first-class random variables, and reasoning is carried out by integrating uncertainty, models may be plugged together hierarchically whenever the components fit. Here, the y variables inferred by the RNA-Seq expression model would be the y variables assumed as data here. Very convenient

Inference is the same, at least in theory — there are just more variables to sample. The big difference is that now y has multiple constraints, so its conditional sampling distribution has to be worked out and simulated. BUGS does this kind of thing automatically. I’m still struggling with Casella and Robert’s *Monte Carlo Statistical Methods*.

### Hierarchical Model of Priors

Just as I did for the hierarchical Bayesian model of batting ability and the hierarchical model of categorical data coding, with enough samples S that are in some sense exchangeable, it’s possible to estimate the priors alpha or beta in models like these.

All the usual benefits accrue, like smoothed multiple comparisons.

We just put a diffuse “hyperprior” on alpha and/or beta and include sampling for those variables. (Again, a bit easier said than done for those still working out all these simulations.) But again, it’s easy in BUGS.

### Better Priors than Dirichlet

The Dirichlet prior only has a number of parameters equal to the number of dimensions and thus cannot fully characterize covariance of expected expression. A more reasonable model might be a multivariate normal on the logit scale. This is what Andrew Gelman suggested when I asked a while ago, and it’s only just sinking in.

Another problem with the Dirichlet (and multivariate normal) priors are that they fix the number of clusters ahead of time. It’d be possible to plug in something like a Dirichlet process prior here. I’m not sure if the model it entails is reasonable. I’ll have a better idea after fitting some data.

### Not Just Theory

Well, OK, this post is just theory. But Mitzi’s already run the expression models I coded based on the previous post, and next up, I’ll try to convince her to run LDA on the data. When we have something ready to release to he public (warning: biologists tend to keep their data close), I’ll announce it on the blog.

### Prior Art?

I’m guessing this (or at least the non-Bayesian version of it) has been done, but I don’t know how to look it up (there’s also confusion with “linear discriminant analysis” and “low density array”). And there’s lots of relevant stuff related to other factor models out there, which were popular for analyzing micro-array data. Feel free to post relevant references in the comments. The only things I could find (albeit after only half an hour or so of searching) are:

- Barnett, John and Tommi Jaakkola. 2007. Topic Models for Gene Expression Analysis. Web page.
- Gerber Georg et al. 2007. Automated Discovery of Functional Generality of Human Gene Expression Programs.
*PLoS Comp. Bio.* - Flaherty, Patrick et al. 2005. A Latent Variable Model for Chemogenomic Profiling.
*Bioinformatics*.

Also, the Stephens Lab at U.Chicago (which brought you the awesome Marioni et al. paper on RNA-Seq expression (using Poisson GLMs) mention LDA, but in the context of population genetics.

February 26, 2010 at 4:10 am |

D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM 2010

Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581, 2006.

February 26, 2010 at 9:23 am |

Simon Rogers, Mark Girolami, Colin Campbell and Rainer Breitling. The latent process decomposition of cDNA microarray data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Vol. 2, No. 2. pp. 143-156. 2005.

Yiming Ying, Peng Li and Colin Campbell. A marginalized variational Bayesian approach to the analysis of array data. BMC Proceedings 2008, 2(Suppl 4):S7.

… This paper proposes a collapsed variational Bayesian inference for latent process decomposition.

February 28, 2010 at 8:36 pm |

Hi Bob

Your gripe with the Dirichlet was exactly our motivation for the “correlated topic model.” It uses a multivariate gaussian on the logit scale, which is called the logistic normal. (See the excellent work of Aitchison.) Inference is complicated because we no longer enjoy conjugacy between the topic proportions and topic indicators. But the model is more expressive.

If interested, see the paper here:

There is C code here

http://www.cs.princeton.edu/~blei/ctm-c/index.html

(Forgive the self promotion…)

Best,

Dave

March 1, 2010 at 4:16 pm |

My bad for not citing that paper to begin with. I even saw you give the talk on it at NYU! I think my problem was that I didn’t understand multivariate normal priors and their relation to multinomials until after reading Gelman and Hill’s book and mulling over Gelman’s comment about why he didn’t like Dirichlet priors.

What’d be nice is to combine this idea with predictors for the pools in a general multilevel model.

P.S. The WordPress spam detector doesn’t like links, which is why I had to manually approve this one.