LDA Clustering for Gene Pathway Inference


Genetic activation is hierarchically structured at many different levels. Tight clusters of associated genes are co-activated to construct molecular machines or participate in genetic pathways. Biologists spend considerable time trying to work out these pathways down to the rate equation level. More abstract interaction graphs are available from the KEGG Pathway Database, the web page for which describes it as “Wiring diagrams of molecular interactions, reactions, and relations.”

I’ve blogged previously about inferring gene or mRNA transcription levels. Now suppose you have several tissue samples from the same organism (e.g. from different cells or at different developmental stages or just replicates of the same experiment). These might be mapped to genes, to splice variants, or to individual exons. What we’d like to be able to do is start making inferences about genetic pathways and their activation.

The Simplest Model: LDA

I wrote the simplest possible model down, stepped back, and realized it was just latent Dirichlet allocation (LDA), where “documents” are samples and “words” are mapped reads.

I really like BUGS notation for models because it combines the ease of a graphical model presentation with the precision of a programming language.

S: number of samples
T: number of pathways
K: number of genes (or variants or exons)
R[s]: number of reads in sample s
y[s,r]: mapping of read r in sample s to gene in 1:K
alpha: T-dimensional prior on pathway expression in sample
beta: K-dimensional prior on gene expression in pathway

phi[t]: gene expression in pathway t
theta[s]: pathway expression in sample s
t[s,r]: pathway for read r in sample s

for (t in 1:T) {
    phi[t] ~ Dirichlet(beta)
for (s in 1:S) {
    theta[s] ~ Dirichlet(alpha)
    for (r in 1:R[s]) {
        t[s,r] ~ Discrete(theta[s])
        y[s,r] ~ Discrete(phi[t[s,r]])

Note that the expression parameters theta and phi are scaled as ratios over reads. These values may be normalized by gene (or variant or exon) length (taking into account the vagaries of the mapping program) to molar units (by analogy with the number of reads being like weight).

These models are really easy to fit and really easy to scale. As output, we get an estimate of the joint posterior distribution p(theta,phi|y,alpha,beta) over pathway expression and pathway composition (we just marginalize out the nuisance parameter t in the usual way by summation).

LingPipe’s implementation is explained with sample code in:

The inputs are the hyperparameters alpha (the dimension of which determine the number of pathways T) and beta (determining the number of genes K), and the (ragged) input matrix y (determining the number of samples S and reads per sample R[s]). The output is a sequence of Gibbs samples. Each sample contains a value for all of the parameters: the source t of reads, the gene in pathway expressions phi and the pathway in sample values theta. We can just discard t to get a simulation of the posterior p(theta,phi|y,alpha,beta).

Reasoning with Samples

As I discussed in our tutorial (and Griffiths and Steyvers discuss in the paper cited), the cluster assignments themselves in LDA are not identifiable. What we do get is a kind of random effects mixture model of correlation among genes.

On the plus side, we can easily carry out multiple comparisons or other plug-in inferences from any derived sample statistics (such as pairwise or groupwise gene co-activation).

Given that we’re usually using clustering like LDA as part of exploratory data analysis, we can just look at samples and topics and see what pops up. And if we have ground truth, as supplied at least partially by KEGG, we can look at whether or not known pathways emerge as “topics”.

Joint Inference with Mapping

Perhaps the most pleasant feature of Bayesian inference is the ease with which it lets you combine models to perform joint inference at multiple levels or over multiple domains. Because all variable quantities are treated as first-class random variables, and reasoning is carried out by integrating uncertainty, models may be plugged together hierarchically whenever the components fit. Here, the y variables inferred by the RNA-Seq expression model would be the y variables assumed as data here. Very convenient

Inference is the same, at least in theory — there are just more variables to sample. The big difference is that now y has multiple constraints, so its conditional sampling distribution has to be worked out and simulated. BUGS does this kind of thing automatically. I’m still struggling with Casella and Robert’s Monte Carlo Statistical Methods.

Hierarchical Model of Priors

Just as I did for the hierarchical Bayesian model of batting ability and the hierarchical model of categorical data coding, with enough samples S that are in some sense exchangeable, it’s possible to estimate the priors alpha or beta in models like these.

All the usual benefits accrue, like smoothed multiple comparisons.

We just put a diffuse “hyperprior” on alpha and/or beta and include sampling for those variables. (Again, a bit easier said than done for those still working out all these simulations.) But again, it’s easy in BUGS.

Better Priors than Dirichlet

The Dirichlet prior only has a number of parameters equal to the number of dimensions and thus cannot fully characterize covariance of expected expression. A more reasonable model might be a multivariate normal on the logit scale. This is what Andrew Gelman suggested when I asked a while ago, and it’s only just sinking in.

Another problem with the Dirichlet (and multivariate normal) priors are that they fix the number of clusters ahead of time. It’d be possible to plug in something like a Dirichlet process prior here. I’m not sure if the model it entails is reasonable. I’ll have a better idea after fitting some data.

Not Just Theory

Well, OK, this post is just theory. But Mitzi’s already run the expression models I coded based on the previous post, and next up, I’ll try to convince her to run LDA on the data. When we have something ready to release to he public (warning: biologists tend to keep their data close), I’ll announce it on the blog.

Prior Art?

I’m guessing this (or at least the non-Bayesian version of it) has been done, but I don’t know how to look it up (there’s also confusion with “linear discriminant analysis” and “low density array”). And there’s lots of relevant stuff related to other factor models out there, which were popular for analyzing micro-array data. Feel free to post relevant references in the comments. The only things I could find (albeit after only half an hour or so of searching) are:

Also, the Stephens Lab at U.Chicago (which brought you the awesome Marioni et al. paper on RNA-Seq expression (using Poisson GLMs) mention LDA, but in the context of population genetics.

4 Responses to “LDA Clustering for Gene Pathway Inference”

  1. sth4nth Says:

    D. Blei, T. Griffiths, and M. Jordan. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM 2010

    Y. W. Teh, M. I. Jordan, M. J. Beal and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101, 1566-1581, 2006.

  2. tmasada Says:

    Simon Rogers, Mark Girolami, Colin Campbell and Rainer Breitling. The latent process decomposition of cDNA microarray data sets. IEEE/ACM Transactions on Computational Biology and Bioinformatics. Vol. 2, No. 2. pp. 143-156. 2005.

    Yiming Ying, Peng Li and Colin Campbell. A marginalized variational Bayesian approach to the analysis of array data. BMC Proceedings 2008, 2(Suppl 4):S7.
    … This paper proposes a collapsed variational Bayesian inference for latent process decomposition.

  3. david blei Says:

    Hi Bob

    Your gripe with the Dirichlet was exactly our motivation for the “correlated topic model.” It uses a multivariate gaussian on the logit scale, which is called the logistic normal. (See the excellent work of Aitchison.) Inference is complicated because we no longer enjoy conjugacy between the topic proportions and topic indicators. But the model is more expressive.

    If interested, see the paper here:

    Click to access BleiLafferty2007.pdf

    There is C code here

    (Forgive the self promotion…)


    • lingpipe Says:

      My bad for not citing that paper to begin with. I even saw you give the talk on it at NYU! I think my problem was that I didn’t understand multivariate normal priors and their relation to multinomials until after reading Gelman and Hill’s book and mulling over Gelman’s comment about why he didn’t like Dirichlet priors.

      What’d be nice is to combine this idea with predictors for the pools in a general multilevel model.

      P.S. The WordPress spam detector doesn’t like links, which is why I had to manually approve this one.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: