Archive for the ‘Statistics’ Category

* Your Mileage May Vary

August 25, 2010

Alex Smola just introduced a blog, Adventures in Data, that’s focusing on many of the same issues as this one. His first few posts have been on coding tweaks and theorems for gradient descent for linear classifiers, a problem Yahoo!’s clearly put a lot of effort into.

Lazy Updates for SGD Regularization, Again

In the wide world of independently discovering algorithmic tricks, Smola has a post on Lazy Updates for Generic Regularization in SGD. It presents the same technique as I summarized in my pedantically named white paper Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. It turns out to be a commonly used, but not often discussed technique.

Blocking’s Faster

I wrote a long comment on that post, explaining that I originally implemented it that way in LingPipe, but eventually found it faster in real large-scale applications to block the prior updates.

Your Mileage May Vary

To finally get to the point of the post (blame my love of rambling New Yorker articles), your mileage may vary.

The reason blocking works for us is that we have feature sets like character n-grams where each text being classified produces on the order of 1/1000 of all features. So if you update by block every 1000 epochs, it’s a win. Not even counting removing the memory and time overhead of keeping the record of sparseness and increased memory locality.

Horses for Courses

The only possible conclusion is that it’s horses for courses. Each application needs its own implementation for optimal results.

As another example, I can estimate my annotation models using Gibbs sampling in BUGS in 20 or 30 minutes. I can estimate them in my custom Java program in a second or two. I only wrote the program because I needed to scale to a named-entity corpus and BUGS fell over. As yet another example, I’m finding coding up genomic data very similar to text data (edit distance, Bayesian models) and also very different (4-character languages, super-long strings).

Barry Richards, the department head of Cog Sci when I was a grad student, gave a career retrospective talk about all the work he did in optimizing planning systems. Turns out each application required a whole new kind of software. As a former logician, he found the lack of generality very dispiriting.

Not being able to resist one further idiom, this sounds like a probletunity to me.

Threading Blog Discussions

This post is a nice illustration of Fernando Pereira’s comment on Hal Daumé III’s blog about the problematic nature of threading discussions across blogs.

That Zeitgeist Again

I need to generalize my earlier post about the scientific zeitgeist to include coding tricks. The zeitgeist post didn’t even discuss my rediscoveries in the context of SGD!

Collapsed Gibbs Sampling for LDA and Bayesian Naive Bayes

July 13, 2010

[Update: 1 February 2014. David Bammam points out that there’s a mistake just before equation (62). The correct formula should be

\Gamma(x + q) = \Gamma(x) \times \prod_{i=1}^q (x + i - 1).

This has implications going forward replacing i with i - 1 which I don’t have time to change right now.]

[Update: 27 September 2010. I corrected several typos and brainos in the tech report after corrections in the comments (see below) from Arwen Twinkle. I also added some historical notes and references. Thanks for all the feedback.]

I’ve uploaded a short (though dense) tech report that works through the collapsing of Gibbs samplers for latent Dirichlet allocation (LDA) and the Bayesian formulation of naive Bayes (NB).

Thomas L. Griffiths and Mark Steyvers used the collapsed sampler for LDA in their (old enough to be open access) PNAS paper, Finding scientific topics. They show the final form, but don’t derive the integral or provide a citation.

I suppose these 25-step integrals are supposed to be child’s play. Maybe they are if you’re a physicist or theoretical statistician. But that was a whole lot of choppin’ with the algebra and the calculus for a simple country computational linguist like me.

On to Bayesian Naive Bayes

My whole motivation for doing the derivation was that someone told me that it wasn’t possible to integrate out the multinomials in naive Bayes (actually, they told me you’d be left with residual \Gamma functions). It seemed to me like it should be possible because the structure of the problem was so much like the LDA setup.

It turned out to be a little trickier than I expected and I had to generalize the LDA case a bit. But in the end, the result’s interesting. I didn’t wind up with what I expected. Instead, the derivation led to me to see that the collapsed sampler uses Bayesian updating at a per-token level within a doc. Thus the second token will be more likely than the first because the topic multinomial parameter will have been updated to take account of the assignment of the first item.

This is so cool. I actually learned something from a derivation.

In my prior post, Bayesian Naive Bayes, aka Dirichlet-Multinomial Classifiers, I provided a proper Bayesian definition of naive Bayes classifiers (though the model is also appropriate for soft clustering with Gibbs sampling replacing EM). Don’t get me started on the misappropriation of the term “Bayesian” by any system that applies Bayes’s rule, but do check out another of my previous posts, What is Bayesian Statistical Inference? if you’re curious.

Thanks to Wikipedia

I couldn’t have done the derivation for LDA (or NB) without the help of

It pointed me (implicitly) at a handful of invaluable tricks, such as

  • multiplying through by the appropriate Dirichlet normalizers to reduce an integral over a Dirichlet density kernel to a constant,
  • unfolding products based on position, unfolding a \Gamma() function for the position at hand, then refolding the rest back up so it could be dropped out, and
  • reparameterizing products for total counts based on sufficient statistics.

Does anyone know of another source that works through equations more gently? I went through the same exegesis for SGD estimation for multinomial logistic regression with priors a while back.

But Wikipedia’s Derivation is Wrong!

At least I’m pretty sure it is as of 5 PM EST, 13 July 2010.

Wikipedia’s calculation problem starts in the move from the fifth equation before the end to the fourth. At this stage, we’ve already eliminated all the integrals, but still have a mess of \Gamma functions left. The only hint at what’s going on is in the text above which says it drops terms that don’t depend on k (the currently considered topic assignment for the n-th word of the m-th document). The Wikipedia’s calculation then proceeds to drop the term \prod_{i \neq k} \Gamma(n^{i,-(m,n)}_{m,(\cdot)} + \alpha_i) without any justification. It clearly depends on k.

The problems continue in the move from the third equation before the end to the penultimate equation, where a whole bunch of \Gamma function applications are dropped, such as \Gamma(n^{k,-(m,n)}_{m,(\cdot)} + \alpha_k), which even more clearly depend on k.

It took me a few days to see what was going on, and I figured out how to eliminate the variables properly. I also explain each and every step for those of you like me who don’t eat, sleep and breathe differential equations. I also use the more conventional stats numbering system where the loop variable m ranges from 1 to M so you don’t need to keep (as large) a symbol table in your head.

I haven’t changed the Wikipedia page for two reasons: (1) I’d like some confirmation that I haven’t gone off the rails myself anywhere, and (2) their notation is hard to follow and the Wikipedia interface not so clean, so I worry I’d introduce typos or spend all day entering it.

LaTeX Rocks

I also don’t think I could’ve done this derivation without LaTeX. The equations are just too daunting for pencil and paper. The non-interactive format of LaTeX did get me thinking that there might be some platform for sketching math out there that would split the difference between paper and LaTeX. Any suggestions?

Contextual Effects and Read Quality in a Probabilistic Aligner

June 14, 2010

In my last post on this topic, Sequence Alignment with Conditional Random Fields, I introduced a properly normalized Smith-Waterman-like global aligner (which required the entire read to be aligned to a subsequence of the reference sequence). The motivation was modeling the probability of a read being observed given a reference sequence from which it originated.

Context Plus Read Quality

Today I want to generalize the model in two ways. First, I’ll add contextual effects like affine gap scores and poly-base errors on Illumina.

Second, I’ll take into consideration per-base read quality, as produced by Illumina’s prb files (with simple fastq format, where you only get the probability of the best base, we can either treat the others as uniformly likely or use knowledge of the confusion error profile of the sequencer).

We’ll still take start position into account, for issues like modeling hexamer priming or fractionation effects in sample prep.

The dynamic programming algorithms work as before with finer-grained cells. Further, you can infer what the actual read was, though I believe a hierarchical model will be better for this, because the sequence polymorphism can be separated from the read errors generatively.

Modeling Alignments

What we’re interested in for our work on RNA-Seq based mRNA expression is modeling is p(y|x), the probability of a read y given a reference sequence x. Rather than approaching this directly, we’ll take the usual route of modeling a (latent) alignment A at a (latent) start position i, p(y,A,i|x), and then marginalizing,

p(y|x) = \sum_{A,i} p(y,A,i|x).

The best alignment will be computable via a standard sequence max algorithm, and the marginalization through the standard product-sum algorithm. Both run in time proportional to the read length times reference sequence length (of course, the latter will be trimmed down to potential matches in a region). Both can be sped up with upper bounds on distance.

The Model

We assume there is a score \mbox{\sc start}(i,x) \in \mathbb{R} for each start position i in reference sequence x. For instance, this may be a log probability of starting at a position given the lab fractionation and priming/amplification protocol.

We will also assume there is a quality score on a (natural) log probability scale \mbox{\sc logprb}(y,b,j) \in \mathbb{R} for the probability of base b at position j in read y. For instance, these may be the Phred quality scores reported in Illumina’s prb files converted to (natural) log probability of match.

Our definition only normalizes for reads of fixed length, which we’ll write \mbox{\sc readlen}.

We will recursively define a score on an additive scale and then exponentiate to normalize. In symbols,

p(y,A,i|x) \propto \exp\big(\mbox{\sc start}(i,x) + f(a,i,0,x,y,\mbox{\rm start})\big)

where we define the alignment score function f(A,i,j,x,y,a), for alignment sequence A, read x and read position i, reference sequence y and reference position j and previous alignment operation a, recursively by

\begin{array}{rcl}f([],i,\mbox{\sc readlen},x,y,a) & = & 0.0\\[18pt]f(\mbox{\rm sub}(x_i,b)+A,i,j,x,y,a) & = &  \mbox{\sc subst}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i+1,j+1,x,y,\mbox{\rm sub}(x_i,b))\\[12pt]f(\mbox{\rm del}(x_i)+A,i,j,x,y,a) & = &  \mbox{\sc del}(x,y,i,j,a)\\ & + & f(A,i+1,j,x,y,\mbox{\rm del}(x_i))\\[12pt]f(\mbox{\rm ins}(b)+A,i,j,x,y,a) & = &  \mbox{\sc ins}(x,y,i,j,a)\\ & + & \mbox{\sc logprb}(y,b,j)\\ & + & f(A,i,j+1,x,y,\mbox{\rm ins}(b))\end{array}

Reminds me of my Prolog programming days.

The Edit Operations

The notation a + A is meant to be the edit operation a followed by edit sequence A. The notation [] is for the empty edit sequence.

The notation \mbox{\rm sub}(x_i,b) is for the substitution of an observed base b from the read for base x_i of the reference sequence; if x_i \neq b it’s a particular mismatch, and if x_i = b, it’s a match score. The notation \mbox{\rm ins}(b) is for the insertion of observed base b and \mbox{\rm del}(x_i) for the deletion of base x_i in the reference sequence. Insertion and deletion are separated because gaps in the reference sequence and read may not be equally likely given the sequencer and sample being sequenced.

Note that the base case requires the entire read to be consumed by requring j = \mbox{\sc readlen}.

Scoring Functions

Note that the score functions, \mbox{\sc subst}(x,y,i,j,a) \in \mathbb{R}, \mbox{\sc del}(x,y,i,j,a) \in \mathbb{R}, and \mbox{\sc ins}(x,y,i,j,a) \in \mathbb{R}, have access to the read y, the reference sequence x, the present position in the read j and reference sequence i, as well as the previous edit operation a.

Enough Context

This context is enough to provide affine gap penalties, because the second delete will have a context indicating the previous operation was a delete, so the score can be higher (remember, bigger is better here). It’s also enough for dealing with poly-deletion, because you have the identity of the previous base on which the operation took place.

Further note that by defining the start position score by a function \mbox{\sc start}(x,i), which includes a reference to not only the position i but the entire reference sequence i, it is possible to take not only position but also information in the bases in x at positions around i into account.

Dynamic Programming Same as Before

The context is also restricted enough so that the obvious dynamic programming algorithms for first-best and marginalization apply. The only tricky part is computing the normalization for p(y,A,i|x), which has only been defined up to a constant. As it turns out, the same dynamic programming algorithm works to compute the normalizing constant as for a single probabilistic read.

In practice, you have the usual dynamic programming table, only it keeps track of the last base matched in the read, the last base matched in the reference sequence, and whether the last operation was an insertion, deletion, or substitution. To compute the normalizer, you treat each base log prob from a pseudoread of the given read length \mbox{\sc readlen} as having score 0 and accumulate all the matches in the usual way with the sum-product algorithm. (You’ll need to work on the log scale and use the log-sum-of-exponentials operation at each stage to prevent underflow.)

In general, these dynamic programming algorithms will work anywhere you have this kind of tail-recursive definition over a finite set of operations.

Inferring the Actual Read Sequence for SNP, Etc. Calling

Because of variation among individuals, the sequence of bases in a read won’t necessarily match the reference sequence. We can marginalize out the probability that the read sequence is z given that it was generated from reference sequence x with read y (remember the read’s non-deterministic here),

p(z|x,y) = \sum_{i,\mbox{yield}(A) = z} p(A,i,x|y),

where \mbox{yield}(A) is the read produced from alignment A, which itself is defined recursively by

\mbox{yield}([]) = \epsilon,

\mbox{\rm yield}(\mbox{\rm del}(x) + A) = \mbox{\rm yield}(A),

\mbox{\rm yield}(\mbox{subst}(b,x_i) + A) = b \cdot \mbox{\rm yield}(A), and

\mbox{\rm yield}(\mbox{ins}(b)+A) = b \cdot \mbox{\rm yield}(A),

where \epsilon is the empty string and \cdot string concatenation.

I believe this is roughly what the GNUMAP mapper does (here’s the pay-per-view paper).

Reference Sequences with Uncertainty

We could extend the algorithm even further if we have a non-deterministic reference sequence. For instance, we can use data from the HapMap project to infer single nucleotide polymorphisms (SNPs). We just do the same thing for reference side non-determinsm as on the read side, assuming there’s some score associated with each choice and adding the choice into the context mix.

Li, Ruotti, Stewart, Thomson and Dewey (2010) RNA-Seq Gene Expression Estimation with Read Mapping Uncertainty

June 9, 2010

It was bound to happen. The model I was proposing for splice-variant (or isoform) expression was too obvious (and too good!) for it not to be independently discovered. Seeing my ideas discovered by others is on one hand a bummer, but on the other hand gives me confidence that I’m thinking about these models in the right way.

The following paper presents roughly the same model of expression I presented in a previous blog post, Inferring Splice Variant mRNA Expression with RNA-Seq.

It also incorporates some of the features I suggested for edit distance in another blog post, Sequence alignment with conditional random fields.

The Basic Expression Model

In fact, it’d almost be easier to say what’s different in Li et al.’s model. The expression model is almost identical, down to the name of the variable, \theta, used for read expression. The model in this paper assumes N reads, implicitly assumes an uniform prior for \theta \sim \mbox{\sf Dirichlet}({\bf 1}), then for each read 1 \leq n \leq N, chooses splice variants z_n \sim \mbox{\sf Discrete}(\theta). So far, identical (other than that I considered arbitrary Dirichlet priors).

[Update, June 10, 2010: I forgot to mention:

Noise Isoform

Li et al. include a so-called “noise isoform”, which they assign a simple probability distribution to. This will act as a sink for reads that do not map elsewhere. I don’t quite see given how it’s defined how anything would get mapped to it if we restricted attention to reads that mapped within a few edits with BOWTIE.]

Position and Strand-Based Reads

Things look different when Li et al. start to generate reads y_n. What I did was assume an arbitrary distribution \mbox{\sf Read}(y_n|z_n,\varphi), with parameters \varphi characterizing the likelihood of observing read y_n from reference source z_n. Li et al. decompose part of \mbox{\sf Read} in their model. First, they assume s_n is the start position of read y_n on reference sequence z_n and assume o_n is the strand, both of which are generated from the ref sequence z_n, by distributions p(o_n|z_n) and p(s_n|z_n).

This is where Li et al.’s proposal relates to my probabilistic aligner proposal. In particular, with the position, strand and reference sequence, s_n, o_n, z_n, the reads may be defined to be sensitive to location (say in relative position measured from 5′ to 3′), or to underlying sequence (such as the hexamer bias due to priming studied in [Hansen, Brenner and Dudoit 2010]). They only study the positional biases, and for their data, they were small. But groovy nonetheless. It’s building in the right kind of sensitivity to the biological sample prep and sequencing.

Non-Probabilistic Alignment

[Updated, June 10, 2010: I don’t know what I was thinking — Li et al. definitely use probabilistic alignment. In fact, they use a position-specific edit matrix w_t for substitutions (no indels). I’m not sure how the matrices are estimated.]

What they’re not using is the the quality scores on the reads themselves (as is done in Clement et al.’s GNUMap aligner).

I think there’s an opportunity to squeeze more sensitivity and specificity out of the model by using the quality scores from the sequencer (if they can be calibrated properly, that is).

The aligner I propose also normalizes the edits to proper probabilities using a sum-product algorithm; it’d be easy to extend to read quality as in GNUMap, or to compose it with a color space finite-state transducer, as in SHRiMP for SOLiD reads.

EM MLE vs. Gibbs Sampling Bayesian Posterior

The other main difference is that I was proposing using Gibbs sampling to produce samples from the posterior p(\theta|y,\alpha,\varphi) of expression given reads y and channel model \varphi and Dirichlet prior \alpha. Li et al. use EM to find the maximum likelihood estimate, \theta^* = \arg\max p(y|\theta,\varphi). As usual, the MLE is just the MAP estimate with a uniform prior, so in my model, the MLE is \theta^* = \arg\max_\theta p(\theta|y,\alpha={\bf 1},\varphi).

Bootstrap Standard Error Estimates

One of the main arguments I made for the Bayesian approach is that, like all truly Bayesian approaches, it supports posterior inference with uncertainty. This is very useful for doing things like pairwise or multiple comparisons of expression or adjusting for false discovery rates.

Li et al. use a bootstrap estimate of standard error, which is great to see. I wish more researchers provided variance estimates.

The danger with only reporting standard error (or variance) in these skewed binomials (or multinomials) is that the parameter value’s very close to the edge of the allowed values, so the 95% interval can contain illegal values. You see the same problem for normal approximations of variance for the Poisson, where a few standard deviations below the mean can result in negative counts.

[Update: June 10, 2010 Of course, you can plot plug-in quantiles such as 95% intervals with the bootstrap, too. It’s really pretty much like Gibbs sampling in terms of the application, if not philosophy.]


They run a bunch of simulation experiments to show that this kind of model makes sense. I did the same thing on a (much much) smaller scale. They use Langmead et al.’s BOWTIE aligner with up to two mismatches, which seems a rather dodgy basis for this kind of expression model. It will depend on the settings, but the defaults provide a greedy heuristic search that’s not guaranteed to be complete in the sense of finding all or even the best alignment.

[Update: June 10, 2010: BOWTIE has a --all setting that the documentation to generate all matching reads, but there’s also a maximum number of backtracks parameter that can eliminate some matches if there are 2 or more edits allowed.

Even if BOWTIE can be configured to find all the matches up to edit distance 2, there’s no reason to assign them all the same probability in the model or to assume that a read is accurately mapped at edit distance 2 and not at edit distance 3 or greater.

My understanding is that due to its best-first heuristic search, BOWTIE does not guarantee it will find all reads even up to a given edit distance.]

What we really need is some real applications. Mitzi and I are reanalyzing the human liver and kidney Illumina RNA-Seq data described in (Marioni et al. 2008), and also some new data generated by Ken Birnbaum (and Paul Scheid) at NYU using a single-cell protocol on SOLiD on Arabidopsis roots over a 5-day period. Paul Scheid, the director of the SOLiD core facilty at NYU, just presented the data at a SOLiD user’s group meeting he hosted at NYU last Friday. The short story is that Mitzi crunched the data to show that you can use a single-cell protocol on SOLiD and use SHRiMP for alignment to derive expression results similar to that estimated from parallel microarray day using Li and Wong’s D-Chip factor model for microarray expression.

20 to 25 Bases!

The conclusion of Li et al. is that if each base costs the same to read independent of read length, then according to their simulations, the optimal read length for caclulating variant expression is around 20 to 25 bases, not the 50 or longer that Illumina and SOLiD are producing these days.

Piecewise-Linear or Isotonic Regression for Calibrating Naive Bayes (or Other Predictors)

June 3, 2010

I was just reading a really nifty Bayesian approach to clickthrough rate estimation from Microsoft Research Cambridge (UK), which used as a baseline the following paper’s approach to calibrating naive Bayes classifiers:

I’ve blogged about Rennie et al.’s approach to tackling the poor assumptions of Naive Bayes for text classifiers; oddly, the Rennie et al. 2003 ICML paper I blogged about before doesn’t cite the 2001 Zadrozny and Elkan ICML paper on the same topic.

Failed Independence Assumptions in Naive Bayes

The well-known problem with naive Bayes (i.e. a discrete mixture of multinomials) derives from using features that are correlated in a model that assumes they’re not. This is, in fact, the naivete from which the name is derived, though it’s a misnomer, because the model’s right for some situations. For instance, it’s common to see naive Bayes applied to model bags of words in human language, which are highly correlated both locally (by syntax and collocation) and globally (by topic). It’s what you’ll get from LingPipe’s NaiveBayesClassifier and TradNaiveBayesClassifier classes.

In practice, what happens, especially for longer documents, is that the probabilities tend toward 0 or 1 because of redundant tokens with high correlations being treated as independent. This is especially easy to see for repeated words. If we’re classifying sports documents and it’s 90% likely to be an article about basketball if it mentions “lebron” (actually it’s probably higher, due to the celebrity of LeBron James). So what if we see “lebron” twice? In the model, the probability of being about basketball goes up to 0.9^2 / (0.9^2 + 0.1^2) \approx 0.987. The problem is that two “lebron”s don’t tell you much more about the topic than one. An article about the filmmaker Eddie Lebron or the Salsa band The Lebrón Brothers is also likely to mention “lebron” more than once. Once you’ve seen one, the effect of the next one should go down. (The fact that all three uses of “lebron” are spelled differently is an orthogonal issue I shouldn’t have confused here; the same holds for a fully capitalized name like “Smith”.)

Measuring Term Correlation

One thing to do is try to measure the correlation. Along these lines, I really liked Ken Church’s analysis of correlation in a paper titled One term or two? (but only available pay-per-view) and Martin Jansche’s careful follow up.

It’s traditional to hack calibrations by converting every document x to the same virtual document length \lambda, using p_{\lambda}(x) \propto p(x)^{\lambda/{\rm length}(x)}. This is what we do in our traditional naive Bayes class in LingPipe (it doesn’t change first-best results, but helps enormously for EM and presumably for n-best results ordering).

Binned Calibration

What Zadrozny and Elkan did for naive Bayes is to train a classifer and and then calibrate its predictions. In particular, they sort the inputs by predicted probability for the most likely category. They then sort these into ten bins of equal size. For each bin, they calculate the proportion of true answers in the bin. For new items, they are assigned probabilities based on which bin their estimate for the probability of the best category falls into.

They talk about using held-out data, but don’t. I’m not sure why — naive Bayes is pretty easy to overfit. Ideally they’d have just done cross-validation on the whole data set to get predictions, or use a representative held-out data set of appropriate size.

Isotonic Regression

The Microsoft paper referred to Zadrozny and Elkan’s approach as isotonic regression, but on my reading, Zadrozny and Elkan didn’t enforce isotonicity (i.e. upward monotonicity) on the bin probabilities. They probably didn’t need to with lots of items per bin and a reasonably ordered (if not calibrated) set of naive Bayes results. Structural constraints like isotonicity (or simiarly intentioned smoothness priors for something like loess) are especially helpful if some of the bins don’t contain much data or you’re fitting a more complex model.

Binning versus Piecewise Linear Normalization

Binning may be useful for absolute probability prediction, but it’s not very useful for sorting results because it’s not a properly isotonic function (that is, we can have p(x_1) > p(x_2) and p'(x_1) = p'(x_2)).

Instead of binning, a more reasonable solution seems to be a piecewise linear approximation. Just fit linear functions through each bin that remain isotonic. That’s what you typically see in non-parametric statistics approaches (e.g. for normalizing intensities in micro-array data or empirically adjusting for false discovery rates).

Length, Anyone?

In addition to predicted probability on new items, I’d think document length would be a huge factor here. In particular, a two-way binning by length and by prediction probability makes sense to me if there’s enough data. As documents get longer, the overconfidence due to failed independence assumptions gets worse.

Category Effects?

I’d also think you could adjust for category effects. There’s often one or more small foreground categories and a big background category. I’d expect them to calibrate differently.

Comaprative Effects

It’s common in estimating confidence for speech recognizer output to look at not only the first-best result, but the difference between the first-best and second-best result’s probability. That may also help here.

Hierarchical Agglomerative Clustering is Non-Deterministic

May 26, 2010

Some of you may have followed the recent back-and-forth on the LingPipe newsgroup concerning the non-determinstic behavior of LingPipe’s implementation of complete-link and single-link agglomerative clustering.

It’s not our fault. Really. The basic algorithm is non-determistic.

The Usual Algorithm

As it’s usually presented, we initialize with every element in its own singleton cluster. Then at each step, we choose two clusters to merge based on the distance between them. For single-link clustering, distance is the distance between the two closest points (i.e. min of pairwise distances); for complete-link, the distance between the two furthest points (i.e. max of pairwise distances). Usually we choose the closest pair of clusters to merge. But what if there’s a tie?

Some Examples

Neither the Wikipedia nor the new(ish) Stanford IR book discuss what to do in the case of ties, leaving the algorithm non-determinstic. Here are links:

(As of today), the Wikipedia says [my emphasis], “This method builds the hierarchy from the individual elements by progressively merging clusters. … The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.”

Similarly, the Stanford IR book’s section on hierarchical agglomerative clustering says [my emphasis], “In each iteration, the two most similar clusters are merged.”

Neither source says what to do when there are ties.


Suppose we have three points on a line, {1, 3, 4, 6} with the usual Euclidean distance function (e.g. d(1,3)=2; d(3,4)=1, d(6,1)=5, d(4,4)=0, …). And let’s suppose we go with complete-link clustering. There are two alternative possibilities when we come to the second step:


1   3   4   6
1   {3,4}:1.0   6
{1, {3,4}:1.0}:3.0   6 1   {{3,4}:1.0, 6}:3.0
{{1, {3,4}:1.0}:3.0}, 6}:5.0 {1, {{3,4}:1.0, 6}:3.0}:5.0

because d(1,{3,4}) = 3.0 and d({3,4},6) = 3.0.

Without the scores, the dendrograms are {{1,{3,4}},6} vs. {1,{{3,4},6}}. Very different structures.

Single-link clustering suffers from the same non-determinsm. Specifically, for the example above, the same first step applies, and then the same ambiguity in the second step where both distances are 2.0 (rather than the 3.0 in the complete-link algorithm).

Breaking the Symmetry

Usually an implementation will break the tie in some programmatic way, such as choosing the first one to appear in some data structure. The problem for LingPipe, which uses hash-based Java collections over objects with reference-based equality is that the ordering’s not fixed. (This is well documented, though a bit subtle.) Specifically, different runs can produce different results. See the last blog post for more details on the non-determinism of Java hash-based collections.

Cophenetic Distance

I suspect it’s not possible to normalize at the dendrogram level for either complete-link or single-link clustering.

On one interpretation of clustering, dendrograms allow us to infer cophenetic distances, which is the distance at which clusters are merged. Unfortunately, the cophenetic distances derived from the two clusterings resulting from the non-deterministic complete-link example above are different, with either 1 or 6 being closer to {3,4} depending on which is merged first. I’m guessing the same thing will happen with single-link, but haven’t worked out the details (in this example, the cophenetic distances are the same).

Thus I believe the real problem’s in the model implicitly assumed in the clustering strategies.

Furthermore, both of these algorithms are notoriously non-robust to small differences in distances or the set of objects being clustered.

Does it Matter?

Are there applications where this will matter? I suppose if you really believe in clustering results and try to analyze them via cophenetic correlation or other measures it will.

LREC 2010 Tutorial: Modeling Data Annotation

May 17, 2010

Massimo and I are giving our tutorial this afternoon in Malta at LREC 2010. Here’s a link to my slides:

Thanks to Massimo for lots of great suggestions and feedback from when I gave the talk in Rovereto Italy last week (U. Trento), when we talked about it at length between then and now, and on the first few drafts of the slides.

And here’s a link to Massimo’s slides:

Massimo’s part was based on his and Ron’s CL Journal review of kappa-like stats, and the tutorials they’ve given in the past. As such, they’re much more mature. The examples are really tight and to the point.

Expected Error Histograms from Illumina 36bp Reads

May 10, 2010

Mitzi and I have been working on reproducing the results reported in the following (hugely influential) recent paper.

The paper analyzes human liver and kidney tissue using both an Affymetrix micro-array and the Illumina short-read sequencer. More about Affy vs. Illumina later; for now, I just want to describe the quality of the Illumina reads.

Error Profile of Illumina

What’s nice about modern bioinformatics is that many authors submit their data to public repositories. So Mitzi was able to download and crunch Marioni et al.’s Illumina data.

For each 36bp read, Illumina provides a probability estimate for each base. For instance, we might have the 17th position assign a probability of 0.97 to A, 0.02 to C, 0.001 to G, and 0.009 to T. Marioni et al., like most researchers, simply take the first-best calls (see below for some exceptions), thus assigning position 17 in the above example to A. But according to Illumina, there’s only a 97% chance that call is correct!

Calculating Expected Errors per Read

So how many errors do we expect in a 36bp read? If q_i is the probability that the most likely base at position i is correct, the expected number of errors in the first I reads is \sum_{i = 1}^{I} (1 - q_i).

With a little Java, Python and R (more about the package we’re working on later), Mitzi plotted a histogram for all 66.4M reads from Marioni’s 3pM kidney data:

Illumina expected errors, 36bp

But it turns out that like most researchers, Marioni et al. truncated the reads because they’re noisier toward the tail (3′ end) of the reads. Specifically, they only used the first 32bp. Here’s the same plot with the last 4bps truncated:

Illumina expected errors, 32bp

As you can see, this dramatically reduces the expected number of errors.

Calculating Expected Number of Correct Reads

Mitzi also plotted the probability that the first best calls up to length I are all correct for a read, which is \prod_{i=1}^{I} q_i. The probabilities are so low it requires a log (base 10) scale when using all 36bps:

Illumina expected correctly called reads

The startling conclusion is that there’s almost no chance at all that the first-best base calls leads to the correct sequence. This has unsettling ramifications for procedures like SNP calling.

The Done Thing

What Marioni et al. did is typical. They used a heuristic aligner (Bowtie) which ignores the read quality scores and allows only a small number of edits. They then censor (aka discard) any read from their data set that doesn’t align “uniquely”. A unique alignment is taken to be one that has a unique maximum quality score with fewer than the maximum number of allowed edits. That is, if there’s an alignment with one edit, and seven with two edits, the alignment is considered unique; but if there are two alignments with one edit, the read’s discarded.

Why throw away non-unique reads? There’s good reason to believe that the unique alignments are not unique. Not eveyone discards non-unique reads; there are so-called “recovery” strategies, which I discussed in a previous post describing my joint mapping recovery and mRNA expression model.

Why throw away quality scores from the aligner? I only know of two aligners that use quality scores, Gnumap and BWA, but these systems are far less popular than simple edit counting systems like Bowtie.

Why only count edits? A proper channel model should be better. For instance, the SHRiMP system employs a more refined channel model (which my probabilistic alignment model refines with proper normalization and integrating out of the actual sequence of edits).

The answer’s simple really: it’s easier to code and easier to explain.

Can We Do Better?

Like every newbie, I feel we can. Not surprisingly, I think we should use all the data we have, integrating a proper probabilistic channel model for alignment with a Bayesian mRNA expression model.

Mitzi’s busy doing much more liberal alignments with SHRiMP (which requires a Viterbi approximation to the channel model) and then we’ll run them through the scalable collapsed Gibbs sampler I wrote the for mRNA expression model. After that, we’ll work in the more refined channel edit model that can account for biases induced by the wet lab process and the sequencer, and the quality scores coming from the sequencer.


No, Alias-i has no plans to work on Bioinformatics; this is still just a hobby for me. Maybe I should choose a hobby that’s a little more dissimilar from my day job.

McCallum, Bellare and Pereira (2005) A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance

April 27, 2010

Following up on my previous post on alignment with CRFs, I’ve found a related paper that I’d like to review here. I’ve spent the last 20 years following Fernando around the intellectual landscape, from logic programming and grammatical formalisms to statistics and biology. (So it’s perhaps not surprising that we’re both speaking next week at the Edinburgh HCRC and Cog Sci Reunion.)

I have to admit I’d read this paper in the dim and distant past (OK, it’s from 2005), so something might’ve stuck in my head. But once you grok CRFs, building new models is easy, because as I said last time, CRFs are really good at normalizing sequential labeling or classification problems.


McCallum et al. applied their model to matching restaurant names and addresses and also to matching bibliographic citations. In contrast, I was interested in matching short reads against reference genomes (for DNA sequencing) or gene models (for mRNA sequencing).

What’s Being Modeled?

McCallum et al.’s goal was to model the probability that two strings match. This reduces the problem to a binary classification problem.

In contrast, what I was modeling in the last post is the probability of a query/read sequence being generated from a reference sequence. The reads are more like the tags generated in a tagging CRF. The alignment is a local alignment of the read to the reference.

We both approached the problem by first defining a notion of an alignment sequence and its relation to the strings. They modeled the probability of a global alignment given both strings, whereas I modeled the probability of a read locally aligned to a reference sequence.

The difference is in normalization terms, both of which are easy to calculate via dynamic programming.

As they point out in their discussion, what they did is to pair HMMs (i.e. as used by [Ristad and Yianilos 1998]), as CRFs are to regular HMMs. I turned a similar crank to estimate a related probability.

Other Innovations

What I really like about McCallum et al.’s approach is that it allows chunked matching operations. Although they don’t consider affine gap models (where longer gaps aren’t that much more unlikely than shorter gaps), they do consider token-sensitive operations like deleting a token that appears in both strings or in a dictionary, and consider deleting to the end of the current token (useful for matching acronyms). The possibilities are endless.

With their CRF FST implementation in Mallet, I’m guessing it was as easy to code as the paper makes it seem. The FST coding is straightforward, with the only problem being the usual one of explosion of dependencies. Neatly (for both McCallum et al. and me), the sum-product (forward/back) and max-product (Viterbi) dynamic programming algorithms fall out of the encoding.

They don’t allow non-local transpositions (as in, say a soft TF/IDF-like approach to matching), so everything remains reasonably tractable.

They note that the estimation problem is not concave, because they use latent variables (alignments and states aren’t directly observed here). I’m pretty sure I’ll have the same problem.

They treat matches and mismatches as a mixture model, with two distinct subgraphs of the FSTs. This allows them to have separate features as in a max entropy or discrete choice (DCA) model and unlike in a logistic regression-type approach where there are no outcome-specific features.

Section 5.1 provides a hint at the kinds of features they use to set the values for the various edit operations, like is a numerical character substituted for another numerical character. Curiously, there are no features for the actual characters being matched, just classes. Maybe they didn’t have enough data.

What I added that they didn’t have was a different normalization and features for the alignment position of local alignments (useful for modeling positional or polymer biases in reads).

Direct Regression

(Tsuruoka, McNaught, Tsujii and Ananiadou 2007) introduced an approach to string similarity that was a straight-up logistic regression based on features extracted from the two strings. These features were things like n-grams.

This also sounds like a good idea. I’m guessing the error patterns would be dissimilar enough that the two classifies could be fruitfully combined.

The Tsuruoka model would be easy to implement in LingPipe. I had to do a custom implementation for my own model decoder and haven’t built the estimator yet.

Further Applications and Generalizations

I’d love to see something like this applied to the BioCreative III gene linkage task (and thank you, organizers for switching to camel case from seemingly random case). You’d have to generalize somewhat, because you’re matching against a whole set of known aliases for a gene. The discriminative training would be a huge help to sort out common phrases, the importance of numbers, the correspondence of Greek and Latin characters, etc.

McCallum et al. only had a chance to scratch the surface of the feature space that might be useful for this and related problems. I’m surprised now that I haven’t seen more work building on this.

It’d also be easy to take any of the pair HMM applications to things like phonology or morphology that are out there.

Sequence Alignment with Conditional Random Fields

April 25, 2010

Having spent hours figuring out how to properly normalize edit probabilities of an edit channel model so they sum to 1 over all reads of a given length given a reference sequence, the answer seems obvious in retrospect. Use conditional random fields (CRFs), because they can normalize anything and are easy to program. An added benefit is that biologists are used to thinking of edit costs on a log scale.

Once we have a conditional model with an arbitrary linear basis, it is straightforward to account for differential expression given location or initial hexamer (more on that below).

The Sequence Alignment Problem

The first step in most DNA or RNA analysis pipelines is sequence alignment, wherein the bases read from a biological sample by a sequencer are aligned with the bases of one or more reference sequences (genome sequences or gene models of RNA transcripts).  The present generation of sequencers (e.g. Illumina or SOLiD) produce tens or even hundreds of millions of reads of 50 bases or so per run.

Typically, simpler models like pure edit distance are computed, using hashing heuristics to narrow search or suffix arrays to share prefix calculation. The model described here is intended to be run over a narrower set of alignment hypotheses.

Generative Expression Models Require Probabilistic Alignments

In a previous post, I described a Bayesian hierarchical model of mRNA expression. That model contains a random variable for each read that is dependent on the sequence from which it is read. Inference with this model requires us to calcluate the probability of a read given a reference sequence. The model described here directly plugs into the expression model. Probabilistic alignments may also be used in SNP calling, splice-variant discovery, or any other applications of sequence alignment.


Our model is based directly on the simple notion of alignment, which consists of a sequence of matches/substitutions, inserts (gaps in reference), and deletes (gaps in query/read). Repeating the example from my earlier post on probabilistic channel models, the following is an alignment of the read/query sequence AGTTAGG (below) to the reference sequence ACGTACG (above):


The first C in the reference is deleted, the second T in the read is inserted, and the second G in the read is substituted for the second C in the reference. We can represent alignments as sequences of edit operations. For the alignment above, we have the sequence of eight edits

m(A), d(C), m(G), m(T), i(T), m(A), s(G,C), m(G).

The alignment indicates the bases involved so that the reference and query/read sequences may be reconstructed from the alignment.

As we noted before, there are typically very many ways to align a read to a reference. Here’s an alternative alignment,


which corresponds to the following sequence of seven edits

m(A), s(G,C), s(T,G), m(T), m(A), s(G,C), m(G).

Read Probabilities from Alignment Probabilities

Suppose we have a reference sequence of M bases x = x_1,\ldots,x_M. The CRF we propose will model the probability p(a,m|x,N) of alignment a aligning at position m of reference sequence x and producing a read of length N. We’ll let \mbox{read}(a) be the read generated by the alignment a.

We compute the probability of a read y of length N starting at position m of the reference sequence as

p(y,m|x,N) = \sum_{\{a \ : \ \mbox{read}(a) = y\}} p(a,m|x,N).

We can then marginalize out the starting position m by summing to calculate the probability of a read,

p(y|x,N) = \sum_{m =1}^M p(y,m|x,N).

Linear Basis

In general, CRFs are based on arbitrary potential functions for cliques of dependent variables which combine linearly by summation over all cliques. We break down the basis into a contribution from edits and starting position. For edits, we take

\varphi(a,N) = \sum_{m = 1}^M \varphi(a_m)

where \varphi(a_m) is the potential of the edit operation a_m. Recall that the edits include their edited bases, so we have four match edits m(Z), twelve substitution edits s(Z1,Z2) with Z1 != Z2, four insert edits i(Z) and four delete edits d(Z), for a total of 24 edit operations. So \varphi(\cdot) provides values in (-\infty,\infty) for each of the 24 operations, and the weight of a sequence of operations is the sum of the operation weights.

For starting positions, we take a potential function \psi(m,x) for starting an alignment at position m of reference sequence x = x_1,\ldots,x_M.

The probability of an alignment at a starting position is defined to be proportional to the exponentiated sum of potentials,

p(a,m|x,N) \propto \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)),

where we take \xi(a,N) to be an indicator variable with value 0 if \mbox{length}(\mbox{read}(a)) = N and -\infty otherwise. This makes sure our distrbution is normalized for the length of reads we care about, and will also make the dynamic programming required for inference tractable.

Normalization is straightforward to calculate in theory. To compute normalized probabilities, we take

p(a,m|x,N) = \frac{1}{Z(x,N)} \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)), where

Z(x,N) = \sum_{a,m}  \exp(\varphi(a,N) + \psi(m,x) + \xi(a,N)).

As they say in the old country, Bob’s your uncle.

Sum-Product Algorithm for Read Probabilities

It’s just the usual sum-product algorithm for CRFs. Its just like the usual Smith-Waterman style edit algorithm except that (1) we must consume all the read, (2) start costs are given by \psi, and (3) we use log sums of exponentials rather than maxes to compute cell values without underflow.

Sum-Product for Normalization

To compute the normalizing term Z(x,N), we run a similar sum-product algorithm, only with an extra sum for all the possible reads.

Viterbi Algorithm for Best Alignments

With the same caveats (1) and (2) above, we can use the usual maximum operation to produce the best alignment. We can also follow LingPipe’s approach and use Viterbi in the forward direction and A* in the reverse direction to enumerate the n-best alignments.

Affine Gap Basis and Edit Context Sensitivity

It’s easy to take affine gaps into account. Just condition the edit operations on the previous edit operation. You have to add the relevant context to dynamic programming as in the usual affine extension to Smith-Waterman alignment.

It’s also easy to model things like the uncertainty caused in Illumina reads by homopolymers (sequences repeating the same base) by allowing context for the previous base. This is also easy to dynamic program, though adding context adds to space and time requirements.

Modeling Uneven Reads

Depending on the sample preparation process in RNA sequencing experiments (RNA-Seq) (e.g. RNA fragmentation by hydrolysis versus oligo-primed cDNA fragmentation by sonication), there may be effects in read coverage based on position (e.g. RNA fragmentation depleted at 3′ and 5′ end or cDNA fragmentation favoring reads toward the 3′ end) or on the sequence of bases near the start of the read (e.g. amplification by “random” hexamer priming).

These effects are not small. [Hansen, Brenner and Dudoit 2010] showed that hexamer binding in the illumina process can cause over a 100-fold difference in expression depending on the first two hexamers in a read. Similarly, see figure 3 in [Wang, Gerstein and Snyder 2009], which demonstrates over a tenfold effect on expression based on position. (I sure wish there was variance in the Wang et al. graphs.)

Modeling Input Quality

We can also add input quality to our calculations if it’s on a log scale. Then we have to select which base was actually read, making it more like the use of CRFs for tagging. For an example of how to do this for standard edit costs, see the description of the GNUMAP system in [Clement et al. 2010]. As the authors point out, this is a serious issue with reads as noisy as those coming off the Illumina or SOLiD platforms.