LingPipe To-Do List

February 8, 2010 by lingpipe

I never catch up with the pile of API designs and features I have planned for the future of LingPipe. After that last release, I’d like to just list the major items on the “to-do” list for LingPipe. Comments appreciated, as always. As are feature requests. It’s not very expensive to add to this list, and you never know what’ll grab me to implement next.

Demos and Commands

We could really use command-line options for much of our technology. I think we’re losing market share because LingPipe requires Java coding. A good start might be a command-line package for classifiers like all the other ones out there to allow users to plug and play.

We could also use more demos. Right now, we just have a few annotation demos up, but that’s only a small part of what we do.

Improvements and Generalizations

Many of our modules are not written as generally as possible, either at the level of API or the level of algorithms.

  • Fully online stochastic gradient for logistic regression, conditional random fields (CRF), and singular value decomposition (SVD)
  • All iterative algorithms producing iterators over epochs. Right now, only latent Dirichlet allocation (LDA) is set up this way, but I could do this for semi-supervised expectation-maximization (EM) classifier training, and all the SGDs for logistic regression, CRFs and SVD.
  • Refactoring SVD to produce iterators over states/dimensions
  • Weighted examples for just about anything trainable from LM classifiers to logistic regression; this would make it trivial to train on probabilistic examples by weighting categories by probability.
  • Entropy-based pruning for language models.
  • Information gain for feature selection; minimum count feature selection
  • Generalize min-max heaps to take a java.util.Comparator rather than requiring entries to implement LingPipe’s util.Scored.
  • Soft k-means abstracted from demo into package for clustering
  • More efficient vector processing for logistic regression, CRFs, etc., where there is no map from strings to numbers, and not necessarily even a vector processed. In most of these modules, we need to compute a vector dot-product with a coefficient vector, not actually construct a feature vector. Ideally, we could go all the way to Vowpal-Wabbit-like implicit feature maps.
  • Integrate short priority queues where appropriate, because they’re more efficient than the general bounded priority queues we’re currently using
  • More general priors and annealing for SVD to match the other versions of SGD.
  • More efficient sorted collection with removes for more efficient hierarchical clustering.
  • Removing all the specific corpus.Handler subinterfaces other than ObjectHandler. Then I can generalize cross-validating corpora, and just have the object handled as a type parameter for corpus.Parser and corpus.Corpus.
  • Add iterator methods to corpora that can enumerate instances rather than requiring handling via callbacks to visitors.
  • Privatize everything that can be.
  • Finalize methods and classes that shouldn’t be overridden. I’ve been very sloppy about this all along.
  • More agressively copy incoming arguments to constructors and wrap getters in immutable views. Later classes are much better than earlier ones at doing this. (Maybe I should just say re-read Effective Java and try one more time to apply all its advice.)

Rearrangements

So many things should move around that it’d be impossible to mention them all. For instance, putting all the HMM classes in the HMM package, and all the LM classes in the LM package, for a start.

I’m planning to move most of the corpus-specific parsers (in corpus.parsers) out of LingPipe to wherever they’re used.

I’m also planning to move the entire MEDLINE package to the sandbox project lingmed.

I should rename classes like util.Math which conflict with Java library class names.

New Modules

  • Wikipedia Wikitext parser (probably not for LingPipe proper)
  • Probabilistic context-free grammar (PCFG) parser. Possibly with Collins-style rules.
  • Discriminative statistical context-free grammars with more global tree kernel features
  • Dependency parsers with the same characteristics as the CFGs.
  • Linear support vector machines (SVM) with regularization via SGD.
  • Morphological analyzer (to work as a stemmer, lemmatizer or feature generator), preferably with semi-supervised learning. I’d like to take an approach that blends the best aspects of Yarowsky and Wicentowski’s morphology model and Goldwater and Johnson’s context-sensitive Bayesian models.
  • Some kind of feature language for CRFs
  • More finely synched cache (along the lines of that suggested in Goetz et al.’s awesome Java Concurrency in Practice)
  • Logistic regression for a long-distance, rescoring tagger or chunker
  • Longer-distance maximum-entropy Markov model (MEMM) type taggers and chunkers; with a greedy option as discussed by Ratinov and Roth.
  • Higher-order HMM rescoring taggers and chunkers
  • More efficient primitive collections; (I just finished a map implementation for boolean features)
  • Unions, differences, etc. for feature extractors
  • Decision tree classifiers
  • Meta-learning, like boosting (requires weighted training instances)
  • Jaro-Winkler (or other edit distances) trie versus trie (scalable all versus all processors based on prefix tries).
  • Prune zeros out of coefficient vectors and symbol tables for classifiers, CRFs, etc.
  • Standard linear regression (in addition to logistic).
  • Other loss functions for linear and generalized regressions, such as probit and robit.
  • Dirichlet-process-based clusterer
  • Discriminative choice analysis (DCA) estimators, classifiers and coref implementation (the base classes are nearly done).

Please let us know if there’s something you’d like to see on this list.

Inferring Splice-Variant mRNA Expression from RNA-Seq

February 5, 2010 by lingpipe

I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants.

The Model

The data from which we infer expression is

  • K \in \mathbb{N}^{+}, the number of splice variants,
  • N \in \mathbb{N}^{+}, the number of reads, and
  • y_1,\ldots,y_N, the reads.

For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores.

We assume two model hyperparameters,

  • \varphi, genetic variation, sample preparation, and sequencing parameters, and
  • \alpha_1,\ldots,\alpha_K \in \mathbb{R}^{+}, the prior read counts (plus one).

The general-purpose parameter \varphi reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base).

We will infer two of the model parameters,

  • t_1,\ldots,t_N \in 1:K, the mapping of read to splice variant, and
  • \theta_1,\ldots,\theta_K \in [0,1] such that \sum_{k=1}^K \theta_k = 1, the read expression probabilities.

The model structure in sampling notation is

  • \theta \sim \mbox{\sf Dirichlet}(\alpha)
  • t_n \sim \mbox{\sf Discrete}(\theta) for n \in 1:N
  • y_n \sim \mbox{\sf Channel}(t_n,\varphi) for n \in 1:N.

We select the expression levels \theta based on prior counts.

The heart of the mapping model is the discrete sampling of the latent mappings t_n based on the expression level parameter \theta. This is effectively a beta-binomial model of expression at the corpus level.

The mysterious part of this whole thing is the channel model, which assigns the probability of a given read y_n being observed given it arose from the splice variant t_n under the sequencing model parameterized by \varphi.

Simple Channel Examples

For purposes of illustration, we’re going to assume a really simple instance of this model.

First, we’ll assume that \alpha_k = 1 for k \in 1:K, which renders \mbox{\sf Dirichlet}(\alpha) a uniform distribution over possible values of \theta. With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate \alpha.

Second, we’ll assume a simple uniform read model. Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure.

Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant.

Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here.

With all of these assumptions in place, if read y_n is mapped to splice variant t_n \in 1:K, we set

\mbox{\sf Channel}(y_n|t_n,\varphi) = 1 / (\mbox{len}(t_n) - \mbox{len}(y_n)).

The denominator \mbox{len}(t_n) - \mbox{len}(y_n) reflects the number of positions a sequence of the given read length \mbox{len}(y_n) may occur in an mRNA of length \mbox{len}(t_n).

A Simple BUGS Implementation

Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS.

model {
    theta[1:K] ~ ddirch(alpha[1:K])
    for (n in 1:N) {
        t[n] ~ dcat(theta[1:K])
        y[n] ~ dcat(rho[t[n],1:K])
    }
}

BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution.

I call it from R in the usual way. Here’s just the part where R tells BUGS how to run the model. After the BUGS window is closed, R plots and attaches the data.

library("R2WinBUGS")

data <- list("alpha", "rho", "y", "N", "K")
parameters <- c("theta","t")
inits <- function() { list(theta=c(1/3,1/3,1/3)) }
rnaexp <- bugs(data, inits, parameters,
       "c:/carp/devmitz/carp/isoexp/src/Bugs/isoexp-multi.bug",
        n.chains=3, n.iter=50000,
        debug=TRUE, clearWD=TRUE,
        bugs.directory="c:\\WinBUGS\\WinBUGS14",
        DIC=FALSE)
print(rnaexp)
plot(rnaexp)
attach.bugs(rnaexp)

The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC.

Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples.

An Example

I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. The gene model and expression levels for the simulation are defined in a comment in the R code. We’ll consider a case with three exons and three isoforms:

# exons  A BB C  (B twice as long as others)
# var1 = A BB
# var2 =   BB C
# var3 = A    C

There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). There are three splice variants, labeled var1, var2, and var3. Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C.

Without cross-junction reads, it’s not possible to get a uniquely mapped read. Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3.

Next, we specify the number of samples drawn from each exon in each splice variant.

#       exon
# var  1  2  3  Total theta*
# 1    1  2    | 3     3/19
#
# 2       4  2 | 6     6/19
#
# 3    5     5 | 10    10/19
#      --------
# tot  6  6  7   18

There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. These are distributed uniformly across the exons. Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3.

If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. But our expression parameter theta (\theta) is defined by reads, not by transcripts.

The observed data is much simpler — it’s just the total number of reads for samples mapping to each exon. We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3.

Coding the Example in R

We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K:

alpha <- c(1,1,1)
K <- length(alpha)

We actually assume there are 10 times that many samples as in our figure, setting the read data as:

y <- c(rep(1,60),rep(2,60),rep(3,70))
N <- length(y)

That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190.

Finally, we use R’s matrix constructor to set up the read model:

rho <- matrix(c(1/3,2/3,0,  0,2/3,1/3,  1/2,0,1/2),
              nrow=3,ncol=3,byrow=TRUE)

I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length).

The Output

I ran the R program, which called BUGS, which terminated neatly and returned to R. The \hat{R} values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler.

Bayeisan Point Estimate

We can compute any statistics we want now with the posterior samples. For instance, theta[,1] contains all the Gibbs samples for the variable \theta_1.

We can then generate the Bayesian (L2) estimates for expression levels:

> thetaHat = c(mean(theta[,1]), mean(theta[,2]), mean(theta[,3]))
> print(thetaHat,digits=2)
[1] 0.17 0.31 0.52

The “right” answer for our simulation is also easy to compute with R:

> print(c(3/19, 6/19, 10/19), digits=2)
[1] 0.16 0.32 0.53

Our estimates are pretty darn close. (Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.)

How certain are we of these estimates? Not very, because there were only 190 samples and lots of mapping uncertainty. It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples:

> print(quantile(theta[,1], c(0.025,0.25,0.5,0.75,0.975)),
        digits=2)
 2.5%   25%   50%   75% 97.5%
0.025 0.104 0.167 0.231 0.340
> print(quantile(theta[,2], c(0.025,0.25,0.5,0.75,0.975)),
        digits=2)
 2.5%   25%   50%   75% 97.5%
 0.13  0.25  0.31  0.37  0.47
> print(quantile(theta[,3], c(0.025,0.25,0.5,0.75,0.975)),
       digits=2)
 2.5%   25%   50%   75% 97.5%
 0.42  0.49  0.52  0.56  0.61

With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced.

A Triangle Plot and Correlated Expressions

Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. Here’s the plot:

I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3.

It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. Here’s the two-way interaction plots:

two-way plots of RNA-seq expression 1 vs 2

two-way plots of RNA-seq expression 1 vs 3

two-way plots of RNA-seq expression 2 vs 3

These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. This makes sense given variants 1 and 2 share the large exon BB.

Finally, here’s the source that generated the PNG file for the triangle plot:

library("ade4")
png(file="splice-variant-exp-tri.png")
triangle.plot(theta,addmean=TRUE,labeltriangle=TRUE,
    draw.line=TRUE,show.position=FALSE,scale=FALSE)
dev.off()

and for the ordinary plots:

png(file="splice-variant-exp-12.png")
plot(theta[,c(1,2)],xlab="variant 1",ylab="variant 2",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 2")
dev.off()

png(file="splice-variant-exp-13.png")
plot(theta[,c(1,3)],xlab="variant 1",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 3")
dev.off()

png(file="splice-variant-exp-23.png")
plot(theta[,c(2,3)],xlab="variant 2",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="2 vs. 3")
dev.off()

Evaluating with Unbalanced Training Data

February 4, 2010 by lingpipe

For simplicity, we’ll only consider binary classifiers, so the population consists of positive and negative instances, such as relevant and irrelevant documents for a search query, or patients who have or do not have a disease.

Quite often, the available training data for a task is not balanced with the same ratio of positive and negative instances as the population of interest. For instance, in many information retrieval evaluations, the training data is positive biased because it is data annoted from the top results of many search engines. In many epidemiological applications, there is also positive bias because a study selects from a high risk subpopulation of the actual population.

Even with unbalanced training data, we might still want to be able to calculate precision, recall, and similar statistics for the actual population. It’s easy if we know (or can estimate) the true percentage of positive instances in the population.

Specificity and Sensitivity vs. Precision and Recall

Using the usual notation for true and false positives and negatives,

\mbox{sens} = \mbox{TP}/(\mbox{TP} + \mbox{FN}), and

\mbox{spec} = \mbox{TN}/(\mbox{TN} + \mbox{FP}).

Sensitivity and specificity are accuracies on positive (\mbox{TP} + \mbox{FN}) and negative cases (\mbox{TN} + \mbox{FP}), respecitively.

Prevalence for a sample may be calculated from the true and false positive and negative counts, by

\pi = (\mbox{TP} + \mbox{FN})/(\mbox{TP} + \mbox{FP} + \mbox{TN} + \mbox{FN}).

Recall is just sensitivity, but precision is the percentage of positive responses that are correct, namely

\mbox{prec} = \mbox{TP}/(\mbox{TP} + \mbox{FP}).

Prevalence Adjusted Contingency Matrix

Suppose we know the test population prevalence \pi', the probability of a random population member being a positive instance. Then we can adjust evaluation statistics over a training population with any ratio of positive to negative examples by recomputing expected true and false positive and negative values.

The expected true and false positive and negative counts in test data with prevalence \pi' over N samples, given test sensitivity and specificity \mbox{sens} and \mbox{spec}, are

\mbox{TP}' = N \cdot \mbox{sens} \cdot \pi',

\mbox{TN}' = N \cdot \mbox{spec} \cdot (1 - \pi'),

\mbox{FP}' = N \cdot (1 - \mbox{spec}) \cdot (1 - \pi'), and

\mbox{FN}' = N \cdot (1-\mbox{sens}) \cdot \pi'.

Sensitivity and Specificity are Invariant to Prevalence

Although the adjusted contingency matrix counts (true and false positive and negatives) vary based on prevalence, sensitivity and specificity derived from them do not. That’s because specificity is accuracy on negative cases and sensitivity the accuracy on positive cases.

Plug-In Population Statistic Estimates

Precision, on the other hand, is not invariant to prevalence. But we can compute its expected value in a population with known prevalence using the adjusted contingency matrix counts,

\mbox{prec}' = \mbox{TP}'/(\mbox{TP}' + \mbox{FP}').

The counts N all cancel; we can really work with true and false positive and negative rates instead of counts.

Related Work

This is related to my earlier blog post on estimating population prevalence using a classifier with known bias and an unbiased population sample:

The approach described in the linked blog post may be used to estimate the population prevalence \pi' given only an arbitrarily biased labeled training set and an unlabeled and unbiased sample from the population.

LingPipe 3.9 Released

February 3, 2010 by lingpipe

Yatta! CRFs for tagging and chunking and a whole lot more are out and available for download from the usual place:

The home page details the updates. Major changes in addition to CRFs include a new tag package to which HMMs were retrofitted, serialization for tokenized LMs, updates to 2010 MEDLINE, and some new utility methods and classes.

4.0 Up Next

We’re no longer planning to adopt the AGPL license, but we are still planning major changes for 4.0, including getting rid of all the currently deprecated classes and methods. Many existing classes are likely to be relocated.

Hopefully this will be done soon, as we’re not planning any new functionality for 4.0. Yet.

Programming esprit d’escalier

February 3, 2010 by lingpipe

Mixing Metaphors

Last night I was struck by a bolt of esprit d’escalier.*

Anecdotal Evidence

I was refactoring our part-of-speech tutorial to the new tagging interface introduced for CRFs. I’d spent over an hour stuck on a bug for unknown token evaluations, which had evocatively repeatable bug patterns. Finally, I had the sense to give up, get on the subway, and go home. As soon as I was sitting on my sofa watching Defying Gravity, it occurred to me what the problem was. I was updating the known token set for the initial segment of the test set and then recalculating test set scores. D’oh!

Down the Garden Path

The problem is that your brain gets into a rut. To borrow a metaphor from psycholinguistics, you wander down a programming garden path.

The Zen of Being Your Own Rubber Ducky

Way back in high school (post-internt, pre-web), Douglas Hofstadter’s book Gödel, Escher, Bach introduced me to the joy of the kōan. I liked the one about running around the house without thinking about white elephants. That is exactly how it feels when you’re down the garden path, trying to escape the labyrinth to reach enlightenment.

I contend that if you were good enough at zen meditation, you could be your own rubber ducky.


* The Wikipedia translates the expression “esprit d’escalier” as “staircase wit” and defines it succintly as “thinking of a clever comeback when it is too late”. I love the Wikipedia; they even point to an esprit d’escalier-themed Seinfeld episode, “The Comeback”.

Hunt and Thomas introduced the term “rubber ducky” in their book The Pragmatic Programmer. It’s about how useful it is to explain your problem out loud, even if only to a rubber ducky. The best rubber ducky is another programmer who actually understands what you’re talking about at some level. Mitzi and I have found each other more useful than our cats as rubber duckies.

Comparing Precision-Recall Curves the Bayesian Way?

January 29, 2010 by lingpipe

Back Story

There’s an interesting thread on the BioNLP mailing list (here’s a link to the publicly readable thread). The thread was kicked off when Andrew Clegg asked:

Suppose I have two precision-recall curves for two different IR algorithms on the same test data and query.

What would be the correct statistical test to determine if they are significantly different?

I’m particularly sympathetic to (part of) Bill Hersh’s second response.

… All statistical inference tells you is how likely your differences are due to chance. It says nothing about whether the difference is important. …

to which Andrew replied

… in my case it’s the other way round, I think we’ve already displayed a clear benefit in terms of usability, I’m just thinking of including significance testing in case a very picky reviewer asks if it’s down to chance.

Can you say CYA?

There were a host of classical tests proposed including the new-to-me Page trend test. Others suggesting taking an aggregate statistic like area-under-the-curve and doing a standard difference test.

ROC Curves vs. PR Curves

There’s much more literature on receiver operating characteristic curves (aka ROC curves) than precision-recall (PR) curves. Just as a refresher, if we take a given number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), precision, recall, sensitivity and specificity are:

Sensitivity = TP / (TP + FN)

Specificity = TN / (TN + FP)

Recall = Sensitivity = TP / (TP + FN)

Precision = TP / (TP + FP)

If we have a ranked list of outputs for which we know their gold-standard category (relevant or not in the case of IR), we can consider initial segments of the ranked list and compute the precision/recall at that point. The ROC curve plots 1-specificity (the false positive rate) versus sensitivity (the true positive rate) at various operating points for an algorithm. The PR curve plots precision versus recall.

In PR curves, it is often the case that only “interpolated” results are considered, which means taking the convex shell of the curve. This is easily accomplished by making sure the function is non-decreasing by propagating later higher results back. For instance, if precision at 1 recall is 1/2 and precision at 2 recall is 2/3 and at 3 recall is 1/2, then the precision at 1 recall is interpolated to 2/3 so the function is non-decreasing. (I’m not sure what this does to hypothesis testing.)

There’s a lot more information and examples in the class documentation for

The Bayesian Way

It’s in exactly these kinds of situations where a Bayesian approach seems natural.

Just add a prior and compute a posterior over entire PR curves. Then compare the PR curves any way you want. Say by calculating how likely it is for one PR curve to be above the other at every operating point. Or how likely it is for one to be above the other at a given operating point (say 99% recall). Or how likely one is to be substantially better than another, or how much you expect one to be higher than another.

See my earlier post on Bayesian counterpart to Fisher’s exact test on contingency table data.

The nifty part is that “adjustments” for multiple comparisons are built in. So we can ask what’s the posterior probability of one system having at least 1% higher precision than all others at 99% recall.

But once we’re doing multiple comparisons, it makes sense to build a hierarchical model. Often, a system will report results on multiple queries, so there’s a set of precision-recall curves for each system, leading to two levels of grouping.

I provide an example of Bayesian multiple comparisons in a hierarchical model in my earlier blog post on Bayesian batting ability with multiple comparisons.

Will it Work?

How do I model the posterior PR (or ROC) curve? From that, a regular or interpolated PR (or ROC) curve can be generated if desired.

I’m specifically worried about precision and recall sharing the TP term in their numerators and denominators; ROC curves seem simpler.

I’m also worried about the low counts at the low recall end of the curves.

And about the correlations among the different operating points.

It doesn’t seem that a naive independent binomial for each recall point would be justified, though it’d make the posteriors very clean, especially with uniform priors.

For instance, suppose that after the first 5 true positives in the systems’ ranked lists, we have seen 7, 10, 4, 11, and 21 false positives for the five systems? The maximum likelihood point estimate of precision with a uniform prior is then 5/12, 5/15, 5/9, 5/16, and 5/26. Can we just use Beta(6,13), …, Beta(6,27) as the posteriors for comparison? Somehow it just seems wrong.

Any Ideas?

Does anyone know how to compute PR or ROC posteriors appropriately?

Does Anyone Have Data?

Andrew, or anyone: if you have the ranked data and gold standard answers, I’d love to code this up in R/BUGS and see what it looks like in the context of data someone cares about.

The Long Road to CRFs

January 28, 2010 by lingpipe

CRFs are Done

The first bit of good news is that LingPipe 3.9 is within days of release. CRFs are coded, documented, unit tested, and I’ve even written a long-ish tutorial with hello-world examples for tagging and chunking, and a longer example of chunking with complex features evaluated over:

And They’re Speedy

The second bit of good news is that it looks like we have near state-of-the-art performance in terms of speed. It’s always hard to compare systems without exactly recreating the feature extractors, requirements for convergence, hardware setup and load, and so on. I was looking at

for comparison. Okazaki also evaluated first-order chain CRFs, though on the CoNLL 2000 English phrase chunking data, which has fewer tags than the CoNLL 2003 English named entity data.

While my estimator may be a tad slower (it took about 10s/epoch during stochastic gradient), I’m applying priors, and I think the tag set is a bit bigger. (Of course, if you use IO encoding rather than BIO encoding, like the Stanford named entity effort, then there’d be even fewer tags; on the other hands, if I followed Turian et al. (ref below), or the way we handle HMM encoding, there’d be more.)

It also looks like our run time is faster than the other systems benchmarked if you don’t consider feature extraction time (and I don’t think they did in the eval, but I may be wrong). It’s running at 70K tokens/second for full forward-backward decoding; first-best would be faster.

All the Interfaces, Please

Like for HMMs, I implemented first-best, n-best with conditional probabilities, and a full forward-backward confidence evaluation. For taggers, confidence is per tag per token; for chunkers, it’s per chunk.

Final Improvements

Yesterday, I was despairing a bit over how slow my approach was. Then I looked at my code, instrumented the time spent in each component, and had my D’oh! moment(s).

Blocked Prior Updates

The first problem was that I was doing dense, stochastic prior updates. That is, for every instance, I walked over the entire set of dense coefficient vectors, calculated the gradient, and applied it. This was dominating estimation time.

So I applied a blocking strategy whereby the prior gradient is only applied every so often (say, every 100 instances). This is the strategy discussed in Langford, Li and Zhang’s truncated gradient paper.

I can vouch for the fact that result vectors didn’t change much, and speed was hugely improved to the point where the priors weren’t taking much of the estimation time at all.

Caching Features

The other shortcoming of my initial implementation was that I was extracting features online rather than extracting them all into a cache. After cleaning up the prior, feature extraction was taking orders of magnitude longer than everything else. So I built a cache, and added yet another parameter to control whether to use it or not. With the cache and rich feature extractors, the estimator needs 2GB; with online feature extraction, it’s about 20 times slower, but only requires around 300MB of memory or less.

Bug Fixes

There were several subtle and not-so-subtle bugs that needed to be fixed along the way.

One problem was that I forgot to scale the priors based on the number of training instances during estimation. This led to huge over-regularization. When I fixed this problem, the results started looking way better.

Structural Zeros

Another bug-like problem I had is that when decoding CRFs for chunkers, I needed to rule out certain illegal tag sequences. The codec I abstracted to handle the encoding of chunkers and taggers and subsequent decoding could compute legal tag sequences and consistency with tokenizers, but the CRF itself couldn’t. So I was getting illegal tag sequences out that caused the codec to crash.

So I built in structural zeros. The simplest way to do it seemed to be to add a flag that only allowed tag transitions seen in the training data. This way, I could enforce legal start tags, legal end tags, and legal transitions. This had the nice side benefit of speeding things up, because I could skip calculations for structural zeros. (This is one of the reasons Thorsten Brants’ TnT is so fast — it also applies this strategy to tags, only allowing tags seen in training data for given tokens.)

Feature Extraction Encapsulation

I was almost ready to go a couple of days ago. But then I tried to build a richer feature extractor for the CoNLL entity data that used part-of-speech tags, token shape features, contextual features, prefixes and suffixes, etc. Basically the “baseline” features suggested in Turian, Ratinov, Bengio and Roth’s survey of cluster features (more to come on that paper).

It turns out that the basic node and edge feature extractors, as I proposed almost six months ago, weren’t quite up to the job.

So I raised the abstraction level so that the features for a whole input were encapsulated in a features object that was called lazily by the decoders and/or estimator. This allowed things like part-of-speech taggings to be computed once and then cached.

This will also allow online document features (like previous tagging decisions) to be rolled into the feature extractor, though it’ll take some work.

And a Whole Lotta’ Interfaces and Retrofitting

I added a whole new package, com.aliasi.tag, to characterize the output of a first-best, n-best, and marginal tag probability tagger. I then implemented these with CRFs and retrofitted them for HMMs. I also pulled out the evaluator and generalized it.

Along the way, I deprecated a few interfaces, like corpus.ChunkHandler, which is no longer needed given corpus.ObjectHandler<Chunking>.

Still No Templated Feature Extraction

Looking at other CRF implementations, and talking to others who’d used them, I see that designing a language to specify feature extractions is popular. Like other decisions in LingPipe, I’ve stuck to code-based solutions. The problem with this is that it limits our users to Java developers.

Dekel and Shamir (2009) Vox Populi: Collecting High-Quality Labels from a Crowd

January 22, 2010 by lingpipe

Aleks just sent me a pointer to this paper from last year’s COLT:

The paper presents, proves theorems about, and evaluates a very simple idea in the context of crowdsourcing binary classifier data coding. Unlike other approaches I’ve blogged about, Dekel and Shamir use only one coder per item, and often coders only label a handful of items.

They train an SVM on all the labels, then use the SVM to evaluate coders. They then prune out low-quality coders using the SVM as a reference.

This paper reminded me of (Brodley and Friedl 1999), which was also about using multiple trained classifiers to remove bad labels. Brodley and Friedl remove items on an item by item basis rather than on an annotator by annotator basis.

This paper is also somewhat reminiscent of (Raykar et al. 2009), who train a classifier and use it as another annotator.

Lots of Theory

There’s lots of theory saying why this’ll work. It is a COLT paper after all.

Evaluation: Web Page Query Relevance

They ran an eval on Amazon’s Mechanical Turk asking people to judge a pair consisting of a web query and web page as to whether the page is relevant to the query.

They used 15 coders per query/page pair in order to define a gold standard by voting. They then evaluated their approach using only one coder per item by subsampling their larger data set.

One thing I noticed is that as they lowered the maximum number of items labeled by an annotator, their average annotator accuracy went up. This is consistent with our findings and those in Snow et al. that the spammy Mechanical Turkers complete a disproportionate number of tasks.

With only one coder per item, Dekel and Shamir can’t evaluate the way everyone else is evaluating crowdsourcing, because they know their resulting data will remain noisy because it’s only singly annotated.

Estimates of coder sensitivity and specificity could be made based on their performance relative to the SVM’s best guesses. That’d provide some handle on final corpus quality in terms of false positives and negatives relative to ground truth.

Rather than evaluating trained classifier performance, Dekel and Shamir measure the “noise level” of the resulting data set after pruning. What we really care about in practice is how much pruning bad annotators helps train a more reliable classifier (or help evaluate if that’s the goal). They discuss this kind of end-to-end evaluation under “future work”! It’s really striking how different the evaluation versus theory requirements are for different conferences.

Marginalizing Latent Variables in EM

January 19, 2010 by lingpipe

After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I “turned the crank” (an idiom in mathematics for solving an equation by brute force without thinking too much about it).

Expression Mixture Model

Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads r, gene or splice variant sources g, and mRNA expression \theta, setting

p(r,g|\theta) = \prod_{n=1}^N p(r_n,g_n|\theta) = \prod_{n=1}^N p(r_n|g_n) \ p(g_n|\theta).

Maximum Likelihood

Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.

What’s a Parameter?

So that’s what I did, calculating

g^*, \theta^* = \arg \max_{g,\theta} p(r|g,\theta).

The problem is that there’s an implicit “except the latent parameters” inside the usual understanding the MLE.

What I should have done is marginalized out the latent gene sources g, taking

\theta^* = \arg \max_{\theta} p(r|\theta).

That requires marginalizing out the latent mapping of reads r_n to gene or splice variant sources g_n,

p(r|\theta) = \prod_{n=1}^N p(r_n|\theta_n) = \prod_{n=1}^N \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

On a log scale, that’s

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

The Example

I’ll repeat the example here for convenience:

Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we’ll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

  Exon 1 Exon 2 Exon 3 Total
Iso 1 N 20 0 20+N
Iso 2 70-N 0 50 120-N
Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.

Turning the Crank In the Right Direction

By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.

Now when we turn the crank, we see that

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k=1}^K p(r_n|k) p(k|\theta)

{} \ \ \ \ =  70 \log \sum_{k=1}^K p(1|k) p(k|\theta) + 20 \log \sum_{k=1}^K p(2|k) p(k|\theta)  + 50 \log \sum_{k=1}^K p(3|k) p(k\theta)

= 20 \log \frac{\theta_1}{2} + 50 \log \frac{\theta_2}{2} + 70 \log (\frac{\theta_1}{2} + \frac{\theta_2}{2})

= 20 \log \theta_1 + 50 \log \theta_2 + (70 \log 1 -20\log 2 - 50 \log 2 - 70 \log 2).

The reduction on the third line is because the probability of a read in the second exon, p(2|k), is only non-zero for isoform k=1, and similarly for a read in the third exon, where p(3|k) is nly non-zero for the second gene or splice variant, k=2. The final reduction follows because \theta_1 + \theta_2 = 1.

Not so Biased After All

After marginalizing out the read mappings g, the MLE for expression is right where we’d expect it, at the solution of

\frac{d}{d\theta} \log p(r|\theta) = 0,

which is

\theta^*=20/70.

It turns out EM isn’t so biased for these discrete mixtures, at least when you turn the crank the right way.

DCA for Discriminative Coreference Resolution

January 15, 2010 by lingpipe

DCA isn’t Just Tied Logistic Regression

After posting about discrete choice analysis (DCA) a few days ago, then adding a post about analyzing transportation choices, it occurred to me that DCA isn’t really just a form of logistic regression with parameter tying in its most general form. In its most general form, any number of choices can be presented at run time, whereas in both the machine-learning max-ent style approaches and the stats textbook type approaches to logistic regression, there is always a fixed, finite set of outcome categories.

DCA for Within-Document Coreference

One of the projects on our plate right now involves within-document coreference resolution, which involves resolving which noun phrases in a document refer to the same entities. Thus it’s a bit more general than the classical anaphora resolution problem, which finds antecedents for pronouns. In our case, two mentions of “John Smith” in the same document need to be linked (if they refer to the same person; they won’t if you’re reading Amit Bagga and Breck’ss John Smith resolution paper, or for that matter, the recreation of their results in LingPipe’s clustering tutorial).

LingPipe’s heuristic coreference package is based on Breck’s thesis work (called CogNIAC). If you troll through the code or read Breck’s paper, you’ll see that it’s essentially a greedy online algorithm that visits the entity mentions in a document in order, and for each mention either links it to a previous linked chain of mentions, or it starts a new chain consisting only of the current mention. Essentially, it’s a greedy online clustering algorithm.

The resolution of a mention is guided by matchers and anti-matchers that score candidate antecedent mention chains based on properties such as the closest matching alias (using a complex comparison allowing for missing tokens), known alias associations, discourse proximity (how far away the last mention in a chain is and how many are intervening), entity type (person, location, and ogranization, and for persons, gender).

So far, there’s nothing for linking common nouns (as there is in ACE), and nothing for linking compound entities (e.g. “John and Mary … They …”).

A DCA Model

We could just throw all those features into a DCA type model! The discrete choices consist of each candidate antecedent mention chain along with the choice of creating a new mention chain. If we can extract a feature vector \phi(c,m) \in \mathbb{R}^n given a mention chain c and mention m, we’re in business. We could potentially even use the other mention chains in generating the features for mention m, as in \phi(\mathcal{C},c,m) where \mathcal{C} is the set of previous mention chains, c \in \mathcal{C} is the candidate mention chain, and m is the mention being resolved. With all existing mention chains in play, we can also generate a reasonable feature vector \psi(\mathcal{C},m) for the choice of creating a new mention chain out of m given the antecedent candidates \mathcal{C}.

The model consists of a coefficient vector \beta \in \mathbb{R}^n. Given a mention m and set of potential antecedent chains \mathcal{C}, the probability of a given antecedent is defined in the usual way for log-linear logistic-type models. For linking to an antecedent chain c \in \mathcal{C}, we have

p(c|\mathcal{C},m) \propto \exp(\beta^{\top} \phi(\mathcal{C},c,m)).

For a new chain, which we’ll write n, we have

p(n|\mathcal{C},m) \propto \exp(\beta^{\top} \psi(\mathcal{C},m)).

Given annotated data and a feature extractor, the training cases easy to create. Each instance consists of the features for all previous mention chains and for creating a fresh along with the decision.

Does Any NLP Package Implement DCA?

Alas, LingPipe can’t handle these models the way I wrote logistic regression. In fact, I’m not sure anyone’s NLP package can handle the full DCA type models.

I’m afraid I can never quite figure out from the Mallet javadoc what exactly it can and can’t do because everything’s so abstract and there are so many implementations of the interfaces and so little high-level doc. But classes like cc.mallet.classify.MaxEnt seem to assume a fixed set of output categories.

The Open NLP javadoc is much more straightforward; their opennlp.maxent.GISModel also requires a fixed set of categories.

It also doesn’t look like Weka supports DCA-style models from the Weka Javadoc, but again, the interfaces are so general, and the implementations so plentiful, I’m not sure.

I’d love to hear about which packages do support this kind of thing. Presumably something the econometricians are using would do it.

Implementing DCA with SGD

DCA won’t be at all hard to write. In fact, having just finished CRFs (stay tuned for the release, probably next week).

Stochastic gradient descent (or a quasi-Newton method) is easy to adapt. The correct antecedent candidate or choice to create a new chain is treated as a positive (1) outcome, with all other candidate antecedent mention chains treated as negative (0) outcomes.

Coefficient Priors

Of course, we can put priors on the coefficients.

It sure would be nice to be able to create Bayesian estimators. Having just blogged about MLE and MAP’s bias, I’m worried about the MAP estimates for logistic regression all these systems (including LingPipe) use.

OK, you could use SVMs, too

Of course, SVM (hinge), perceptron (0/1), or other simple loss functions could be used instead of the logistic loss function. SGD will still work as is. And you could use non-linear kernels, which often squeeze a bit more accuracy out for very little work (contrasted with hand tuning non-linear features, such as interaction features, in a standard linear model).

Coreference Training Data?

Well, there was MUC 7 which kicked things off, and then ACE more recently. MUC-6 and MUC-7 and several years of ACE datasets are available from the LDC for a fee.

A couple months ago, I blogged about Massimo Poesio et al.’s effort to collect a coreference corpus using the Phrase Detectives annotation as game. This should be out soon, and be much larger than the MUC and ACE datasets.