Custom Java Map for Binary Features

February 9, 2010 by lingpipe

The built-in Java collection implementations tend to be heavy. For instance, hash maps are based on hash sets of map entries, where a map entry is a simple container object with a reference to a key and a value.

For many natural language applications, such as CRF taggers and chunkers, feature maps are often designed to be boolean, with features taking only binary values, 0 or 1. These maps are almost always sparse, only representing the one values. Given the proportion of time (most of it) and memory (pretty much all of it) spent extracting features compared to decoding in CRFs or classification with logistic regression, it seems worthwhile to consider more minimal representations.

Luckily, the Java collections were designed to be easy to extend (thanks Dr. Bloch). In this post, I’ll walk you through LingPipe’s new binary map class, which implements a generic map to numbers based on a pluggable set implementation. Because these maps need to be modifiable so that they can be easily built up in feature extractors, there’s a lot of work to do in the implementation. There are so many inner classes that the whole resembles a set of nesting Russian dolls (pictured above).

Class Declaration

Skipping all the java.util imports, we declare the map to be in our util package and extend Java’s abstract map:

package com.aliasi.util;

public class BinaryMap<E> extends AbstractMap<E,Number> {
...

The return type will be a number, but the keys may be any old type as specified by the generic E.

Member Variable and Constructors

There’s only a single member variable, containing the set of items that map to 1, and it’s specified in the constructor, defaulting to a hash map:

...
    private final Set<E> mPositiveSet;

    public BinaryMap() {
        this(new HashSet<E>());
    }

    public BinaryMap(Set<E> positiveSet) {
        mPositiveSet = positiveSet;
    }
...

Implementing the Entry Set Method

Java’s abstract map class does so much work that only one method must be implemented, entrySet(), which returns a set view of the map entries underlying the map. The abstract map class can then use that to implement everything from gets and puts to size to hash code methods.

Here, we implement it to return a custom entry set (see below):

...
    @Override
    public Set<Map.Entry<E,Number>> entrySet() {
        return new EntrySet();
    }
    ...
}

I’ll fill in implementations of the other map methods later, although this one method would be enough to get the job done in terms of space savings. It’d just be inefficient in terms of speed.

The Entry Set Implementation

Like nesting Russian dolls, this implementation’s going to involve quite a few inner classes. Here’s the entry set implementation itself, which is an inner class of the top level binary map class:

    class EntrySet extends AbstractSet<Map.Entry<E,Number>> {
        @Override
        public int size() {
            return mPositiveSet.size();
        }
        @Override
        public Iterator<Map.Entry<E,Number>> iterator() {
            return new EntrySetIterator();
        }
        @Override
        public void clear() {
            mPositiveSet.clear();
        }
        @Override
        public boolean contains(Object o) {
            if (!(o instanceof Map.Entry<?,?>))
                return false;
            @SuppressWarnings("unchecked") // checked above
            Map.Entry<?,?> entry = (Map.Entry<?,?>) o;
            return ONE.equals(entry.getValue())
                && mPositiveSet.contains(entry.getKey());
        }
        @Override
        public boolean isEmpty() {
            return mPositiveSet.isEmpty();
        }
        @Override
        public boolean remove(Object o) {
            if (!(o instanceof Map.Entry<?,?>))
                return false;
            @SuppressWarnings("unchecked") // checked above
            Map.Entry<?,?> entry = (Map.Entry<?,?>) o;
            return ONE.equals(entry.getValue())
                && mPositiveSet.remove(entry.getKey());
        }
    }

Only the iterator and size method are necessary. The size method just delegates to the containing class’s positive set’s size method. The iterator method returns a custom iterator implementation we describe below, and the size method delegate’s to the containing binary map’s positive set. (That’s why it’s an inner class, not a static nested class.) The remaining methods, the clear, contains, is-emtpy and remove methods are provided for efficiency; I grayed out their code to signal its optionality.

Some Constants

I declared two constants for convenience in the top-level binary map class:

...
    public static final Number ONE = Integer.valueOf(1);

    public static final Object[] EMPTY_OBJECT_ARRAY
        = new Object[0];
...

Implementing Entry Set’s Iterator

The entry set iterator is declared as an inner class of the top-level binary map. It’s declared to implement the appropriate iterator interface. The code consists of the iterator method implementations for the has-next, next, and remove methods.

    class EntrySetIterator
        implements Iterator<Map.Entry<E,Number>> {

        private final Iterator<E> mIterator
            = mPositiveSet.iterator();

        public boolean hasNext() {
            return mIterator.hasNext();
        }
        public Map.Entry<E,Number> next() {
            return new PositiveMapEntry();
        }
        public void remove() {
            mIterator.remove();
        }
...

When it is constructed, it uses the containing binary map’s positive set to create an iterator over the positive instances. The has-next and remove methods are simply delegated to the binary map’s positive set iterator. This means that when we’re iterating over the entry set for the top-level maps, remove calls have the intended effect on the top-level map, namely removing one of its positive entries. Of course, if the positive entry set on which the map is based is immutable, the remove operation will throw an unsupported operation exception, as documented in the iterator interface documentation.

The entry itself just returns an instance of another inner class, a positive entry map.

Positive Map Entry Implementation

The positive map entry is itself an inner class declared inside of the entry set iterator (which is itself an inner class of the top-level binary map).

...
        class PositiveMapEntry implements Map.Entry<E,Number> {
            private final E mE = mIterator.next();
            public E getKey() {
                return mE;
            }
            public Number getValue() {
                return ONE;
            }
            public Number setValue(Number value) {
                throw new UnsupportedOperationException();
            }
            public boolean equals(Object that) {
                if (!(that instanceof Map.Entry<?,?>))
                    return false;
                @SuppressWarnings("unchecked") // checked above
                    Map.Entry<?,?> e1 = (Map.Entry<?,?>) that;
                Map.Entry<?,?> e2 = this;
                return
                    (e1.getKey()==null
                     ? e2.getKey()==null
                     : e1.getKey().equals(e2.getKey()))
                    &&
                    (e1.getValue()==null
                     ? e2.getValue()==null
                     : e1.getValue().equals(e2.getValue()));
            }
            public int hashCode() {
                return (getKey()==null ? 0 : getKey().hashCode())
                    ^ (getValue()==null ? 0 : getValue().hashCode());
            }
        }
   }

During construction, it just consumes the next element from the iterator’s contained positive set iterator. That’s then held locally as the value to return for the entry’s key. The value is just the constant number 1. Note that the set method throws an unsupported operation exception; it could’ve been defined to do nothing for a value of 1, remove the positive entry for the value of 0, and throw an illegal argument exception otherwise. But that would almost certainly cause the iterator from which the value was drawn to throw a concurrent modification exception.

The equality and hash code methods are described in Java’s map entry interface; they must be defined this way for compatibility with Java’s collections.

Map Methods for Efficiency

The remaining methods defined in the top level map are for efficiency. Here they are:

...
    @Override
    public Number get(Object key) {
        return mPositiveSet.contains(key)
            ? ONE
            : null;
    }

    @Override
    public Number remove(Object key) {
        return mPositiveSet.remove(key)
            ? ONE
            : null;
    }

    @Override
    public int size() {
        return mPositiveSet.size();
    }

    @Override
    public Collection<Number> values() {
        return new Values();
    }

    @Override
    public void clear() {
        mPositiveSet.clear();
    }

    @Override
    public boolean containsKey(Object o) {
        return mPositiveSet.contains(o);
    }

    @Override
    public boolean containsValue(Object o) {
        return ONE.equals(o) && !isEmpty();
    }

    @Override
    public boolean isEmpty() {
        return mPositiveSet.isEmpty();
    }

    @Override
    public Number put(E e, Number n) {
        if (!ONE.equals(n))
            throw new IllegalArgumentException();
        return  mPositiveSet.add(e) ? null : ONE;
    }
...

All of these implementations are pretty much straightforward delegations to the contained set of objects mapping to 1. The return types of methods such as the put method are defined in the map interface documentation. There are additional methods that could’ve been added for even more efficiency, such as computing hash code and equality as specified in the map interface’s documentation.

Values Collection Implementation

Alas, one more doll to unpack, the value collection implementation.

    class Values extends AbstractCollection<Number> {
        @Override
        public int size() {
            return isEmpty() ? 0 : 1;
        }
        @Override
        public Iterator<Number> iterator() {
            return new ValuesIterator();
        }
        @Override
        public void clear() {
            mPositiveSet.clear();
        }
        @Override
        public boolean contains(Object o) {
            return ONE.equals(o) && !isEmpty();
        }
        @Override
        public boolean isEmpty() {
            return mPositiveSet.isEmpty();
        }
        @Override
        public boolean remove(Object o) {
            if (!ONE.equals(o))
                return false;
            boolean removedSomething = !isEmpty();
            mPositiveSet.clear();
            return removedSomething;
        }
        @Override
        public boolean removeAll(Collection<?> c) {
            if (!c.contains(ONE))
                return false;
            boolean removedSomething = !isEmpty();
            mPositiveSet.clear();
            return removedSomething;
        }
        @Override
        public boolean retainAll(Collection<?> c) {
            if (isEmpty()) return false;
            if (c.contains(ONE)) return false;
            mPositiveSet.clear();
            return true;
        }
        @Override
        public Object[] toArray() {
            return isEmpty()
                ? EMPTY_OBJECT_ARRAY
                : new Object[] { ONE };
        }
    }
...

Values Iterator Implementation

The final inner class is the values iterator, which also gets a custom implementation.

...
    class ValuesIterator implements Iterator<Number> {
        boolean finished = mPositiveSet.isEmpty();
        boolean mayRemove = false;
        public boolean hasNext() {
            return !finished;
        }
        public Number next() {
            if (!hasNext())
                throw new NoSuchElementException();
            finished = true;
            mayRemove = true;
            return ONE;
        }
        public void remove() {
            if (!mayRemove)
                throw new IllegalStateException();
            mayRemove = false;
            mPositiveSet.clear();
        }
    }
...

This one’s a little trickier because it doesn’t delegate the remove method, but keeps track of the state. Iterators only allow their remove method to be called once after each next call.

Override Considered Helpful

Extending Java collections is a good illustration of the utility of the override attribute. If you mark a method definition with the attribute @Override and it doesn’t actually override a superclass method, the compiler throws an error.

The reason it’s so crucial for collections is that many of the collection methods are defined at object level, like set’s contain(Object) method. Even though in a type-safe world, that object is only going to be of the set’s type, given Java’s retrofitted type system, this can’t be guaranteed. The types in the collections are as strong as they can be given Java’s type system and the need for backward compatibility with pre-generics Java.

Serialization and Synchronization

I also made the map serializable using the serialization proxy pattern as supported by LingPipe’s util.AbstractExternalizable base class.

For synchronization, the class may be wrapped in the synchronized wrappers from Java’s java.util.Collections utility class.

Whew!

No wonder programming books are so darn long. And this was a relatively simple example that only took me a few hours to write and unit test. Check it out in the next LingPipe release. It’ll be paired with a binary vector implementation along the same lines.

LingPipe To-Do List

February 8, 2010 by lingpipe

I never catch up with the pile of API designs and features I have planned for the future of LingPipe. After that last release, I’d like to just list the major items on the “to-do” list for LingPipe. Comments appreciated, as always. As are feature requests. It’s not very expensive to add to this list, and you never know what’ll grab me to implement next.

Demos and Commands

We could really use command-line options for much of our technology. I think we’re losing market share because LingPipe requires Java coding. A good start might be a command-line package for classifiers like all the other ones out there to allow users to plug and play.

We could also use more demos. Right now, we just have a few annotation demos up, but that’s only a small part of what we do.

External Interface Implementations

We’ve never really gotten around to properly integrating large chunks of LingPipe in deployment container wrappers such as Apacche UIMA or development environments such as U. Sheffield’s GATE system.

We don’t support any kind of RDF output for the semantic web.

We’ve pretty much stopped trying to write XML output format wrappers for everything.

Clusters and Clouds

We’ve also not provided any kind of wrappers to make it easy for people to run large LingPipe jobs on local clusters or distributed cloud computing platforms like Amazon’s EC2.

Multi-threaded Implementation

Although almost all of the run time classes for LingPipe are thread safe, at least for read operations like classification or clustering or chunking. But what we don’t have is any kind of threading in our base classes. I just bought a quad-core notebook, so it’s probably time to start thinking about this.

There are all sorts of opportunities for concurrency in basic LingPipe classes, from K-means clustering to the per-epoch log loss reports in logistic regression or CRFs to any of the cross-validating evaluations. The tricky part is concurrent training, though that’s also possible for approaches such as EM. And would be possible if I reorganized logistic regression or CRFs to more directly support blocked updates, because each instance in a block’s gradient may be computed independently.

Improvements and Generalizations

Many of our modules are not written as generally as possible, either at the level of API or the level of algorithms.

  • Fully online stochastic gradient for logistic regression, conditional random fields (CRF), and singular value decomposition (SVD)
  • All iterative algorithms producing iterators over epochs. Right now, only latent Dirichlet allocation (LDA) is set up this way, but I could do this for semi-supervised expectation-maximization (EM) classifier training, and all the SGDs for logistic regression, CRFs and SVD.
  • Refactoring SVD to produce iterators over states/dimensions
  • Weighted examples for just about anything trainable from LM classifiers to logistic regression; this would make it trivial to train on probabilistic examples by weighting categories by probability.
  • Entropy-based pruning for language models.
  • Information gain for feature selection; minimum count feature selection
  • Generalize min-max heaps to take a java.util.Comparator rather than requiring entries to implement LingPipe’s util.Scored.
  • Soft k-means abstracted from demo into package for clustering
  • More efficient vector processing for logistic regression, CRFs, etc., where there is no map from strings to numbers, and not necessarily even a vector processed. In most of these modules, we need to compute a vector dot-product with a coefficient vector, not actually construct a feature vector. Ideally, we could go all the way to Vowpal-Wabbit-like implicit feature maps.
  • Integrate short priority queues where appropriate, because they’re more efficient than the general bounded priority queues we’re currently using
  • More general priors and annealing for SVD to match the other versions of SGD.
  • More efficient sorted collection with removes for more efficient hierarchical clustering.
  • Removing all the specific corpus.Handler subinterfaces other than ObjectHandler. Then I can generalize cross-validating corpora, and just have the object handled as a type parameter for corpus.Parser and corpus.Corpus.
  • Add iterator methods to corpora that can enumerate instances rather than requiring handling via callbacks to visitors.
  • Privatize everything that can be.
  • Finalize methods and classes that shouldn’t be overridden. I’ve been very sloppy about this all along.
  • More agressively copy incoming arguments to constructors and wrap getters in immutable views. Later classes are much better than earlier ones at doing this. (Maybe I should just say re-read Effective Java and try one more time to apply all its advice.)

Rearrangements

So many things should move around that it’d be impossible to mention them all. For instance, putting all the HMM classes in the HMM package, and all the LM classes in the LM package, for a start.

I’m planning to move most of the corpus-specific parsers (in corpus.parsers) out of LingPipe to wherever they’re used.

I’m also planning to move the entire MEDLINE package to the sandbox project lingmed.

I should rename classes like util.Math which conflict with Java library class names.

New Modules

  • Wikipedia Wikitext parser (probably not for LingPipe proper)
  • Probabilistic context-free grammar (PCFG) parser. Possibly with Collins-style rules.
  • Discriminative statistical context-free grammars with more global tree kernel features
  • Dependency parsers with the same characteristics as the CFGs.
  • Linear support vector machines (SVM) with regularization via SGD.
  • Morphological analyzer (to work as a stemmer, lemmatizer or feature generator), preferably with semi-supervised learning. I’d like to take an approach that blends the best aspects of Yarowsky and Wicentowski’s morphology model and Goldwater and Johnson’s context-sensitive Bayesian models.
  • Some kind of feature language for CRFs
  • More finely synched cache (along the lines of that suggested in Goetz et al.’s awesome Java Concurrency in Practice)
  • Logistic regression for a long-distance, rescoring tagger or chunker
  • Longer-distance maximum-entropy Markov model (MEMM) type taggers and chunkers; with a greedy option as discussed by Ratinov and Roth.
  • Higher-order HMM rescoring taggers and chunkers
  • More efficient primitive collections; (I just finished a map implementation for boolean features)
  • Unions, differences, etc. for feature extractors
  • Decision tree classifiers
  • Meta-learning, like boosting (requires weighted training instances)
  • Jaro-Winkler (or other edit distances) trie versus trie (scalable all versus all processors based on prefix tries).
  • Prune zeros out of coefficient vectors and symbol tables for classifiers, CRFs, etc.
  • Standard linear regression (in addition to logistic).
  • Other loss functions for linear and generalized regressions, such as probit and robit.
  • Dirichlet-process-based clusterer
  • Discriminative choice analysis (DCA) estimators, classifiers and coref implementation (the base classes are nearly done).

Please let us know if there’s something you’d like to see on this list.

Inferring Splice-Variant mRNA Expression from RNA-Seq

February 5, 2010 by lingpipe

I introduced the problem of splice-variant expression from RNA-Seq data in my previous post on maximum likelihood estimates of RNA-Seq expression. In this post, I’ll show you a Bayesian approach to the problem from which we can compute a full posterior over the latent read mappings and relative mature mRNA expression of gene splice variants.

The Model

The data from which we infer expression is

  • K \in \mathbb{N}^{+}, the number of splice variants,
  • N \in \mathbb{N}^{+}, the number of reads, and
  • y_1,\ldots,y_N, the reads.

For the purpose of this model, the reads are likely to be characterized in terms of the output of a sequencer, such as base calls with quality scores.

We assume two model hyperparameters,

  • \varphi, genetic variation, sample preparation, and sequencing parameters, and
  • \alpha_1,\ldots,\alpha_K \in \mathbb{R}^{+}, the prior read counts (plus one).

The general-purpose parameter \varphi reflect how reads are generated, including the expected variation from the reference genome, the sample preparation process, and the sequencing platform’s error profile. The variation from the reference genome will determine indel and SNP rates, for instance using a probabilistically normalized affine edit distance model. Details of the sample preparation process, such as amplicon selection, fractionation technique, and so on, also determine which mRNA sequences are actually observed (through cDNA). Finally, the sequencing platform’s error profile, such as color confusability in image space, decreasing accuracy from the start to end of the read, and sensitivity to context such as the previously read gene (as in color-space dependencies in SOLiD or errors due to multiple copies of the same base).

We will infer two of the model parameters,

  • t_1,\ldots,t_N \in 1:K, the mapping of read to splice variant, and
  • \theta_1,\ldots,\theta_K \in [0,1] such that \sum_{k=1}^K \theta_k = 1, the read expression probabilities.

The model structure in sampling notation is

  • \theta \sim \mbox{\sf Dirichlet}(\alpha)
  • t_n \sim \mbox{\sf Discrete}(\theta) for n \in 1:N
  • y_n \sim \mbox{\sf Channel}(t_n,\varphi) for n \in 1:N.

We select the expression levels \theta based on prior counts.

The heart of the mapping model is the discrete sampling of the latent mappings t_n based on the expression level parameter \theta. This is effectively a beta-binomial model of expression at the corpus level.

The mysterious part of this whole thing is the channel model, which assigns the probability of a given read y_n being observed given it arose from the splice variant t_n under the sequencing model parameterized by \varphi.

Simple Channel Examples

For purposes of illustration, we’re going to assume a really simple instance of this model.

First, we’ll assume that \alpha_k = 1 for k \in 1:K, which renders \mbox{\sf Dirichlet}(\alpha) a uniform distribution over possible values of \theta. With replicates at the sequencing run level (or even over channels for Solexa/Illumina data), we could build a hierarchical model and estimate \alpha.

Second, we’ll assume a simple uniform read model. Specifically, we’ll assume that a read is just as likely to come from anywhere on the mRNA; a richer model would account for known variation due to the experimental procedure.

Third, we’ll assume a mapping program that aligns reads with the reference gene models for splice variant sequences. If the mapper provides multiple reads, we will treat them all as equally likely; in a more articulated model, we would take into account both sequencing quality scores, the error properties of the sequencer, and likely variation such as indel and SNP rates in estimating the probability of a read given a splice variant.

Although cross-junction splice mapping is straightforward to model versus a known reference gene model, we’ll restrict attention to exon reads here.

With all of these assumptions in place, if read y_n is mapped to splice variant t_n \in 1:K, we set

\mbox{\sf Channel}(y_n|t_n,\varphi) = 1 / (\mbox{len}(t_n) - \mbox{len}(y_n)).

The denominator \mbox{len}(t_n) - \mbox{len}(y_n) reflects the number of positions a sequence of the given read length \mbox{len}(y_n) may occur in an mRNA of length \mbox{len}(t_n).

A Simple BUGS Implementation

Although not scalable to RNA-Seq data sizes, it’s simple to investigate the properties of our model in BUGS.

model {
    theta[1:K] ~ ddirch(alpha[1:K])
    for (n in 1:N) {
        t[n] ~ dcat(theta[1:K])
        y[n] ~ dcat(rho[t[n],1:K])
    }
}

BUGS programs look just like the sampling notation defining the model; ddirch is the Dirichlet distribution and dcat is the discrete (aka categorical) distribution.

I call it from R in the usual way. Here’s just the part where R tells BUGS how to run the model. After the BUGS window is closed, R plots and attaches the data.

library("R2WinBUGS")

data <- list("alpha", "rho", "y", "N", "K")
parameters <- c("theta","t")
inits <- function() { list(theta=c(1/3,1/3,1/3)) }
rnaexp <- bugs(data, inits, parameters,
       "c:/carp/devmitz/carp/isoexp/src/Bugs/isoexp-multi.bug",
        n.chains=3, n.iter=50000,
        debug=TRUE, clearWD=TRUE,
        bugs.directory="c:\\WinBUGS\\WinBUGS14",
        DIC=FALSE)
print(rnaexp)
plot(rnaexp)
attach.bugs(rnaexp)

The other debugging problem I had is that you absolutely need to tell BUGS not to compute DIC (a kind of model comparison stat); otherwise, BUGS crashes when trying to compute DIC.

Attaching the data converts the parameter of the model to matrix data representing the Gibbs samples.

An Example

I’ll close with an example that illustrates how the R and BUGS program work, and how the model is able to infer expression even without a single non-multi-mapped read. The gene model and expression levels for the simulation are defined in a comment in the R code. We’ll consider a case with three exons and three isoforms:

# exons  A BB C  (B twice as long as others)
# var1 = A BB
# var2 =   BB C
# var3 = A    C

There are three exons, A, BB, and C, where the notation is meant to indicate the middle exon is twice as long as the others (to spice up the read calculation a bit). There are three splice variants, labeled var1, var2, and var3. Variant 1 is made up of exons A and BB, variant 2 of exons BB and C, and varient 3 of exons A and C.

Without cross-junction reads, it’s not possible to get a uniquely mapped read. Any read mapping to exon A will match splice variants 1 and 3, any read mapping to exon BB will match variants 1 and 2, and any read mapping to exon C will match 2 and 3.

Next, we specify the number of samples drawn from each exon in each splice variant.

#       exon
# var  1  2  3  Total theta*
# 1    1  2    | 3     3/19
#
# 2       4  2 | 6     6/19
#
# 3    5     5 | 10    10/19
#      --------
# tot  6  6  7   18

There are three samples from variant 1, 6 samples from variant 2, and 10 samples from variant 3. These are distributed uniformly across the exons. Here, variant 1 has 1 sample from exon 1 and 2 samples from exon 2, variant 2 has 4 samples from exon 2 and 2 samples from exon 3, whereas variant 3 has 5 samples each from exon 1 and exon 3.

If we adjust for length, variant 1 has 1 mRNA transcript, variant 2 has 2 transcripts, and variant 3 has 5 transcripts; these are computed by dividing by length. But our expression parameter theta (\theta) is defined by reads, not by transcripts.

The observed data is much simpler — it’s just the total number of reads for samples mapping to each exon. We see that there are 6 samples mapping to exon 1, 6 samples mapping to exon 2, and 7 samples mapping to exon 3.

Coding the Example in R

We set the Dirichlet prior to be uniform, whihc also determines our number of splice variants K:

alpha <- c(1,1,1)
K <- length(alpha)

We actually assume there are 10 times that many samples as in our figure, setting the read data as:

y <- c(rep(1,60),rep(2,60),rep(3,70))
N <- length(y)

That is, there are 60 reads from exon 1, 60 reads from exon 2 and 70 reads from exon 3, and N=190.

Finally, we use R’s matrix constructor to set up the read model:

rho <- matrix(c(1/3,2/3,0,  0,2/3,1/3,  1/2,0,1/2),
              nrow=3,ncol=3,byrow=TRUE)

I tried to make this human readable: a sample from the first splice variant has a 1/3 chance of originating from exon 1 and a 2/3 chance of originating from exon 2 (because exon 2 is twice as long as exon 1). A sample from the third splice variant has an equal chance of being from either exon 1 or exon 3 (exon 1 and exon 3 are the same length).

The Output

I ran the R program, which called BUGS, which terminated neatly and returned to R. The \hat{R} values were all pinned at 1.0, indicating good chain mixing and hence convergence of the sampler.

Bayeisan Point Estimate

We can compute any statistics we want now with the posterior samples. For instance, theta[,1] contains all the Gibbs samples for the variable \theta_1.

We can then generate the Bayesian (L2) estimates for expression levels:

> thetaHat = c(mean(theta[,1]), mean(theta[,2]), mean(theta[,3]))
> print(thetaHat,digits=2)
[1] 0.17 0.31 0.52

The “right” answer for our simulation is also easy to compute with R:

> print(c(3/19, 6/19, 10/19), digits=2)
[1] 0.16 0.32 0.53

Our estimates are pretty darn close. (Recall we can normalize these sample counts by length to get mRNA transcript counts for each variant.)

How certain are we of these estimates? Not very, because there were only 190 samples and lots of mapping uncertainty. It’s easy to estimate posterior 95% intervals, 50% intervals, and medians using R’s quantile function over the Gibbs samples:

> print(quantile(theta[,1], c(0.025,0.25,0.5,0.75,0.975)),
        digits=2)
 2.5%   25%   50%   75% 97.5%
0.025 0.104 0.167 0.231 0.340
> print(quantile(theta[,2], c(0.025,0.25,0.5,0.75,0.975)),
        digits=2)
 2.5%   25%   50%   75% 97.5%
 0.13  0.25  0.31  0.37  0.47
> print(quantile(theta[,3], c(0.025,0.25,0.5,0.75,0.975)),
       digits=2)
 2.5%   25%   50%   75% 97.5%
 0.42  0.49  0.52  0.56  0.61

With more data, and especially with uniquely mapping reads (e.g. across unique junction boundaries or in unique exons for a variant), posterior uncertainty will be greatly reduced.

A Triangle Plot and Correlated Expressions

Given that we only have three splice variants, we can create a so-called triangle plot of the expression levels, which will show correlations among the expression levels estimated for the variants on the simplex. Here’s the plot:

I couldn’t figure out how to label the axes in R’s ade4 package, but it’s clear from the means that are labeled that the upper left axis (mean of 0.17) is for splice variant 1, the bottom axis (mean of 0.308) is for variant 2, and the top right axis (mean of 0.522) for variant 3.

It takes some practice to view triangle plots like this; it’s perhaps easier to view two at a time. Here’s the two-way interaction plots:

two-way plots of RNA-seq expression 1 vs 2

two-way plots of RNA-seq expression 1 vs 3

two-way plots of RNA-seq expression 2 vs 3

These show the strong negative correlation between estimates of the expression for variants 1 and 2, but little correlation between these and variant 3. This makes sense given variants 1 and 2 share the large exon BB.

Finally, here’s the source that generated the PNG file for the triangle plot:

library("ade4")
png(file="splice-variant-exp-tri.png")
triangle.plot(theta,addmean=TRUE,labeltriangle=TRUE,
    draw.line=TRUE,show.position=FALSE,scale=FALSE)
dev.off()

and for the ordinary plots:

png(file="splice-variant-exp-12.png")
plot(theta[,c(1,2)],xlab="variant 1",ylab="variant 2",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 2")
dev.off()

png(file="splice-variant-exp-13.png")
plot(theta[,c(1,3)],xlab="variant 1",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="1 vs. 3")
dev.off()

png(file="splice-variant-exp-23.png")
plot(theta[,c(2,3)],xlab="variant 2",ylab="variant 3",
  xlim=c(0,1),ylim=c(0,1),pch=20,main="2 vs. 3")
dev.off()

Evaluating with Unbalanced Training Data

February 4, 2010 by lingpipe

For simplicity, we’ll only consider binary classifiers, so the population consists of positive and negative instances, such as relevant and irrelevant documents for a search query, or patients who have or do not have a disease.

Quite often, the available training data for a task is not balanced with the same ratio of positive and negative instances as the population of interest. For instance, in many information retrieval evaluations, the training data is positive biased because it is data annoted from the top results of many search engines. In many epidemiological applications, there is also positive bias because a study selects from a high risk subpopulation of the actual population.

Even with unbalanced training data, we might still want to be able to calculate precision, recall, and similar statistics for the actual population. It’s easy if we know (or can estimate) the true percentage of positive instances in the population.

Specificity and Sensitivity vs. Precision and Recall

Using the usual notation for true and false positives and negatives,

\mbox{sens} = \mbox{TP}/(\mbox{TP} + \mbox{FN}), and

\mbox{spec} = \mbox{TN}/(\mbox{TN} + \mbox{FP}).

Sensitivity and specificity are accuracies on positive (\mbox{TP} + \mbox{FN}) and negative cases (\mbox{TN} + \mbox{FP}), respecitively.

Prevalence for a sample may be calculated from the true and false positive and negative counts, by

\pi = (\mbox{TP} + \mbox{FN})/(\mbox{TP} + \mbox{FP} + \mbox{TN} + \mbox{FN}).

Recall is just sensitivity, but precision is the percentage of positive responses that are correct, namely

\mbox{prec} = \mbox{TP}/(\mbox{TP} + \mbox{FP}).

Prevalence Adjusted Contingency Matrix

Suppose we know the test population prevalence \pi', the probability of a random population member being a positive instance. Then we can adjust evaluation statistics over a training population with any ratio of positive to negative examples by recomputing expected true and false positive and negative values.

The expected true and false positive and negative counts in test data with prevalence \pi' over N samples, given test sensitivity and specificity \mbox{sens} and \mbox{spec}, are

\mbox{TP}' = N \cdot \mbox{sens} \cdot \pi',

\mbox{TN}' = N \cdot \mbox{spec} \cdot (1 - \pi'),

\mbox{FP}' = N \cdot (1 - \mbox{spec}) \cdot (1 - \pi'), and

\mbox{FN}' = N \cdot (1-\mbox{sens}) \cdot \pi'.

Sensitivity and Specificity are Invariant to Prevalence

Although the adjusted contingency matrix counts (true and false positive and negatives) vary based on prevalence, sensitivity and specificity derived from them do not. That’s because specificity is accuracy on negative cases and sensitivity the accuracy on positive cases.

Plug-In Population Statistic Estimates

Precision, on the other hand, is not invariant to prevalence. But we can compute its expected value in a population with known prevalence using the adjusted contingency matrix counts,

\mbox{prec}' = \mbox{TP}'/(\mbox{TP}' + \mbox{FP}').

The counts N all cancel; we can really work with true and false positive and negative rates instead of counts.

Related Work

This is related to my earlier blog post on estimating population prevalence using a classifier with known bias and an unbiased population sample:

The approach described in the linked blog post may be used to estimate the population prevalence \pi' given only an arbitrarily biased labeled training set and an unlabeled and unbiased sample from the population.

LingPipe 3.9 Released

February 3, 2010 by lingpipe

Yatta! CRFs for tagging and chunking and a whole lot more are out and available for download from the usual place:

The home page details the updates. Major changes in addition to CRFs include a new tag package to which HMMs were retrofitted, serialization for tokenized LMs, updates to 2010 MEDLINE, and some new utility methods and classes.

4.0 Up Next

We’re no longer planning to adopt the AGPL license, but we are still planning major changes for 4.0, including getting rid of all the currently deprecated classes and methods. Many existing classes are likely to be relocated.

Hopefully this will be done soon, as we’re not planning any new functionality for 4.0. Yet.

Programming esprit d’escalier

February 3, 2010 by lingpipe

Mixing Metaphors

Last night I was struck by a bolt of esprit d’escalier.*

Anecdotal Evidence

I was refactoring our part-of-speech tutorial to the new tagging interface introduced for CRFs. I’d spent over an hour stuck on a bug for unknown token evaluations, which had evocatively repeatable bug patterns. Finally, I had the sense to give up, get on the subway, and go home. As soon as I was sitting on my sofa watching Defying Gravity, it occurred to me what the problem was. I was updating the known token set for the initial segment of the test set and then recalculating test set scores. D’oh!

Down the Garden Path

The problem is that your brain gets into a rut. To borrow a metaphor from psycholinguistics, you wander down a programming garden path.

The Zen of Being Your Own Rubber Ducky

Way back in high school (post-internt, pre-web), Douglas Hofstadter’s book Gödel, Escher, Bach introduced me to the joy of the kōan. I liked the one about running around the house without thinking about white elephants. That is exactly how it feels when you’re down the garden path, trying to escape the labyrinth to reach enlightenment.

I contend that if you were good enough at zen meditation, you could be your own rubber ducky.


* The Wikipedia translates the expression “esprit d’escalier” as “staircase wit” and defines it succintly as “thinking of a clever comeback when it is too late”. I love the Wikipedia; they even point to an esprit d’escalier-themed Seinfeld episode, “The Comeback”.

Hunt and Thomas introduced the term “rubber ducky” in their book The Pragmatic Programmer. It’s about how useful it is to explain your problem out loud, even if only to a rubber ducky. The best rubber ducky is another programmer who actually understands what you’re talking about at some level. Mitzi and I have found each other more useful than our cats as rubber duckies.

Comparing Precision-Recall Curves the Bayesian Way?

January 29, 2010 by lingpipe

Back Story

There’s an interesting thread on the BioNLP mailing list (here’s a link to the publicly readable thread). The thread was kicked off when Andrew Clegg asked:

Suppose I have two precision-recall curves for two different IR algorithms on the same test data and query.

What would be the correct statistical test to determine if they are significantly different?

I’m particularly sympathetic to (part of) Bill Hersh’s second response.

… All statistical inference tells you is how likely your differences are due to chance. It says nothing about whether the difference is important. …

to which Andrew replied

… in my case it’s the other way round, I think we’ve already displayed a clear benefit in terms of usability, I’m just thinking of including significance testing in case a very picky reviewer asks if it’s down to chance.

Can you say CYA?

There were a host of classical tests proposed including the new-to-me Page trend test. Others suggesting taking an aggregate statistic like area-under-the-curve and doing a standard difference test.

ROC Curves vs. PR Curves

There’s much more literature on receiver operating characteristic curves (aka ROC curves) than precision-recall (PR) curves. Just as a refresher, if we take a given number of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN), precision, recall, sensitivity and specificity are:

Sensitivity = TP / (TP + FN)

Specificity = TN / (TN + FP)

Recall = Sensitivity = TP / (TP + FN)

Precision = TP / (TP + FP)

If we have a ranked list of outputs for which we know their gold-standard category (relevant or not in the case of IR), we can consider initial segments of the ranked list and compute the precision/recall at that point. The ROC curve plots 1-specificity (the false positive rate) versus sensitivity (the true positive rate) at various operating points for an algorithm. The PR curve plots precision versus recall.

In PR curves, it is often the case that only “interpolated” results are considered, which means taking the convex shell of the curve. This is easily accomplished by making sure the function is non-decreasing by propagating later higher results back. For instance, if precision at 1 recall is 1/2 and precision at 2 recall is 2/3 and at 3 recall is 1/2, then the precision at 1 recall is interpolated to 2/3 so the function is non-decreasing. (I’m not sure what this does to hypothesis testing.)

There’s a lot more information and examples in the class documentation for

The Bayesian Way

It’s in exactly these kinds of situations where a Bayesian approach seems natural.

Just add a prior and compute a posterior over entire PR curves. Then compare the PR curves any way you want. Say by calculating how likely it is for one PR curve to be above the other at every operating point. Or how likely it is for one to be above the other at a given operating point (say 99% recall). Or how likely one is to be substantially better than another, or how much you expect one to be higher than another.

See my earlier post on Bayesian counterpart to Fisher’s exact test on contingency table data.

The nifty part is that “adjustments” for multiple comparisons are built in. So we can ask what’s the posterior probability of one system having at least 1% higher precision than all others at 99% recall.

But once we’re doing multiple comparisons, it makes sense to build a hierarchical model. Often, a system will report results on multiple queries, so there’s a set of precision-recall curves for each system, leading to two levels of grouping.

I provide an example of Bayesian multiple comparisons in a hierarchical model in my earlier blog post on Bayesian batting ability with multiple comparisons.

Will it Work?

How do I model the posterior PR (or ROC) curve? From that, a regular or interpolated PR (or ROC) curve can be generated if desired.

I’m specifically worried about precision and recall sharing the TP term in their numerators and denominators; ROC curves seem simpler.

I’m also worried about the low counts at the low recall end of the curves.

And about the correlations among the different operating points.

It doesn’t seem that a naive independent binomial for each recall point would be justified, though it’d make the posteriors very clean, especially with uniform priors.

For instance, suppose that after the first 5 true positives in the systems’ ranked lists, we have seen 7, 10, 4, 11, and 21 false positives for the five systems? The maximum likelihood point estimate of precision with a uniform prior is then 5/12, 5/15, 5/9, 5/16, and 5/26. Can we just use Beta(6,13), …, Beta(6,27) as the posteriors for comparison? Somehow it just seems wrong.

Any Ideas?

Does anyone know how to compute PR or ROC posteriors appropriately?

Does Anyone Have Data?

Andrew, or anyone: if you have the ranked data and gold standard answers, I’d love to code this up in R/BUGS and see what it looks like in the context of data someone cares about.

The Long Road to CRFs

January 28, 2010 by lingpipe

CRFs are Done

The first bit of good news is that LingPipe 3.9 is within days of release. CRFs are coded, documented, unit tested, and I’ve even written a long-ish tutorial with hello-world examples for tagging and chunking, and a longer example of chunking with complex features evaluated over:

And They’re Speedy

The second bit of good news is that it looks like we have near state-of-the-art performance in terms of speed. It’s always hard to compare systems without exactly recreating the feature extractors, requirements for convergence, hardware setup and load, and so on. I was looking at

for comparison. Okazaki also evaluated first-order chain CRFs, though on the CoNLL 2000 English phrase chunking data, which has fewer tags than the CoNLL 2003 English named entity data.

While my estimator may be a tad slower (it took about 10s/epoch during stochastic gradient), I’m applying priors, and I think the tag set is a bit bigger. (Of course, if you use IO encoding rather than BIO encoding, like the Stanford named entity effort, then there’d be even fewer tags; on the other hands, if I followed Turian et al. (ref below), or the way we handle HMM encoding, there’d be more.)

It also looks like our run time is faster than the other systems benchmarked if you don’t consider feature extraction time (and I don’t think they did in the eval, but I may be wrong). It’s running at 70K tokens/second for full forward-backward decoding; first-best would be faster.

All the Interfaces, Please

Like for HMMs, I implemented first-best, n-best with conditional probabilities, and a full forward-backward confidence evaluation. For taggers, confidence is per tag per token; for chunkers, it’s per chunk.

Final Improvements

Yesterday, I was despairing a bit over how slow my approach was. Then I looked at my code, instrumented the time spent in each component, and had my D’oh! moment(s).

Blocked Prior Updates

The first problem was that I was doing dense, stochastic prior updates. That is, for every instance, I walked over the entire set of dense coefficient vectors, calculated the gradient, and applied it. This was dominating estimation time.

So I applied a blocking strategy whereby the prior gradient is only applied every so often (say, every 100 instances). This is the strategy discussed in Langford, Li and Zhang’s truncated gradient paper.

I can vouch for the fact that result vectors didn’t change much, and speed was hugely improved to the point where the priors weren’t taking much of the estimation time at all.

Caching Features

The other shortcoming of my initial implementation was that I was extracting features online rather than extracting them all into a cache. After cleaning up the prior, feature extraction was taking orders of magnitude longer than everything else. So I built a cache, and added yet another parameter to control whether to use it or not. With the cache and rich feature extractors, the estimator needs 2GB; with online feature extraction, it’s about 20 times slower, but only requires around 300MB of memory or less.

Bug Fixes

There were several subtle and not-so-subtle bugs that needed to be fixed along the way.

One problem was that I forgot to scale the priors based on the number of training instances during estimation. This led to huge over-regularization. When I fixed this problem, the results started looking way better.

Structural Zeros

Another bug-like problem I had is that when decoding CRFs for chunkers, I needed to rule out certain illegal tag sequences. The codec I abstracted to handle the encoding of chunkers and taggers and subsequent decoding could compute legal tag sequences and consistency with tokenizers, but the CRF itself couldn’t. So I was getting illegal tag sequences out that caused the codec to crash.

So I built in structural zeros. The simplest way to do it seemed to be to add a flag that only allowed tag transitions seen in the training data. This way, I could enforce legal start tags, legal end tags, and legal transitions. This had the nice side benefit of speeding things up, because I could skip calculations for structural zeros. (This is one of the reasons Thorsten Brants’ TnT is so fast — it also applies this strategy to tags, only allowing tags seen in training data for given tokens.)

Feature Extraction Encapsulation

I was almost ready to go a couple of days ago. But then I tried to build a richer feature extractor for the CoNLL entity data that used part-of-speech tags, token shape features, contextual features, prefixes and suffixes, etc. Basically the “baseline” features suggested in Turian, Ratinov, Bengio and Roth’s survey of cluster features (more to come on that paper).

It turns out that the basic node and edge feature extractors, as I proposed almost six months ago, weren’t quite up to the job.

So I raised the abstraction level so that the features for a whole input were encapsulated in a features object that was called lazily by the decoders and/or estimator. This allowed things like part-of-speech taggings to be computed once and then cached.

This will also allow online document features (like previous tagging decisions) to be rolled into the feature extractor, though it’ll take some work.

And a Whole Lotta’ Interfaces and Retrofitting

I added a whole new package, com.aliasi.tag, to characterize the output of a first-best, n-best, and marginal tag probability tagger. I then implemented these with CRFs and retrofitted them for HMMs. I also pulled out the evaluator and generalized it.

Along the way, I deprecated a few interfaces, like corpus.ChunkHandler, which is no longer needed given corpus.ObjectHandler<Chunking>.

Still No Templated Feature Extraction

Looking at other CRF implementations, and talking to others who’d used them, I see that designing a language to specify feature extractions is popular. Like other decisions in LingPipe, I’ve stuck to code-based solutions. The problem with this is that it limits our users to Java developers.

Dekel and Shamir (2009) Vox Populi: Collecting High-Quality Labels from a Crowd

January 22, 2010 by lingpipe

Aleks just sent me a pointer to this paper from last year’s COLT:

The paper presents, proves theorems about, and evaluates a very simple idea in the context of crowdsourcing binary classifier data coding. Unlike other approaches I’ve blogged about, Dekel and Shamir use only one coder per item, and often coders only label a handful of items.

They train an SVM on all the labels, then use the SVM to evaluate coders. They then prune out low-quality coders using the SVM as a reference.

This paper reminded me of (Brodley and Friedl 1999), which was also about using multiple trained classifiers to remove bad labels. Brodley and Friedl remove items on an item by item basis rather than on an annotator by annotator basis.

This paper is also somewhat reminiscent of (Raykar et al. 2009), who train a classifier and use it as another annotator.

Lots of Theory

There’s lots of theory saying why this’ll work. It is a COLT paper after all.

Evaluation: Web Page Query Relevance

They ran an eval on Amazon’s Mechanical Turk asking people to judge a pair consisting of a web query and web page as to whether the page is relevant to the query.

They used 15 coders per query/page pair in order to define a gold standard by voting. They then evaluated their approach using only one coder per item by subsampling their larger data set.

One thing I noticed is that as they lowered the maximum number of items labeled by an annotator, their average annotator accuracy went up. This is consistent with our findings and those in Snow et al. that the spammy Mechanical Turkers complete a disproportionate number of tasks.

With only one coder per item, Dekel and Shamir can’t evaluate the way everyone else is evaluating crowdsourcing, because they know their resulting data will remain noisy because it’s only singly annotated.

Estimates of coder sensitivity and specificity could be made based on their performance relative to the SVM’s best guesses. That’d provide some handle on final corpus quality in terms of false positives and negatives relative to ground truth.

Rather than evaluating trained classifier performance, Dekel and Shamir measure the “noise level” of the resulting data set after pruning. What we really care about in practice is how much pruning bad annotators helps train a more reliable classifier (or help evaluate if that’s the goal). They discuss this kind of end-to-end evaluation under “future work”! It’s really striking how different the evaluation versus theory requirements are for different conferences.

Marginalizing Latent Variables in EM

January 19, 2010 by lingpipe

After getting carried away in my last post on this topic, Maximum Likelihood Estimates of Expression in RNA-Seq are Winner-Take-All Biased, I want to explain what went wrong when I “turned the crank” (an idiom in mathematics for solving an equation by brute force without thinking too much about it).

Expression Mixture Model

Recall that we were looking at a mixture model, which is the bread and butter of maximum likelihood estimators like EM. We defined a joint likelihood function for a sequence of observed reads r, gene or splice variant sources g, and mRNA expression \theta, setting

p(r,g|\theta) = \prod_{n=1}^N p(r_n,g_n|\theta) = \prod_{n=1}^N p(r_n|g_n) \ p(g_n|\theta).

Maximum Likelihood

Maximum likelihood is usually stated as finding the model parameters that maximize the likelihood function consisting of the probability of the observed data given the model parameters.

What’s a Parameter?

So that’s what I did, calculating

g^*, \theta^* = \arg \max_{g,\theta} p(r|g,\theta).

The problem is that there’s an implicit “except the latent parameters” inside the usual understanding the MLE.

What I should have done is marginalized out the latent gene sources g, taking

\theta^* = \arg \max_{\theta} p(r|\theta).

That requires marginalizing out the latent mapping of reads r_n to gene or splice variant sources g_n,

p(r|\theta) = \prod_{n=1}^N p(r_n|\theta_n) = \prod_{n=1}^N \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

On a log scale, that’s

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k = 1}^K p(r_n|k) \ p(k|\theta).

The Example

I’ll repeat the example here for convenience:

Isoform Expression Example

Suppose I have a gene with two isoforms, A and B, spliced from three exons, 1, 2, and 3. For simplicitly, we’ll suppose all the exons are the same length. Suppose isoform A is made up of exon 1 and exon 2, and isoform B of exon 1 and 3. Now suppose that you run your RNA-Seq experiment and find 70 reads mapping to exon 1, 20 reads mapping to exon 2, and 50 reads mapping to exon 3. In table form, we have the following mapping counts:

  Exon 1 Exon 2 Exon 3 Total
Iso 1 N 20 0 20+N
Iso 2 70-N 0 50 120-N
Total 70 20 50 140

The 20 reads mapping to exon 2 must be part of isoform 1, and similarly the 50 reads mapping to exon 3 must belong to isoform 2. That leaves the 70 reads falling in exon 1 to spread in some proportion between isoform 1 and isoform 2.

Turning the Crank In the Right Direction

By assuming each splice variant consists of two exons of the same length, the probability of a read in an exon is 1/2 for each exon in the splice variant and 0 for the exon not in the variant.

Now when we turn the crank, we see that

\log p(r|\theta) = \sum_{n=1}^N \log \sum_{k=1}^K p(r_n|k) p(k|\theta)

{} \ \ \ \ =  70 \log \sum_{k=1}^K p(1|k) p(k|\theta) + 20 \log \sum_{k=1}^K p(2|k) p(k|\theta)  + 50 \log \sum_{k=1}^K p(3|k) p(k\theta)

= 20 \log \frac{\theta_1}{2} + 50 \log \frac{\theta_2}{2} + 70 \log (\frac{\theta_1}{2} + \frac{\theta_2}{2})

= 20 \log \theta_1 + 50 \log \theta_2 + (70 \log 1 -20\log 2 - 50 \log 2 - 70 \log 2).

The reduction on the third line is because the probability of a read in the second exon, p(2|k), is only non-zero for isoform k=1, and similarly for a read in the third exon, where p(3|k) is nly non-zero for the second gene or splice variant, k=2. The final reduction follows because \theta_1 + \theta_2 = 1.

Not so Biased After All

After marginalizing out the read mappings g, the MLE for expression is right where we’d expect it, at the solution of

\frac{d}{d\theta} \log p(r|\theta) = 0,

which is

\theta^*=20/70.

It turns out EM isn’t so biased for these discrete mixtures, at least when you turn the crank the right way.