Archive for the ‘LingPipe News’ Category

LingPipe Book Draft Available

August 5, 2010

We’ve started work on a proper LingPipe book. You can download the current partial draft of the book and sample code from:

In case you were wondering why the blog’s been quieter these days, this is it!

Our Goals

Our goal is to produce something with a little more breadth and depth and much more narrative structure than the current LingPipe tutorials. Something that a relative Java and natural language processing novice could work through from beginning to end, coming out with a fairly comprehensive knowledge of LingPipe and a good overview of some aspects of natural language processing.

We’re not trying to write an introduction to Java, but there’s a lot more detail on Java in the book than in LingPipe’s javadoc or the current tutorials.

Progress so Far

So far, there’s a getting started chapter with sections on all the tools we use, a hello world example and an introduction to Ant. The second chapter is all about character encodings and characters and strings in Java, as well as an introduction to the International Components for Unicode (ICU) library for normalization, encoding detection, and transliteration. The third chapter covers regular expressions, focusing on their interaction with Unicode. The fourth chapter is on input and output in Java, including files, byte and character streams, the various interfaces and buffering, compression, standard input and output, reading from URLs and resources on the classpath, and serialization, including the serialization proxy. The fifth chapter is on tokenization, including an overview of all of LingPipe’s built-in tokenizers and how to build your own.

The first appendiex is an introduction to Java, including the primitives, objects and arrays. The second appendix contains suggested further reading in areas related to natural language processing.

I hope to keep churning out chapters at around one per week. As I complete chapters, I’ll release new versions.

Comments Most Appreciated

C’mon — you know you want to be listed in the front of the book for sending us feedback. Any comments, suggestions, etc., should go to

The book’s not been copy-edited yet, even by me, so I don’t need to know about misspellings, sentences that run off into space, or that kind of thing.

I would love to get feedback about the general level of description, the tone, or get suggestions for demos or how to improve the existing code or descriptions.

Eventual Publication

We’ll be publishing it ourselves, probably through CreateSpace. That’ll let us sell through Amazon.

If it turns out to be 800 pages long, as we expect, we should be able to sell it for around US$ 20 (in the US anyway).

We plan to continue distributing PDF versions for free.

It’s about Time

I’m psyched, because it’s been over ten years since my last book.

I’m also working on a book on Bayesian categorical modeling — the posts here on non-NLP-related Bayesian stats, and posts like working through the collapsed samplers for LDA and naive Bayes, are warmups for that book. Don’t hold your breath; I’m trying to finish the LingPipe book first.

LingPipe 4.0.0 Released

June 2, 2010

After a week of relentless deprecation and pushing every last change through the unit tests and tutorials, we’re simultaneously rolling out LingPipe 4.0.0 and LingPipe 3.9.3. As usual, the latest is available from the

There’s a link from there to the latest 3.9 release, 3.9.3. 3.9.3 is a new release, and is only very minimally different than 3.9.2. I expected more updates for deprecation, but the whole process went more smoothly than anticipated.

Migrating from LingPipe 3.9 to LingPipe 4.0

Models are Backward Compatible

The good news is that all of the models compiled in 3.9 will run as is in 4.0 (using the 4.0 methods, of course). You can read about the mild wizardry required to do this in a previous blog entry, Upgrading Java classes with backward-compatible serialization.

Remove Deprecation Warnings from Code

The migration process for existing code importing LingPipe classes is simple. Get your code to where it compiles in 3.9 with no deprecation warnings and you’re good to go for 4.0. All of the deprecated methods and classes in 3.9 indicate what to replace them with in their javadoc.

3.9 Branched and Supported

We’re expecting to have to support 3.9 for a while longer, because 4.0 isn’t backward compatible without code changes. We’re anticipating continuing to patch bugs for 3.9 and release subsequent versions on the 3.9 branch. At least until all of our support-paying customers upgrade all of their code to 4.0.

The Time to Make it Shorter

Other than updating all the jars, there are no new features in 4.0. In fact, it’s smaller than 3.9. As Pascal said (originally in French, referring to a letter),

I have only made this longer, because I have not had the time to make it shorter.

Well, we finally found the time to make LingPipe shorter.

LingPipe 3.9.2 Released

April 23, 2010

LingPipe Version 3.9.2 is now available from the

The home page lists the updates. It’s backward compatible with 3.9.1.

LingPipe 3.9.1 Released

March 15, 2010

We’re happy to announce the release of LingPipe 3.9.1, which is available for download from:

The home page contains the major release notes, the highlights of which are discrete choice analysis models and more deprecations and consolidation of interfaces.

Let us know if you have questions or run into any problems.

The Path to LingPipe 4.0

We’re now working on LingPipe 4.0, which will remove all of the classes and methods marked as deprecated in 3.9.1. If you manage to compile without deprecation warnings in 3.9.1, you should be able to compile in LingPipe 4.0.

Custom Java Map for Binary Features

February 9, 2010

Update: 10 February, 2010. You can either jump to David R. MacIver’s comment which shows that my work is not yet done, or try to figure out what’s wrong with the implementation below.

Update II: 10 February, 2010. Of course, I should’ve also added a top-level method to BinaryMap to return the postive set as the key set.

The built-in Java collection implementations tend to be heavy. For instance, hash maps are based on hash sets of map entries, where a map entry is a simple container object with a reference to a key and a value.

For many natural language applications, such as CRF taggers and chunkers, feature maps are often designed to be boolean, with features taking only binary values, 0 or 1. These maps are almost always sparse, only representing the one values. Given the proportion of time (most of it) and memory (pretty much all of it) spent extracting features compared to decoding in CRFs or classification with logistic regression, it seems worthwhile to consider more minimal representations.

Luckily, the Java collections were designed to be easy to extend (thanks Dr. Bloch). In this post, I’ll walk you through LingPipe’s new binary map class, which implements a generic map to numbers based on a pluggable set implementation. Because these maps need to be modifiable so that they can be easily built up in feature extractors, there’s a lot of work to do in the implementation. There are so many inner classes that the whole resembles a set of nesting Russian dolls (pictured above).

Class Declaration

Skipping all the java.util imports, we declare the map to be in our util package and extend Java’s abstract map:

package com.aliasi.util;

public class BinaryMap<E> extends AbstractMap<E,Number> {

The return type will be a number, but the keys may be any old type as specified by the generic E.

Member Variable and Constructors

There’s only a single member variable, containing the set of items that map to 1, and it’s specified in the constructor, defaulting to a hash map:

    private final Set<E> mPositiveSet;

    public BinaryMap() {
        this(new HashSet<E>());

    public BinaryMap(Set<E> positiveSet) {
        mPositiveSet = positiveSet;

Implementing the Entry Set Method

Java’s abstract map class does so much work that only one method must be implemented, entrySet(), which returns a set view of the map entries underlying the map. The abstract map class can then use that to implement everything from gets and puts to size to hash code methods.

Here, we implement it to return a custom entry set (see below):

    public Set<Map.Entry<E,Number>> entrySet() {
        return new EntrySet();

I’ll fill in implementations of the other map methods later, although this one method would be enough to get the job done in terms of space savings. It’d just be inefficient in terms of speed.

The Entry Set Implementation

Like nesting Russian dolls, this implementation’s going to involve quite a few inner classes. Here’s the entry set implementation itself, which is an inner class of the top level binary map class:

    class EntrySet extends AbstractSet<Map.Entry<E,Number>> {
        public int size() {
            return mPositiveSet.size();
        public Iterator<Map.Entry<E,Number>> iterator() {
            return new EntrySetIterator();
        public void clear() {
        public boolean contains(Object o) {
            if (!(o instanceof Map.Entry<?,?>))
                return false;
            @SuppressWarnings("unchecked") // checked above
            Map.Entry<?,?> entry = (Map.Entry<?,?>) o;
            return ONE.equals(entry.getValue())
                && mPositiveSet.contains(entry.getKey());
        public boolean isEmpty() {
            return mPositiveSet.isEmpty();
        public boolean remove(Object o) {
            if (!(o instanceof Map.Entry<?,?>))
                return false;
            @SuppressWarnings("unchecked") // checked above
            Map.Entry<?,?> entry = (Map.Entry<?,?>) o;
            return ONE.equals(entry.getValue())
                && mPositiveSet.remove(entry.getKey());

Only the iterator and size method are necessary. The size method just delegates to the containing class’s positive set’s size method. The iterator method returns a custom iterator implementation we describe below, and the size method delegate’s to the containing binary map’s positive set. (That’s why it’s an inner class, not a static nested class.) The remaining methods, the clear, contains, is-emtpy and remove methods are provided for efficiency; I grayed out their code to signal its optionality.

Some Constants

I declared two constants for convenience in the top-level binary map class:

    public static final Number ONE = Integer.valueOf(1);

    public static final Object[] EMPTY_OBJECT_ARRAY
        = new Object[0];

Implementing Entry Set’s Iterator

The entry set iterator is declared as an inner class of the top-level binary map. It’s declared to implement the appropriate iterator interface. The code consists of the iterator method implementations for the has-next, next, and remove methods.

    class EntrySetIterator 
        implements Iterator<Map.Entry<E,Number>> {

        private final Iterator<E> mIterator 
            = mPositiveSet.iterator();

        public boolean hasNext() {
            return mIterator.hasNext();
        public Map.Entry<E,Number> next() {
            return new PositiveMapEntry();
        public void remove() {

When it is constructed, it uses the containing binary map’s positive set to create an iterator over the positive instances. The has-next and remove methods are simply delegated to the binary map’s positive set iterator. This means that when we’re iterating over the entry set for the top-level maps, remove calls have the intended effect on the top-level map, namely removing one of its positive entries. Of course, if the positive entry set on which the map is based is immutable, the remove operation will throw an unsupported operation exception, as documented in the iterator interface documentation.

The entry itself just returns an instance of another inner class, a positive entry map.

Positive Map Entry Implementation

The positive map entry is itself an inner class declared inside of the entry set iterator (which is itself an inner class of the top-level binary map).

        class PositiveMapEntry implements Map.Entry<E,Number> {
            private final E mE =;
            public E getKey() {
                return mE;
            public Number getValue() {
                return ONE;
            public Number setValue(Number value) {
                throw new UnsupportedOperationException();
            public boolean equals(Object that) {
                if (!(that instanceof Map.Entry<?,?>))
                    return false;
                @SuppressWarnings("unchecked") // checked above
                    Map.Entry<?,?> e1 = (Map.Entry<?,?>) that;
                Map.Entry<?,?> e2 = this;
                     ? e2.getKey()==null
                     : e1.getKey().equals(e2.getKey()))
                     ? e2.getValue()==null
                     : e1.getValue().equals(e2.getValue()));
            public int hashCode() {
                return (getKey()==null ? 0 : getKey().hashCode())
                    ^ (getValue()==null ? 0 : getValue().hashCode());

During construction, it just consumes the next element from the iterator’s contained positive set iterator. That’s then held locally as the value to return for the entry’s key. The value is just the constant number 1. Note that the set method throws an unsupported operation exception; it could’ve been defined to do nothing for a value of 1, remove the positive entry for the value of 0, and throw an illegal argument exception otherwise. But that would almost certainly cause the iterator from which the value was drawn to throw a concurrent modification exception.

The equality and hash code methods are described in Java’s map entry interface; they must be defined this way for compatibility with Java’s collections.

Map Methods for Efficiency

The remaining methods defined in the top level map are for efficiency. Here they are:

    public Number get(Object key) {
        return mPositiveSet.contains(key)
            ? ONE
            : null;

    public Number remove(Object key) {
        return mPositiveSet.remove(key)
            ? ONE
            : null;

    public int size() {
        return mPositiveSet.size();

    public Collection<Number> values() {
        return new Values();

    public void clear() {

    public boolean containsKey(Object o) {
        return mPositiveSet.contains(o);

    public boolean containsValue(Object o) {
        return ONE.equals(o) && !isEmpty();

    public boolean isEmpty() {
        return mPositiveSet.isEmpty();

    public Number put(E e, Number n) {
        if (!ONE.equals(n))
            throw new IllegalArgumentException();
        return  mPositiveSet.add(e) ? null : ONE;

All of these implementations are pretty much straightforward delegations to the contained set of objects mapping to 1. The return types of methods such as the put method are defined in the map interface documentation. There are additional methods that could’ve been added for even more efficiency, such as computing hash code and equality as specified in the map interface’s documentation.

Values Collection Implementation

Alas, one more doll to unpack, the value collection implementation.

    class Values extends AbstractCollection<Number> {
        public int size() {
            return isEmpty() ? 0 : 1;
        public Iterator<Number> iterator() {
            return new ValuesIterator();
        public void clear() {
        public boolean contains(Object o) {
            return ONE.equals(o) && !isEmpty();
        public boolean isEmpty() {
            return mPositiveSet.isEmpty();
        public boolean remove(Object o) {
            if (!ONE.equals(o))
                return false;
            boolean removedSomething = !isEmpty();
            return removedSomething;
        public boolean removeAll(Collection<?> c) {
            if (!c.contains(ONE))
                return false;
            boolean removedSomething = !isEmpty();
            return removedSomething;
        public boolean retainAll(Collection<?> c) {
            if (isEmpty()) return false;
            if (c.contains(ONE)) return false;
            return true;
        public Object[] toArray() {
            return isEmpty() 
                ? EMPTY_OBJECT_ARRAY 
                : new Object[] { ONE };

Values Iterator Implementation

The final inner class is the values iterator, which also gets a custom implementation.

    class ValuesIterator implements Iterator<Number> {
        boolean finished = mPositiveSet.isEmpty();
        boolean mayRemove = false;
        public boolean hasNext() {
            return !finished;
        public Number next() {
            if (!hasNext())
                throw new NoSuchElementException();
            finished = true;
            mayRemove = true;
            return ONE;
        public void remove() {
            if (!mayRemove)
                throw new IllegalStateException();
            mayRemove = false;

This one’s a little trickier because it doesn’t delegate the remove method, but keeps track of the state. Iterators only allow their remove method to be called once after each next call.

Override Considered Helpful

Extending Java collections is a good illustration of the utility of the override attribute. If you mark a method definition with the attribute @Override and it doesn’t actually override a superclass method, the compiler throws an error.

The reason it’s so crucial for collections is that many of the collection methods are defined at object level, like set’s contain(Object) method. Even though in a type-safe world, that object is only going to be of the set’s type, given Java’s retrofitted type system, this can’t be guaranteed. The types in the collections are as strong as they can be given Java’s type system and the need for backward compatibility with pre-generics Java.

Serialization and Synchronization

I also made the map serializable using the serialization proxy pattern as supported by LingPipe’s util.AbstractExternalizable base class.

For synchronization, the class may be wrapped in the synchronized wrappers from Java’s java.util.Collections utility class.


No wonder programming books are so darn long. And this was a relatively simple example that only took me a few hours to write and unit test. Check it out in the next LingPipe release. It’ll be paired with a binary vector implementation along the same lines.

LingPipe To-Do List

February 8, 2010

I never catch up with the pile of API designs and features I have planned for the future of LingPipe. After that last release, I’d like to just list the major items on the “to-do” list for LingPipe. Comments appreciated, as always. As are feature requests. It’s not very expensive to add to this list, and you never know what’ll grab me to implement next.

Demos and Commands

We could really use command-line options for much of our technology. I think we’re losing market share because LingPipe requires Java coding. A good start might be a command-line package for classifiers like all the other ones out there to allow users to plug and play.

We could also use more demos. Right now, we just have a few annotation demos up, but that’s only a small part of what we do.

External Interface Implementations

We’ve never really gotten around to properly integrating large chunks of LingPipe in deployment container wrappers such as Apacche UIMA or development environments such as U. Sheffield’s GATE system.

We don’t support any kind of RDF output for the semantic web.

We’ve pretty much stopped trying to write XML output format wrappers for everything.

Clusters and Clouds

We’ve also not provided any kind of wrappers to make it easy for people to run large LingPipe jobs on local clusters or distributed cloud computing platforms like Amazon’s EC2.

Multi-threaded Implementation

Although almost all of the run time classes for LingPipe are thread safe, at least for read operations like classification or clustering or chunking. But what we don’t have is any kind of threading in our base classes. I just bought a quad-core notebook, so it’s probably time to start thinking about this.

There are all sorts of opportunities for concurrency in basic LingPipe classes, from K-means clustering to the per-epoch log loss reports in logistic regression or CRFs to any of the cross-validating evaluations. The tricky part is concurrent training, though that’s also possible for approaches such as EM. And would be possible if I reorganized logistic regression or CRFs to more directly support blocked updates, because each instance in a block’s gradient may be computed independently.

Improvements and Generalizations

Many of our modules are not written as generally as possible, either at the level of API or the level of algorithms.

  • Fully online stochastic gradient for logistic regression, conditional random fields (CRF), and singular value decomposition (SVD)
  • All iterative algorithms producing iterators over epochs. Right now, only latent Dirichlet allocation (LDA) is set up this way, but I could do this for semi-supervised expectation-maximization (EM) classifier training, and all the SGDs for logistic regression, CRFs and SVD.
  • Refactoring SVD to produce iterators over states/dimensions
  • Weighted examples for just about anything trainable from LM classifiers to logistic regression; this would make it trivial to train on probabilistic examples by weighting categories by probability.
  • Entropy-based pruning for language models.
  • Information gain for feature selection; minimum count feature selection
  • Generalize min-max heaps to take a java.util.Comparator rather than requiring entries to implement LingPipe’s util.Scored.
  • Soft k-means abstracted from demo into package for clustering
  • More efficient vector processing for logistic regression, CRFs, etc., where there is no map from strings to numbers, and not necessarily even a vector processed. In most of these modules, we need to compute a vector dot-product with a coefficient vector, not actually construct a feature vector. Ideally, we could go all the way to Vowpal-Wabbit-like implicit feature maps.
  • Integrate short priority queues where appropriate, because they’re more efficient than the general bounded priority queues we’re currently using
  • More general priors and annealing for SVD to match the other versions of SGD.
  • More efficient sorted collection with removes for more efficient hierarchical clustering.
  • Removing all the specific corpus.Handler subinterfaces other than ObjectHandler. Then I can generalize cross-validating corpora, and just have the object handled as a type parameter for corpus.Parser and corpus.Corpus.
  • Add iterator methods to corpora that can enumerate instances rather than requiring handling via callbacks to visitors.
  • Privatize everything that can be.
  • Finalize methods and classes that shouldn’t be overridden. I’ve been very sloppy about this all along.
  • More agressively copy incoming arguments to constructors and wrap getters in immutable views. Later classes are much better than earlier ones at doing this. (Maybe I should just say re-read Effective Java and try one more time to apply all its advice.)
  • Serialize dynamic LM classifiers and serializing tokenized LMs to support it.


So many things should move around that it’d be impossible to mention them all. For instance, putting all the HMM classes in the HMM package, and all the LM classes in the LM package, for a start.

I’m planning to move most of the corpus-specific parsers (in corpus.parsers) out of LingPipe to wherever they’re used.

I’m also planning to move the entire MEDLINE package to the sandbox project lingmed.

I should rename classes like util.Math which conflict with Java library class names.

New Modules

  • Wikipedia Wikitext parser (probably not for LingPipe proper)
  • Probabilistic context-free grammar (PCFG) parser. Possibly with Collins-style rules.
  • Discriminative statistical context-free grammars with more global tree kernel features
  • Dependency parsers with the same characteristics as the CFGs.
  • Linear support vector machines (SVM) with regularization via SGD.
  • Morphological analyzer (to work as a stemmer, lemmatizer or feature generator), preferably with semi-supervised learning. I’d like to take an approach that blends the best aspects of Yarowsky and Wicentowski’s morphology model and Goldwater and Johnson’s context-sensitive Bayesian models.
  • Some kind of feature language for CRFs
  • More finely synched cache (along the lines of that suggested in Goetz et al.’s awesome Java Concurrency in Practice)
  • Logistic regression for a long-distance, rescoring tagger or chunker
  • Longer-distance maximum-entropy Markov model (MEMM) type taggers and chunkers; with a greedy option as discussed by Ratinov and Roth.
  • Higher-order HMM rescoring taggers and chunkers
  • More efficient primitive collections; (I just finished a map implementation for boolean features)
  • Unions, differences, etc. for feature extractors
  • Decision tree classifiers
  • Meta-learning, like boosting (requires weighted training instances)
  • Jaro-Winkler (or other edit distances) trie versus trie (scalable all versus all processors based on prefix tries).
  • Prune zeros out of coefficient vectors and symbol tables for classifiers, CRFs, etc.
  • Standard linear regression (in addition to logistic).
  • Other loss functions for linear and generalized regressions, such as probit and robit.
  • Dirichlet-process-based clusterer
  • Discriminative choice analysis (DCA) estimators, classifiers and coref implementation (the base classes are nearly done).

Please let us know if there’s something you’d like to see on this list.

LingPipe 3.9 Released

February 3, 2010

Yatta! CRFs for tagging and chunking and a whole lot more are out and available for download from the usual place:

The home page details the updates. Major changes in addition to CRFs include a new tag package to which HMMs were retrofitted, serialization for tokenized LMs, updates to 2010 MEDLINE, and some new utility methods and classes.

4.0 Up Next

We’re no longer planning to adopt the AGPL license, but we are still planning major changes for 4.0, including getting rid of all the currently deprecated classes and methods. Many existing classes are likely to be relocated.

Hopefully this will be done soon, as we’re not planning any new functionality for 4.0. Yet.

The Long Road to CRFs

January 28, 2010

CRFs are Done

The first bit of good news is that LingPipe 3.9 is within days of release. CRFs are coded, documented, unit tested, and I’ve even written a long-ish tutorial with hello-world examples for tagging and chunking, and a longer example of chunking with complex features evaluated over:

And They’re Speedy

The second bit of good news is that it looks like we have near state-of-the-art performance in terms of speed. It’s always hard to compare systems without exactly recreating the feature extractors, requirements for convergence, hardware setup and load, and so on. I was looking at

for comparison. Okazaki also evaluated first-order chain CRFs, though on the CoNLL 2000 English phrase chunking data, which has fewer tags than the CoNLL 2003 English named entity data.

While my estimator may be a tad slower (it took about 10s/epoch during stochastic gradient), I’m applying priors, and I think the tag set is a bit bigger. (Of course, if you use IO encoding rather than BIO encoding, like the Stanford named entity effort, then there’d be even fewer tags; on the other hands, if I followed Turian et al. (ref below), or the way we handle HMM encoding, there’d be more.)

It also looks like our run time is faster than the other systems benchmarked if you don’t consider feature extraction time (and I don’t think they did in the eval, but I may be wrong). It’s running at 70K tokens/second for full forward-backward decoding; first-best would be faster.

All the Interfaces, Please

Like for HMMs, I implemented first-best, n-best with conditional probabilities, and a full forward-backward confidence evaluation. For taggers, confidence is per tag per token; for chunkers, it’s per chunk.

Final Improvements

Yesterday, I was despairing a bit over how slow my approach was. Then I looked at my code, instrumented the time spent in each component, and had my D’oh! moment(s).

Blocked Prior Updates

The first problem was that I was doing dense, stochastic prior updates. That is, for every instance, I walked over the entire set of dense coefficient vectors, calculated the gradient, and applied it. This was dominating estimation time.

So I applied a blocking strategy whereby the prior gradient is only applied every so often (say, every 100 instances). This is the strategy discussed in Langford, Li and Zhang’s truncated gradient paper.

I can vouch for the fact that result vectors didn’t change much, and speed was hugely improved to the point where the priors weren’t taking much of the estimation time at all.

Caching Features

The other shortcoming of my initial implementation was that I was extracting features online rather than extracting them all into a cache. After cleaning up the prior, feature extraction was taking orders of magnitude longer than everything else. So I built a cache, and added yet another parameter to control whether to use it or not. With the cache and rich feature extractors, the estimator needs 2GB; with online feature extraction, it’s about 20 times slower, but only requires around 300MB of memory or less.

Bug Fixes

There were several subtle and not-so-subtle bugs that needed to be fixed along the way.

One problem was that I forgot to scale the priors based on the number of training instances during estimation. This led to huge over-regularization. When I fixed this problem, the results started looking way better.

Structural Zeros

Another bug-like problem I had is that when decoding CRFs for chunkers, I needed to rule out certain illegal tag sequences. The codec I abstracted to handle the encoding of chunkers and taggers and subsequent decoding could compute legal tag sequences and consistency with tokenizers, but the CRF itself couldn’t. So I was getting illegal tag sequences out that caused the codec to crash.

So I built in structural zeros. The simplest way to do it seemed to be to add a flag that only allowed tag transitions seen in the training data. This way, I could enforce legal start tags, legal end tags, and legal transitions. This had the nice side benefit of speeding things up, because I could skip calculations for structural zeros. (This is one of the reasons Thorsten Brants’ TnT is so fast — it also applies this strategy to tags, only allowing tags seen in training data for given tokens.)

Feature Extraction Encapsulation

I was almost ready to go a couple of days ago. But then I tried to build a richer feature extractor for the CoNLL entity data that used part-of-speech tags, token shape features, contextual features, prefixes and suffixes, etc. Basically the “baseline” features suggested in Turian, Ratinov, Bengio and Roth’s survey of cluster features (more to come on that paper).

It turns out that the basic node and edge feature extractors, as I proposed almost six months ago, weren’t quite up to the job.

So I raised the abstraction level so that the features for a whole input were encapsulated in a features object that was called lazily by the decoders and/or estimator. This allowed things like part-of-speech taggings to be computed once and then cached.

This will also allow online document features (like previous tagging decisions) to be rolled into the feature extractor, though it’ll take some work.

And a Whole Lotta’ Interfaces and Retrofitting

I added a whole new package, com.aliasi.tag, to characterize the output of a first-best, n-best, and marginal tag probability tagger. I then implemented these with CRFs and retrofitted them for HMMs. I also pulled out the evaluator and generalized it.

Along the way, I deprecated a few interfaces, like corpus.ChunkHandler, which is no longer needed given corpus.ObjectHandler<Chunking>.

Still No Templated Feature Extraction

Looking at other CRF implementations, and talking to others who’d used them, I see that designing a language to specify feature extractions is popular. Like other decisions in LingPipe, I’ve stuck to code-based solutions. The problem with this is that it limits our users to Java developers.

LingPipe Classifiers and Chunkers for Endeca Extend Partner Program

November 3, 2009

A couple weeks ago, Endeca made the following press release:

The “leading text analytics software vendors” are us (props to Breck for naming us with an “A”), Basis Technology, Lexalytics, MetaCarta, NetOwl, Nstein, Semantia and Temis. But wait, that’s not all. A slew of text analytics companies had either joined earlier or announced joining now, including ChoiceStream, BayNote, Lexalytics, Coremetrics, NStein, and Searchandise.

It’s no surprise that we’re all thinking Endeca has quite a bit of potential as a channel partner.

After the usual marketing blather (e.g. “leveraging the extensibility of the McKinley platform”, “lower cost of ownership”, “value-added capabilities”, etc.) and vague promises (e.g. “unrestricted exploration of unstructured content”), the third paragraph of Endeca’s press release explains what it’s all about in allowing Endeca’s search customers to

… run their data through an Endeca Extend partner solution, extract additional meta-data elements from the text, and append that meta-data to the original content

Endeca Records

Endeca stores documents in record data structures, which associate string keys with lists of string values. This is the same rought structure as is found in a Lucene Document.

One striking difference is that Endeca’s API is cleaner and better documented. Overall, I’m very impressed with Endeca’s API. Looking at their API reminds me of the APIs we built at SpeechWorks, where wiser heads prevailed on me to forego complex controls designed for control-freak grad students in favor of making easy things easy.

Another striking difference is that Lucene’s document structure is much richer, allowing for binary blobs to be stored by those trying to use Lucene as a database. Lucene also allows both documents as a whole and fields within a document to be boosted, adding a multiplier to their search scores for matching queries.

Manipulator Extensions

Endeca’s produced an API for extensions. An extension visits records, modifies them, and writes them back to the index. It can also write into its own scratch space on the file system and generate all new records.

An extension consists of three components: configuration, factory, and runtime.

Class 1. Configuration

The bean-like configuration class provides setters and getters for strings, booleans, integers, and doubles. These are labeled with attributes and accessed through reflection. There’s then a method to validate a configuration that returns a list of errors as structured objects. I’m a big fan of immutable objects, so working with beans drives me crazy. They could use some more doc on concurrency and lifecycle order; as is, I was conservative and programmed defensively against changes in config.

Configuration is handled through an administrator interface. As I said, it’s bean-like.

Class 2. Factory

There is then a factory class with a method that returns the config class (so the admin interface can tell what kind of config to build for it). It also contains a method that takes an Endeca application context and configuration and produces a runtime application. The context provides services like logging, a path to local file space, and a hook into a pipe into which modified records may be sent.

Class 3. Runtime

The runtime simply provides a record visitor method. To write out changes, you grab the output channel from the context provided to the factory. There are also some lifecycle methods used as callbacks: interrupt processing, processing of records is complete, and final cleanup. You can still write out answers during the completion callback.

Endeca’s Demo Manipulator Extension

Endeca has great programmers and their Java API design was really clear. I love it when vendors follow standard patterns and idioms in their API designs. Especially when they use generics usefully.

The PDF developer doc’s still in progress, but their Javadoc’s mostly in place. What was really sweet is that they gave us a working demo extension program with all of its configuration, tests, and even mock objects for use in JUnit testing the entire framework without a complete install of Endeca’s platform. I’m so happy when someone sends me a Java package that unpacks then compiles with Ant without griping.

LingPipe Classifier CAS Manipulator Extension

The first extension I wrote is configured with a path to a serialized text classifier on the classpath. I then configured a list of field names (only strings are available, so I went with comma-separated values) from which to collect text, and a field name into which to write the result of classification. [Correction: 5 Nov 2009, Endeca let me know that they had this covered; if I declare the variables in the bean-like configuration to be list-like values, the reflection-based config system will figure it out. This is awesome. I always hate rolling in ad-hoc little parsing “languages” like CSV in config. It’s just sooo hard to doc and code correctly.]

LingPipe Chunker CAS Manipulator Extension

The second extension is a chunker. It requires a path to a chunker. Optionally, it allows a sentence detector to be configured for preprocessing (most of our chunkers work better at the sentence level). It also optionally allows a dictionary (and tokenizer factory) to be specified for overriding the chunks found by the chunker. Then, a list of field names from which to read text. The output gets written into chunk-specific fields. Because a given field name can contain multiple values, you can keep the resulting spans separate.

Endeca’s Faceting

Endeca’s big on faceted search. You may be familiar with it from two of the best online stores, NewEgg and Amazon.

It’s easy to treat our classifier plugin output as a facet. For instance, classify documents by sentiment and now sentiment’s a facet. Do a search, and you’ll get a summary of how many positive and how many negative documents, with an option to restrict search to either subset.

It’s also easy to treat our chunker output as a facet. For instance, if you include a company name chunker, you’ll be able to use companies as facets (e.g. as NewEgg does with manufacturers, though documents may contain references to more than one company).

Buying Plugins

Drop Breck a line.

Now that I have my head around the bigger picture, it’s pretty easy to build these kinds of extensions. So if there’s something you’d like integrated into Endeca and you’re willing to pay for it, let us know.

Coding Chunkers as Taggers: IO, BIO, BMEWO, and BMEWO+

October 14, 2009

I’ve finished up the first order linear-chain CRF tagger implementation and a bunch of associated generalizations in the tagging interface. Now it’s time to code up chunkers using CRFs, and I’m faced with the usual problem of how to encode chunks, which are about character spans, as taggings, which are sequences of tokens (words or other units) and tags (aka labels, categories).

An Example

Here’s an example of a tokenized string and its encoding using several standards:

Yesterday O O O BOS_O
afternoon 0 O O O
, 0 O O O_PER
traveled O 0 O PER_O
to O O O O_LOC


IO Encoding

The simplest encoding is the IO encoding, which tags each token as either being in (I_X) a particular type of named entity type X or in no entity (O). This encoding is defective in that it can’t represent two entities next to each other, because there’s no boundary tag.

BIO Encoding

The “industry standard” encoding is the BIO encoding (anyone know who invented this encoding?). It subdivides the in tags as either being begin-of-entity (B_X) or continuation-of-entity (I_X).

BMEWO Encoding

The BMEWO encoding further distinguishes end-of-entity (E_X) tokens from mid-entity tokens (M_X), and adds a whole new tag for single-token entities (W_X). I believe the BMEWO encoding was introduced in Andrew Borthwick’s NYU thesis and related papers on “max entropy” named entity recognition around 1998, following Satoshi Sekine’s similar encoding for decision tree named entity recognition. (Satoshi and David Nadeau just released their Survey of NER.)

BMEWO+ Encoding

I introduced the BMEWO+ encoding for the LingPipe HMM-based chunkers. Because of the conditional independence assumptions in HMMs, they can’t use information about preceding or following words. Adding finer-grained information to the tags themselves implicitly encodes a kind of longer-distance information. This allows a different model to generate words after person entities (e.g. John said), for example, than generates words before location entities (e.g. in Boston). The tag transition constraints (B_X must be followed by M_X or E_X, etc.) propagate decisions, allowing a strong location-preceding word to trigger a location.

Note that it also adds a begin and end of sequence subcategorization to the out tags. This helped reduce the confusion between English sentence capitalization and proper name capitalization.

Transition and Feature Tying

Consider the BMEWO notation. At least for regular named-entity recognition in text, we probably don’t want to distinguish being preceded by an end-of-entity token labeled E_X from a whole entity token labeled W_X.

In HMMs, we’d tie transitions, so p(T|E_X) = p(T|W_X) for all tags T.

In CRFs, we could tie the feature generation process, so having either preceding tag (end or whole entity) produces the same features, and hence the same probability estimates.


First-best decoding in a first-order CRF or HMM is quadratic in the number of tags (and linear in the number of tokens). Here’s a rundown:

Encoding # Tags N=1 N=3 N=20
IO N+1 2 4 21
BIO 2N+1 3 7 41
BMEWO 4N+1 5 13 81
BMEWO+ 7N+3 10 24 143

Legal Transition Pruning

It’s not quite as bad as number of tags squared in practice, because not every tag can follow every other tag in the more complex codings. Even in BIO encoding, I_X can only follow B_X. In BMEWO encoding, M_X and E_X can only follow B_X or I_X, and O can only follow O, E_X, or W_X.

Of course, this is a major pain to compute and specify in the API, as you need some way of figuring out legal transitions. For HMMs, I used the fact that transition probabilities were zero to rule out consideration. I’ll have to add some way to specify legal transitions if I implement something general for CRF tagger decoding.

Necessary for CRFs?

Jenny Finkel recently told me she really thought IO encoding is all you need in most cases. It’s how I coded a single-entity problem for the mechanical Turkers for NE annotation. As far as encodability, it’s fine for MUC, almost OK for Biocreative [there are a couple contiguous entities, I think], but degenerate in general. The really nice thing about this encoding is that for a single entity type, we can get away with a single feature vector.

The recent Ratinov and Roth NE survey concluded BMEWO was superior to BIO for a second-order online logistic model (basically what’s known as a MEMM, but with best guess at each token enforced going forward).

I found BMEWO+ worked best for single-entity and small entity set tagging for our HMMs.

What to Do for First-Order CRFs?

What I don’t know is what I should do for CRFs. I’ve settled on only using first-order CRFs, and not having feature extraction depend on the output tag (or more precisely, every feature is crossed with every output tag at every position).

I’m tempted to try to punt and write up a flexible TaggingChunkingCodec interface and let the user specify it. The real problem is that evaluation’s not simple in this context, because results vary pretty dramatically based on features. That’s why I love and hate CRFs.