LingPipe 3.2.0 Released

by

The latest release of LingPipe is LingPipe 3.2.0. This release replaces LingPipe 3.1.2, with which it is backward compatible with one exception (see next section).

Backward Incompatibility

The p-value methods in stats.BinomialDistribution and stats.MultinomialDistribution have been removed. The javadoc for the classes includes the code for the removed implementations, which were based on the Jakarta Commons Math library.

Zero External Dependencies

The reason we removed p-values is that was the last remaining piece of functionality in LingPipe that required external dependencies. As of this release, there is no longer a dependency on the Jakarta Commons Math library or the Apache Xerces XML libraries. The latter functionality has been folded into Java itself.

New Features

Singular Value Decomposition

We’ve included an implementation of singular value decomposition. It uses stochastic gradient descent, so is scalable to large matrices and allows partial matrix input (matrices with some unknown values). The implementation is encapsulated in a single class, matrix.SvdMatrix, the javadoc of which explains the mathematics of SVD and our implementation.

SGML Normalizer

We were tired of writing one-off SGML entity normalizers, so we imported a class representing the entire standard, util.Sgml.

Line Tokenizer Factory

To support our work on document parsing (text extraction, bibliography extraction, e-mail signature extraction, etc.), we’ve included a new tokenizer, tokenizer.LineTokenizerFactory, that produces tokens based on lines of text. It is used in the sandbox project citationEntities

Soundex Tokenizer Filter

We’ve provided an implementation of Soundex as a tokenizer filter in tokenizer.SoundexFilterTokenizer. Soundex reduces strings of words to a simplified, (English) pronunciation-based representation. We were mainly motivated by exploring features in our feature-based classifiers. The javadoc contains a complete description of the Soundex algorithm.

Distances and Proximities

We made sure that if one of the interfaces util.Distance or util.Proximity was implemented by an object, so was the other. This allows them all to be plugged into distance-based clusterers and classifiers (e.g. k nearest neighbors).

Tutorials

Singular Value Decomposition

The SVD Tutorial that walks through the applications to latent semantic indexing and basic n-gram smoothing. It also covers all the tuning parameters (learning rate and annealing, initial values, early stopping, regularization, etc.)

Word Sense Disambiguation Tutorial

The Word Sense Tutorial provides details on creating a complete Senseval word sense disambiguation system using contextual classification. Word sense disambiguation is the problem of determining which dictionary entry (for a given dictionary) corresponds to a given token of a word in a text.