I never catch up with the pile of API designs and features I have planned for the future of LingPipe. After that last release, I’d like to just list the major items on the “to-do” list for LingPipe. Comments appreciated, as always. As are feature requests. It’s not very expensive to add to this list, and you never know what’ll grab me to implement next.
Demos and Commands
We could really use command-line options for much of our technology. I think we’re losing market share because LingPipe requires Java coding. A good start might be a command-line package for classifiers like all the other ones out there to allow users to plug and play.
We could also use more demos. Right now, we just have a few annotation demos up, but that’s only a small part of what we do.
External Interface Implementations
We’ve never really gotten around to properly integrating large chunks of LingPipe in deployment container wrappers such as Apacche UIMA or development environments such as U. Sheffield’s GATE system.
We don’t support any kind of RDF output for the semantic web.
We’ve pretty much stopped trying to write XML output format wrappers for everything.
Clusters and Clouds
We’ve also not provided any kind of wrappers to make it easy for people to run large LingPipe jobs on local clusters or distributed cloud computing platforms like Amazon’s EC2.
Although almost all of the run time classes for LingPipe are thread safe, at least for read operations like classification or clustering or chunking. But what we don’t have is any kind of threading in our base classes. I just bought a quad-core notebook, so it’s probably time to start thinking about this.
There are all sorts of opportunities for concurrency in basic LingPipe classes, from K-means clustering to the per-epoch log loss reports in logistic regression or CRFs to any of the cross-validating evaluations. The tricky part is concurrent training, though that’s also possible for approaches such as EM. And would be possible if I reorganized logistic regression or CRFs to more directly support blocked updates, because each instance in a block’s gradient may be computed independently.
Improvements and Generalizations
Many of our modules are not written as generally as possible, either at the level of API or the level of algorithms.
- Fully online stochastic gradient for logistic regression, conditional random fields (CRF), and singular value decomposition (SVD)
- All iterative algorithms producing iterators over epochs. Right now, only latent Dirichlet allocation (LDA) is set up this way, but I could do this for semi-supervised expectation-maximization (EM) classifier training, and all the SGDs for logistic regression, CRFs and SVD.
- Refactoring SVD to produce iterators over states/dimensions
- Weighted examples for just about anything trainable from LM classifiers to logistic regression; this would make it trivial to train on probabilistic examples by weighting categories by probability.
- Entropy-based pruning for language models.
- Information gain for feature selection; minimum count feature selection
- Generalize min-max heaps to take a
java.util.Comparatorrather than requiring entries to implement LingPipe’s
- Soft k-means abstracted from demo into package for clustering
- More efficient vector processing for logistic regression, CRFs, etc., where there is no map from strings to numbers, and not necessarily even a vector processed. In most of these modules, we need to compute a vector dot-product with a coefficient vector, not actually construct a feature vector. Ideally, we could go all the way to Vowpal-Wabbit-like implicit feature maps.
- Integrate short priority queues where appropriate, because they’re more efficient than the general bounded priority queues we’re currently using
- More general priors and annealing for SVD to match the other versions of SGD.
- More efficient sorted collection with removes for more efficient hierarchical clustering.
- Removing all the specific
corpus.Handlersubinterfaces other than ObjectHandler. Then I can generalize cross-validating corpora, and just have the object handled as a type parameter for
- Add iterator methods to corpora that can enumerate instances rather than requiring handling via callbacks to visitors.
- Privatize everything that can be.
- Finalize methods and classes that shouldn’t be overridden. I’ve been very sloppy about this all along.
- More agressively copy incoming arguments to constructors and wrap getters in immutable views. Later classes are much better than earlier ones at doing this. (Maybe I should just say re-read Effective Java and try one more time to apply all its advice.)
- Serialize dynamic LM classifiers and serializing tokenized LMs to support it.
So many things should move around that it’d be impossible to mention them all. For instance, putting all the HMM classes in the HMM package, and all the LM classes in the LM package, for a start.
I’m planning to move most of the corpus-specific parsers (in
corpus.parsers) out of LingPipe to wherever they’re used.
I’m also planning to move the entire MEDLINE package to the sandbox project lingmed.
I should rename classes like
util.Math which conflict with Java library class names.
- Wikipedia Wikitext parser (probably not for LingPipe proper)
- Probabilistic context-free grammar (PCFG) parser. Possibly with Collins-style rules.
- Discriminative statistical context-free grammars with more global tree kernel features
- Dependency parsers with the same characteristics as the CFGs.
- Linear support vector machines (SVM) with regularization via SGD.
- Morphological analyzer (to work as a stemmer, lemmatizer or feature generator), preferably with semi-supervised learning. I’d like to take an approach that blends the best aspects of Yarowsky and Wicentowski’s morphology model and Goldwater and Johnson’s context-sensitive Bayesian models.
- Some kind of feature language for CRFs
- More finely synched cache (along the lines of that suggested in Goetz et al.’s awesome Java Concurrency in Practice)
- Logistic regression for a long-distance, rescoring tagger or chunker
- Longer-distance maximum-entropy Markov model (MEMM) type taggers and chunkers; with a greedy option as discussed by Ratinov and Roth.
- Higher-order HMM rescoring taggers and chunkers
- More efficient primitive collections; (I just finished a map implementation for boolean features)
- Unions, differences, etc. for feature extractors
- Decision tree classifiers
- Meta-learning, like boosting (requires weighted training instances)
- Jaro-Winkler (or other edit distances) trie versus trie (scalable all versus all processors based on prefix tries).
- Prune zeros out of coefficient vectors and symbol tables for classifiers, CRFs, etc.
- Standard linear regression (in addition to logistic).
- Other loss functions for linear and generalized regressions, such as probit and robit.
- Dirichlet-process-based clusterer
- Discriminative choice analysis (DCA) estimators, classifiers and coref implementation (the base classes are nearly done).
Please let us know if there’s something you’d like to see on this list.