Archive for the ‘LingPipe News’ Category

Log Sum of Exponentials

June 25, 2009

I’ve decided, at least in the first pass, to calculate forward/backward for CRFs on a log scale. The alternative way to maintain floating point stability, by keeping log multipliers for columns of outputs, was just making my head hurt too much in the nested loops.

The log computation involves taking the logarithm of a sum over the exponentiated values found in an array. In symbols:

$z= \log \sum_i \exp(x_i)$

The problem, of course, is that if any of the x values get sufficiently large (greater than log(Double.MAX_VALUE), which is about 710), they overflow to positive infinity when you calculate the exponent; if all of the values are tiny (less than -750 or so), they all underflow to 0.0 and the resulting log sum is zero.

Luckily, we can get more arithmetic stability with a little algebra. First, find the maximum value in x:

$m = \max_i x_i$

Then use it to normalize the largest value to 0:

$\begin{array}{lll} \log \sum_i \exp(x_i) & = & \log \sum \frac{\exp(m)}{\exp(m)}\exp(x_i) \\[12pt] & = & \log \exp(m) \sum \frac{1}{\exp(m)} \exp(x_i) \\[12pt]& = & \log \exp(m) + \log \sum \exp(-m) \exp(x_i)\\[12pt]& = & m + \log \sum \exp(x_i - m) \end{array}$

So the largest value passed to the exponential function is 0. If there are really tiny values after subtraction, they’ll become zero and drop out, as they should with limited precision arithmetic.

I’ve implemented a new static utility method to compute the log of the sum of the exponentiated values of an array:

com.aliasi.util.Math.logSumOfExponentials(double[]).

It’ll be out in the next release (hopefully along with CRFs).

In case you’re curious where this comes up, here’s the inductive step of the forward pass of the forward-backward algorithm for CRFs (φ(n,k’) is the feature vector for token position n and previous outcome k’; β[k] is the coefficient coefficient for outcome k):

$\alpha_{n,k} = \sum_{k'} \alpha_{n-1,k'} \exp \big( \phi(n,k')^{\top} \beta_k \big)$

and here it is on the log scale:

$\log \alpha_{n,k} = \log \sum_{k'} \exp\big(\, \log \alpha_{n-1,k'} + \phi(n,k')^{\top} \beta_k\, \big)$

Here’s our Java source from com.aliasi.util.Math:

    public static double logSumOfExponentials(double[] xs) {
if (xs.length == 1) return xs[0];
double max = maximum(xs);
double sum = 0.0;
for (int i = 0; i < xs.length; ++i)
if (xs[i] != Double.NEGATIVE_INFINITY)
sum += java.lang.Math.exp(xs[i] - max);
return max + java.lang.Math.log(sum);
}


Chain Conditional Random Fields: Implementation and Design Issues

June 18, 2009

I’m starting to implement conditional random fields (CRF) for LingPipe. (See, I really don’t hate CRFs.) If you don’t know CRFs, here are three good tutorials:

Implementing CRFs brings up several design issues.

N-gram Order

Like HMMs, chain CRFs extract features over the sequence of tags. The issue is whether the features can span more than a single pair of tags (a bigram). For instance, can I use the previous label and label two back to help model the current label?

Although it’d be nice to implement everything very generally for arbitrary n-gram contexts, it’s actually a pain in practice in that everything gets deeper loops and allocating all the arrays is hard to do generically in Java with arrays. And then extracting answers with confidence and doing n-best also gets more complex.

I’m also not sure that any order beyond first is going to be tractable enough for training.

Feature Extraction

I plan to do the usual factoring of features into node features (depend only on inputs) and edge features (depend on input and also previous category). Even though edge features are more general, it’s much more efficient to implement the usually much more prevalent node features independently (it all goes through the forward-backward algorithm). Nodes and edges have the following interfaces:

interface ChainCrf.Node {
int position();
String[] words();
}


and

interface ChainCrf.Edge  {
String previousTag();
int position();
String[] words();
}


Then we have two feature extractors required to estimate a CRF from a corpus of labeled tag data:

• FeatureExtractor<ChainCrf.Edge>, and
• FeatureExtractor<ChainCrf.Node>.

One or More Coefficient Vectors?

Given the feature extraction, I have to follow our logistic regression model in having different coefficient vectors per category rather than a single coefficient vector that essentially concatenates all the individual vectors. Having different vectors makes feature extraction much more economical. In the “standard” CRF presentation, a separate feature extraction is done for each possible tag, both for nodes and edges. That does allow some richer features (e.g. using a feature for labels that don’t change entity type, such as B_PERSON to I_PERSON), and also allows features to be structural zeroes for some tags.

Note that the output tag is implicit in the proposed feature extraction, which means it’s always included to distinguish both node and edge features across output tags.

K-1 Coefficient Vectors vs. K coefficient Vectors

That leaves the issue of whether to use (K-1) coefficient vectors (where K is the number of tags), as you see in typical statistics presentations of logistic regression. The first advantage is one less vector to store, and even more importantly, one fewer vector to update. The second is a system with an identified answer even for Laplace priors or maximum likelihood.

The alternative is to use K coefficient vectors, as you see in the typical “machine learning” presentation of CRFs or logistic regression. I’m planning to go with the stats approach, as I did for our logistic regression.

Order of Coefficient Vectors and Sparsity

There’s an issue of how to store the coefficient matrix vector, either tag dominant or feature dominant. That is, do we have a map from features to categories to numerical values, or from categories to features to numerical values. The latter is the natural way to do it if you look at the typical definitions. The former way would be useful if many features are non-zero for half the categories or fewer. I don’t think that’ll happen much given the way the regressions work, so I’m planning to go with (features to categories to numerical values).

I pretty much have to use sparse vectors for the edge and node features. But what about the model’s regression coefficients? I’ll be taking the coefficient vector’s dot-product with the sparse edge and node features. That’s much more efficient for dense coefficient vectors. I could save some space for really big models, but I think the time’s going to be more precious here. Luckily, this implementation detail will be easy to change under the hood if I change my mind later.

Arithmetic Precision

We won’t be able to estimate coefficients with more accuracy than is representable as a float. So for storing the models, floats make most sense. The problem is that at run time, we need to do double-precision arithmetic for decoding (in all three cases, Viterbi first best, Viterbi/A* n-best, or full forward-backward marginal decoding). It’s slower to cast floats to doubles every time you do a multiply, so I’ll probably keep the coefficients dense and doubles as long as they’ll fit. If they won’t fit, I’ll probably switch the implementation to sparse floats. It’ll all be transparent and backward compatible at the API level if I make the switch after the first release.

Start/End Tags

For HMMs, I added a BEGIN-OF-SENTENCE tag and an END-OF-SENTENCE tag. The start tag is not generated, but used as context to generate the first word in the sentence. The end-of-sentence tag is generated, because it’s required to get the normalizations right.

For CRFs, it seems to make most sense to include a BEGIN-OF-SENTENCE tag that doesn’t have an estimated coefficient, but is just used in defining the edge features for the first token in a sentence. The end marker is actually derivable from the node (word-only) features. So is the begin marker, but it seems simpler this way than to have a null value passed in to the edge feature extractor.

I’m not even considering other implementations. I’ll parameterize it the same way as for logistic regression. This means I’ll allow coefficient priors (Laplace, Gaussian, Cauchy, with the option of non-zero means and different variances/scales per dimension). I’ll also allow an annealing schedule.

SGD Code Re-use

A serious question is whether I should try to re-use the SGD code in logistic regression. The basic algorithm’s the same, only now I have to do forward-backward in the inner loop to compute the gradient of the log likelihood function. I’m thinking I won’t share the code, but that’s mainly because the refactoring to something with the right callbacks in the inner loop looks daunting.

Forward/Backward Decoding and N-best

I need to do forward/backward for training, not just confidence-based extraction. What I’d like to do is provide a common format for the lattice output I can share between HMM taggers and CRFs.

I’ve already defined a new package com.aliasi.tag that has three interfaces for tagging: first-best, n-best (joint or conditional probabilities), and marginal (per tag) taggers matching those for HMMs.

Relation to Chunking

I’d also like to provide a general tagger to chunker implementation, which means more refactoring, but will immediately let me use CRFs to do named-entity extraction with n-best and confidence-based (per entity mention) extractions. I’d very much like to refactor the CharLmHmm chunker to rely on a generic tagger. This should be doable.

Unfortunately, storing in memory all of the features for an input task the size of the Penn Treebank POS data (about 50 tags and 1M words) is prohibitive. There are 50M edge feature extractions and 1M node feature extractions. Even if the edge feature extractions are tiny (say 40 bytes), that’s a lot of memory; if we add in previous tag plus word features, it won’t fit in memory even on a big workstation. If there are hundreds of word features per node, which is conservative if we use 5-word contexts, n-grams and subword features, and represent them tightly, we’re still looking at several KB/vector and thus several GB of memory.

So I’m inclined to do the feature extraction online. This is making me queasy, though, because it’ll be very expensive if I do it naively in a single thread (even measured relative to forward/backward, perhaps dominating training time).

Alternatively, I can run feature extraction in one thread and gradient computation in another. Then we probably won’t notice the feature extraction if there are enough CPUs free.

The other alternative, which matches what may CRF packages do, is to require all of the feature vectors for the training corpus to be serialized. Then they can be read back in much more efficiently than creating them from scratch. It won’t take that much disk space, but the requirement will be nontrivial (perhaps as much as 20 or 30GB with Treebank POS size corpora and subword features).

Generic Inputs

There’s really nothing word-specific about CRFs. In fact, they could take arbitrarily structured inputs instead of word tokens. This would require generifying the CRF interface to the type of objects being tagged, and similarly generifying feature extraction and the lattice representations. I could probably get away then with having the HMMs just instantiate the generic to String.

I may just go ahead and do this once it’s implemented for strings. It’ll probably be easier to refactor a string-based implementation than to start with something more generic. (Thanks again to those Java engineers for making things so easy to modify with backward compatibility.)

Overflow / Underflow

Overflow happens when a linear predictor gets too big and exponentiating it overflows. We handle this by computing softmax the right way.

It’s easy to underflow during forward-backward, which is on a linear scale. Sutton and McCallum outline two approaches to handling this. One is what we did for HMM forward/backward, which is to normalize each forward vector (the forward values for all tags at a particular token) to sum to 1.0. In this case, the usual computation for the normalizing constant Z just drops out.

The other approach Sutton and McCallum mention, which I’d never heard of before reading their paper, is to keep the forward/backward values on the log scale and then perform computations in the exponentiated domain and rescale. This can actually be done with reasonable precision (always a problem with logs/exps). As Sutton and McCallum note, this can be expensive because of the increased number of calls to log() and exp().

Unsupported Features

I have to admit I didn’t understand the discussion of “unsupported” features in Sutton and McCallum, which are defined to be features that don’t appear in the training data. This may be because I’ve already factored my features so that there are no output-specific features, so there may not be any such “unsupported” features. I’m definitely restricting to only features that appear in the training data, because I can’t train anything using stochastic gradient.

Boolean Feature Vectors

Many NLP applications use boolean feature vectors, which only allow values 0 or 1. By restricting to boolean feature vectors, many computations can be made more efficient. I think I may go partway there by including dense and sparse boolean feature vector representations. The sparse ones would require a very high degree of sparsity to be worth it in terms of space, but they may dominate in things like cross-products because they’d use very few comparisons.

Quantized Information Gain (Conditional Entropy) for Real- and Count-Valued Features

May 14, 2009

I’ve just implemented information gain for feature selection, but I’m unhappy with how poorly its selection criterion is suited to linear classifiers like logistic regression, which take real-valued inputs, or even naive Bayes, which takes count-valued inputs. I’ve been thinking about the best I can do is to quantize the input features to three categories, positive, negative, and zero.

Information Gain (aka Mutual Information)

Information gain (closely related to mutual information) is defined intuitively as the reduction in your uncertainty about the category of an item being classified (X) provided by knowing the value of a feature (Y). Uncertainty is measured in entropy for distributions, sample entropy or estimated model entropy for data sets. Uncertainty given the value of a feature is conditional entropy. The gain is the difference between the entropy and conditional entropy.

In symbols, the entropy H(X) of a discrete random variable X with distribution p() is

H(X) = Σx in X p(x) log p(x).

Conditional entropy is just the entropy of random variable X given the value of Y,

H(X|Y) = Σ y in Y p(y) H(X|Y=y),

where H(X|Y=y) is the entropy of the variable X restricted to cases where Y=y,

H(X|Y=y) = Σx in X p(x|y) log p(x|y).

The information gain of a feature Y with respect to X is just its mutual information with respect to X:

IGX(Y) = I(X; Y) = H(X) – H(X|Y).

Because H(X) is fixed, a feature Y has a higher information gain than feature Z if H(X|Y) < H(X|Z). Information gain is just mutual information as a function of Y given a fixed X. This confusing terminology is like referring to p(y|θ) as a likelihood function when considered a function of θ with y fixed and as a conditional probability function when considered as a function of y with θ fixed.

Count and Continuous Variables

In the most common case, Y is a boolean (Bernoulli) random variable, taking on only two possible values, 0 and 1. More than two outcomes is typically coded with boolean indicator variables, exactly one of which takes on value 1 for any input. In these cases, information gain makes sense.

But we’re usually dealing with count variables (e.g. naive Bayes, language model classifiers) or continuous variables (e.g. logistic regression, K-nearest neighbors, decision trees). The definition of conditional entropy already coves the count case, because it’s countable. But it assumes that somehow naive Bayes could somehow treat the counts non-monotonically. In fact, naive Bayes probability estimates are always monotonic in the features (because they’re a simple multiplying counts by word log probability estimates).

Continuous variables are handled by replacing the sum with an integral. But with discriminative classifiers like logistic regression, we don’t even have a generative model of the input data, so what do we integrate? Logistic regression, at least without expanding the feature basis with interactions, quadratic terms, etc., is also monotonic, though it can go up and down because of the possibility of negative values.

What I’ve been toying with doing is just quantizing continuous variables and count variables into three categories: negative (Y < 0), zero (Y = 0), and positive (Y > 0). If no one has any objections or better ideas, that’s what’s going into com.aliasi.features for the next release. It reduces to the usual definition for binary features (coded either way, as variables that take on only values 0 or 1, or take on only values -1 or 1).

This is pretty much what George Forman did in his 2003 JMLR paper on feature selection, An Extensive Empirical Study of Feature Selection Metrics for Text Classification (in their special issue on feature selection). Spoiler Warning: Forman found information gain to be among the top feature selection methods for text classifiers, but only considered count data quantized into two categories, zero and positive.

LingPipe 3.8.1 Released

May 12, 2009

We just released LingPipe 3.8.1, which patches LingPipe 3.8.0. As usual, it’s available from the home page:

There are only two changes:

1. Bug fixes/performance for logistic regression

We removed the object allocations introduced in 3.8 to reduce garbage collection overhead.

We fixed a bug introduced in 3.8 that prevented all but the first category in a multinomial logistic regression from being regularized.

2. Start/End points for tokenizers

We added an end point method to tokenizers, and implemented start/end with unit tests for all of the supplied tokenizers. This makes for easier text highlighting, especially for filtered tokenizers like Porter stemming or stoplisting.

Java Object Allocations Costly in Inner Loops

May 11, 2009

I’d pretty much stopped worrying about short-lived object allocations in Java. GC’s gotten so fast it’s usually not the bottleneck in anything I’ve profiled lately, especially for short-lived objects in programs where throughput rather than low maximum latency is critical.

That is, until I got a phone call from a customer asking if I’d added multi-threading to the logistic regression estimator. The short answer was “no”.

Hmm. Logistic regression version 3.8 uses two threads. What changed? Just a really tiny, really short-lived object allocation inside the next-to-innermost loop. It was a wrapper for the non-zero dimensions. Mike added it in order to clean up my original code which had cut-and-paste the prior/regularization/shrinkage steps in multiple places to deal with middle/end of loop differences and sparse/dense array differences.

Here’s what version 3.8.0 looks like (slightly simplified), with this loop being executed once per training epoch:

for (int j = 0; j < numTrainingInstances; ++j) {
Vector xsJ = xs[j];
int csJ = cs[j];
for (int k = 0; k < numOutcomesMinus1; ++k) {
PrimitiveInt dimensions
= hasSparseInputs
? Iterators.intArray(xsJ.nonZeroDimensions())
: Iterators.intRange(numDimensions);
prior, learningRate, numTrainingInstances,
lastRegularizations);
}
double[] conditionalProbs = regression.classify(xsJ);
for (int k = 0; k < numOutcomesMinus1; ++k) {
conditionalProbs[k], learningRate, xsJ, k, csJ);
}
...
}


and then here’s the code in the inner loop:

void adjustWeightsWithPrior(DenseVector weightVectorsK,
int curInstance,
PrimitiveInt dimensions,
RegressionPrior prior,
double learningRate,
int numTrainingInstances,
int[] lastRegularizations) {

while (dimensions.hasNext()) {
int dim = dimensions.nextPrimitive();
if (curInstance == lastRegularizations[dim])
continue;
int skippedDimMultiplier
= curInstance - lastRegularizations[dim];
double weightVectorsKDim = weightVectorsK.value(dim);
if (weightVectorsKDim == 0.0)
continue;
double delta
= (skippedDimMultiplier * learningRate * priorGradient)
/ numTrainingInstances;
double newVal = weightVectorsKDim > 0.0
? Math.max(0.0,weightVectorsKDim-delta)
: Math.min(0.0,weightVectorsKDim-delta);
weightVectorsK.setValue(dim, newVal);
lastRegularizations[dim] = curInstance;
}
}


With the primitive int iterator (in util.Iterators along with factory creation methods as used above), there’s no further boxing/unboxing required for the numbers.

Mike’s changes seemed like a good idea to me when I code reviewed it, as they made the code much cleaner and more modular. And after all, the innermost loop went over the non-zero dimensions themselves, doing multiply/add steps after method calls. And usually hundreds of these. And all the code to actually follow the likelihood gradient is even more expensive than the prior gradients. So how expensive could the object allocation be, relatively speaking?

Very. It was enough to cause a second thread to be pinned on the computer running the job. That’s exactly what you’d expect if you maxed out the garbage collector with short-lived object deallocations. In the server versions of the JVMs (pretty much all we use as all our machines are 64 bit), garbage collection runs concurrently by default.

At work, on our dual quad-core Xeons with big caches and fast memory, you couldn’t even notice it in terms of either wall clock time or its effect on the system. Debugging on my Core Duo notebook was a different story. Firing up logistic regression version 3.8.0 so hosed the machine I couldn’t even open Firefox without a 2 minute wait for the screen to draw.

To make a long story short, I ripped the lid off again and we now have the best of both worlds. I refactored Mike’s refactoring to preserve the modularity but remove the object allocation. Profiling showed it worked in that logistic regression estimation is only using a single thread, which is what you’d expect, because there’s no object allocation anywhere. It’s also faster on my notebook.

The moral of the story is that fast Java programs need to look like C (without the malloc, of course). In most cases, we’re not deep enough in inner loops in CPU-intensive enough jobs for this to matter. Alas, sometimes, we are.

Stay tuned — I’m about to roll out LingPipe 3.8.1 which has this patch in place. For comparison, here’s the new code:

int[] allDimensions = new int[numDimensions];
...
for (int k = 0; k < numOutcomesMinus1; ++k) {
...
int[] dimensions
= hasSparseInputs
? xsJ.nonZeroDimensions()
: allDimensions;
dimensions, ...


where nonZeroDimensions() returns the array of non-zero dimensions for a sparse vector. The update code is pretty much the same:

void adjustWeightsWithPrior(...
int[] dimensions,
...) {
for (int i = 0; i < dimensions.length; ++i) {
int dim = dimensions[i];
...


(Warning: There’s another bug in the above code relative to 3.7.0, namely that the lastRegularizaton gets reset too soon, preventing the 2nd to K-1st coefficient vectors from getting regularized. On the plus side, we’ve also fixed a bug and a separate arithmetic overflow problem from 3.7.0.)

Max Entropy vs. Logistic Regression: Feature Extraction and Coefficient Encoding

April 30, 2009

I’m designing a conditional random field (CRF) package for LingPipe, so I’ve been reading through the literature on CRFs, most of which is expressed in terms of maximum entropy. I can’t recommend Ryan McDonald’s tutorial slides highly enough:

Among other things, Ryan shows how the maximum entropy estimate subject to expectation matching constraints yields the same unique solution as the maximum likelihood estimate (where the derivative of error shows the max likelihood estimate matches expectations). But I’m not going to harp on terminology or philosophy today, but rather on the more immediate practical problems of feature extraction and coefficient encoding.

K vs. K-1 Coefficient Vectors?

Consider multinomial classifiers with K outcome categories c[1],…,c[K] and n-dimensional vector inputs x. In logistic regression, as typically done in stats, you have K-1 n-dimensional coefficient vectors β[1], …, β[K-1] and the probability distribution over categories given an input vector x is defined so that p(c[k]|x) is proportional to exp(β[k] * x) for 1 <= k <= K-1, and p(c[K]|x) is proportional to exp(0 * x) = 1.

Sometimes, you see K coefficient vectors and p(c[K]|x) is proprotional to exp(β[K] * x) just like for the other dimensions. For instance, Dave Lewis and David Madigan et al.’s package Bayesian Multinomial Regression (BMR) lets you choose K or K-1 coefficient vectors; LingPipe uses the K-1 encoding. Dave (Lewis) told me they included the K-vector approach because their users found it easier to interpret than pinning the last coefficient vector to 0.

The problem with the K coefficient vector approach is that the parameters are not identified in the statistical sense. Basically, if you subtract β[k] from all the coefficient vectors (leaving β[k] = 0), you get exactly the same predictions. Zero-centered (or other) priors can identify unique solutions that minimize coefficient values. Gaussian priors (aka L2 regularization) always leads to a unique solution. Laplace priors might not. Imagine a very simple binary (outcomes K = 2) single regression (dimensions n = 1), and the vector β[1] = 1.0 in the K-1 coefficient coding, which implicitly sets β[2] = 0.0. For an input x, we get

p(c[1]|x) = exp(x * 1.0) / ( exp(x * 1.0) + exp(x * 0.0) )

p(c[2]|x) = exp(x * 0.0) / ( exp(x * 1.0) + exp(x * 0.0) ).

We get the same results with the K-vector coefficient encoding, taking β[1] = 0.5 and β[2] = -0.5, namely

p(c[1]|x) = exp(x * 0.5) / ( exp(x * 0.5) + exp(x * -0.5) )

p(c[2]|x) = exp(x * -0.5) / ( exp(x * 0.5) + exp(x * -0.5) )

(To see the equivalence, multiply the numerator and denominator of the second set of equations by exp(x*0.5). )

Coefficients of 0.5 and -0.5 are more likely in a zero-centered Gaussian prior (L2 regularization) than coefficients of 1.0 and 0.0, but have the same likelihood under a Laplace prior (L1 regularization).

Sampling approaches to modeling the posterior distribution p(θ|X) are more problematic, because the usual measures of Gibbs sampler convergence (e.g. R hat) won’t work. The usual solution is to monitor the K-1 dimensional transform, because that is identified.

Using a Single Coefficient Vector

What always throws me when I read max entropy presentations is that they don’t use K or K-1 vectors, they use a single vector, with a different kind of feature extraction. Specifically, there’s a different input feature vector for each category, traditionally written φ(c[k],e) for input e and category c[k]. Recall that the usual set up in logistic regression is to use a single feature extractor ψ(e) that is independent of category (the value of ψ(e) is the input vector x we used above before talking about feature extraction). Now with K vectors for input e, φ(c[1],e) through φ(c[K],e), we have a single coefficient vector β and we model probabilities p(c[k]|e) as being proportional to φ(c[k],e) * β.

It’s easy to code the usual logistic setup in the max entropy pattern. Just set β = β[1],…,β[K] and set φ(c[k],e) to 0,0,…ψ(c[k]),…,0. Going the other way around doesn’t work, because there may be features extracted by more general φ that can’t be replicated in the usual logistic setup.

I asked Hal Daumé III when he was in town if people used this flexibility in the models. He said usually not, but he did mention one case he’d seen, which is in first-order CRFs, where the conditioning is on the previous category as well as the entire input string. He said he’d seen people code up the feature “my tag is the same as the previous tag”, which is not something you can code directly in the logistic setting. There, you can only get “is my tag T and is the previous tag T” duplicated for every tag T.

Are there other examples?

Is it worth implementing the more general feature extraction case?

LingPipe 3.8.0 Released

April 16, 2009

LingPipe 3.8.0 is now available at:

The big news is that 3.8 is likely to be the last backward-compatible release before migration to 4.0, which will remove methods, classes, and constants deprecated in 3.8.

Backward-Compatible Java (De)Serialization

April 10, 2009

Time to ‘fess up. Sometimes when you look at code you wrote in the past, it looks very ugly. I took a very ill-advised expedient in serializing tokenized language models in the original implementation.

Current TokenizerFactory Serialization

I simply wrote out their fully qualified class name:

void writeExternal(ObjectOutput out) throws IOException {
out.writeUTF(mLM.mTokenizerFactory.getClass().getName());
...
}


Then I read it back in the obvious way:

Object read(ObjectInput in)
throws IOException, ClassNotFoundException {

TokenizerFactory factory
= (TokenizerFactory) Reflection.newInstance(className);
...
}


where Reflection.newInstance(String) is our (about to be deprecated) utility to create an object from the name of its class using the no-arg constructor.

What was I thinking? The sad truth is that I was too lazy to write all those serializers. Always think twice before implementing something because it’s easy. The whole util.Reflection was just such a hack; the original reflection implementations threw all those exceptions for a reason!

The Real Problem

The problem here is that folks ran into run-time issues with their factories (and ours) not having no-arg constructors. For instance, suppose we have a factory that requires a string and integer to construct:

class SomeFactory implements TokenizerFactory {
SomeFactory(String a, int b) { ... }
}


A Hacked Solution

In practice, when you’re building a specific model, there’s a fixed value for the constructor args. So you can define a quick adapter class:

class MyFactory extends SomeFactory {
public MyFactory() { super("foo",42); }
}


Refactoring the Right Way

So I want to now refactor to serialize the factory rather than writing the name of its class. But how to handle backward compatibility? At least in this case, it wasn’t too hard. I use the fact that I used to write out a string to write out a “magic value” as a flag, in this case the empty string, because it’s short and it can’t be a class name.

void writeExternal(ObjectOutput out) throws IOException {
if (mLM.mTokenizerFactory instanceof Serializable) {
out.writeUTF("");
out.writeObject(mLM.mTokenizerFactory);
} else {
out.writeUTF(mLM.mTokenizerFactory.getClass().getName());
}
...
}


Object read(ObjectInput in)
throws IOException, ClassNotFoundException {

TokenizerFactory factory = null;
if ("".equals(className)) {
} else {
factory
= (TokenizerFactory)
Reflection.newInstance(className);
}
...
}


There you have it, backward compatibility.

Tokenizer Pattern Smackdown: Factories, Constructors, Singletons, Filters, and Adapters

April 9, 2009

We’re rolling out a suite of feature extraction utilities and tokenizer factories for the upcoming 3.8 release. This has caused me (and those in my questioning range, namely Mike, Mitzi and Breck), considerable anguish. I’m trying to be a good boy and follow Josh Bloch’s advice in Effective Java. But like Paul Grice’s conversational maxims, which urge you to be both concise and unambiguous, Bloch’s individually reasonable sounding advice is rather contradictory when considered collectively.

The Tokenization Interfaces

We’re pretty much locked into our basic interfaces and base classes, which for the point of this discussion may be thought of as:

Tokenizer
String nextToken();

TokenizerFactory
Tokenizer tokenizer(char[] cs, int start, int end);


Chaining Tokenizers with Filters

Tokenizers like to chain. Typically, you start with a base tokenizer, like our Indo-European tokenizer, then filter it by lowercasing tokens, converting to stems, removing stopwords, and so on. This is a classic instance of the filter pattern.

Way back in version 1.0, tokenizers were often external to models. This is still the case, for instance, with HMMs, which take in an array of strings (representing tokens) and return an array of strings (representing tags, or hidden states). It’s then the client’s job to make sure the tokens supplied at run time match those used for training.

The natural thing to do was to use a filter pattern to implement. With simple tokenizers and filtered tokenizers, we’d do things like:

Tokenizer ieTok = new IndoEuropeanTokenizer(cs,start,len);
Tokenizer lowTok = new LowerCaseFilterTokenizer(ieTok);
Tokenizer stopTok = new EnglishStopFilterTokenizer(lowTok);
String[] tokens = stopTok.tokenize();


Looks just like java.io stream, reader and writers in action. That is, we construct an instance of tokenizer using a constructor (in this case, IndoEuropeanTokenizer(cs,start,len)). Then we apply a bunch of filters through their constructors (e.g. LowerCaseFilterTokenizer(ieTok)). Each filter holds onto its contained tokenizer and modifies its output in some way.

Type Raising: Factories

In order to supply tokenization capabilities to model, we need a way to create tokenizers from strings, which is where the tokenizer factory comes in. This interface is easy but the implementation is harder. To apply the tokenizer-level filters, we need to use:

TokenizerFactory tf = new TokenizerFactory() {
Tokenizer tokenizer(char[] cs, int start, int len) {
Tokenizer ieTok
= new IndoEuropeanTokenizer(cs,start,len);
Tokenizer lowtok
= new LowerCaseFilterTokenizer(ieTok);
Tokenizer stoptok
= new EnglishStopFilterTokenizer(lowTok);
return stopTok;
}
}


Oh, and you probably want to make that implement java.io.Serializable, which means you need a static nested class or new class file on which to hang the declaration. Writing robust serialization is a pain (LingPipe’s util.AbstractExternalizable helps immensely in following Josh Bloch’s recommended practices).

Filtering Tokenizer Factories

This was all getting to be too much. What we really need is something to filter tokenizer factories. What we eventually settled on was the same-old pattern, reasoning that it’d be much more familiar to programmers because of its ubiquity in the Java libraries (like java.io and org.xml.sax). What we now have is something like this:

TokenizerFactory ieTf = new IndoEuropeanTokenizerFactory();
TokenizerFactory lowTf
= new LowerCaseTokenizerFactory(ieTf);
TokenizerFactory stopTf
= new EnglishStopTokenizerFactory(lowTf);


Singleton Base Factories

Typically, a tokenizer factory is stateless and thread safe, and hence can be modeled using the singleton pattern, meaning a single instance that is used wherever an instance of that tokenizer factory needed.

So I’ve gone through and deprecated the constructors (e.g. new IndoEuropeanTokenizerFactory) in favor of static final instances (e.g. IndoEuropeanTokenizerFactory.INSTANCE), which is by far the safest way to implement singletons in Java. Java’s class loader even provides lazy instantiation (the singleton’s created when the class is loaded, and being static and final, is always safely published).

It’s an easy change for the clients, with the first line nullary constructor being replaced with the instance:

TokenizerFactory ieTf = IndoEuropeanTokenizerFactory.INSTANCE;
...


Singleton Tokenizer Factory Filters?

So why not do the same thing with the filters? That is, define an interface such as this one:

TokenizerFactoryFilter
TokenizerFactory filter(TokenizerFactory factory);


They’re easy to implement, too, given that we’ve already defined the constructor-based filter LowerCaseTokenizerFactoryFilter. Here we’ve just gone ahead and assigned them to a singleton, as we don’t need more than one instance of the filter:

static final TokenizerFilter LOWER_CASE_FILTER
= new TokenizerFilter() {
public TokenizerFactory filter(TokenizerFactory f) {
return new LowerCaseTokenizerFactoryFilter(f);
}
};


We can even collect the singleton tokenizer factories and filters into a big utility class and privatize the classes that implement them. Then what we’re left with is something like this:

TokenizerFactory f =
ENGLISH_STOP_FILTER
.filter(LOWER_CASE_FILTER
.filter(INDO_EUROPEAN_FACTORY));


In the end, we decided that as cool as that is, and as much as it follows all of Josh Bloch’s advice to a tee. By hiding all the classes, you cut down on cognitive load. Unfortunately, you also cut down on findability and make one big class with tons of doc and singletons in it. You do get the advantage of being able to return whatever types you want and retain freedom to change implementations (modulo serializability backward compatiblity) at will.

Let’s Not Innovate on Style

But in the end, we decided it was so non-standard that it’d just be confusing to clients. So we went with the more familiar constructor-based filter pattern. Although I did make the base factories singletons.

What I did need to do was add adapter implementations. Most of the tokenizer factory filter implementations really only modify the tokenizer, and then, typically they only modify the tokens and whitespaces individually. So I broke out ye olde adapter pattern, first for tokenizers:

abstract class ModifedTokenizerFactory
implements TokenizerFactory {

private final TokenizerFactory mBaseFactory;

public ModifiedTokenizerFactory(TokenizerFactory factory) {
mBaseFactory = factory;

public Tokenizer tokenizer(char[] cs, int start, int len) {
Tokenizer tokenizer
= mBaseFactory.tokenizer(cs,start,len);
return filter(tokenizer);
}

public abstract Tokenizer filter(Tokenizer tok);

}


and then one more level to filter the tokenizer itself (in simplified form; the real implementation has to deal with deletions of tokens, too, in order to generalize to stop lists):

abstract class ModifyTokenFactory
extends ModifiedTokenizerFactory {

public ModifyTokenFactory(TokenizerFactory factory) {
super(factory);
}

public Tokenizer modify(final Tokenizer in) {
return new Tokenizer() {
public String nextToken() {
return modify(in.nextToken());
}
}
}

abstract String modify(String token);

}


It makes implementing the filters a piece of cake. The basic idea is to adapt a token modifier up to a tokenizer modifier and then a tokenizer factory modifier.

Feature Extractors: More of the Same

Most of this discussion applies to feature extractors, too, which we’ve modified in the same way as tokenizers for 3.8. In fact, when you create feature extractors on top of tokenizers, it looks even more like Java I/O, because you move from tokenizer factories to feature extractors the same way Java I/O moves from streams to readers/writers.

JDK 7 Twice as Fast* as JDK 6 for Arrays and Arithmetic

March 30, 2009

The 7th version of the Java Developer’s Kit (aka JDK 7) delivers quite a speed boost over JDK 6 array accesses. For us, this is huge. It’s like another year and a half of Moore’s law for free. Only in software. And you don’t even have to write multi-threaded code.

I’ve been profiling my new K-Means++ implementation for the next LingPipe release on some randomly generated data. It’s basically a stress test for array gets, array sets, and simple multiply-add arithmetic. Many LingPipe modules are like this at run-time: named entity, part-of-speech tagging, language modeling, LM-based classifiers, and much more.

While I was waiting for a run using JDK 1.6 to finish, I installed the following beta release of JDK 7:

> java -version
java version "1.7.0-ea"
Java(TM) SE Runtime Environment (build 1.7.0-ea-b52)
Java HotSpot(TM) 64-Bit Server VM (build 15.0-b03, mixed mode)


You can get it, too:

I believe much of the reason it’s faster is the work of these fellows:

Java’s always suffered relative to C in straight matrix multiplication because Java does range checks on every array access (set or get). With some clever static and run-time analysis, Würthinger et al. are able to eliminate most of the array bounds checks. They show on matrix benchmarks that this one improvement doubles the speed of the LU matrix factorization benchmark in the U.S. National Institute of Standards (NIST) benchmark suite SciMark 2, which like our clustering algorithm, is basically just a stress test for array access and arithmetic.

So far, my tests have only been on a Thinkpad Z61P notebook running Windows Vista (64 bit) with an Intel Core 2 CPU (T2700; 2.0GHz), and 4GB of reasonably zippy memory. I don’t know if the speedups will be as great for other OSes or for 32-bit JDKs.

I’m pretty excited about the new fork-join concurrency, too, as it’s just what we’ll need to parallelize the inner loops without too much work for us or the operating system.

*Update: 2:30 PM, 30 March 2009 JDK 7 is only about 15% faster than Sun’s JDK 6 on my quad Xeons (E5410, 2.33GHz) at work running the same code. I’ll have to check the exact specs on both of my memory buses. The notebook has surprisingly fast memory and the Xeon’s running ECC registered memory that I don’t think is quite as fast.

Update: 11:00 AM, 31 March 2009 Like other matrix algorithms, k-means clustering is extremely front-side-bus sensitive (connection between memory and the CPU), because the bottleneck is often between memory and the CPU’s L2 cache. Memory’s significantly slower than CPU these days.

The Intel dual quad-core Xeon E5410 have 12MB of L2 cache at 2.3GHz, whereas the Thinkpad Z61P has Intel Core 2 Mobile T7200 has only 4MB of L2 cache at 2GHz. The Core 2 has a 667 MHz front-side bus whereas the Xeon reports a 1333 MHz front-side bus (is that just the confusion between spec reporting). I actually don’t know what kind of memory’s in the workstation — I’ll have to crack it open and look. I’ve got 4GB of RAM in the notebook, but the motherboard can only see 3GB (ithat is, it’s not Windows — the same thing happened when I installed Ubuntu on the notebook and it’s a known design limitation in many motherboards); I have 16GB of RAM in the workstation and the motherboard can see all of it. But it has two physical chips, each of which share the memory, so the motherboard’s architecture’s very different. There are so many confounding factors that I can’t tease apart what’s speeding up in JDK 7 so much on the notebook.

Anway, go forth and test. If you’re using a machine like my notebook to do number crunching, JDK 7 really is twice as fast as JDK 6 for matrix algorithms.