Chain Conditional Random Fields: Implementation and Design Issues

by

I’m starting to implement conditional random fields (CRF) for LingPipe. (See, I really don’t hate CRFs.) If you don’t know CRFs, here are three good tutorials:

Implementing CRFs brings up several design issues.

N-gram Order

Like HMMs, chain CRFs extract features over the sequence of tags. The issue is whether the features can span more than a single pair of tags (a bigram). For instance, can I use the previous label and label two back to help model the current label?

Although it’d be nice to implement everything very generally for arbitrary n-gram contexts, it’s actually a pain in practice in that everything gets deeper loops and allocating all the arrays is hard to do generically in Java with arrays. And then extracting answers with confidence and doing n-best also gets more complex.

I’m also not sure that any order beyond first is going to be tractable enough for training.

Feature Extraction

I plan to do the usual factoring of features into node features (depend only on inputs) and edge features (depend on input and also previous category). Even though edge features are more general, it’s much more efficient to implement the usually much more prevalent node features independently (it all goes through the forward-backward algorithm). Nodes and edges have the following interfaces:

interface ChainCrf.Node {
    int position();
    String[] words();
}

and

interface ChainCrf.Edge  {
    String previousTag();
    int position();
    String[] words();
}

Then we have two feature extractors required to estimate a CRF from a corpus of labeled tag data:

  • FeatureExtractor<ChainCrf.Edge>, and
  • FeatureExtractor<ChainCrf.Node>.

One or More Coefficient Vectors?

Given the feature extraction, I have to follow our logistic regression model in having different coefficient vectors per category rather than a single coefficient vector that essentially concatenates all the individual vectors. Having different vectors makes feature extraction much more economical. In the “standard” CRF presentation, a separate feature extraction is done for each possible tag, both for nodes and edges. That does allow some richer features (e.g. using a feature for labels that don’t change entity type, such as B_PERSON to I_PERSON), and also allows features to be structural zeroes for some tags.

Note that the output tag is implicit in the proposed feature extraction, which means it’s always included to distinguish both node and edge features across output tags.

K-1 Coefficient Vectors vs. K coefficient Vectors

That leaves the issue of whether to use (K-1) coefficient vectors (where K is the number of tags), as you see in typical statistics presentations of logistic regression. The first advantage is one less vector to store, and even more importantly, one fewer vector to update. The second is a system with an identified answer even for Laplace priors or maximum likelihood.

The alternative is to use K coefficient vectors, as you see in the typical “machine learning” presentation of CRFs or logistic regression. I’m planning to go with the stats approach, as I did for our logistic regression.

Order of Coefficient Vectors and Sparsity

There’s an issue of how to store the coefficient matrix vector, either tag dominant or feature dominant. That is, do we have a map from features to categories to numerical values, or from categories to features to numerical values. The latter is the natural way to do it if you look at the typical definitions. The former way would be useful if many features are non-zero for half the categories or fewer. I don’t think that’ll happen much given the way the regressions work, so I’m planning to go with (features to categories to numerical values).

I pretty much have to use sparse vectors for the edge and node features. But what about the model’s regression coefficients? I’ll be taking the coefficient vector’s dot-product with the sparse edge and node features. That’s much more efficient for dense coefficient vectors. I could save some space for really big models, but I think the time’s going to be more precious here. Luckily, this implementation detail will be easy to change under the hood if I change my mind later.

Arithmetic Precision

We won’t be able to estimate coefficients with more accuracy than is representable as a float. So for storing the models, floats make most sense. The problem is that at run time, we need to do double-precision arithmetic for decoding (in all three cases, Viterbi first best, Viterbi/A* n-best, or full forward-backward marginal decoding). It’s slower to cast floats to doubles every time you do a multiply, so I’ll probably keep the coefficients dense and doubles as long as they’ll fit. If they won’t fit, I’ll probably switch the implementation to sparse floats. It’ll all be transparent and backward compatible at the API level if I make the switch after the first release.

Start/End Tags

For HMMs, I added a BEGIN-OF-SENTENCE tag and an END-OF-SENTENCE tag. The start tag is not generated, but used as context to generate the first word in the sentence. The end-of-sentence tag is generated, because it’s required to get the normalizations right.

For CRFs, it seems to make most sense to include a BEGIN-OF-SENTENCE tag that doesn’t have an estimated coefficient, but is just used in defining the edge features for the first token in a sentence. The end marker is actually derivable from the node (word-only) features. So is the begin marker, but it seems simpler this way than to have a null value passed in to the edge feature extractor.

Stochastic Gradient Descent

I’m not even considering other implementations. I’ll parameterize it the same way as for logistic regression. This means I’ll allow coefficient priors (Laplace, Gaussian, Cauchy, with the option of non-zero means and different variances/scales per dimension). I’ll also allow an annealing schedule.

SGD Code Re-use

A serious question is whether I should try to re-use the SGD code in logistic regression. The basic algorithm’s the same, only now I have to do forward-backward in the inner loop to compute the gradient of the log likelihood function. I’m thinking I won’t share the code, but that’s mainly because the refactoring to something with the right callbacks in the inner loop looks daunting.

Forward/Backward Decoding and N-best

I need to do forward/backward for training, not just confidence-based extraction. What I’d like to do is provide a common format for the lattice output I can share between HMM taggers and CRFs.

I’ve already defined a new package com.aliasi.tag that has three interfaces for tagging: first-best, n-best (joint or conditional probabilities), and marginal (per tag) taggers matching those for HMMs.

Relation to Chunking

I’d also like to provide a general tagger to chunker implementation, which means more refactoring, but will immediately let me use CRFs to do named-entity extraction with n-best and confidence-based (per entity mention) extractions. I’d very much like to refactor the CharLmHmm chunker to rely on a generic tagger. This should be doable.

Threading and/or File Compilation

Unfortunately, storing in memory all of the features for an input task the size of the Penn Treebank POS data (about 50 tags and 1M words) is prohibitive. There are 50M edge feature extractions and 1M node feature extractions. Even if the edge feature extractions are tiny (say 40 bytes), that’s a lot of memory; if we add in previous tag plus word features, it won’t fit in memory even on a big workstation. If there are hundreds of word features per node, which is conservative if we use 5-word contexts, n-grams and subword features, and represent them tightly, we’re still looking at several KB/vector and thus several GB of memory.

So I’m inclined to do the feature extraction online. This is making me queasy, though, because it’ll be very expensive if I do it naively in a single thread (even measured relative to forward/backward, perhaps dominating training time).

Alternatively, I can run feature extraction in one thread and gradient computation in another. Then we probably won’t notice the feature extraction if there are enough CPUs free.

The other alternative, which matches what may CRF packages do, is to require all of the feature vectors for the training corpus to be serialized. Then they can be read back in much more efficiently than creating them from scratch. It won’t take that much disk space, but the requirement will be nontrivial (perhaps as much as 20 or 30GB with Treebank POS size corpora and subword features).

Generic Inputs

There’s really nothing word-specific about CRFs. In fact, they could take arbitrarily structured inputs instead of word tokens. This would require generifying the CRF interface to the type of objects being tagged, and similarly generifying feature extraction and the lattice representations. I could probably get away then with having the HMMs just instantiate the generic to String.

I may just go ahead and do this once it’s implemented for strings. It’ll probably be easier to refactor a string-based implementation than to start with something more generic. (Thanks again to those Java engineers for making things so easy to modify with backward compatibility.)

Overflow / Underflow

Overflow happens when a linear predictor gets too big and exponentiating it overflows. We handle this by computing softmax the right way.

It’s easy to underflow during forward-backward, which is on a linear scale. Sutton and McCallum outline two approaches to handling this. One is what we did for HMM forward/backward, which is to normalize each forward vector (the forward values for all tags at a particular token) to sum to 1.0. In this case, the usual computation for the normalizing constant Z just drops out.

The other approach Sutton and McCallum mention, which I’d never heard of before reading their paper, is to keep the forward/backward values on the log scale and then perform computations in the exponentiated domain and rescale. This can actually be done with reasonable precision (always a problem with logs/exps). As Sutton and McCallum note, this can be expensive because of the increased number of calls to log() and exp().

Unsupported Features

I have to admit I didn’t understand the discussion of “unsupported” features in Sutton and McCallum, which are defined to be features that don’t appear in the training data. This may be because I’ve already factored my features so that there are no output-specific features, so there may not be any such “unsupported” features. I’m definitely restricting to only features that appear in the training data, because I can’t train anything using stochastic gradient.

Boolean Feature Vectors

Many NLP applications use boolean feature vectors, which only allow values 0 or 1. By restricting to boolean feature vectors, many computations can be made more efficient. I think I may go partway there by including dense and sparse boolean feature vector representations. The sparse ones would require a very high degree of sparsity to be worth it in terms of space, but they may dominate in things like cross-products because they’d use very few comparisons.

6 Responses to “Chain Conditional Random Fields: Implementation and Design Issues”

  1. Jeff Says:

    Have you looked at the Mallet implementation? It’s one of the standard implementations from Andrew McCallum’s lab.

  2. lingpipe Says:

    I’ve glanced through the source for Mallet, JavaNLP, Carafe, and read most of the papers. I cite the Sutton and McCallum tutorial above — it goes into the most implementation detail.

    I’m not the world’s best reader of other people’s code (that’d be Ezra), but CRF algorithms are particularly difficult because they embed I/O, tokenization, serialization, feature extraction, forward-backward, and some kind of regularized logistic regression optimizer. The research packages tend to provide lots of options because they’ve been doing research on these things for years. I get lost just at the “what exactly are these input types” stage with Mallet and JavaNLP.

    I’m also not sure the people who implemented this stuff wouldn’t do it differently given a second chance.

  3. Bernhard Says:

    Its hard to understand what a ‘feature’ is in CRFs. In HMMs its essentially a word/token. What’s it in CRFs? Can you help out with an intuitive explanation?

  4. lingpipe Says:

    HMMs are generative, modeling p(words,tags) and using Bayes’ rule to estimate p(tags|words), whereas CRFs are conditional, modeling p(tags|words) directly.

    HMMs model label transitions as a multinomial p(tag|previousTag) per previous tag. In NLP, emissions are also typically multinomial (like naive Bayes) for p(word|tag).

    CRFs are more like logistic regression in that they extract features from the input words using arbitrary amounts of context, outside knowledge, etc. The words are explanatory variables and are not themselves modeled. The chain part means you can also extract features from pairs of labels plus the context. The key differentiator of CRFs is that they normalize over the whole input, not per tag.

    Really, though, you should read one of the tutorials.

  5. Bernhard Says:

    Just trying to get a handle on the specialized wording which is identical in almost all tutorials.

    What I understand from your explanation is that CRF generates features in a fashion similar to logistic regression, that is binomial probabilities from an arbitrary number of regression coefficients (betas). If it’s so, then thanks for that. This helps.

    Would you mind elaborating a bit on what you mean by ‘normalization?’

  6. lingpipe Says:

    @Bernhard: Yes, and no. At a high level, it’s essentially a logistic regression problem at the p(tags|tokens) level. In chain CRFs, the input features are factored to only depend on pairs of adjacent tags and all the tokens. So the features stand in a one-to-one relation with the dimensionality of the regression coefficients.

    What makes CRFs computationally challenging is making sure p(tags|words) sums to 1.0 over the exponentially sized set of possible tag sequences. Comptutationally, you usually start with something proportional to p(tags|words) and then normalize by summing. The challenge is that there are exponentially many sequences of tags. The good news is that you can use a version of forward/backward (a linear dynamic programming algorithm) to solve the summation.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s