Enriched Inputs for Long-Distance CRF Tagging


I’ve been wrestling with how to extract features for CRFs. Looking at papers like (Ratinov and Roth 2009), and also at what others had done before, it’s pretty clear that longer-distance contextual features are going to be helpful. These can be at the global (corpus) or more local (sentence/paragaph) level. At the corpus level, we have things like majority taggings by baseline taggers. At the sentence level, we have things like part-of-speech tags.

I’ve been struggling with how to preserve a tagging interface like we use for HMMs:

String[] tag(String[] inputs);

And while the Node/Edge feature factoring is theoretically sound, it’s wrong computationally if the input is an an array of strings. We often want to bring in richer contextual or long-distance information that we don’t want to recompute on a per-node or per-edge basis.

At the same time, I’m doing my homework and looking at packages like CRFSuite, which follow the CoNLL input format. With CoNLL input format, rather than only token strings, you have multiple features per input item. Here’s an example from the CRFSuite tutorial (linked above):

London JJ B-NP
shares NNS I-NP
closed VBD B-VP
moderately RB B-ADVP
lower JJR I-ADVP
in IN B-PP
thin JJ B-NP
trading NN I-NP
. . O

The first column is a token, the second column a part-of-speech tag, and the third column a BIO encoding of phrasal chunks. This can go on with more and more information from dictionaries, etc., as additional columns. We might even want to bring in Ando and Zhang-like spectral features.

CRFSuite then gives you a language for extracting features from the columns. For instance, the following specifies a node feature consisting of a window of +/- 2 tokens around the current token:

${x[t-2].token}, ${x[t-1].token}, ${x[t].token}, 
${x[t+1].token}, ${x[t+2].token}

whereas this one supplies part-of-speech interaction (bigram) features:

${x[t-2].pos}/${x[t-1].pos}, ${x[t-1].pos}/${x[t].pos},
 ${x[t].pos}/${x[t+1].pos}, ${x[t+1].pos}/${x[t+2].pos}

While I’m not sure I want to develop a language and parser, I am going to generalize the input from String[] to List<E>, where, for instance, E could be structured as in the CoNLL example as a sequence of strings (one per column, perhaps with named accessors, like getPartOfSpeech()). This reduces the the earlier token-tagging model when E is String. The benefit is that I can stick to the simple Node and Edge feature extractor model, only the nodes and edges will have all this extra information available; the downside is that clients will have to compute it for inputs. This could be wrapped in a bigger chunker, but hey, no one said CRF feature extraction was easy.

This will give the client a way to pass in extra information. For instance, one column could be reserved for a boolean flag indicating whether the token was tagged as part of an entity earlier in the sequence of sentences being processed. Several other columns could indicates gazeteer matches. It puts more work on the client, but I don’t see how to specify a reasonable feature extractor otherwise that doesn’t do a ton of redundant work. (I thought about adding a metadata object along with the sequence; as is, the items in the sequence can still link back to some kind of metadata.) One alternative would be to extract a whole feature lattice at once, although that precludes later implementing agressive beam search, which mainly saves time in feature extraction.

I’m also going to get rid of begin-of-sentence features; those can be encoded in the node features.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: