In the first order chain CRF case, the issue is whether the feature extractor has the signature φ(inputs,position,previousTag) or ψ(inputs,position,previousTag,currentTag). With the former, the current tag is implicitly included, so each feature output by φ is multipled by the number of output tags. As noted by Sutton and McCallum, this can save both time and space in feature generation, which dominates the run-time performance of linear classifiers like CRFs.
I’ve asked many people over the past few months about how they use label-specific edge and node features for CRFs. Jenny Finkel had the clearest motivation, which is that some tags are related and may benefit from tied features. For instance, if I’m doing BIO encoding for chunkers, I might want ψ(in,pos,prev,B-PER) and ψ(in,pos,prev,I-PER) to generate some shared features which separate tokens in people names from other tokens, such as “input word at position is in a list of person name tokens”. In the implicit current tag approach using φ(in,pos,prev), any feature has a distinct instance for each output category, so we’d have one feature for output label B-PER (first token in a person name) and another for I-PER (non-initial token in person name).
One advantage of tying is that it reduces the overall number of parameters, which is always good if it’s motivated, which here means the parameters really should have the same value. It can also help if they shouldn’t really have the same value, but we don’t have enough data to estimate them properly separately. If I implement the factored approach with φ(in,pos,prev), I won’t be able to tie parameters at all.
I’d love to hear other opinions on the matter from people with experience training and tuning CRFs.
Having also just read Jenny’s hierarchical modeling paper, it occurred to me that we could impose a hierarchical model on the coefficients. Tying simply reduces to a prior with zero variance and unknown mean (the prior mean’s the tied value). That’d be the “right” way to do this, of course, as it’d allow any degree of pooling between completely pooled (tying) and unpooled (completely separating).
P.S. Jenny also helped me understand what Sutton and McCallum meant by “unsupported features”, which are features that do not show up in any positive instance in the training data; Sutton and McCallum just said “those which never occur in the training data”, and I didn’t think anyone included features that didn’t occur at all in the training data.