Archive for the ‘LingPipe News’ Category

API Design: Should I Reify Taggings for CRFs and HMMs?

October 6, 2009

I finished the technical core and testing for CRFs, including regularized stochastic gradient estimation as well as first-best, n-best and marginal decoders. Now I’m thinking about retrofitting everything so that CRFs work the same way as HMMs from an interface perpsective.

Current HMM Interfaces

For hmm.HmmDecoder, the current interface is:

class hmm.HmmDecoder

    String[] firstBest(String[] emissions);

        nBest(String[] emissions, int maxN);

    TagWordLattice lattice(String[] emissions);

Note that this is using arrays for inputs and outputs (I wrote it before generics), and constructs n-best lists and scored outputs using standardized wrappers/interfaces like Java's java.util.Iterator and com.aliasi.util.ScoredObject from LingPipe.

Tag Package

I started a new package, tag, with three new interfaces for first-best, n-best, and marginal taggers. I'll adapt the HMM decoders to implement them and deprecate the methods described above (if necessary). The big question is what should the interfaces look like?

First-Best Taggers

I generalized inputs to lists of generic objects -- they used to be fixed to arrays of strings. The first interface is for first-best tagging:

interface tag.Tagger<E>

     Tagging<E> tag(List<E> tokens)

with an associated tag.Tagging abstract base class:

class tag.Tagging<E>

    int size();

    String tag(int n); 

    E token(int n);

    List<E> tokens();

    List<String> tags();

Just to be safe, the public constructor copies the input tags and tokens, and the methods that return the lists provide unmodifiable views. I also have the tags being returned as a list. There's nothing in principle preventing the tags from being structured -- it just didn't seem worth the implementation hassle and resulting interface complexity.

Question 1

Should there be a Tagging interface (or should I just return lists of tags and leave it up to users to keep track of inputs)?

N-Best Sequence Taggers

There's another interface for n-best taggers:

interface tag.NBestTagger<E>

        tagNBest(List<E> tokens, int maxResults);

The ScoredTagging object extends Tagging and implements the interface util.Scored:

interface util.Scored

    double score();

I like the iterator over the results, which works well in practice. What I'm less certain about is:

Question 2

Should I create a ScoredTagging interface (or just use ScoredObject<Tagging>)?

Marginal (per Tag, per Phrase) Tagging

The final interface is for marginal results, which allow the computation of the conditional probability of a sequence of tags at a given position given the input sequence.

interface tag.MarginalTagger<E>

    TagLattice<E> tagMarginal(List<E> tokens);

This one's easy, as I need a structure for the result. What I need to do is adapt the HMM TagWordLattice so that it implements the common interface TagLattice<String>. I think all that needs to have in it is:

abstract class tag.TagLattice

    double logProbability(int n, int tag);

    double logProbability(int n, int[] tags);

    SymbolTable tagSymbolTable(); 

Of course, I can add utility methods for linear probabilities and accessing tags by name rather than symbol table identifier.

Forward-Backward Lattice

There's also the implementation issue of whether to combine the currently separate forward-backward (or generalized sum-product) implementations of the tag lattices in HMMs and CRFs, and then whether to expose the implementation methods, which are necessary for extracting n-best phrases in chunkers. The forward/backward phrase extraction requires forward, backward, and transition scores, and an overall normalizer (conventionally written as Z) to normalize scores to conditional probabilities.

tag.ForwardBackwardLattice extends tag.TagLattice

    double logForward(int n, int tagTo);

    double logBackward(int n, int tagTo);

    double logTransition(int n, int tagFrom, int tagTo);

    double logZ();

Should I just require a forward-backward lattice in the interface? I never feel it's worth it to add really generic interfaces, such as:

Tagger<E, T extends TagLattice<E>>

    T tagMarginal(List<E> tokens);

That'd let the classes that need the forward-backward version specify T.

Taggings and Handlers

The next big issue is whether to change the way taggings are supplied to taggers. The current method uses a specific handler:

interface corpus.TagHandler

    void handle(String[] toks, String[] whitespaces, 
                String[] tags);

This allows whitespaces to be provided, but we never use them at the tagger level. This brings up the question of whether to reify taggings for training:

Question 3

Should I replace TagHandler with ObjectHandler<Tagging>?

A "yes" answer would allow me to deprecate TagHandler going forward and makes tagging more parallel to chunking. But it's a lot of work (not that it should worry you), will require lots of changes to existing code, and it messes with longer-term backward compatibility.

Evaluator Overloading

Right now tagger evaluators only work for HMMs. I need to generalize. The classifier evaluator inspects the classifier being evaluated with reflection to see what it can do, and evaluates all of the parts that can be evaluated. Which brings up:

Question 4

Should I create three distinct evaluators, one for first-best, one for n-best, and one for marginal taggers (or should I just overload a single one)?

Speak Now, or Forever ...

My current thinking is to answer "yes" to all the questions. So now's a good time to chime in if you don't like those answers!

LingPipe / GATE Integration

August 12, 2009

Following on their 2009 summer school in July, Hamish Cunningham and crew have finished the initial integration of LingPipe into GATE (aka the “General Architecture for Text Engineering”, described on their home page as the “Eclipse of natural language engineering”).

So far, the GATE integration includes LingPipe’s tokenization, sentence splitting, part-of-speech tagging, named-entity (mention) recognition, and language ID (which I suppose means classifiers are integrated to some degree).

The integration is now available in the GATE nightly builds from the

You can find it under gate/plugins/LingPipe.

Although I don’t see them yet, Hamish tells me there will soon be details in:

The integration was carried out with Alias-i’s permission and LingPipe remains under our Royalty-Free License. GATE itself is distributed under the more permissive LGPL.

Zou and Hastie (2005) Regularization and Variable Selection via the Elastic Net [Prior]

July 30, 2009

I don’t know how the elastic net prior (aka regularizer) snuck by me unnoticed. It was introduced in:

I was reading about it in the technical report appearing as part of the documentation to their glmnet R package:

The idea’s basically an “interpolation” between the Laplace (L1, lasso) and Gaussian (L2, ridge) priors:

p_{\lambda,\alpha}(\beta) \propto \exp(-\lambda(\alpha|\beta|_2^2 + (1-\alpha)|\beta|_1)

where β is the vector coefficients, λ a factor for prior variance, and α the blending ratio between L2 and L1 priors, themselves represented by the squared L2 norm and (unsquared) L1 norm terms.

Note that the interpolation happens inside the exponential form, so it’s not a mixture of a Gaussian and Laplace prior, but rather a mixture of their bases. Thus on the log scale used in gradient methods, the prior takes on a pleasantly differentiable form:

\log p_{\lambda,\alpha}(\beta) = -\log Z_{\lambda,\alpha} -\lambda(\alpha|\beta|_2^2 + (1-\alpha)|\beta|_1)

where Z is the usual regularization term, here dependent on the variance and mixture parameters.

The later paper divides the L2 component by 2, which gives you a slightly different blend. In general, I suppose you could give the two distributions being blended arbitrary variances λ1 and λ2.

The authors are particularly interested in two situations that are problematic for L1 or L2 alone:

  • the number of dimensions is larger than the number of training items
  • the features/coefficients are highly correlated

Note that bag-of-words and other lexically-based feature representations in natural language classification and tagging are of exactly this nature. In the former case, L1 breaks down with too few features selected. In the latter case, L2 tends to select all the features and L1 only one of the features. With the elastic net, with most of the weight on L1, you get the beneifts of L1’s sparseness and fat tails, with better behavior in the face of correlated variables and large numbers of dimensions.

There’s a beautiful plot in the later paper with prior variance decreasing on the X-axis versus the fitted coefficient values on the Y-axis, with a line for each dimension/feature (and there are lots of non-zero coefficients overlaid on the Gaussian, which never pushes any feature to 0 through regularization):

The Zou and Hastie paper also discusses “bridge regression” (bad pun), which takes an arbitrary Ln penalty for n in the interval [1,2], and thus provides a slightly different way of blending L1 and L2 priors. The limit at 1 and 2 are the L1 and L2 priors. Although the gradient is still easy to calculate, the resulting solutions for n > 1 are not sparse for the same reason L2 isn’t sparse — the shrinkage is proportional to a fraction of the current value.

Given the general exponential form, the gradient of the elastic net’s log probability is easy to compute. It just reduces to interpolating between the L1 and L2 gradients. I’ll probably just go ahead and add elastic net priors to LingPipe as implementations of stats.RegressionPrior. That’ll let them plug-and-play in multinomial logistic regression and CRFs.

Arabic Named Entity Recognition with the ANER Corpus

July 28, 2009

One thing I forgot to mention on the release notes for LingPipe 3.8.2 is that I’ve added a section on building an Arabic named entity recognizer to the

Benajiba’s ANER Corpus

It’s based on Yassine Benajiba‘s freely distributed (thanks!) corpus:

It’s 150K tokens in CoNLL format (easy to parse, but lossy for whitespace) using person, location, organization and miscellaneous types (like CoNLL’s English corpus). Here’s a sample (rendered with style value direction:rtl to get the ordering right; the actual file’s in the usual character order):

فرانكفورت B-LOC
(د O
ب O
أ) O
أعلن O
اتحاد B-ORG
صناعة I-ORG
السيارات I-ORG
في O
ألمانيا B-LOC

Benajiba also supplies small dictionaries for locations, organizations and persons, which we explain how to use in the demo.

LingPipe Model and Evaluation

I applied a simple sentence-chunking heuristic and then built a character 8-gram-based rescoring chunker using otherwise default parameters and training on all the data (but not the dictionaries).
There’s absolutely nothing Arabic-specific about the model.

Overall performance is in line with Benajiba’s, while differing substantially on the four entity types. Here are the LingPipe 6-fold (125K token train, 25K token test) cross-validated results:

Type Precision Recall F1
LOC 0.782 0.788 0.785
PERS 0.634 0.657 0.645
ORG 0.609 0.527 0.565
MISC 0.553 0.421 0.478
COMBINED 0.685 0.661 0.673

Dictionaries improved recall a bit, but hurt precision even more. Bigger dictionaries and more training data would certainly help here.


For a description of the corpus, and a description and evaluation of Benajiba’s own Arabic NER system, see:

And of course, there’s more information in LingPipe’s Named Entity Tutorial.

LingPipe 3.8.2 Released

July 28, 2009

LingPipe 3.8.2 is now available from:

This is a patch release, but includes lots of little utility methods and even some utility classes. The home page above lists the changes.

Label-Specific Features versus Priors for Tying CRF Features

July 22, 2009

In the first order chain CRF case, the issue is whether the feature extractor has the signature φ(inputs,position,previousTag) or ψ(inputs,position,previousTag,currentTag). With the former, the current tag is implicitly included, so each feature output by φ is multipled by the number of output tags. As noted by Sutton and McCallum, this can save both time and space in feature generation, which dominates the run-time performance of linear classifiers like CRFs.

I’ve asked many people over the past few months about how they use label-specific edge and node features for CRFs. Jenny Finkel had the clearest motivation, which is that some tags are related and may benefit from tied features. For instance, if I’m doing BIO encoding for chunkers, I might want ψ(in,pos,prev,B-PER) and ψ(in,pos,prev,I-PER) to generate some shared features which separate tokens in people names from other tokens, such as “input word at position is in a list of person name tokens”. In the implicit current tag approach using φ(in,pos,prev), any feature has a distinct instance for each output category, so we’d have one feature for output label B-PER (first token in a person name) and another for I-PER (non-initial token in person name).

One advantage of tying is that it reduces the overall number of parameters, which is always good if it’s motivated, which here means the parameters really should have the same value. It can also help if they shouldn’t really have the same value, but we don’t have enough data to estimate them properly separately. If I implement the factored approach with φ(in,pos,prev), I won’t be able to tie parameters at all.

I’d love to hear other opinions on the matter from people with experience training and tuning CRFs.

Having also just read Jenny’s hierarchical modeling paper, it occurred to me that we could impose a hierarchical model on the coefficients. Tying simply reduces to a prior with zero variance and unknown mean (the prior mean’s the tied value). That’d be the “right” way to do this, of course, as it’d allow any degree of pooling between completely pooled (tying) and unpooled (completely separating).

P.S. Jenny also helped me understand what Sutton and McCallum meant by “unsupported features”, which are features that do not show up in any positive instance in the training data; Sutton and McCallum just said “those which never occur in the training data”, and I didn’t think anyone included features that didn’t occur at all in the training data.

Enriched Inputs for Long-Distance CRF Tagging

July 13, 2009

I’ve been wrestling with how to extract features for CRFs. Looking at papers like (Ratinov and Roth 2009), and also at what others had done before, it’s pretty clear that longer-distance contextual features are going to be helpful. These can be at the global (corpus) or more local (sentence/paragaph) level. At the corpus level, we have things like majority taggings by baseline taggers. At the sentence level, we have things like part-of-speech tags.

I’ve been struggling with how to preserve a tagging interface like we use for HMMs:

String[] tag(String[] inputs);

And while the Node/Edge feature factoring is theoretically sound, it’s wrong computationally if the input is an an array of strings. We often want to bring in richer contextual or long-distance information that we don’t want to recompute on a per-node or per-edge basis.

At the same time, I’m doing my homework and looking at packages like CRFSuite, which follow the CoNLL input format. With CoNLL input format, rather than only token strings, you have multiple features per input item. Here’s an example from the CRFSuite tutorial (linked above):

London JJ B-NP
shares NNS I-NP
closed VBD B-VP
moderately RB B-ADVP
lower JJR I-ADVP
in IN B-PP
thin JJ B-NP
trading NN I-NP
. . O

The first column is a token, the second column a part-of-speech tag, and the third column a BIO encoding of phrasal chunks. This can go on with more and more information from dictionaries, etc., as additional columns. We might even want to bring in Ando and Zhang-like spectral features.

CRFSuite then gives you a language for extracting features from the columns. For instance, the following specifies a node feature consisting of a window of +/- 2 tokens around the current token:

${x[t-2].token}, ${x[t-1].token}, ${x[t].token}, 
${x[t+1].token}, ${x[t+2].token}

whereas this one supplies part-of-speech interaction (bigram) features:

${x[t-2].pos}/${x[t-1].pos}, ${x[t-1].pos}/${x[t].pos},
 ${x[t].pos}/${x[t+1].pos}, ${x[t+1].pos}/${x[t+2].pos}

While I’m not sure I want to develop a language and parser, I am going to generalize the input from String[] to List<E>, where, for instance, E could be structured as in the CoNLL example as a sequence of strings (one per column, perhaps with named accessors, like getPartOfSpeech()). This reduces the the earlier token-tagging model when E is String. The benefit is that I can stick to the simple Node and Edge feature extractor model, only the nodes and edges will have all this extra information available; the downside is that clients will have to compute it for inputs. This could be wrapped in a bigger chunker, but hey, no one said CRF feature extraction was easy.

This will give the client a way to pass in extra information. For instance, one column could be reserved for a boolean flag indicating whether the token was tagged as part of an entity earlier in the sequence of sentences being processed. Several other columns could indicates gazeteer matches. It puts more work on the client, but I don’t see how to specify a reasonable feature extractor otherwise that doesn’t do a ton of redundant work. (I thought about adding a metadata object along with the sequence; as is, the items in the sequence can still link back to some kind of metadata.) One alternative would be to extract a whole feature lattice at once, although that precludes later implementing agressive beam search, which mainly saves time in feature extraction.

I’m also going to get rid of begin-of-sentence features; those can be encoded in the node features.

Deleting Values in Welford’s Algorithm for Online Mean and Variance

July 7, 2009

To take a Gibbs sample, the previous sampled value for a variable is deleted from the sufficient statistics of the estimators that depend on it, the variable is resampled, and the the variable is added back into the estimators. What if our variable’s normal? Well, it turns out we can generalize Welford’s algorithm, which I describe in a comment to:

Here’s the basic algorithm, cribbed and simplified from John Cook’s site (see reference below), along with an extra method (unHandle(double)) added in for deletes:

private long mN = 0L;
private double mM = 0.0;
private double mS = 0.0;
public void handle(double x) {
    double nextM = mM + (x – mM) / mN;
    mS += (x – mM) * (x – nextM);
    mM = nextM;
public void unHandle(double x) {
    if (mN == 0) {
        throw new IllegalStateException();
    } else if (mN == 1) { 
        mN = 0; mM = 0.0; mS = 0.0;
    } else {
        double mMOld = (mN * mM - x)/(mN-1);
        mS -= (x -  mM) * (x - mMOld);
        mM = mMOld;
public double mean() {
    return mM;
public double varianceUnbiased() {
    return mN > 1 ? mS/(mN-1) : 0.0;

Works like a charm. I’m sure someone’s done this already, but I didn’t find any references in a quick check.

The same technique can obviously be applied to covariance and correlation calculations.

This’ll be in the next release of LingPipe.


  • Welford, B. P. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics 4(3): 419-420.
  • Cook, John. Accurately computing running variance. Web page.

Collapsed Gibbs Sampler for Hierarchical Annotation Model

July 6, 2009

The R and BUGS combination is fine as far as it goes, but it’s slow, hard to debug, and doesn’t scale well. Because I’m going to need to process some big Turker-derived named entity corpora in the next month (more about that later), I thought I’d work on scaling the sampler.

Mitzi was away over the weekend, so I naturally spent my newly found “free time” coding and reading up on stats. While I was procrastinating refactoring feature extraction for CRFs reading a really neat paper (On Smoothing and Inference for Topic Models) from the Irvine paper mill (I also blogged about another of their paper’s on fast LDA sampling), it occurred to me that I could create a fast collapsed sampler for the multinomial form of the hierarchical models of annotation I’ve blogged about.

Hierarchical Model of Binomial Annotation

The model’s quite simple, at least in the binomial case. I’ve further simplified here to the case where every annotator labels every item, but the general case is just as easy (modulo indexing):

Variable Range Status Distribution Description
I > 0 input fixed number of Items
J > 0 input fixed number of annotators
π [0,1] estimated Beta(1,1) prevalence of category 1
c[i] {0,1} estimated Bern(π) category for item i
θ0[j] [0,1] estimated Beta(α0,β0) specificity of annotator j
θ1[j] [0,1] estimated Beta(α1,β1) sensitivity of annotator j
α0/(α0+β0) [0,1] estimated Beta(1,1) prior specificity mean
α0 + β0 (0,∞) estimated Pareto(1.5)* prior specificity scale
α1/(α1+β1) [0,1] estimated Beta(1,1) prior sensitivity mean
α1 + β1 (0,∞) estimated Pareto(1.5)* prior sensitivity scale
x[i,j] {0,1} input Bern(c[i,j]==1
? θ1[j]
: 1-θ0[j])
annotation of item i by annotator j

The Collapsed Sampler

The basic idea is to sample only the category assignments c[i] in each round of Gibbs sampling. With categories given, it’s easy to compute prevalence, annotator sensitivity and specificity given their conjugate priors.

The only thing we need to sample is c[i], and we can inspect the graphical model for dependencies: the parent π of the c[i], and the parents θ0 and θ1 of the descendants x[i,j] of c[i]. The formula’s straightforwardly derived with Bayes’ rule:

p(c[i]|x, θ0, θ1) p(c[i]) * Πj in 1:J p(x[i,j] | c[i], θ0[j], θ1[j])

Moment-Matching Beta Priors

*The only trick is estimating the priors over the sensitivities and specificities, for which I took the usual expedient of using moment matching. Note that this does not take into account the Pareto prior on scales of the prior specificity and sensitivity (hence the asterisk in the table). In particular, given a set of annotator specificities (and there were 200+ annotators for the named-entity data), we find the beta prior with mean matching the empirical mean and variance matching the empirical variance (requires some algebra).

I’m not too worried about the Pareto scale prior — it’s pretty diffuse. I suppose I could’ve done something with maximum likelihood rather than moment matching (but for all I know, this is the maximum likelihood solution! [update: it’s not the ML estimate; check out Thomas Minka’s paper Estimating a Dirichlet Distribution and references therein.]).


The inputs are initial values for annotator specificity, annotator sensitivity, and prevalence. These are used to create the first category sample given the above equation, which allows us to define all the other variables for the first sample. Then each epoch just resamples all the categories, then recomputes all the other estimates. This could be made more stochastic by updating all of the variables after each category update, but it converges so fast as is, that it hardly seemed worth the extra coding effort. I made sure to scale for underflow, and that’s about it.

It took about an hour to think about (most of which was working out the moment matching algebra, which I later found in Gelman’s BDA book’s appendix), about two hours to code, and about forty-five minutes to write up for this blog entry.

Speed and Convergence

It’s very speedy and converges very very quickly compared to the full Gibbs sampler in BUGS. I spent another hour after everything was built and running writing the online sample handler that’d compute means and deviations for all the estimated variables, just like the R/BUGS interface prints out. Having the online mean and variance calculator was just what I needed (more about it later, too), especially as many of the values were very close to each other and I didn’t want to store all of the samples (part of what’s slowing BUGS down).


I didn’t run into identifiability problems, but in general, something needs to be done to get rid of the dual solutions (you’d get them here, in fact, if the initial sensitivities and specificities were worse than 0.5).

Open Question (for me)

My only remaining question is: why does this work? I don’t understand the theory of collapsed samplers. First, I don’t get nearly the range of possible samples for the priors given that they’re estimated from discrete sets. Second, I don’t apply beta-binomial inference in the main sampling equation — that is, I take the prevalence and annotator sensitivities and specificities as point estimates rather than integrating out their beta-binomial form. Is this kosher?

Downloading from LingPipe Sandbox

You can find the code in the LingPipe Sandbox in the project hierAnno (the original R and BUGS code and the data are also in that project). It’s not documented at all yet, but the one Ant task should run out of the box; I’ll probably figure out how to move the application into LingPipe.

[Update: The evaluation looks fine for named entities; I’m going to censor the data the same way as in the NAACL submission, and then evaluate again; with all the negative noise, it’s a bit worse than voting as is and the estimates aren’t useful because the specificites so dominate the calculations. For Snow et al.’s RTE data, the collapsed estimator explodes just like EM, with sensitivity scales diverging to infinity; either there’s a bug, the collapsed sampler isn’t stable, or I really do need a prior on those scales!]

[Update 2: The censored NE evaluation with collapsed Gibbs sampling and simple posterior evaluation by category sample worked out to have one fewer error than the full sampling model in BUGS (versus the original, albeit noisy, gold standard): collapsed Gibbs 232 errors, full Gibbs 233 errors, voting 242.5 errors (the half is from flipping coins on ties). Voting after throwing out the two really noisy annotators is probably competitive as is. I still don’t know why the RTE data’s blowing out variance.]

Medical Subject Headings (MeSH) Parser

July 2, 2009

I just finished up the doc on a complete parser and object model for the for the U.S. National Library of Medicine’s XML-based Medical Subject Headings (MeSH) distribution. MeSH contains over 25,000 controlled vocabulary items in the biomedical domain, organized into highly structured and cross-linked records. Here are links that explain what’s in MeSH:

What’s really neat is that MEDLINE is distributed with MeSH terms added by a dedicated staff of research librarians.

The parser and object representations follow the XML pretty closely, which may be familiar from LingPipe’s longstanding MEDLINE parsing and object package (which has its own tutorial). The parsers for MeSH and MEDLINE can take the compressed distribution files and parse out object-oriented representations of the associated MeSH or MEDLINE records.

The parsing pattern employed in LingPipe’s corpus package is very much like SAX’s parser/handler pattern. At one point, I wrote a whole blog entry on LingPipe’s parser/handler framework. A handler is attached to a parser, and as the parser parses the raw data, it sends object-oriented representations of records to the handler. There’s a simple demo command in the package that just prints them out to show how the code can be used.

As part of the preparations for LingPipe 4.0, I’m going to be moving the MEDLINE package (com.aliasi.medline) out of LingPipe proper and into the LingMed sandbox project. LingMed will probably graduate from the sandbox and start getting distributed on its own.

LingMed also includes parsers for distributions of NLM’s Entrez-Gene, NLM et al.’s Online Mendelian Inheritance in Man (OMIM) and the GO Consortium’s Gene Ontology (GO). These all work using the same basic SAX-like pattern.

LingMed also has data access layer implementations of search for many of these data sets (like Entrez-Gene and MEDLINE) integrated with Lucene. In particular, you can do fielded searches, yet retrieve object-oriented results (rather than Lucene documents) out the other side.

We’ll continue to add to the parsing and search base over time. Mitzi‘s working on Wormbase as part of her job at NYU, and I’ve done the preliminary parsing for Uniprot.

Downloading LingMed

You can find instructions on the LingPipe site for anonymous subversion (SVN) access to our sandbox, which includes specific information about checking out the LingMed project.

Let us know if you find it useful or want us to add other features. We’d love to get some other folks using this. As is, the documentation’s a bit sketchy in places. Please feel free to send the LingPipe mailing list or me ( questions about it directly.