API Design: Should I Reify Taggings for CRFs and HMMs?


I finished the technical core and testing for CRFs, including regularized stochastic gradient estimation as well as first-best, n-best and marginal decoders. Now I’m thinking about retrofitting everything so that CRFs work the same way as HMMs from an interface perpsective.

Current HMM Interfaces

For hmm.HmmDecoder, the current interface is:

class hmm.HmmDecoder

    String[] firstBest(String[] emissions);

        nBest(String[] emissions, int maxN);

    TagWordLattice lattice(String[] emissions);

Note that this is using arrays for inputs and outputs (I wrote it before generics), and constructs n-best lists and scored outputs using standardized wrappers/interfaces like Java's java.util.Iterator and com.aliasi.util.ScoredObject from LingPipe.

Tag Package

I started a new package, tag, with three new interfaces for first-best, n-best, and marginal taggers. I'll adapt the HMM decoders to implement them and deprecate the methods described above (if necessary). The big question is what should the interfaces look like?

First-Best Taggers

I generalized inputs to lists of generic objects -- they used to be fixed to arrays of strings. The first interface is for first-best tagging:

interface tag.Tagger<E>

     Tagging<E> tag(List<E> tokens)

with an associated tag.Tagging abstract base class:

class tag.Tagging<E>

    int size();

    String tag(int n); 

    E token(int n);

    List<E> tokens();

    List<String> tags();

Just to be safe, the public constructor copies the input tags and tokens, and the methods that return the lists provide unmodifiable views. I also have the tags being returned as a list. There's nothing in principle preventing the tags from being structured -- it just didn't seem worth the implementation hassle and resulting interface complexity.

Question 1

Should there be a Tagging interface (or should I just return lists of tags and leave it up to users to keep track of inputs)?

N-Best Sequence Taggers

There's another interface for n-best taggers:

interface tag.NBestTagger<E>

        tagNBest(List<E> tokens, int maxResults);

The ScoredTagging object extends Tagging and implements the interface util.Scored:

interface util.Scored

    double score();

I like the iterator over the results, which works well in practice. What I'm less certain about is:

Question 2

Should I create a ScoredTagging interface (or just use ScoredObject<Tagging>)?

Marginal (per Tag, per Phrase) Tagging

The final interface is for marginal results, which allow the computation of the conditional probability of a sequence of tags at a given position given the input sequence.

interface tag.MarginalTagger<E>

    TagLattice<E> tagMarginal(List<E> tokens);

This one's easy, as I need a structure for the result. What I need to do is adapt the HMM TagWordLattice so that it implements the common interface TagLattice<String>. I think all that needs to have in it is:

abstract class tag.TagLattice

    double logProbability(int n, int tag);

    double logProbability(int n, int[] tags);

    SymbolTable tagSymbolTable(); 

Of course, I can add utility methods for linear probabilities and accessing tags by name rather than symbol table identifier.

Forward-Backward Lattice

There's also the implementation issue of whether to combine the currently separate forward-backward (or generalized sum-product) implementations of the tag lattices in HMMs and CRFs, and then whether to expose the implementation methods, which are necessary for extracting n-best phrases in chunkers. The forward/backward phrase extraction requires forward, backward, and transition scores, and an overall normalizer (conventionally written as Z) to normalize scores to conditional probabilities.

tag.ForwardBackwardLattice extends tag.TagLattice

    double logForward(int n, int tagTo);

    double logBackward(int n, int tagTo);

    double logTransition(int n, int tagFrom, int tagTo);

    double logZ();

Should I just require a forward-backward lattice in the interface? I never feel it's worth it to add really generic interfaces, such as:

Tagger<E, T extends TagLattice<E>>

    T tagMarginal(List<E> tokens);

That'd let the classes that need the forward-backward version specify T.

Taggings and Handlers

The next big issue is whether to change the way taggings are supplied to taggers. The current method uses a specific handler:

interface corpus.TagHandler

    void handle(String[] toks, String[] whitespaces, 
                String[] tags);

This allows whitespaces to be provided, but we never use them at the tagger level. This brings up the question of whether to reify taggings for training:

Question 3

Should I replace TagHandler with ObjectHandler<Tagging>?

A "yes" answer would allow me to deprecate TagHandler going forward and makes tagging more parallel to chunking. But it's a lot of work (not that it should worry you), will require lots of changes to existing code, and it messes with longer-term backward compatibility.

Evaluator Overloading

Right now tagger evaluators only work for HMMs. I need to generalize. The classifier evaluator inspects the classifier being evaluated with reflection to see what it can do, and evaluates all of the parts that can be evaluated. Which brings up:

Question 4

Should I create three distinct evaluators, one for first-best, one for n-best, and one for marginal taggers (or should I just overload a single one)?

Speak Now, or Forever ...

My current thinking is to answer "yes" to all the questions. So now's a good time to chime in if you don't like those answers!

2 Responses to “API Design: Should I Reify Taggings for CRFs and HMMs?”

  1. Paraba Says:

    Question 1:

    I would say: go with the list. The Tagging interface doesn’t seem to provide anything interesting, unless you plan to add something to it the future. I guess one thing that would make life easier in some cases was the case if there was a way to access pairs easily, i.e. if the Tagging interface had a method that returned List<Pair> but this seems also stupid.

    Question 2:
    I would prefer ScoredObject. Again, unless you plan to add something to ScoredTagging in the future.

    I’m not actually using LingPipe, that’s why I cannot really answer Questions 3 and 4 and also don’t take my recommendation too seriously.

    • lingpipe Says:

      Taggings add two classes, Tagging and ScoredTagging, and their associated object allocations and space with the pressure it puts on garbage collection (relatively speaking not much of an issue here). I originally went with arrays of strings, which are even more economical than lists of strings.

      If I add Tagging, I can eliminate the interface corpus.TagHandler in favor of ObjectHandler<Tagging>. One longer-term benefit of this is that it’s easy to write generic cross-validating corpora (which is one case where you need the pairs of inputs and to be iterated). A downside is that you can only implement ObjectHandler for one generic type — how would I handle taggings and chunkings?

      If I kept string arrays as outputs instead of Tagging, it’d certainly mean less fiddling with the existing HMM classes. But there’s no way to generify the handlers because TagHandler is really tied to char sequence inputs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s