I finished the technical core and testing for CRFs, including regularized stochastic gradient estimation as well as first-best, n-best and marginal decoders. Now I’m thinking about retrofitting everything so that CRFs work the same way as HMMs from an interface perpsective.
Current HMM Interfaces
hmm.HmmDecoder, the current interface is:
class hmm.HmmDecoder String firstBest(String emissions); Iterator<ScoredObject<String>> nBest(String emissions, int maxN); TagWordLattice lattice(String emissions);
Note that this is using arrays for inputs and outputs (I wrote it before generics), and constructs n-best lists and scored outputs using standardized wrappers/interfaces like Java's
com.aliasi.util.ScoredObject from LingPipe.
I started a new package,
tag, with three new interfaces for first-best, n-best, and marginal taggers. I'll adapt the HMM decoders to implement them and deprecate the methods described above (if necessary). The big question is what should the interfaces look like?
I generalized inputs to lists of generic objects -- they used to be fixed to arrays of strings. The first interface is for first-best tagging:
interface tag.Tagger<E> Tagging<E> tag(List<E> tokens)
with an associated
tag.Tagging abstract base class:
class tag.Tagging<E> int size(); String tag(int n); E token(int n); List<E> tokens(); List<String> tags();
Just to be safe, the public constructor copies the input tags and tokens, and the methods that return the lists provide unmodifiable views. I also have the tags being returned as a list. There's nothing in principle preventing the tags from being structured -- it just didn't seem worth the implementation hassle and resulting interface complexity.
N-Best Sequence Taggers
There's another interface for n-best taggers:
interface tag.NBestTagger<E> Iterator<ScoredTagging<E>> tagNBest(List<E> tokens, int maxResults);
ScoredTagging object extends
Tagging and implements the interface
interface util.Scored double score();
I like the iterator over the results, which works well in practice. What I'm less certain about is:
Marginal (per Tag, per Phrase) Tagging
The final interface is for marginal results, which allow the computation of the conditional probability of a sequence of tags at a given position given the input sequence.
interface tag.MarginalTagger<E> TagLattice<E> tagMarginal(List<E> tokens);
This one's easy, as I need a structure for the result. What I need to do is adapt the HMM
TagWordLattice so that it implements the common interface
TagLattice<String>. I think all that needs to have in it is:
abstract class tag.TagLattice double logProbability(int n, int tag); double logProbability(int n, int tags); SymbolTable tagSymbolTable();
Of course, I can add utility methods for linear probabilities and accessing tags by name rather than symbol table identifier.
There's also the implementation issue of whether to combine the currently separate forward-backward (or generalized sum-product) implementations of the tag lattices in HMMs and CRFs, and then whether to expose the implementation methods, which are necessary for extracting n-best phrases in chunkers. The forward/backward phrase extraction requires forward, backward, and transition scores, and an overall normalizer (conventionally written as
Z) to normalize scores to conditional probabilities.
tag.ForwardBackwardLattice extends tag.TagLattice double logForward(int n, int tagTo); double logBackward(int n, int tagTo); double logTransition(int n, int tagFrom, int tagTo); double logZ();
Should I just require a forward-backward lattice in the interface? I never feel it's worth it to add really generic interfaces, such as:
Tagger<E, T extends TagLattice<E>> T tagMarginal(List<E> tokens);
That'd let the classes that need the forward-backward version specify
Taggings and Handlers
The next big issue is whether to change the way taggings are supplied to taggers. The current method uses a specific handler:
interface corpus.TagHandler void handle(String toks, String whitespaces, String tags);
This allows whitespaces to be provided, but we never use them at the tagger level. This brings up the question of whether to reify taggings for training:
A "yes" answer would allow me to deprecate
TagHandler going forward and makes tagging more parallel to chunking. But it's a lot of work (not that it should worry you), will require lots of changes to existing code, and it messes with longer-term backward compatibility.
Right now tagger evaluators only work for HMMs. I need to generalize. The classifier evaluator inspects the classifier being evaluated with reflection to see what it can do, and evaluates all of the parts that can be evaluated. Which brings up:
Speak Now, or Forever ...
My current thinking is to answer "yes" to all the questions. So now's a good time to chime in if you don't like those answers!