I finished the technical core and testing for CRFs, including regularized stochastic gradient estimation as well as first-best, n-best and marginal decoders. Now I’m thinking about retrofitting everything so that CRFs work the same way as HMMs from an interface perpsective.
Current HMM Interfaces
For hmm.HmmDecoder
, the current interface is:
class hmm.HmmDecoder String[] firstBest(String[] emissions); Iterator<ScoredObject<String[]>> nBest(String[] emissions, int maxN); TagWordLattice lattice(String[] emissions);
Note that this is using arrays for inputs and outputs (I wrote it before generics), and constructs n-best lists and scored outputs using standardized wrappers/interfaces like Java's java.util.Iterator
and com.aliasi.util.ScoredObject
from LingPipe.
Tag Package
I started a new package, tag
, with three new interfaces for first-best, n-best, and marginal taggers. I'll adapt the HMM decoders to implement them and deprecate the methods described above (if necessary). The big question is what should the interfaces look like?
First-Best Taggers
I generalized inputs to lists of generic objects -- they used to be fixed to arrays of strings. The first interface is for first-best tagging:
interface tag.Tagger<E> Tagging<E> tag(List<E> tokens)
with an associated tag.Tagging
abstract base class:
class tag.Tagging<E> int size(); String tag(int n); E token(int n); List<E> tokens(); List<String> tags();
Just to be safe, the public constructor copies the input tags and tokens, and the methods that return the lists provide unmodifiable views. I also have the tags being returned as a list. There's nothing in principle preventing the tags from being structured -- it just didn't seem worth the implementation hassle and resulting interface complexity.
Should there be a Tagging
interface (or should I just return lists of tags and leave it up to users to keep track of inputs)?
N-Best Sequence Taggers
There's another interface for n-best taggers:
interface tag.NBestTagger<E> Iterator<ScoredTagging<E>> tagNBest(List<E> tokens, int maxResults);
The ScoredTagging
object extends Tagging
and implements the interface util.Scored
:
interface util.Scored double score();
I like the iterator over the results, which works well in practice. What I'm less certain about is:
Should I create a ScoredTagging
interface (or just use ScoredObject<Tagging>
)?
Marginal (per Tag, per Phrase) Tagging
The final interface is for marginal results, which allow the computation of the conditional probability of a sequence of tags at a given position given the input sequence.
interface tag.MarginalTagger<E> TagLattice<E> tagMarginal(List<E> tokens);
This one's easy, as I need a structure for the result. What I need to do is adapt the HMM TagWordLattice
so that it implements the common interface TagLattice<String>
. I think all that needs to have in it is:
abstract class tag.TagLattice double logProbability(int n, int tag); double logProbability(int n, int[] tags); SymbolTable tagSymbolTable();
Of course, I can add utility methods for linear probabilities and accessing tags by name rather than symbol table identifier.
Forward-Backward Lattice
There's also the implementation issue of whether to combine the currently separate forward-backward (or generalized sum-product) implementations of the tag lattices in HMMs and CRFs, and then whether to expose the implementation methods, which are necessary for extracting n-best phrases in chunkers. The forward/backward phrase extraction requires forward, backward, and transition scores, and an overall normalizer (conventionally written as Z
) to normalize scores to conditional probabilities.
tag.ForwardBackwardLattice extends tag.TagLattice double logForward(int n, int tagTo); double logBackward(int n, int tagTo); double logTransition(int n, int tagFrom, int tagTo); double logZ();
Should I just require a forward-backward lattice in the interface? I never feel it's worth it to add really generic interfaces, such as:
Tagger<E, T extends TagLattice<E>> T tagMarginal(List<E> tokens);
That'd let the classes that need the forward-backward version specify T
.
Taggings and Handlers
The next big issue is whether to change the way taggings are supplied to taggers. The current method uses a specific handler:
interface corpus.TagHandler void handle(String[] toks, String[] whitespaces, String[] tags);
This allows whitespaces to be provided, but we never use them at the tagger level. This brings up the question of whether to reify taggings for training:
Should I replace TagHandler
with ObjectHandler<Tagging>
?
A "yes" answer would allow me to deprecate TagHandler
going forward and makes tagging more parallel to chunking. But it's a lot of work (not that it should worry you), will require lots of changes to existing code, and it messes with longer-term backward compatibility.
Evaluator Overloading
Right now tagger evaluators only work for HMMs. I need to generalize. The classifier evaluator inspects the classifier being evaluated with reflection to see what it can do, and evaluates all of the parts that can be evaluated. Which brings up:
Should I create three distinct evaluators, one for first-best, one for n-best, and one for marginal taggers (or should I just overload a single one)?
Speak Now, or Forever ...
My current thinking is to answer "yes" to all the questions. So now's a good time to chime in if you don't like those answers!
October 7, 2009 at 2:10 am |
Question 1:
I would say: go with the list. The Tagging interface doesn’t seem to provide anything interesting, unless you plan to add something to it the future. I guess one thing that would make life easier in some cases was the case if there was a way to access pairs easily, i.e. if the Tagging interface had a method that returned List<Pair> but this seems also stupid.
Question 2:
I would prefer ScoredObject. Again, unless you plan to add something to ScoredTagging in the future.
I’m not actually using LingPipe, that’s why I cannot really answer Questions 3 and 4 and also don’t take my recommendation too seriously.
October 7, 2009 at 12:52 pm |
Taggings add two classes,
Tagging
andScoredTagging
, and their associated object allocations and space with the pressure it puts on garbage collection (relatively speaking not much of an issue here). I originally went with arrays of strings, which are even more economical than lists of strings.If I add
Tagging
, I can eliminate the interfacecorpus.TagHandler
in favor ofObjectHandler<Tagging>
. One longer-term benefit of this is that it’s easy to write generic cross-validating corpora (which is one case where you need the pairs of inputs and to be iterated). A downside is that you can only implementObjectHandler
for one generic type — how would I handle taggings and chunkings?If I kept string arrays as outputs instead of
Tagging
, it’d certainly mean less fiddling with the existing HMM classes. But there’s no way to generify the handlers becauseTagHandler
is really tied to char sequence inputs.