Now that we’ve been gearing up some applications involving logistic regression and feature-based clustering, I’ve been getting lots of requests for how to do things in LingPipe. The answers all point to two general design patterns:
- Wikipedia: Adapter Pattern (aka Wrapper)
- Wikipedia; Decorator Pattern
Really, I’m not an architecture astronaut. No more pattern talk, I promise, just case studies.
Case Study 1: Single Evaluation Spanning Cross-Validations
Now that we have a cross-validating corpus built in for classifiers, it’s getting heavy use. The typical use is to evaluate each fold (perhaps collecting cross-fold stats for summaries):
XValidatingClassifierCorpus corpus = ...;
for (int fold = 0; fold < numFolds; ++fold)
corpus.setFold(fold);
Classifier classifier = trainClassifier(corpus,fold);
ClassifierEvaluator eval = new ClassifierEvaluator(classifier);
corpus.visitTest(eval);
}
You can print out stats in the loop, or collect up stats for the whole corpus. But what if we want to combine the evaluations? The trick is to write a simple mutable classifier filter (more patterns):
class Filter implements Classifier {
Classifier mC;
public Classification classify(Object x) {
return mC.classify(x);
}
}
Now, we pass the filter into the evaluator and set the classifier in it before each round:
XValidatingClassifierCorpus corpus = ...;
Filter filter = new Filter();
ClassifierEvaluator eval = new ClassifierEvaluator(filter);
for (int fold = 0; fold < numFolds; ++fold)
corpus.setFold(fold);
Classifier classifier = trainClassifier(corpus,fold);
filter.mC = classifier;
corpus.visitTest(eval);
}
That’s it. After the loop’s done, the single evaluator holds the combined eval under cross-validation. We got around the immutability of the single classifier held by the evaluator by writing a simple filter that has a mutable object.
Case 2: Nullary Tokenizer Factory Constructors
This case just came up on our mailing list. I made a regrettable design decision in writing only the fully qualified class name of a tokenizer during serialization so that it gets reconstituted using reflection over the nullary (no-arg) constructor. But what if you want a regular-expression based tokenizer that has no nullary constructors? Simple, write an adapter.
class MyRegexTokenizer implements TokenizerFactory {
static final String MY_REGEX = ...;
public MyRegexTokenizerFactory() { super(MY_REGEX); }
}
That’s it. Same behavior only now we have the necessary nullary constructor for my brain-damaged serializer.
Case Study 3: Decorators
Let’s say you have a corpus and it’s being passed into some program that takes a long time to chew on it but doesn’t give any feedback. We can instrument the corpus with a decorator to give us a little feedback:
final Corpus corpus = ...;
Corpus decoratedCorpus = new Corpus() {
public void visitTest(Handler h) {
System.out.println("visiting test");
corpus.visitTest(h);
}
...
}
Yes, that’s it. Well, actually we need to fill in the ellipses with the same thing for visitTrain().
On the same topic, suppose we have a text corpus and we want to restrict it to only texts of length 30 to 50 (yes, that just came up this week in a client project). We just apply the same trick twice, filtering the corpus by filtering the handler:
final Corpus<TextHandler> corpus = ...
Corpus<TextHandler> boundedCorpus
= new Corpus<TextHandler>() {
public void visitTest(TextHandler handler) {
copus.visitTest(new TextHandler() {
public void handle(String in) {
if (in.length > 30 && in.length < 50)
getHandler().handle(in);
}
});
}
};
Basically, these approaches all put something in between the implementation you have and the implementation that’s needed, usually acting as a filter. While it’s not quite Lisp, it’s getting close in terms of parenthesis nesting, and is a handy tool for your Java toolbox.