Optimizing Feature Extraction

by

A widely quoted optimization tip is that the fastest code is code that isn’t executed. The same goes for memory; if you want to use less, don’t allocate. I’m currently applying this optimization technique to feature extraction. As I’ve discussed here before, feature extraction is the slowest aspect of linear classification, which is otherwise just a sequence of multiply and add operations.

As things stand in LingPipe 3.6, features are extracted from generic objects and passed to classifiers as maps from names represented as strings to numbers. This is mediated by the feature extractor interface:

interface util.FeatureExtractor
    Map<String,? extends Number> features(E element);

Even worse, the map gets converted to an instance of matrix.Vector using a symbol.SymbolTable to store the feature to dimension mapping. For binary logistic regression, the vector is then multiplied by a vector stored in the classifier, with results greater than zero resulting in category 1.

Using callbacks for inversion of control, we can get rid of the map and vector altogether. With some noise, we can also get rid of the symbol table.

First, we need a vector entry handler interface:

interface corpus.VectorEntryHandler
    void handle(int dimension, double value);

Implementations of this interface will receive callbacks to process vectors. For binary logistic regression, the handler is simple, incrementing an accumulator by multiplying the value by the parameter vector’s value for that dimension:

class classify.LinearClassifier.Handler
    double mAccum = 0.0;
    void handle(int dimension, double value) {
       mAccum += value * mParameter.value(dimension);
    }

A handler is used for each classification, getting the dimensions and values fed to it through callbacks. Calling the same dimension twice is equivalent to summing the feature values, making this especially convenient for structured classification settings.

Then, instead of implementing FeatureExtractor, we require a parser:

interface corpus.VectorEntryParser
     void parse(E e, VectorEntryHandler handler);

For instance, the type E could be an XML document represented as a DOM tree, or it might be a sequence of tokens tagged with part-of-speech tags. The parser extracts features and sends them to the handler. In code, the classifier is simple:

class classify.LinearClassifier<E> 
    implements Classifier<E,Classification>

    LinearClassifier(VectorEntryParser<E> parser, Vector parameter) ...
    Classification classify(E in) {
        Handler handler = new Handler();
        mParser.parse(in,handler);
        double result = handler.mAccum;
        ...
    }

The handler defined above is then an inner class within LinearClassifier.

There is no need with this arrangement to create features themselves if the dimensions can be extracted. We can use hash codes for them. With n-grams, we can even use the Karp-Rabin algorithm to compute the hash codes.

For convenience, I’ve implemented adapters based on feature parsers that are analogous to vector entry parsers, only with string-based features rather than vector dimensions.

4 Responses to “Optimizing Feature Extraction”

  1. Jon Says:

    We had to do a similar optimization when we built our classification server. In C++ each STL map entry has an overhead of 28 bytes if I recall correctly, and strings in best case 4 bytes, worst case (strings longer than 16) 20 bytes. Using maps and strings for features was not possible for large frequency distributions – using an intermediate map for training which is “compacted” into raw memory (where we can do binary search) every now and then lowered memory consumption a lot. Do you know how much overhead maps have in Java?

    Nice blog, keep up the good work! /Jon

  2. lingpipe Says:

    In Java, Map is an interface, so we have to talk about specific implementations, like TreeMap and HashMap.

    Objects are the real killer, though. Every object stores a reference (4 byte integer) as well as a pointer to the class itself (4 byte integer, I think), so you’re in for 8 bytes/object minimum. Strings store an underlying string (array of characters, meaning an object with associated pointer), as well as integer start/length, so you’re in for at least 24 bytes with an empty string.

    Then if you look at HashMap, it’s structured as an array of Map.Entry implementations. HashMap implements Map. Entry as linked lists, with a generic key, generic value, next value, and integer hash, so 16 bytes plus object overhead, so about 24 bytes.

    I think this can actually be worse in that some implementations store a 4-byte synchronization key on every object.

    And it’s not just size — all the hash lookups are extremely slow compared to the basic multiply/add operation of a linear classify.

    It’s possible to make smaller map implementations, as I do implicitly with tries in the lm package and with and the util.SmallObjectToDouble map class in LingPipe. (SmallSet is easier to understand and more general if anyone’s looking for examples.)

  3. Jon Says:

    Thanks for the good explaination!

  4. uClassify blog » More memory or smaller memories? Says:

    […] A similar optimization for Java is described on the LingPipe blog. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s