Custom Java Map for Binary Features

by

Update: 10 February, 2010. You can either jump to David R. MacIver’s comment which shows that my work is not yet done, or try to figure out what’s wrong with the implementation below.

Update II: 10 February, 2010. Of course, I should’ve also added a top-level method to BinaryMap to return the postive set as the key set.

The built-in Java collection implementations tend to be heavy. For instance, hash maps are based on hash sets of map entries, where a map entry is a simple container object with a reference to a key and a value.

For many natural language applications, such as CRF taggers and chunkers, feature maps are often designed to be boolean, with features taking only binary values, 0 or 1. These maps are almost always sparse, only representing the one values. Given the proportion of time (most of it) and memory (pretty much all of it) spent extracting features compared to decoding in CRFs or classification with logistic regression, it seems worthwhile to consider more minimal representations.

Luckily, the Java collections were designed to be easy to extend (thanks Dr. Bloch). In this post, I’ll walk you through LingPipe’s new binary map class, which implements a generic map to numbers based on a pluggable set implementation. Because these maps need to be modifiable so that they can be easily built up in feature extractors, there’s a lot of work to do in the implementation. There are so many inner classes that the whole resembles a set of nesting Russian dolls (pictured above).

Class Declaration

Skipping all the java.util imports, we declare the map to be in our util package and extend Java’s abstract map:

package com.aliasi.util;

public class BinaryMap<E> extends AbstractMap<E,Number> {
...

The return type will be a number, but the keys may be any old type as specified by the generic E.

Member Variable and Constructors

There’s only a single member variable, containing the set of items that map to 1, and it’s specified in the constructor, defaulting to a hash map:

...
    private final Set<E> mPositiveSet;

    public BinaryMap() {
        this(new HashSet<E>());
    }

    public BinaryMap(Set<E> positiveSet) {
        mPositiveSet = positiveSet;
    }
...

Implementing the Entry Set Method

Java’s abstract map class does so much work that only one method must be implemented, entrySet(), which returns a set view of the map entries underlying the map. The abstract map class can then use that to implement everything from gets and puts to size to hash code methods.

Here, we implement it to return a custom entry set (see below):

...
    @Override
    public Set<Map.Entry<E,Number>> entrySet() {
        return new EntrySet();
    }
    ...
}

I’ll fill in implementations of the other map methods later, although this one method would be enough to get the job done in terms of space savings. It’d just be inefficient in terms of speed.

The Entry Set Implementation

Like nesting Russian dolls, this implementation’s going to involve quite a few inner classes. Here’s the entry set implementation itself, which is an inner class of the top level binary map class:

    class EntrySet extends AbstractSet<Map.Entry<E,Number>> {
        @Override
        public int size() {
            return mPositiveSet.size();
        }
        @Override
        public Iterator<Map.Entry<E,Number>> iterator() {
            return new EntrySetIterator();
        }
        @Override
        public void clear() {
            mPositiveSet.clear();
        }
        @Override
        public boolean contains(Object o) {
            if (!(o instanceof Map.Entry<?,?>))
                return false;
            @SuppressWarnings("unchecked") // checked above
            Map.Entry<?,?> entry = (Map.Entry<?,?>) o;
            return ONE.equals(entry.getValue())
                && mPositiveSet.contains(entry.getKey());
        }
        @Override
        public boolean isEmpty() {
            return mPositiveSet.isEmpty();
        }
        @Override
        public boolean remove(Object o) {
            if (!(o instanceof Map.Entry<?,?>))
                return false;
            @SuppressWarnings("unchecked") // checked above
            Map.Entry<?,?> entry = (Map.Entry<?,?>) o;
            return ONE.equals(entry.getValue())
                && mPositiveSet.remove(entry.getKey());
        }
    }

Only the iterator and size method are necessary. The size method just delegates to the containing class’s positive set’s size method. The iterator method returns a custom iterator implementation we describe below, and the size method delegate’s to the containing binary map’s positive set. (That’s why it’s an inner class, not a static nested class.) The remaining methods, the clear, contains, is-emtpy and remove methods are provided for efficiency; I grayed out their code to signal its optionality.

Some Constants

I declared two constants for convenience in the top-level binary map class:

...
    public static final Number ONE = Integer.valueOf(1);

    public static final Object[] EMPTY_OBJECT_ARRAY
        = new Object[0];
...

Implementing Entry Set’s Iterator

The entry set iterator is declared as an inner class of the top-level binary map. It’s declared to implement the appropriate iterator interface. The code consists of the iterator method implementations for the has-next, next, and remove methods.

    class EntrySetIterator 
        implements Iterator<Map.Entry<E,Number>> {

        private final Iterator<E> mIterator 
            = mPositiveSet.iterator();

        public boolean hasNext() {
            return mIterator.hasNext();
        }
        public Map.Entry<E,Number> next() {
            return new PositiveMapEntry();
        }
        public void remove() {
            mIterator.remove();
        }
...

When it is constructed, it uses the containing binary map’s positive set to create an iterator over the positive instances. The has-next and remove methods are simply delegated to the binary map’s positive set iterator. This means that when we’re iterating over the entry set for the top-level maps, remove calls have the intended effect on the top-level map, namely removing one of its positive entries. Of course, if the positive entry set on which the map is based is immutable, the remove operation will throw an unsupported operation exception, as documented in the iterator interface documentation.

The entry itself just returns an instance of another inner class, a positive entry map.

Positive Map Entry Implementation

The positive map entry is itself an inner class declared inside of the entry set iterator (which is itself an inner class of the top-level binary map).

...
        class PositiveMapEntry implements Map.Entry<E,Number> {
            private final E mE = mIterator.next();
            public E getKey() {
                return mE;
            }
            public Number getValue() {
                return ONE;
            }
            public Number setValue(Number value) {
                throw new UnsupportedOperationException();
            }
            public boolean equals(Object that) {
                if (!(that instanceof Map.Entry<?,?>))
                    return false;
                @SuppressWarnings("unchecked") // checked above
                    Map.Entry<?,?> e1 = (Map.Entry<?,?>) that;
                Map.Entry<?,?> e2 = this;
                return
                    (e1.getKey()==null
                     ? e2.getKey()==null
                     : e1.getKey().equals(e2.getKey()))
                    &&
                    (e1.getValue()==null
                     ? e2.getValue()==null
                     : e1.getValue().equals(e2.getValue()));
            }
            public int hashCode() {
                return (getKey()==null ? 0 : getKey().hashCode())
                    ^ (getValue()==null ? 0 : getValue().hashCode());
            }
        }
   }

During construction, it just consumes the next element from the iterator’s contained positive set iterator. That’s then held locally as the value to return for the entry’s key. The value is just the constant number 1. Note that the set method throws an unsupported operation exception; it could’ve been defined to do nothing for a value of 1, remove the positive entry for the value of 0, and throw an illegal argument exception otherwise. But that would almost certainly cause the iterator from which the value was drawn to throw a concurrent modification exception.

The equality and hash code methods are described in Java’s map entry interface; they must be defined this way for compatibility with Java’s collections.

Map Methods for Efficiency

The remaining methods defined in the top level map are for efficiency. Here they are:

...
    @Override
    public Number get(Object key) {
        return mPositiveSet.contains(key)
            ? ONE
            : null;
    }

    @Override
    public Number remove(Object key) {
        return mPositiveSet.remove(key)
            ? ONE
            : null;
    }

    @Override
    public int size() {
        return mPositiveSet.size();
    }

    @Override
    public Collection<Number> values() {
        return new Values();
    }

    @Override
    public void clear() {
        mPositiveSet.clear();
    }

    @Override
    public boolean containsKey(Object o) {
        return mPositiveSet.contains(o);
    }

    @Override
    public boolean containsValue(Object o) {
        return ONE.equals(o) && !isEmpty();
    }

    @Override
    public boolean isEmpty() {
        return mPositiveSet.isEmpty();
    }

    @Override
    public Number put(E e, Number n) {
        if (!ONE.equals(n))
            throw new IllegalArgumentException();
        return  mPositiveSet.add(e) ? null : ONE;
    }
...

All of these implementations are pretty much straightforward delegations to the contained set of objects mapping to 1. The return types of methods such as the put method are defined in the map interface documentation. There are additional methods that could’ve been added for even more efficiency, such as computing hash code and equality as specified in the map interface’s documentation.

Values Collection Implementation

Alas, one more doll to unpack, the value collection implementation.

    class Values extends AbstractCollection<Number> {
        @Override
        public int size() {
            return isEmpty() ? 0 : 1;
        }
        @Override
        public Iterator<Number> iterator() {
            return new ValuesIterator();
        }
        @Override
        public void clear() {
            mPositiveSet.clear();
        }
        @Override
        public boolean contains(Object o) {
            return ONE.equals(o) && !isEmpty();
        }
        @Override
        public boolean isEmpty() {
            return mPositiveSet.isEmpty();
        }
        @Override
        public boolean remove(Object o) {
            if (!ONE.equals(o))
                return false;
            boolean removedSomething = !isEmpty();
            mPositiveSet.clear();
            return removedSomething;
        }
        @Override
        public boolean removeAll(Collection<?> c) {
            if (!c.contains(ONE))
                return false;
            boolean removedSomething = !isEmpty();
            mPositiveSet.clear();
            return removedSomething;
        }
        @Override
        public boolean retainAll(Collection<?> c) {
            if (isEmpty()) return false;
            if (c.contains(ONE)) return false;
            mPositiveSet.clear();
            return true;
        }
        @Override
        public Object[] toArray() {
            return isEmpty() 
                ? EMPTY_OBJECT_ARRAY 
                : new Object[] { ONE };
        }
    }
...

Values Iterator Implementation

The final inner class is the values iterator, which also gets a custom implementation.

...
    class ValuesIterator implements Iterator<Number> {
        boolean finished = mPositiveSet.isEmpty();
        boolean mayRemove = false;
        public boolean hasNext() {
            return !finished;
        }
        public Number next() {
            if (!hasNext())
                throw new NoSuchElementException();
            finished = true;
            mayRemove = true;
            return ONE;
        }
        public void remove() {
            if (!mayRemove)
                throw new IllegalStateException();
            mayRemove = false;
            mPositiveSet.clear();
        }
    }
...

This one’s a little trickier because it doesn’t delegate the remove method, but keeps track of the state. Iterators only allow their remove method to be called once after each next call.

Override Considered Helpful

Extending Java collections is a good illustration of the utility of the override attribute. If you mark a method definition with the attribute @Override and it doesn’t actually override a superclass method, the compiler throws an error.

The reason it’s so crucial for collections is that many of the collection methods are defined at object level, like set’s contain(Object) method. Even though in a type-safe world, that object is only going to be of the set’s type, given Java’s retrofitted type system, this can’t be guaranteed. The types in the collections are as strong as they can be given Java’s type system and the need for backward compatibility with pre-generics Java.

Serialization and Synchronization

I also made the map serializable using the serialization proxy pattern as supported by LingPipe’s util.AbstractExternalizable base class.

For synchronization, the class may be wrapped in the synchronized wrappers from Java’s java.util.Collections utility class.

Whew!

No wonder programming books are so darn long. And this was a relatively simple example that only took me a few hours to write and unit test. Check it out in the next LingPipe release. It’ll be paired with a binary vector implementation along the same lines.

5 Responses to “Custom Java Map for Binary Features”

  1. David R. MacIver Says:

    This would probably be a better idea if java.util.HashSet weren’t internally implemented on top of a HashMap…

    In the meantime, you might be better off using gnu trove (it’s LGPL, so the license is tolerable) or implementing your own open addressing based HashSet (it’s not overly hard) if you don’t want the dependency.

    • lingpipe Says:

      Great catch! It made writing the whole post worthwhile.

      Does anyone know why the @author(s) Bloch and Gafter did it this way? It seems so wrongheaded. The only motivation I see is to have a single implementation of hashing. Their approach also makes having null entries simpler, but come on, that could’ve been handled with a single boolean, though it’d sure complicate all the implementations.

      Somewhere in the back of my mind, I knew that Java’s hash sets were implemented on top of their maps. If I’d profiled my new implementation versus the default one, it would’ve also become obvious.

      We try to stay away from any dependencies other than Java. Paying customers seem to be allergic to GPL-like dependencies (though tolerant of Apache and often excepting MySQL and Linux from their GPL aversion).

      I have written a bunch of other collection extensions, mainly for small sized collections to save space. Next up, a space efficient resizable hash set.

      I’ve been meaning to write primitive-specific collections for a while, but I so rarely use Java collections for primitives in a tight loop that they’ve never been a bottleneck, so I’ve never gotten around to it.

      • David R. MacIver Says:

        I’m not entirely sure why it was done this way, but there’s a whole pile of weird behaviours around the core collection implementations. Another example of weirdness is the way the implementations of LinkedHashMap/HashMap interact (there’s a package private constructor for HashMap which takes a boolean argument. If you pass true to it it behaves like a LinkedHashMap, if you pass false to it it behaves like a normal one).

        I ended up writing my own HashMap implementation for the Scala standard library and was surprised at how easy it was to beat java.util.HashMap in performance.

        The avoidance of dependencies makes sense. I figured that was probably the case, just thought I’d mention it as prior art rather than jump right in and tell you to write your own. :-)

    • lingpipe Says:

      P.S. I added David R. MacIver’s blog t our roll. It has the same kind of programming and NLP focus as this blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s