Data Structure for Parse Trees?


Writing parsers used to be my bread and butter, but I’m a bit stumped on how to create a reasonable object-oriented data structure for parse trees. Here’s a simple parse tree in the Penn Treebank‘s notation (borrowed from Wikipedia: Treebank):

(S (NP (NNP John))
   (VP (VPZ loves)
       (NP (NNP Mary)))
   (. .))

Conceptually trees come in two varieties. Lexical trees are made up of a root category and a word. For example, the lexical tree (NNP John) has a root category NNP and a word John. Phrasal trees are made up of a root category and a sequence of subtrees. For example, the phrasal tree (VP (VPZ loves) (NP (NNP Mary))) consists has root category VP and two subtrees, the lexical subtree (VPZ loves) and the phrasal subtree (NP (NNP Mary)). The latter tree, rooted at NP, only has a single subtree, (NNP Mary). On the other hand, the top level tree, rooted at S, has three subtrees, rooted at NP, VP, and . (period).

Java Data Structures?

So my question is, how do I represent parse trees in Java? For simplicity assume categories are strings (rather than, say generic objects or integer symbol table representations) and that words are also strings.

A First Attempt

An obvious first step would be to define an interface for trees,

interface Tree {
    String root();

and then extend it for lexical trees,

interface LexicalTree extends Tree {
    String word();

and phrasal trees,

interface PhrasalTree extends Tree {
    List<Tree> subtrees();

This pretty much matches our informal description of trees directly.

The Problem

So what’s the problem?

Waiter, Taste My Soup

Coming to America Movie Poster

Eddie Murphy‘s memorable schtick from Coming to America provides a hint:

Eddie Murphy – as the old white man in the barber shop says, “Wait a minute. Wait a minute. Wait. Stop right there! Listen. Stop right there, man.

A man goes into a restaurant. You listenin’? A man goes into a restaurant, and he sits down, he’s having a bowl of soup and he says to the waiter, “waiter come taste the soup.”

Waiter says, “Is something wrong with the soup?”

He says, “Taste the soup.”

He [waiter] says, “Is there something wrong with the soup? Is the soup too hot?”

He says, “Will you taste the soup?”

[Waiter says ,] “What’s wrong, is the soup too cold?”

[He says, ] “Will you just taste the soup?”

[Waiter says, ] “Alright, I’ll taste the soup – where’s the spoon? Aha. Aha!”

Not to mention, a widely applicable moral that may be applied to software. Maybe “eating your customer’s soup” should be as well known an idiom as “eating your own dogfood.”

The Problem is Unions

So now that you sussed out the moral of Mr. Murphy’s story, and diligently applied it by trying to use the tree interface, hopefully, like Mr. Murphy’s waiter, you see the problem.

To do anything other than get the root of a tree, you need to use instanceof and then apply an unchecked cast. Double Yucky.

The root of the problem (pun intended) is that the tree data structure is essentially a union data structure (aka a “variant record”). You see these in straight-up C all the time.

An Alternative

What I actually have in place is a C-like approach that doesn’t require any unchecked casts. But it’s giving off a really bad smell. Darn near boufin.

interface Tree {
    String rootCategory();
    boolean isLexical();
    String word();
    List<Tree> subtrees();

The idea is to use the isLexical() test and if it’s true, use word() to get the word and if it’s false, use subtrees() to get the list of subtrees. In C, there’s often a key indicating which type a union data structure contains.

I can’t decide which is better, returning null from word() if it’s not a lexical tree, or throwing an unsupported operation exception (it’s not quite an illegal state, as my tree implementations would almost certainly be immutable).

Any Better Ideas?

I’m very open to suggestions for a better way to represent trees.

20 Responses to “Data Structure for Parse Trees?”

  1. James Says:

    Why can’t there be a ‘words’ method shared for all trees which just joins and toStrings their lexical bits?

    And purely lexical nodes would simply return no subtrees…

    Well, perhaps it’s a little neater; at least you avoid the null/unsupported op mess.

    • lingpipe Says:

      I could certainly add something that returns a list of words. I won’t have the whitespace in this simple interface, though you could imagine having that available somewhere.

      Purely lexical trees and empty productions would return no subtrees.

  2. msb Says:

    James’ response is close to what I was going to say and have used before: isLexical() returns (subtrees().isEmpty()) and instead of “word()” have “text()” that either returns the lone word for a leaf node or a concat of all words in the phrase. It would be up to the user to not call “text()” on all nodes if only needed for leaves.

    • lingpipe Says:

      The word wold be a singleton for either a lexical tree or for a phrasal tree with a lexical tree as its only daughter. As I said in the last reply, the subtrees list would be empty for both lexical trees and for nullary rules (empty productions). So if I went this route, isLexical() would return something like words().length == 1 && subtrees().isEmpty().

  3. Yoav Says:

    a/ you can have String getData() which will return the phrase/pos/word, List subtrees() which will return an empty list for leaves. You can also add an int depth() method do distinguish terminals/pre-terminals/non-terminals.

    b/ String headWord() can also be well defined for both lexical and phrasal levels (and is quite useful to have anyhow). If you are willing to have heads, you can have a reasonably neat interface using headWord(), depth(), rootCategory(), and keeping both the word and pos-tag in the same item.

    • lingpipe Says:

      Suggestion (a) just pushes the problem down a level by returning a union object for getData(). Packages like Mallet and Stanford NLP seem to have adopted something like this approach on steroids.

      I’m not sure what you mean by depth. How deep the tree is (usually called “height”) or the depth down from the root for a node or tree in a larger tree? The latter’s problematic in a bottom-up parser. Height’s easy to compute bottom-up and depth’s easy to compute with top-down tree building.

      Depth is enough to tease apart lexical and phrasal rules. A tree with depth 1 would be lexical and anything else phrasal.

      As to (b), I’m actually building something for a client that wants to build semantic grammars and isn’t going to fuss with a notion of syntactic head. If I was building something like a lexicalized syntactic CFG (e.g. Collins’ parser, as I’ve done before), then I’d need the head information.

      Unless I’m missing something, heads don’t interact with the lexical/phrasal tree distinction. For instance, the phrasal tree (NP (NNP Mary)) and lexical tree (NNP Mary) have the same head.

      • Yoav Goldberg Says:

        I meant depth as “height” (how deep is the tree).

        Yes, heads don’t interact with the phrasal/lexical distinction. This is why adopting heads would allow you to define “getWord()” and “getCategory()” for every node in the tree, with reasonable semantics attached. (getCategory will return the phrasal category for phrases, and the pos-tag for lexical items, while getWord will return the head word either way).

      • lingpipe Says:

        Ah, now I see what you’re getting at. I tend to think of methods as doing single duty rather than what you’re proposing for getWord(), which presupposes that lexical entries will have themselves as their head.

        I haven’t thought about heads in a while, but don’t you sometimes want to have the head be something like the lemma, so it’s not quite the same as the lexical phrase?

        There are all sorts of other complications in the back of my mind, like how to deal with phrasal lexical entries — do they have a single daughter with word that just happens to have whitespace, or can a lexical entry be a series of words? I lean toward the latter, which complicates some of the suggestions above (though nothing worse than requiring an is-lexical flag).

      • Yoav Goldberg Says:

        lemma is not a complication, you can have the phrasal element return the lemma as the head, and the lexical element return the actual word form. But it would be nicer to use a specialized “Token” class instead of a plain String, in which case you could store a bunch of other information as well (indexes in original sentence, orthographic form, lemma, “word signatures”, etc.). And by subclassing the Token you can easily add phrasal lexical entries later.

  4. example Says:

    I think you’re over thinking this to a certain extent. I mean, it doesn’t really matter if it’s aesthetically pleasing if it works. There are lots of different ways to do it, and they will all work just about as well. So just pick one and go with it.

    It really depends on what you’re planning on doing with it.

    • lingpipe Says:

      Good question! Initially, I’m planning on using it to provide return results for a handwritten semantic grammar. I’ll probably wind up needing to wrap things in Python, so the Java may really not matter much as the intended user might not even see it. I don’t even know enough to formulate the problem in Python.

      Part of the “aesthetics” is not having the compiler gripe. If you require an unchecked cast, the compiler will spit out warnings whenever you use the class (unless you suppress the warning). It also introduces possible type errors, which I prefer to let the compiler catch because I’m very error prone in my typing (probably because I rely on the compiler).

  5. Mark Says:

    I’ve worried about this myself a lot. There are two separate problems here. (1) How do you distinguish lexical or terminal nodes from phrasal or nonterminal nodes, and (2) How can you design a data structure that will permit you to add arbitrary extra information to tree nodes, but will also permit you to write generic tree algorithms (e.g., tree walkers).

    I think I have something useful to say about problem (2), so I will start with it. In more detail, the problem arises because trees are recursive data structures that contain pointers to child nodes that need to be the same type as themselves. I’m not sure if the Java type system is sophisticated enough to permit this, but in C++ the “Curiously recurring template pattern” lets you solve this very nicely. The generic tree node type contains two pointers — one to its right sibling, and one to its first child. These pointers are pointers to the specialised node type, which is passed in as a template argument. Crucially, the specialised node type can contain additional class variables. In the reranking parser I use this to specify things such as the head, the parent of this node, the left and right string positions of this node, etc.

    Problem (1) is easy to solve if you’re willing to waste a bit indicating whether this is a terminal or nonterminal node. The problem is that wasting a bit is the same as wasting a word (unless you’re willing to diddle around with the bits in the pointers), and words are precious — especially when you want to store as many tree nodes as I do!

    Since I’m generally working with Penn Treebank style trees these days, even “empty nodes” have preterminal and terminal strings. So a terminal is a node with no children (i.e., a NULL child pointer), a preterminal in a node with a terminal child, and everything else is a nonterminal.

    • lingpipe Says:

      Right. I should’ve mentioned that this is the other usual way to define trees recursively: a tree is either a word, or a category and a sequence of subtrees.

      Putting this all in the lost art of Backus-Naur form (BNF), using “x” for record types and “|” for unions:

      Word-as-Tree Approach
      Tree ::= Word | Cat x Tree*

      Pre-Terminal Approach
      Tree ::= Cat x Word | Cat x Tree*

      Some of the earlier discussion would’ve done something like

      Tree ::= Cat x (Word | Tree*)

      The pre-terminal approach is a bit easier in my experience because every tree has a root category that way. And the root’s what participates in further derivations in a CFG.

      I use exactly the kind of representation you’re suggesting (link to daughter plus link to next sib) in other places in LingPipe. Often, I can keep the daughters contiguous in a compiled structure (as in language model tries).

      Alas, given Java’s generics through erasure approach, you can’t base anything at run time on the generic type.

      Don’t even get me started on space. Java objects are hugely wasteful compared to C structs. I’m already in for at least 8 bytes per object, not counting its data. At that point, I can just use different classes, so no need for an extra flag bit. (Instead, it’s implicitly a pointer to the class in the object implementation.)

      This is the single thing I miss most about C — simple structs. I wind up writing lots of parallel arrays to get around the problem.

  6. Martin Says:

    I think your first attempt would work in connection with the Visitor pattern. The specific “code smell” of instanceof tests plus down-casts is often a good indicator that the Visitor pattern may be applicable. Specifically:

    abstract class TreeNode {
    abstract void accept(TreeNodeVisitor v);

    abstract class LexicalTreeNode extends TreeNode {
    final void accept(TreeNodeVisitor v) { v.visit(this); }
    abstract String word();

    abstract class PhrasalTreeNode extends TreeNode {
    final void accept(TreeNodeVisitor v) { v.visit(this); }
    abstract List subtrees();

    interface TreeNodeVisitor {
    void visit(LexicalTreeNode n);
    void visit(PhrasalTreeNode n);

    class MyTreeNodeVisitor implements TreeNodeVisitor {
    void visit(LexicalTreeNode n) {
    // do stuff with n.word()
    void visit(PhrasalTreeNode n) {
    // do stuff with n.subtrees()

    Somewhere else:
    MyTreeNodeVisitor v = new MyTreeNodeVisitor();
    // initialize v
    TreeNode n = parser.getParseTree(); // …

    This requires that you code all traversals of trees with specific implementations of the Visitor interface. Which is not too bad, since you can use anonymous instantiations for one-off traversals, in which case the line count of your code isn’t much different from the case where you had done it with instanceof tests and casts.

    This design has drawbacks if you think you’ll frequently add immediate subtypes of TreeNode, but in your situation I’m assuming this won’t be the case. You may want to add different concrete implementations of LexicalTreeNode, but for purposes of the Visitor, you probably don’t care to distinguish them.

    • lingpipe Says:

      At least that’ll encapsulate all the branching logic behind the client.

      I’m thinking a visitor for the tree node’s going to be a bit rough on the clients.

      You’re right in that I don’t anticipate adding more kinds of nodes conceptually, but perhaps adding different implementations for efficiency. Not that it does much good once you’ve already wasted a dozen bytes for an Object instance.

  7. Saira Says:

    can anyone tell me the API for making parse trees in java using CFGs???

    • lingpipe Says:

      There isn’t an official API. You might want to check out some of the existing ones, which are pretty complex.

      OpenNLP contains a parser package; the class Parse represents the tree-like info.

      Mallet also contains a Tree class, but it’s based on general graphs with constraints.

      JavaNLP also contains a Tree class.

  8. Rod Says:

    The Object-Oriented Way™ is to replace
    boolean isLexical();
    with two child classes, sent a common message which causes the variant functions to be executed — in C++ this would be:

    class CTree{
    String getWord();
    class CLexical:public CTree{
    String getWord();
    class CNonLexical:public CTree{
    String getWord();
    String CLexical::getWord(){return this.word();}
    String CNonLexical::getWord(){
    String result=new String();
    return result;

    • Bob Carpenter Says:

      To make the above OO in C++, you need to declare getWord() as virtual in CTree.

      But this doesn’t solve the problem of being able to inspect which subclass you’re an instance of from the outside. As you imply, it is considered good OO practice to try to inspect the subclasses of the class or interface you’re using (see, e.g., two of my favorite books on Java coding, Martin Fowler’s Refactoring and Josh Bloch’s Effective Java). One problem is that it makes the code fragile — if you add a new subclass, you have to go change client code.

      But let’s for the sake of argument say that you want to. It certainly makes it more flexible for users of the subclasses when you haven’t anticipated everything a client might want to do with it. And in the case of parse trees, we know there are only two types we care about.

      Using instanceof in Java is expensive — even more expensive than function resolution. More expensive than calling a non-final method on a superclass. The result is the isLexical() method.

    • Bob Carpenter Says:

      The more modern C++ style would be to template out the “concept” of CTree. That cuts down on the overhead of virtual functions.

      Then, you can use the curiously recurring template pattern to get some of the base class functionality you’re used to, but without the cost of virtual function resolution at runtime.

      The Eigen matrix and linear algebra package do this very nicely for matrix operations.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s