Tokenization vs. Eager Regular Expressions

by

We now have regular-expression based chunkers and tokenizers in LingPipe. They work by compiling a regex using java.util.regex.Pattern and then running a java.util.regex.Matcher‘s find() method over the inputs and pulling out the matches. Recently, I (Bob) have been tuning part-of-speech taggers for French trained on the French Treebank.

I thought writing a regex-based tokenizer would make sense to match the way the treebank itself tokenized as best as possible. Unfortunately, it’s impossible to write a pure tokenizer to match because, like the Penn Treebank and Penn BioIE projects, the annotators decided to use syntactic and semantic contextual information in deciding when to group a sequence of characters into a token. For instance, hyphens after prefixes (a lexical syntactic issue) are coded one way and hyphens before suffixes another; in the Penn Treebank, periods at the end of sentences (a contextual decision) are coded as separate tokens whereas those appearing sentence internally are coded as part of the token they follow.

It turns out that regular expressions don’t work the way I thought they would. I wanted to write a bunch of regexes and then or (|) them together to produce a larger regular expression that would greedily match as much as it could in each chunk.

Let’s consider a simple example:

(a|b)+|(a|c)+

And consider running a find against the string "aacac". What do we get? Let’s ask Java.

import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class Regex {

    public static void main(String[] args) {
        test("(a|b)+|(a|c)+", "aacac");
    }

    static void test(String regex, String input) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(input);
        matcher.find();
        System.out.println("regex=" + regex 
                           + " input=" + input 
                           + " found=" +  matcher.group());
    }

}

This just sets up the pattern based on the regex, generates a matcher from the input and then runs find on the matcher and prints out the first thing found. What’ll it do?

c:\carp\temp>javac Regex.java

c:\carp\temp>java -cp . Regex
regex=(a|b)+|(a|c)+ input=aacac found=aa

Ouch. I was expecting the whole thing to match. Apparently, that’s what a POSIX regex would do. But Java follows the Perl model, in which eagerness overcomes greediness. Specifically, if the first disjunct of an alternation matches, the matcher does not try to find a longer match in the second disjunct. Unfortunately, there are no greediness/reluctance modifiers for the disjunct.

So what do we do? Refactor the regex, of course. How about this one?

a*(b(a|b)*|c(a|c)*)?

This should do the trick of matching the longest possible sequence of alternating as and bs or alternating as and cs. Sure enough, adding the following to the main() method:

    test("a*(b(a|b)*|c(a|c)*)?", "aacac");

produces the expected output:

regex=a*(b(a|b)*|c(a|c)*)? input=aacac found=aacac

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s