But does it do Greek? Unicode, Java Chars and the BMP

by

Yes, Extended Greek will work just fine in LingPipe. LingPipe processes at the char level, not at the byte level.

Extended Greek involves characters in the 0x1F00 to 0x1FFF range. That’s in the basic multilingual plane (BMP) of unicode.

The Ancient Greek numbers at 10140-1018F could be problematic if you need them, depending on what you need to do. Even these should work OK in most of LingPipe.

The significance of being in the BMP is that the UTF-16 coding and hence Java’s char encoding is transparent. Specifically, for a code point in the BMP, the UTF-16 is just the two-byte sequence making up the unsigned integer representation of the code point. For any such code points, Java’s char representation is just the unsigned integer corresponding to the code point.

The significance of all this for LingPipe is that we treat char values in java.lang.String and java.lang.CharSequence as characters. Thus if we have a 5-gram language model, we treat that as 5 chars.

The problem for code points beyond the BMP is that they take more than one char in Java to represent. For these code points, Java’s String.length() is not the length of the the string in terms of number of code points, but in terms of number of chars. Thus it actually overstates the length.

Non-BMP code points can be problematic depending on the tokenizers used, which we define at the char level, not at the code point level. So you could write a nefarious tokenizer that split surrogate pairs of chars representing non-BMP code points into different tokens. Our character n-gram tokenizers are nefarious in this way, but even they won’t necessarily be problematic in their intended applications (like classification).

As long as you’re not nefarious, and don’t mind a little inaccuracy in things like per-char cross-entropy rates due to the length, everything should actually work OK. If you still tokenize on whitespace, you should be OK with non-BMP code points in our classifiers, POS taggers and chunkers.

Places where they won’t make sense is in things like edit distance, where our API is defined in terms of chars, not code points. And string comparison like TF/IDF over character n-grams will be off in terms of using chars, not code points, so the dimensionality winds up being a little higher. For applications like classification, this shouldn’t matter. Of course, it’d be possible to write a proper n-gram code point (vs. char) tokenizer by taking the surrogate pairs into account.

Here’s a little test program that doesn’t involve LingPipe so you (and I) can see what’s up with the encodings:

public class Greek {
    public static void main(String[] args) throws Exception {
        char c1 = (char) 0x1F10;
        String s = new String(new char[] { c1 });
        byte[] bytes = s.getBytes("UTF-16LE");
        for (int i = 0; i < bytes.length; ++i)
            System.out.println("byte " + i + " " + Integer.toHexString(bytes[i]));

        String s2 = new String(new byte[] { (byte) 0x1F, (byte) 0x10 },
                               "UTF-16LE");
        for (int i = 0; i < s2.length(); ++i)
            System.out.println("char " + i + "="
                               + Integer.toHexString((int)s2.charAt(i)));
    }
}