Props to Mark Davis for Unicode and ICU


Three cheers for Mark Davis, IBM tech staffer extraordinaire. Not only is he the co-author of unicode, he somehow found time to co-write an amazing Java package that does all the unicode munging and normalization you could dream of. And it supplies a whole bunch of charset encoders and decoders that don’t ship with Sun’s JVM. It’s all up to date with the latest unicode and plays nicely with both the 1.4 JDK and the 1.5 JDK.

We’ve been working on international search, specifically in Chinese, Japanese, Korean, Arabic. We needed to be able to do unicode normalization. For instance, the composite Japanese character ぺ (code point 0x307A) is functionally equivalent to the pair of characters へ and &#309A; (code points for the base character 0x3078 and combining mark 0x309A, the latter of which probably looks like garbage even if you have Japanese fonts installed). See the Unicode chart for Hiragana for a nice PDF rendering.

But it turns out this problem is widespread. There are single unicode characters for typesetting ligatures like “fft” in English. And characters like Ö (letter o with umlauts) may be written as two characters, with the diaresis (two dot diacritc) supplied as a combining character. There are also full-width and half-width versions of Latin1 characters which should be normalized. And font variants with their own characters (e.g. cursive vs. non-cursive in many languages). Oh, and fractions, superscripts, subscripts and scientific symbols. It even normalizes rotated geometric shapes. The rules for Hangul (Korean) are too complex, but suffice it to say there’s an algorithm behind it and a single "character" may be composed of three unicode characters.

The normalization we’re using is NFKC, which involves aggressive decomposition (called “compatibility”) and standard recomposition (called “canonical”). Read the spec.

Mark Davis, co-author of the normalization spec, also co-authored a Java package to deal with the vagaries of unicode programatically. What we’ve been using is the normalization. If you use character data at all, drop what you’re doing (unless it’s reading the spec, in which case you may finish) and check out IBM’s open source Components for Unicode, which is available in C and Java versions.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s