The Latin1 Transcoding Trick (for Java Servlets, etc.)

by

This post uses Java servlets as an example, but applies more broadly to a situation where a program automatically converts streams of bytes to streams of Unicode characters.

The Trick

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte. These char values may be translated back into bytes with a simple cast:

Converter.setEncoding("ISO-8859-1");
char[] cs = Converter.getChars();
byte[] bs = new byte[cs.length];
for (int i = 0; i < cs.length; ++i)
    bs[i] = (byte) cs[i];

Now you have the original bytes back again in the array bs. In Java, char values act as unsigned 16-bit values, whereas byte values are signed 8-bit values. The casting preserves values through the magic of integer arithmetic overflow and twos-complement notation. (I attach a program that’ll verify this works at the end.)

You can now use your own character encoding detection or pull out a general solution like the International Components for Unicode (which I highly recommend — it tracks the Unicode standard very closely, performing character encoding detection, fully general and configurable Unicode normalization, and even transliteration).

Use in Servlets for Forms

I learned this trick from Jason Hunter’s excellent book, Java Servlet Programming, (2nd Edition, O’Reilly). Hunter uses the trick for decoding form data. The problem is that there’s no way in HTML to declare what character encoding is used on a form. What Hunter does is add a hidden field for the value of the char encoding followed by the Latin1 transcoding trick to recover the actual string.

Here’s an illustrative code snippet, copied more or less directly from Hunter’s book:

public void doGet(HttpServletRequest req, ...) {
   ...
    String encoding = req.getParameter("charset");
    String text = req.getParameter("text");
    text = new String(text.getBytes("ISO-8859-1"), encoding);
    ...

Of course, this assumes that the getParameter() will use Latin1 to do the decoding so that the getBytes("ISO-8859-1") returns the original bytes. According to Hunter, this is typically what happens because browsers insist on submitting forms using ISO-8859-1, no matter what the user has chosen as an encoding.

We borrowed this solution for our demos (all the servlet code is in the distribution under $LINGPIPE/demos/generic), though we let the user choose the encoding (which is itself problematic because of the way browsers work, but that’s another story).

Testing Transcoding

public class Test {
    public static void main(String[] args) throws Exception {
        byte[] bs = new byte[256];
        for (int i = -128; i < 128; ++i)
            bs[i+128] = (byte) i;
        String s = new String(bs,"ISO-8859-1");
        char[] cs = s.toCharArray();
        byte[] bs2 = s.getBytes("ISO-8859-1");
        for (int i = 0; i < 256; ++i)
            System.out.printf("%d %d %d\n",
                             (int)bs[i],(int)cs[i],(int)bs2[i]);
    }
}

which prints out

c:\carp\temp>javac Test.java

c:\carp\temp>java Test
-128 128 -128
-127 129 -127
...
-2 254 -2
-1 255 -1
0 0 0
1 1 1
...
126 126 126
127 127 127

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s