This post uses Java servlets as an example, but applies more broadly to a situation where a program automatically converts streams of bytes to streams of Unicode characters.
The Trick
Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.
Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char
for each byte. These char
values may be translated back into bytes with a simple cast:
Converter.setEncoding("ISO-8859-1"); char[] cs = Converter.getChars(); byte[] bs = new byte[cs.length]; for (int i = 0; i < cs.length; ++i) bs[i] = (byte) cs[i];
Now you have the original bytes back again in the array bs
. In Java, char
values act as unsigned 16-bit values, whereas byte
values are signed 8-bit values. The casting preserves values through the magic of integer arithmetic overflow and twos-complement notation. (I attach a program that’ll verify this works at the end.)
You can now use your own character encoding detection or pull out a general solution like the International Components for Unicode (which I highly recommend — it tracks the Unicode standard very closely, performing character encoding detection, fully general and configurable Unicode normalization, and even transliteration).
Use in Servlets for Forms
I learned this trick from Jason Hunter’s excellent book, Java Servlet Programming, (2nd Edition, O’Reilly). Hunter uses the trick for decoding form data. The problem is that there’s no way in HTML to declare what character encoding is used on a form. What Hunter does is add a hidden field for the value of the char encoding followed by the Latin1 transcoding trick to recover the actual string.
Here’s an illustrative code snippet, copied more or less directly from Hunter’s book:
public void doGet(HttpServletRequest req, ...) { ... String encoding = req.getParameter("charset"); String text = req.getParameter("text"); text = new String(text.getBytes("ISO-8859-1"), encoding); ...
Of course, this assumes that the getParameter()
will use Latin1 to do the decoding so that the getBytes("ISO-8859-1")
returns the original bytes. According to Hunter, this is typically what happens because browsers insist on submitting forms using ISO-8859-1, no matter what the user has chosen as an encoding.
We borrowed this solution for our demos (all the servlet code is in the distribution under $LINGPIPE/demos/generic
), though we let the user choose the encoding (which is itself problematic because of the way browsers work, but that’s another story).
Testing Transcoding
public class Test { public static void main(String[] args) throws Exception { byte[] bs = new byte[256]; for (int i = -128; i < 128; ++i) bs[i+128] = (byte) i; String s = new String(bs,"ISO-8859-1"); char[] cs = s.toCharArray(); byte[] bs2 = s.getBytes("ISO-8859-1"); for (int i = 0; i < 256; ++i) System.out.printf("%d %d %d\n", (int)bs[i],(int)cs[i],(int)bs2[i]); } }
which prints out
c:\carp\temp>javac Test.java c:\carp\temp>java Test -128 128 -128 -127 129 -127 ... -2 254 -2 -1 255 -1 0 0 0 1 1 1 ... 126 126 126 127 127 127
Leave a Reply