The Latin1 Transcoding Trick for Ant


A while back Bob blogged about The Latin1 Transcoding Trick for Java Servlets, etc.

Suppose you have an API that insists on converting an as-yet-unseen stream of bytes to characters for you (e.g. servlets), but lets you set the character encoding if you want.

Because Latin1 (officially, ISO-8859-1) maps bytes one-to-one to Unicode code points, setting Latin1 as the character encoding means that you get a single Java char for each byte.

Another situation where this trick comes in real handy is dealing with the way that Ant compiles its logfiles.

If, like me, you’re fond of debug-by-printf and you use Ant to compile and run your programs, then you might have run into the problem that has given rise to many StackOverflow queries, that is, when you use an Ant task to run the program and instrument your code with print statements to standard out, Ant replaces non-ASCII characters with a question mark. When the problem you’re trying to debug is making sure that non-ASCII characters are being processed correctly, this is both misleading and maddening. The standard advice on StackOverflow is to set the shell environment variable ANT_OPTS using the following incantation (for bash shell):

export ANT_OPTS="-Dfile.encoding=UTF-8"

This works as long as you’re working with UTF-8 encoded character data and your terminal’s encoding is set to UTF-8 as well. Here a solution that works no matter what character encoding is in play:

export ANT_OPTS="-Dfile.encoding=Latin1"

It’s the ol’ Latin1 transcoding trick!

Of course you already know about character encodings . Do you know about Ant’s IO System? Here’s what Ant contributor Conor MacNeill says:

The Ant IO system is designed to associate any output sent to System.out and System.err with the task that generated the output and to log the output accordingly.

Ant’s Main class installs its own output streams into System.out and System.err. These streams are instances of DemuxOutputStreams

Using the source code for Ant 1.9.0, in class we see that System.In, System.Out, and System.Err are all reassigned to Ant’s DemuxInputStream and DemuxOutputStream, which extend InputStream and OutputStream, respectively:

System.setIn(new DemuxInputStream(project));
System.setOut(new PrintStream(new DemuxOutputStream(project, false)));
System.setErr(new PrintStream(new DemuxOutputStream(project, true)));

The call to the PrintStream constructor is the one-arg constructor PrintStream(OutputStream out). Because no file encoding is specified, the encoding used is the default charset for the JVM that’s running Ant. This is specified by the system property file.encoding. This property varies depending on your platform and locale. To check this, try this on your machine:

public class GetDefaultEncoding {
    public static void main(String[] args) {

On my Mac running OS-X the default is US-ASCII (because the default locale on this machine is en_US). On my Windows XP machine the default is Cp1252 (Windows Latin1 which differs from ISO_8859-1 just enough to be noticeable).

At the point where Ant’s DemuxInputStream reads in the bytes sent to System.out by a Java task, any character outside of the default character set is replaced by a question mark. When Latin1 is the default encoding, all bytes are valid Latin1 characters and their Unicode code point value is the same as the byte value so the bytes from the Java task pass through the Ant IO system unchanged.

As long as the next process in the chain (e.g. the terminal app) is configured to handle whatever encoding your text data is in, you’re good to go.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s