SAXWriter updated for XHTML subset of XML


At Alias-i recently, we’ve been writing lots of XHTML. You may have noticed the new LingPipe site, for example.

Now we’re trying to generate XHTML automatically. The XML written by com.aliasi.xml.SAXWriter, although valid XML, does not comply with XHTML’s requirements. XHTML puts two additional restrictions on XML elements: (1) that elements with no content and no attributes have an additional space, as in
<br />, and (2) elements with attributes and no content must be closed with a separate tag, as in <a name="foo"></a>.

I extended LingPipe’s SAXWriter class to support this. The old constructors still provide the old behavior, which writes elements as compactly as possible. Two new constructors allow a flag to specify the more verbose output required by XHTML. This wasn’t too hard; in fact, the unit tests were harder than adding the new formatting code.

The original motivation for the SAXWriter was that the Xerces example writer demo was so minimal. The new Xerces 2 J sax.Writer is much slicker. Even so, it still suffers from allowing you to change the character set on the output writer without declaring it in the XML declaration, which always prints UTF-8. Very odd, given the care they’ve now taken with the rest of the program. It’s worth studying, especially for tips on XML 1.1.

Warning: The feature described here will be in the 2.2.1 release, which is not yet scheduled.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: