Thursday, May 24, 2007

Character encoding matters in Java

Last week, a co-worker told me he was breaking his head since several days about one special character in an XML file. The programming project had to migrate from a Microsoft Windows to GNU/Linux platform, and a few automatic tests keep failing. In the end, it all came down to one single "e" character with a diaeresis. The character was read correctly on Windows, but the same test file results into an unexpected character on Linux side.

What is the problem?


On most Microsoft platforms, there is encoding character set (CP1252) available for the main languages in Western Europe. Beside the Microsoft mapping, there is also an international standard ISO-8859-1 (also known as LATIN-1), supported on almost every system.

What most developers forget when performing character I/O operations in Java, is thinking about encoding. As long as you stay on a single platform, and always work with files encoded with the same schema, that works fine. Problems occur when you migrate to another platform, or have to deal with text files in alternate encodings. In this particular example, the XML file was encoded in (default) CP1252. On the Linux side however, the default encoding schema is ISO-8859-1, and the same text file is interpreted differently. This results into unexpected text read from files.

Find a solution


How to solve this? Use a common encoding like ISO-8859-1. Or even better, UTF-8 from the Unicode standard. The latter one adds support for almost every language in the world, and is well supported in lot of platforms. Next step is to implement in both the write and read code. If you look into the InputStreamReader and OutputStreamWriter documentation from the java.io package, you'll easily find they support custom encoding schemes. An alternate way is to specify the encoding on Java VM startup, with the file.encoding environment variable.

Some extra catches


  • UTF-8 is a superset of US-ASCII, but NOT of LATIN-1

  • use the same encoding for the DTD/XSD, not only the actual XML file

  • some Microsoft text editors, add an extra, non-standard BOM prefix in Unicode documents

  • CVS clients may strip or add carriage return and/or newline characters, but usually this is no problem in XML documents



Useful links

* ISO-8859-1 at wikipedia
* UTF-8 at wikipedia

Friday, May 4, 2007

Hang JBoss app by unresponsive SMTP

Several calls just came in, asking why our ordering system was unavailable. Users couldn't login anymore on the initial web page. The web-based J2EE application is hosted on a JBoss-3.2.7 server, and normally sends out error mail when fatal errors occur. Accessing the JBoss JMX console works fine, and the log did not display an obvious problem.

In the end, it was "netstat" coming to the rescue: while the application server keeps growing open connection, there was one, single connection which stayed in half open mode. That was a TCP connection to a remote SMTP server. A quick call to the operations department revealed a new problem with the mail server, and after a reboot, the ordering system came back on track easily.

The bottom line: we must make our application less dependable on external services. Asynchronous mailing in both application and logging, and possible stricter timeout settings on network connections, like SMTP. This will probably require further investigation in the log4j framework with AsyncAppender and SMTPAppender.