Thursday, May 24, 2007

Character encoding matters in Java

Last week, a co-worker told me he was breaking his head since several days about one special character in an XML file. The programming project had to migrate from a Microsoft Windows to GNU/Linux platform, and a few automatic tests keep failing. In the end, it all came down to one single "e" character with a diaeresis. The character was read correctly on Windows, but the same test file results into an unexpected character on Linux side.

What is the problem?


On most Microsoft platforms, there is encoding character set (CP1252) available for the main languages in Western Europe. Beside the Microsoft mapping, there is also an international standard ISO-8859-1 (also known as LATIN-1), supported on almost every system.

What most developers forget when performing character I/O operations in Java, is thinking about encoding. As long as you stay on a single platform, and always work with files encoded with the same schema, that works fine. Problems occur when you migrate to another platform, or have to deal with text files in alternate encodings. In this particular example, the XML file was encoded in (default) CP1252. On the Linux side however, the default encoding schema is ISO-8859-1, and the same text file is interpreted differently. This results into unexpected text read from files.

Find a solution


How to solve this? Use a common encoding like ISO-8859-1. Or even better, UTF-8 from the Unicode standard. The latter one adds support for almost every language in the world, and is well supported in lot of platforms. Next step is to implement in both the write and read code. If you look into the InputStreamReader and OutputStreamWriter documentation from the java.io package, you'll easily find they support custom encoding schemes. An alternate way is to specify the encoding on Java VM startup, with the file.encoding environment variable.

Some extra catches


  • UTF-8 is a superset of US-ASCII, but NOT of LATIN-1

  • use the same encoding for the DTD/XSD, not only the actual XML file

  • some Microsoft text editors, add an extra, non-standard BOM prefix in Unicode documents

  • CVS clients may strip or add carriage return and/or newline characters, but usually this is no problem in XML documents



Useful links

* ISO-8859-1 at wikipedia
* UTF-8 at wikipedia

No comments: