What is the problem?
On most Microsoft platforms, there is encoding character set (CP1252) available for the main languages in Western Europe. Beside the Microsoft mapping, there is also an international standard ISO-8859-1 (also known as LATIN-1), supported on almost every system.
What most developers forget when performing character I/O operations in Java, is thinking about encoding. As long as you stay on a single platform, and always work with files encoded with the same schema, that works fine. Problems occur when you migrate to another platform, or have to deal with text files in alternate encodings. In this particular example, the XML file was encoded in (default) CP1252. On the Linux side however, the default encoding schema is ISO-8859-1, and the same text file is interpreted differently. This results into unexpected text read from files.
Find a solution
How to solve this? Use a common encoding like ISO-8859-1. Or even better, UTF-8 from the Unicode standard. The latter one adds support for almost every language in the world, and is well supported in lot of platforms. Next step is to implement in both the write and read code. If you look into the
InputStreamReader
and OutputStreamWriter
documentation from the java.io
package, you'll easily find they support custom encoding schemes. An alternate way is to specify the encoding on Java VM startup, with the file.encoding
environment variable.Some extra catches
- UTF-8 is a superset of US-ASCII, but NOT of LATIN-1
- use the same encoding for the DTD/XSD, not only the actual XML file
- some Microsoft text editors, add an extra, non-standard BOM prefix in Unicode documents
- CVS clients may strip or add carriage return and/or newline characters, but usually this is no problem in XML documents
Useful links
* ISO-8859-1 at wikipedia
* UTF-8 at wikipedia
No comments:
Post a Comment