Thursday, May 19, 2011

Encoding form data in Java servlets

Today I got tricked and frustrated again into bad handling of non ISO-8859-1 (also known as LATIN-1) form data in a Java web application. Russian and German users rightfully complain about losing there localized input once they press the submit button. A few things I have heard in the past, but had to look up again because I tend to forget these things easily (at least for the Java web app world):
  • explicitly indicate some Unicode encoding in response, both in HTTP header AND HTML meta data

  • set the character encoding on the incoming HttpServletRequest BEFORE reading any value
By default, most web application servers simply use LATIN-1 encoding, which would NOT match the UTF-8 encoding used by the browser. The full story is rather complicated, and drags a lot of legacy combined with pragmatic choices. But in the end, following those 2 rules brings the solution much closer in most cases.

In general, UTF-8 Unicode encoding seems to have the best and widest support, so I suggest to stick to this recommendation. For people using the Spring Framework, have look into the CharacterEncodingFilter.

As a side note: one specific JSF 1.2 application does NOT expose the encoding problem, without setting the character encoding manually on the request. Still need to find out WHY it seems to work fine there. Maybe the application server is setup slightly different, causing UTF-8 mapping as default, or maybe I'm simply getting blind. :-)

AJAX exception?

For some reason, our Firefox 4 browser submits AJAX POST data with a explicit character set indication (UTF-8) in the HTTP header, and therefore those AJAX submits (based on RichFaces 3.x) work fine out of the box. Should investigate why the behavior is different here. Is it a feature of the XmlHttpRequest, or is it some fancy generated JavaScript code from the used library?

Some useful links

No comments: