Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

End-to-end internationalization of Web applications

Going beyond the JDK

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 2 of 7

The final blow that knocks ISO-8859 out of the realm of fully internationalized applications is its lack of support for multibyte characters such as those found in Asian languages. Although wider character encodings and modal 8-bit encodings support these character sets, they also cannot coexist with other encodings.

For this reason, the Unicode Consortium developed the Unicode Standard. Unicode was created to be a character set of all characters and can represent millions of characters. One encoding for Unicode is the variable width, UTF-8 encoding. UTF-8 is compatible with US-ASCII—the first seven bits overlap precisely. Any character supported by the US-ASCII encoding is encoded into a single byte in UTF-8 using the same US-ASCII encoding value. UTF-8 indicates the presence of a multibyte encoding by setting the most significant bit of the first byte. The UTF-16 encoding is similar, but all characters are at least two-bytes wide.

To be fully internationalized—and avoid headaches—pick a UTF encoding and use it throughout your application. Both UTF-8 and UTF-16 provide precisely the same support, although documents with characters taken predominantly from the US-ASCII encoding and encoded in UTF-8 will be about half the size of a UTF-16-encoded document because the default character width is one byte instead of two.

The right input requires the right output

Text is both sent and received by Web applications, so you must address the character encoding of user submitted text as carefully as the encoding of your Website's pages.

If your Website collects user input through an HTML form text field, you must know the character encoding used by the browser submitting the form. First, let's start with the bad news: the browser probably won't tell you what encoding it used. Some browsers may indicate the encoding in an HTTP header, and some browser-specific mechanisms exist to indicate encoding, but you must still deal with the reality that many browsers simply won't tell you how the data was encoded.

The HTML 4.0 standard introduced the accept-charset attribute on the <form> element to indicate what character encodings the server must accept. Unfortunately, the browser may disregard this value altogether, thus rendering this construct essentially useless for controlling character encoding.

What you can do consistently with common modern browsers is assume the text's character encoding in a form submission is the same as the page encoding of the HTML containing the submitted form. Thus, if the form is contained on a page rendered with UTF-8, you can assume the submitted form text content is also UTF-8-encoded.

One caveat is that many browsers, including Internet Explorer and Netscape, allow the user to change which encoding is used to interpret the page after the page has loaded. A user could request the browser to display a UTF-8-encoded document as if it were actually ISO-8859-1-encoded. If the page contains only US-ASCII characters, the page will not look different to the user. However, any submitted form text will be encoded differently than what the server anticipates. Again, if the submitted text is US-ASCII compatible, the server won't be any wiser. However, if any of the submitted text is in the upper end of the ISO-8859-1 encoding space, it will not be decoded properly—the server will view it as garbage.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources