Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

End-to-end internationalization of Web applications

Going beyond the JDK

  • Print
  • Feedback

A typical Web application workflow involves a user loading one of your Webpages into her browser, filling out HTML form parameters, and submitting data back to the server. The server makes decisions based on this data, sends the data to other components such as databases and Web services, and renders a response back to the browser. At each step along the way, a globally aware application must pay attention to the user's locale and the text's character encoding.

The JDK provides many facilities to enable an internationalized workflow from within your Java code, and the Apache Struts framework extends it even further. However, you must still take particular care when managing how data gets into your application code and how your application interacts with other components in an internationalized manner. It is at the interfaces where enabling internationalization is thinly documented and supported.

In this article, you explore what you need to accomplish when developing an internationalized Web application. You also learn some best practices that will make your global applications successful.

A refresher on character encoding

Depending on what article, book, or standard you read, you'll notice subtle differences in the use of the terms character set and character encoding. Loosely speaking, a character set is a collection of the atomic letters, numbers, punctuation marks, and dingbats used to construct textual documents for one or more locales. A character encoding defines a mapping of numbers to the members of a character set. Although not truly synonymous, the terms are often used interchangeably.

The familiar 7-bit US-ASCII encoding maps a Latin character set suitable for American users, but it proves unsuitable for global applications. To accommodate additional characters, ligatures, and diacritics, the 8-bit ISO-8859 series of encodings was created. These standards augment US-ASCII by extending the encodings to include 128 additional characters. The most common encoding (and, for many browsers and application servers, the default) is ISO-8859-1, or Latin Alphabet No. 1, which supports Western European character sets. Other encodings include ISO-8859-7 for Greek characters and ISO-8859-10 for Nordic languages.

Many applications are built solely around the ISO-8859-1 encoding. Although this encoding accommodates a wide scope of users—and might prove sufficient for many applications—it is not a complete character set. An application could, of course, select an appropriate ISO-8859 encoding based on the user's locale, but that can only create a good deal of grief. One problem is that the byte-sized ISO-8859 encodings may not coexist on the same page because the upper halves of their encoding spaces map numbers to different characters. Another headache comes from receiving HTML form input from users using different encodings. When this data is stored in a database using byte-size characters, you also need to store the encoding associated with the field.

  • Print
  • Feedback

Resources