End-to-end internationalization of Web applications

Going beyond the JDK

A typical Web application workflow involves a user loading one of your Webpages into her browser, filling out HTML form parameters, and submitting data back to the server. The server makes decisions based on this data, sends the data to other components such as databases and Web services, and renders a response back to the browser. At each step along the way, a globally aware application must pay attention to the user's locale and the text's character encoding.

The JDK provides many facilities to enable an internationalized workflow from within your Java code, and the Apache Struts framework extends it even further. However, you must still take particular care when managing how data gets into your application code and how your application interacts with other components in an internationalized manner. It is at the interfaces where enabling internationalization is thinly documented and supported.

In this article, you explore what you need to accomplish when developing an internationalized Web application. You also learn some best practices that will make your global applications successful.

A refresher on character encoding

Depending on what article, book, or standard you read, you'll notice subtle differences in the use of the terms character set and character encoding. Loosely speaking, a character set is a collection of the atomic letters, numbers, punctuation marks, and dingbats used to construct textual documents for one or more locales. A character encoding defines a mapping of numbers to the members of a character set. Although not truly synonymous, the terms are often used interchangeably.

The familiar

7-bitUS-ASCII

encoding maps a Latin character set suitable for American users, but it proves unsuitable for global applications. To accommodate additional characters, ligatures, and diacritics, the 8-bit

ISO-8859

series of encodings was created. These standards augment

US-ASCII

by extending the encodings to include 128 additional characters. The most common encoding (and, for many browsers and application servers, the default) is

ISO-8859-1,

or Latin Alphabet

No. 1

, which supports Western European character sets. Other encodings include

ISO-8859-7

for Greek characters and

ISO-8859-10

for Nordic languages.

Many applications are built solely around the

ISO-8859-1

encoding. Although this encoding accommodates a wide scope of users—and might prove sufficient for many applications—it is not a complete character set. An application could, of course, select an appropriate

ISO-8859

encoding based on the user's locale, but that can only create a good deal of grief. One problem is that the byte-sized

ISO-8859

encodings may not coexist on the same page because the upper halves of their encoding spaces map numbers to different characters. Another headache comes from receiving HTML form input from users using different encodings. When this data is stored in a database using byte-size characters, you also need to store the encoding associated with the field.

The final blow that knocks

ISO-8859

out of the realm of fully internationalized applications is its lack of support for multibyte characters such as those found in Asian languages. Although wider character encodings and modal 8-bit encodings support these character sets, they also cannot coexist with other encodings.

For this reason, the Unicode Consortium developed the Unicode Standard. Unicode was created to be a character set of

all

characters and can represent millions of characters. One encoding for Unicode is the variable width,

UTF-8

encoding.

UTF-8

is compatible with

US-ASCII

—the first seven bits overlap precisely. Any character supported by the US-ASCII encoding is encoded into a single byte in

UTF-8

using the same

US-ASCII

encoding value.

UTF-8

indicates the presence of a multibyte encoding by setting the most significant bit of the first byte. The

UTF-16

encoding is similar, but all characters are at least two-bytes wide.

To be fully internationalized—and avoid headaches—pick a UTF encoding and use it throughout your application. Both

UTF-8

and

UTF-16

provide precisely the same support, although documents with characters taken predominantly from the

US-ASCII

encoding and encoded in

UTF-8

will be about half the size of a

UTF-16

-encoded document because the default character width is one byte instead of two.

The right input requires the right output

Text is both sent and received by Web applications, so you must address the character encoding of user submitted text as carefully as the encoding of your Website's pages.

If your Website collects user input through an HTML form text field, you must know the character encoding used by the browser submitting the form. First, let's start with the bad news: the browser probably won't tell you what encoding it used. Some browsers may indicate the encoding in an HTTP header, and some browser-specific mechanisms exist to indicate encoding, but you must still deal with the reality that many browsers simply won't tell you how the data was encoded.

The HTML 4.0 standard introduced the accept-charset attribute on the <form> element to indicate what character encodings the server must accept. Unfortunately, the browser may disregard this value altogether, thus rendering this construct essentially useless for controlling character encoding.

What you can do consistently with common modern browsers is assume the text's character encoding in a form submission is the same as the page encoding of the HTML containing the submitted form. Thus, if the form is contained on a page rendered with

UTF-8

, you can assume the submitted form text content is also

UTF-8

-encoded.

One caveat is that many browsers, including Internet Explorer and Netscape, allow the user to change which encoding is used to interpret the page after the page has loaded. A user could request the browser to display a

UTF-8

-encoded document as if it were actually

ISO-8859-1

-encoded. If the page contains only

US-ASCII

characters, the page will not look different to the user. However, any submitted form text will be encoded differently than what the server anticipates. Again, if the submitted text is

US-ASCII

compatible, the server won't be any wiser. However, if any of the submitted text is in the upper end of the

ISO-8859-1

encoding space, it will not be decoded properly—the server will view it as garbage.

This risk only results when a user forces the page to be interpreted with an encoding for which it was not intended. In general, assuming the submitted text uses the same encoding as the form page is perfectly reasonable.

As noted earlier, there are problems associated with applications that render different pages using different encodings—and needing to know the browser's character encoding only adds to the mess. The character encoding used to decode submitted text must be set by calling

setCharacterEncoding()

on the

ServletRequest

object

before

calling

getParameter()

. Hence, you cannot embed the page encoding in a hidden form field unless you bypass the Servlet API (which is not recommended). Your best solution is to pick a single UTF encoding, such as

UTF-8

, and use it consistently throughout your application.

Controlling output character encoding

Because the output character encoding controls input character encoding, you must ensure the pages sent to your user are encoded as you intended.

You have several options for controlling output character encoding in a J2EE application. If you're writing a servlet, you can set the content type directly on the ServletResponse object. In doing so, however, be sure to use the java.io.PrintWriter to render your output. If you write directly to the java.io.OutputStream, your response will not be encoded as you intended:

   ServletResponse response = getServletResponse();
   // Always set the content type before getting the PrintWriter
   response.setContentType( "text/html; charset=UTF-8" );
   // Now, get the writer that will handle your output
   PrintWriter writer = response.getWriter();

Setting the content type directly on the response object in a servlet is essentially the same as using a JSP (JavaServer Pages) page directive like this:

   <%@ page contentType="text/html; charset=UTF-8" %>

Both methods set the output response encoding, but they have a shortcoming. If you use the same page encoding throughout your Web application, you'll need to replicate this code throughout all of your application's servlets and JSP pages. Are you certain you, or another developer on your team, won't forget this subtle one-liner in any of your code? If you set the encoding in the servlet, then you can, of course, encapsulate this behavior in a common subclass for all of your servlets. However, this approach isn't recommended; it now prevents you from subclassing from other framework-related base classes because Java restricts you to single-inheritance of implementation.

If you're using Struts, you're in luck. The

contentType

attribute on the

controller

element in your

struts-config.xml

file can be used to set your responses' default character encodings:

   <controller contentType="text/html; charset=UTF-8" />

This attribute only sets the default encoding type. A JSP page directive setting the content type, or setting the content type on the response object, overrides this setting.

If your Struts application has workflows that pass through servlets, or go directly to JSP pages without first passing through Struts, this configuration setting won't help.

Also, if your application contains static HTML documents, the problem proves even more difficult. You can use an

http-equiv

setting in an HTML

<meta>

tag to specify an output encoding, but that doesn't mean the editor really used that encoding to save the file! (I talk more about conflicting encoding information later.)

The broader solution to control output-encoding for JSP pages, servlets, and static HTML in a single place is to add a javax.servlet.Filter implementation to your application. First, implement a filter that wraps the servlet response object:

   public class UTF8EncodingFilter implements javax.servlet.Filter
   {
      public void init( FilterConfig filterConfig )  throws ServletException
      {
         // This would be a good place to collect a parameterized
         // default encoding type.  For brevity, we're going to
         // use a hard-coded value in this example.
      }
      public void doFilter( ServletRequest request,
                            ServletResponse response,
                            FilterChain filterChain )
                                     throws IOException, ServletException
      {
         // Wrap the response object.  You should create a mechanism 
         // to ensure the response object only gets wrapped once.
         // In this example, the response object will inappropriately
         // get wrapped multiple times during a forward.
         response = new UTF8EncodingServletResponse( (HttpServletResponse) response );
         filterChain.doFilter( request, response );
      }
      public void destroy()
      {
         // no-op
      }
   }

The servlet response wrapper should set the default content type before the application attempts to read the submitted form parameters. Here, we override the call to

setContentType()

, which will be called at least once during the request (by the application server). If no explicit character encoding is specified—for example, the content type is simply set to

"text/html"

instead of

"text/html; charset=ISO-8859-1"

—we'll set the encoding to

UTF-8

, as shown in the code below. It's important, however, to make sure you only do this to text documents and not images or similar binary files.

   public class UTF8EncodingServletResponse
                     extends javax.servlet.http.HttpServletResponseWrapper
   {
      private boolean encodingSpecified = false;
      public UTF8EncodingServletResponse( HttpServletResponse response )
      {
         super( response );
      }
      public void setContentType( String type )
      {
         String explicitType = type;
         // If a specific encoding has not already been set by the app,
         // let's see if this is a call to specify it.  If the content
         // type doesn't explicitly set an encoding, make it UTF-8.
         if (!encodingSpecified)
         {
            String lowerType = type.toLowerCase();
            // See if this is a call to explicitly set the character encoding.
            if (lowerType.indexOf( "charset" ) < 0)
            {
               // If no character encoding is specified, we still need to
               // ensure the app is specifying text content.
               if (lowerType.startsWith( "text/" ))
               {
                  // App is sending a text response, but no encoding
                  // is specified, so we'll force it to UTF-8.
                  explicitType = type + "; charset=UTF-8";
               }
            }
            else
            {
               // App picked a specific encoding, so let's make
               // sure we don't override it.
               encodingSpecified = true;
            }
         }
         // Delegate to supertype to record encoding.
         super.setContentType( explicitType );
      }
   }
1 2 3 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more