End-to-end internationalization of Web applications

Going beyond the JDK

A typical Web application workflow involves a user loading one of your Webpages into her browser, filling out HTML form parameters, and submitting data back to the server. The server makes decisions based on this data, sends the data to other components such as databases and Web services, and renders a response back to the browser. At each step along the way, a globally aware application must pay attention to the user's locale and the text's character encoding.

The JDK provides many facilities to enable an internationalized workflow from within your Java code, and the Apache Struts framework extends it even further. However, you must still take particular care when managing how data gets into your application code and how your application interacts with other components in an internationalized manner. It is at the interfaces where enabling internationalization is thinly documented and supported.

In this article, you explore what you need to accomplish when developing an internationalized Web application. You also learn some best practices that will make your global applications successful.

A refresher on character encoding

Depending on what article, book, or standard you read, you'll notice subtle differences in the use of the terms character set and character encoding. Loosely speaking, a character set is a collection of the atomic letters, numbers, punctuation marks, and dingbats used to construct textual documents for one or more locales. A character encoding defines a mapping of numbers to the members of a character set. Although not truly synonymous, the terms are often used interchangeably.

The familiar


encoding maps a Latin character set suitable for American users, but it proves unsuitable for global applications. To accommodate additional characters, ligatures, and diacritics, the 8-bit


series of encodings was created. These standards augment


by extending the encodings to include 128 additional characters. The most common encoding (and, for many browsers and application servers, the default) is


or Latin Alphabet

No. 1

, which supports Western European character sets. Other encodings include


for Greek characters and


for Nordic languages.

Many applications are built solely around the


encoding. Although this encoding accommodates a wide scope of users—and might prove sufficient for many applications—it is not a complete character set. An application could, of course, select an appropriate


encoding based on the user's locale, but that can only create a good deal of grief. One problem is that the byte-sized


encodings may not coexist on the same page because the upper halves of their encoding spaces map numbers to different characters. Another headache comes from receiving HTML form input from users using different encodings. When this data is stored in a database using byte-size characters, you also need to store the encoding associated with the field.

The final blow that knocks


out of the realm of fully internationalized applications is its lack of support for multibyte characters such as those found in Asian languages. Although wider character encodings and modal 8-bit encodings support these character sets, they also cannot coexist with other encodings.

For this reason, the Unicode Consortium developed the Unicode Standard. Unicode was created to be a character set of


characters and can represent millions of characters. One encoding for Unicode is the variable width,




is compatible with


—the first seven bits overlap precisely. Any character supported by the US-ASCII encoding is encoded into a single byte in


using the same


encoding value.


indicates the presence of a multibyte encoding by setting the most significant bit of the first byte. The


encoding is similar, but all characters are at least two-bytes wide.

To be fully internationalized—and avoid headaches—pick a UTF encoding and use it throughout your application. Both




provide precisely the same support, although documents with characters taken predominantly from the


encoding and encoded in


will be about half the size of a


-encoded document because the default character width is one byte instead of two.

The right input requires the right output

Text is both sent and received by Web applications, so you must address the character encoding of user submitted text as carefully as the encoding of your Website's pages.

If your Website collects user input through an HTML form text field, you must know the character encoding used by the browser submitting the form. First, let's start with the bad news: the browser probably won't tell you what encoding it used. Some browsers may indicate the encoding in an HTTP header, and some browser-specific mechanisms exist to indicate encoding, but you must still deal with the reality that many browsers simply won't tell you how the data was encoded.

The HTML 4.0 standard introduced the accept-charset attribute on the <form> element to indicate what character encodings the server must accept. Unfortunately, the browser may disregard this value altogether, thus rendering this construct essentially useless for controlling character encoding.

What you can do consistently with common modern browsers is assume the text's character encoding in a form submission is the same as the page encoding of the HTML containing the submitted form. Thus, if the form is contained on a page rendered with


, you can assume the submitted form text content is also



One caveat is that many browsers, including Internet Explorer and Netscape, allow the user to change which encoding is used to interpret the page after the page has loaded. A user could request the browser to display a


-encoded document as if it were actually


-encoded. If the page contains only


characters, the page will not look different to the user. However, any submitted form text will be encoded differently than what the server anticipates. Again, if the submitted text is


compatible, the server won't be any wiser. However, if any of the submitted text is in the upper end of the


encoding space, it will not be decoded properly—the server will view it as garbage.

This risk only results when a user forces the page to be interpreted with an encoding for which it was not intended. In general, assuming the submitted text uses the same encoding as the form page is perfectly reasonable.

As noted earlier, there are problems associated with applications that render different pages using different encodings—and needing to know the browser's character encoding only adds to the mess. The character encoding used to decode submitted text must be set by calling


on the






. Hence, you cannot embed the page encoding in a hidden form field unless you bypass the Servlet API (which is not recommended). Your best solution is to pick a single UTF encoding, such as


, and use it consistently throughout your application.

Controlling output character encoding

Because the output character encoding controls input character encoding, you must ensure the pages sent to your user are encoded as you intended.

You have several options for controlling output character encoding in a J2EE application. If you're writing a servlet, you can set the content type directly on the ServletResponse object. In doing so, however, be sure to use the java.io.PrintWriter to render your output. If you write directly to the java.io.OutputStream, your response will not be encoded as you intended:

   ServletResponse response = getServletResponse();
   // Always set the content type before getting the PrintWriter
   response.setContentType( "text/html; charset=UTF-8" );
   // Now, get the writer that will handle your output
   PrintWriter writer = response.getWriter();

Setting the content type directly on the response object in a servlet is essentially the same as using a JSP (JavaServer Pages) page directive like this:

   <%@ page contentType="text/html; charset=UTF-8" %>

Both methods set the output response encoding, but they have a shortcoming. If you use the same page encoding throughout your Web application, you'll need to replicate this code throughout all of your application's servlets and JSP pages. Are you certain you, or another developer on your team, won't forget this subtle one-liner in any of your code? If you set the encoding in the servlet, then you can, of course, encapsulate this behavior in a common subclass for all of your servlets. However, this approach isn't recommended; it now prevents you from subclassing from other framework-related base classes because Java restricts you to single-inheritance of implementation.

If you're using Struts, you're in luck. The


attribute on the


element in your


file can be used to set your responses' default character encodings:

   <controller contentType="text/html; charset=UTF-8" />

This attribute only sets the default encoding type. A JSP page directive setting the content type, or setting the content type on the response object, overrides this setting.

If your Struts application has workflows that pass through servlets, or go directly to JSP pages without first passing through Struts, this configuration setting won't help.

Also, if your application contains static HTML documents, the problem proves even more difficult. You can use an


setting in an HTML


tag to specify an output encoding, but that doesn't mean the editor really used that encoding to save the file! (I talk more about conflicting encoding information later.)

The broader solution to control output-encoding for JSP pages, servlets, and static HTML in a single place is to add a javax.servlet.Filter implementation to your application. First, implement a filter that wraps the servlet response object:

   public class UTF8EncodingFilter implements javax.servlet.Filter
      public void init( FilterConfig filterConfig )  throws ServletException
         // This would be a good place to collect a parameterized
         // default encoding type.  For brevity, we're going to
         // use a hard-coded value in this example.
      public void doFilter( ServletRequest request,
                            ServletResponse response,
                            FilterChain filterChain )
                                     throws IOException, ServletException
         // Wrap the response object.  You should create a mechanism 
         // to ensure the response object only gets wrapped once.
         // In this example, the response object will inappropriately
         // get wrapped multiple times during a forward.
         response = new UTF8EncodingServletResponse( (HttpServletResponse) response );
         filterChain.doFilter( request, response );
      public void destroy()
         // no-op

The servlet response wrapper should set the default content type before the application attempts to read the submitted form parameters. Here, we override the call to


, which will be called at least once during the request (by the application server). If no explicit character encoding is specified—for example, the content type is simply set to


instead of

"text/html; charset=ISO-8859-1"

—we'll set the encoding to


, as shown in the code below. It's important, however, to make sure you only do this to text documents and not images or similar binary files.

   public class UTF8EncodingServletResponse
                     extends javax.servlet.http.HttpServletResponseWrapper
      private boolean encodingSpecified = false;
      public UTF8EncodingServletResponse( HttpServletResponse response )
         super( response );
      public void setContentType( String type )
         String explicitType = type;
         // If a specific encoding has not already been set by the app,
         // let's see if this is a call to specify it.  If the content
         // type doesn't explicitly set an encoding, make it UTF-8.
         if (!encodingSpecified)
            String lowerType = type.toLowerCase();
            // See if this is a call to explicitly set the character encoding.
            if (lowerType.indexOf( "charset" ) < 0)
               // If no character encoding is specified, we still need to
               // ensure the app is specifying text content.
               if (lowerType.startsWith( "text/" ))
                  // App is sending a text response, but no encoding
                  // is specified, so we'll force it to UTF-8.
                  explicitType = type + "; charset=UTF-8";
               // App picked a specific encoding, so let's make
               // sure we don't override it.
               encodingSpecified = true;
         // Delegate to supertype to record encoding.
         super.setContentType( explicitType );
1 2 3 Page 1
Page 1 of 3