Multibyte-character processing in J2EE

Develop J2EE applications with multibyte characters

The Chinese language is one of the most complex and comprehensive languages in the world. Sometimes I feel lucky to be Chinese, specifically when I see some of my foreign friends struggle to learn the language, especially writing Chinese characters. However, I do not feel so lucky when developing localized Web applications using J2EE. This article explains why.

Though the Java platform and most J2EE servers support internationalization well, I am still confronted by many multibyte-character problems when developing Chinese or Japanese language-based applications:

  • What is the difference between encoding and charset?
  • Why do multibyte-character applications display differently when ported from one operating system to another?
  • Why do multibyte-character applications display differently when ported from one application server to another?
  • Why do my multibyte-character applications display well on the Internet Explorer browser but not on the Mozilla browser?
  • Why do applications on most J2EE servers display poorly when using UTF-16 (universal transformation format) encoding?

If you are asking the same set of questions, this article helps you answer them.

Basic knowledge of characters

Characters have existed long before computers. More than 3,000 years ago, special characters (named Oracles) appeared in ancient China. These characters have special visual forms and special meanings, with most having names and pronunciations. All of these facets compose the character repertoire, a set of distinct characters defined by a special language, with no relationship to the computer at all. Over thousands of years, many languages evolved and thousands of characters were created. And now we are trying to digitize all these characters into 1s and 0s, so computers can understand them.

When typing words with a keyboard, you deal with character input methods. For simple characters, there is one-to-one mapping between a key and a character. For a more complex language, a character needs multiple keystrokes.

Before you can see characters on the screen, the operating system must store characters in memory. In fact, the OS defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers, which are stored in memory and used by the OS. These integers are called character code.

Characters can be stored in a file or transmitted through the network. Software uses character encoding to define a method (algorithm) for mapping sequences of a character's character code into sequences of octets. Some character code maps into one byte, such as ASCII code; other character code, such as Chinese and Japanese, map into two or more bytes, depending on the different character-encoding schemas.

Different languages may use different character repertoires; each character repertoire uses some special encodings. Sometimes, when you choose a language, you may choose a character repertoire implicitly, which uses an implied character encoding. For example, when you choose the Chinese language, you may, by default, use the GBK Chinese character repertoire and a special encoding schema also named GBK.

I avoid the term character set because it causes confusion. Apparently, character set is the synonym of character repertoire. Character set is misused in the HTTP Mime (Multipurpose Internet Mail Extensions) header, where "charset" is used for "encoding."

One of Java's many features is the 16-bit character. This feature supports Unicode use, a standard way of representing many different kinds of characters in various languages. Unfortunately, this character also causes many problems when developing multibyte J2EE applications, which this article focuses on.

Development phases cause display problems

J2EE application development includes several phases (shown in Figure 1); each phase can cause multibyte-character display problems.

Figure 1. J2EE application development life cycle

Coding phase

When you code your J2EE applications, most likely, you use an IDE like JBuilder or NetBeans, or an editor like UltraEdit or vi. Whatever you choose, if you have a literal string in your JSP (JavaServer Pages), Java, or HTML files, and if these literal strings are multibyte characters such as Chinese or Japanese, most likely, you will encounter display problems if you are not careful.

A literal string is static information stored in files. Different encodings are used for different language characters. Most IDEs set their default encoding to ISO-8859-1, which is for ASCII characters and causes multibyte characters to lose information. For example, in the Chinese version of NetBeans, the default setting for file encoding is, unfortunately, ISO-8859-1. When I edit a JSP file with some Chinese characters (shown in Figure 2), everything seems correct. As I mentioned above, we know that all these characters shown in the screen are in memory, having no direct relationship with encoding. After saving this file, if you close the IDE and reopen it, these characters appear incomprehensible (shown in Figure 3) because ISO-8859-1 encoding loses some information when storing Chinese characters.

Figure 2. Chinese characters in NetBeans
Figure 3. Chinese characters in chaos

Character-encoding APIs

There are several APIs in the servlet and JSP specifications that handle the character-encoding process in J2EE applications. For a servlet request, setCharacterEncoding() sets the encoding schema for the current HTTP request's body. For a servlet response, setContentType() and setLocale() set Mime header encoding for the output HTTP response.

These APIs cause no problems themselves. On the contrary, the problems exist when you forget to use them. For example, in some servers, you can display multibyte characters properly without using any of the above APIs in your code, but when you run the application in other servers, characters appear incomprehensible. The reason for this multibyte-character display problem lies in how the servers treat character encoding during HTTP requests and responses. The following rules apply to most servers when determining the character encoding in requests and responses:

When processing a servlet request, the server uses the following order of precedence, first to last, to determine the request character encoding:

  • Code-specific settings (for example: the setCharacterEncoding() method)
  • Vendor-specific settings
  • The default setting

When processing a servlet response, the server uses the following order of precedence, first to last, to determine the response character encoding:

  • Code-specific settings (for example: the setContentType() and setLocale() methods)
  • Vendor-specific settings
  • The default setting

According to the above rules, if you give instruction codes using these APIs, all servers will obey them when choosing the character-encoding schema. Otherwise, different servers will behave differently. Some vendors use hidden fields in the HTTP form to determine the request encoding, others use specific settings in their own configuration files. The default settings can differ also. Most vendors use

ISO-8859-1

as default settings, while a few use the OS's locale settings. Thus, some multibyte character-based applications have display problems when porting to another vendor's J2EE server.

Compile phase

You can store multibyte literal strings, if correctly set, in source files during the edit phase. But these source files cannot execute directly. If you write servlet code, these Java files must be compiled to classfiles before deploying to the application server. For JSP, the application server will automatically compile the JSP files to the classfiles before executing them. During the compile phase, character-encoding problems are still possible. To see the following simple demo, download this article's source code.

Listing 1. EncodingTest.java

1      import java.io.ByteArrayOutputStream;
2      import java.io.OutputStreamWriter;
3
4      public class EncodingTest {  
5         public static void main(String[] args) {
6            OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream());
7            System.out.println("Current Encoding:  "+out.getEncoding());
8            System.out.println("Literal output:  ��ã�"); // You may not see this Chinese String
9         }
10     }

Some explanation about the source code:

  • We use the following code to determine the system's current encoding:
  •  6    OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream());
    7    System.out.println("Current Encoding:  "+out.getEncoding());
    
  • Line 8 includes a direct print-out of a Chinese character literal string (you may not see this string correctly due to your OS language settings)
  • Store this Java source file with GBK encoding

Look at the execution result shown in Figure 4.

Figure 4. Sample output. Click on thumbnail to view full-sized image.

From the result in Figure 4, we can conclude that:

  • The Java compiler (javac) uses the system's language environment as the default encoding setting, so does the Java Runtime Environment.
  • Only the first result is correct; other strings display incomprehensibly.
  • Only when the runtime encoding setting is the same as the one used to store the source file can multibyte literal strings display correctly (alternatively, you must convert from one encoding schema to another; please see the "Runtime phase" section).

Server configuration phase

Before you run your J2EE application, you should configure your application to meet special needs. In the previous section, we found that different language settings can cause literal-string display problems. Actually, different levels of configuration exist, and they all can cause problems for multibyte characters.

OS level

Language support of the operating system is most important. The language supports on the server side will affect JVM default encoding settings as described above. And the language support on the client side, such as font, can also directly affect character display, but this article doesn't focus on that.

J2EE application server level

Most servers have a per-server setting to configure the default behavior of character-encoding processing. For example, Listing 2 is part of Tomcat's configuration file (located in $TOMCAT_HOME/conf/web.xml):

Listing 2. web.xml

<servlet>
        <servlet-name>jsp</servlet-name>
        <servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class>
        <init-param>
            <param-name>fork</param-name>
            <param-value>false</param-value>
        </init-param>
        <init-param>
            <strong>
            <param-name>javaEncoding</param-name>>
            <param-value>>UTF8</param-value>
            </strong>
        </init-param>
        <load-on-startup>3</load-on-startup>
  </servlet>

Tomcat uses parameter javaEncoding to define Java file encoding for generating Java source files from JSP files; the default is UTF-8. That means if you store Chinese characters in your JSP with GBK encoding and you want to display your characters using UTF-8 (browser setting), problems may result.

JVM level

Most servers can have multiple instances simultaneously, and each server instance can have an individual JVM instance. Plus, you can have separate settings for each JVM instance. Most servers have locale settings for each instance to define the default language support.

Figure 5. Sun ONE Application Server setting

Shown in Figure 5, the Sun ONE (Open Network Environment) Application Server has a per-instance setting for locale. This setting indicates the default behavior of encoding characters for the logging system and standard output.

On the other hand, different servers may use distinct JVM versions; and different JDK versions support various encoding standards. All these issues can cause porting problems. For example, Sun ONE Application Server and Tomcat support J2SE 1.4, while others support only J2SE 1.3. J2SE 1.4 supports Unicode 3.1, which has many new features previous versions lacked.

Per-application level

Every application deployed on the server can be configured with its unique encoding settings before it runs within the server container. This feature allows multiple applications using different languages to run inside one server instance. For example, in some servers, you can give the following character-encoding settings for each deployed application to indicate which encoding schema your application should use:

<locale-charset-info default-locale="en_US">
      </locale-charset-map locale="zh_CN" agent="Mozilla/4.77 [en] (Windows NT 5.0; U)"  charset="GBK">
</locale-charset-info>

The reason for all these configuration levels is flexibility and maintenance. However, unfortunately, they will cause problems when porting from one server to another, because not all configurations adhere to standards. For example, if you develop your application in a server that supports the locale-charset-info setting, you may have difficulties if you want to port the application to another server that does not support this encoding setting.

Runtime phase

At runtime, your J2EE application most likely communicates with other external systems. Your applications may read and write files, or use databases to manage your data. In other cases, an LDAP (lightweight directory access protocol) server stores identity information. Under all these situations, data exchange is needed between J2EE applications and external systems. If your data contains multibyte characters such as Chinese characters, you may face some issues.

Most external systems have their own encoding settings. For example, an LDAP server most likely uses UTF-8 to encode characters; Oracle Database System uses environment variable NLS_LANG to indicate encoding style. If you install Oracle on a Chinese OS, this variable resets to ZHS16GBK by default, which uses GBK encoding to store Chinese characters. So if your J2EE application's encoding settings differ from the external system, conversion is needed. The following code is common for these situations:

byte[] defaultBytes = original.getBytes(current_encoding);
String newEncodingStr = new String(defaultBytes, old_encoding);

The above code shows how to convert a string from one encoding to another. For example, you have stored a username (multibyte characters) in an LDAP server, which uses UTF-8 encoding and your J2EE application uses GBK encoding. So when your application gets usernames from LDAP, they may not be encoded correctly. To resolve this, you can use original.getBytes("GBK") to get the original bytes. Then construct a new string using new String(defaultBytes, "UTF-8"), which can display correctly.

Client display phase

Most J2EE applications now use the browser-server architecture, which employs browsers as their clients. To display multibyte characters correctly in browsers, you should take note of the following:

Browser language support

To display multibyte characters correctly, the browser and the OS where the browser runs should have language-specific supports, such as fonts and the character repertoire.

Browser encoding settings

The HTML header that the server returns, such as <meta http-equiv="content-type" content="text/html;charset=gb2312"> gives the browser an instruction about which encoding this page uses. Otherwise, the browser uses the default encoding setting or automatically detects one. Alternatively, users can set the page's encoding as shown in Figure 6.

Figure 6. Netscape's encoding-setting page

Thus, if a page lacks any instructions, the multibyte characters may display incorrectly. Under such situations, users must manually set the current page's encoding.

HTTP POST encoding

The situation grows more complicated when you post data to the server using the form tag in HTML pages. Which encoding the browser uses depends on the current page's encoding settings, which contains the form tag. That means if you construct an ASCII-encoded HTML page using ISO-8859-1, in this page, a user cannot possibly post Chinese characters. Since all post data uses ISO-8859-1 encoding, it causes Chinese characters to lose some bytes. That is the HTML standard, which all browsers abide by.

HTTP GET encoding

Things become more troublesome when you add multibyte characters to URL links, like <A href = getuser.jsp?name=**>View detail information of this user</A> (** represents multibyte characters). Such scenarios are common; for example, you can put usernames or other information in links and transfer them to the next page. But when non US-ASCII characters appear in a URL, its format is not clearly defined in RFC (request for comment) 2396. Different browsers use their own methods for encoding multicharacters in URLs.

Take Mozilla, for example, (shown in figures 7, 8, 9, 10); it will always perform URL encoding before the HTTP request is sent. As we know, during the URL-encoding process, a multibyte character first converts into two or more bytes using some encoding scheme (such as UTF-8 or GBK). Then, each byte is characterized by the 3-character string %xy, where xy is the byte's two-digit hexadecimal representation. For more information, please consult the HTML Specification. However, which encoding scheme the URL-encoding method uses, depends on the current page's encoding scheme.

I use the following gbk_test.jsp page as a demo:

Listing 3. gbk_test.jsp

<%@page contentType="text/html;charset=GBK"%>
<HTML>
   <BODY>
      <a href='/chartest/servlet/httpGetTest?name=王'><h1>Test for GBK encoded URL</h1></a>
   </BODY>
</HTML>

The x738b is the escape sequence of a Chinese character that is my family name. This page displays as Figure 7.

Figure 7. URL in Mozilla

When the mouse moves above the link, you can see the link's address in the status bar, which shows a Chinese character embedded inside this URL. When you click the link in the page, you can see clearly in the address bar that this character is URL-encoded. Character x738b encodes to %CD%F5, which is the result of URL encoding combined with GBK encoding. And on the server side, I can get the query string using a simple method, request.getQueryString(). In the next line, I use another method, request.getParameter(String), to show this character as a comparison to the query string, shown in Figure 8 .

Figure 8. URL encoding in Mozilla

When I change the current page's encoding from GBK to UTF-8, then click the link in the page again, you can see the result: x738b encodes to %E7%8E%8B, shown as Figure 9, which is the result of URL encoding combined with UTF-8 encoding.

Figure 9. URL encoding in Mozilla

But Microsoft Internet Explorer treats the multibyte URL differently. IE never completes URL encoding before the HTTP request is sent; the encoding scheme the URL-encoding method uses depends on the current page's encoding scheme, shown in Figure 10.

Figure 10. No URL encoding in IE

IE also has an advanced optional setting that forces the browser to always send the URL request with UTF-8 encoding, shown in Figure 11.

Figure 11. Advance option setting in IE

According to the above explanation, you will face a problem: if your application pages have multibyte characters embedded into URL links and can work using Mozilla with GBK encoding, this application will encounter problems when users employ IE with the setting that forces the browser to always send the URL request with UTF-8 encoding.

Solution to multibyte-character problems

Writing J2EE applications that can run on any server and be viewed correctly using any browsers is a challenge. Some solutions for multibyte-character problems in J2EE applications follow:

General principle: Never assume any default settings on both the client side (browser) and server side.

  • In the edit phase, never assume that your IDE's default encoding settings are what you want; set them manually.
  • If your IDE does not support a specific language, you can use the \uXXXX escape sequence in your Java code and use the \uXXXX escape sequence in your HTML pages, or use the native2ascii tool shipped with the JDK to convert the native literal string to a Unicode escape sequence. That can help you avoid most of your problems.
  • In the coding phase, never assume your server's default encoding-processing settings are correct . Use the following methods to give specific instructions:
  • Request: setCharacterEncoding()
  • Response: setContentType(), setLocale(), <%@ page contentType="text/html; charset=encoding" %>
  • When developing multilanguage applications, choose a UTF-8-encoding scheme or use the \uXXXX escape sequence for all language characters.
  • When compiling a Java class, ensure the current language environment variables and encoding scheme are correctly set.
  • In the configuration phase, use the standard setting as much as possible. For example, in the Servlet 2.4 specification, a standard is available for configuring every application's character-encoding scheme:
  • <locale-encoding-mapping-list>
        <locale-encoding-mapping>
            <locale>ja</locale>
            <encoding>Shift_JIS</encoding>
        </locale-encoding-mapping>
    </locale-encoding-mapping-list>
    
  • When communicating with an external system, find out as much as possible about that system's encoding scheme. Do the conversion if different encoding is used. Use UnicodeFormatter.java as a debugger to print all the bytes:
  • Listing 4. UnicodeFormatter.java

  • import java.io.*;
    public class UnicodeFormatter  {
       static public String byteToHex(byte b) {
          // Returns hex String representation of byte b
          char hexDigit = {
             '0', '1', '2', '3', '4', '5', '6', '7',
             '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
          };
          char array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
          return new String(array);
       }
       static public String charToHex(char c) {
          // Returns hex String representation of char c
          byte hi = (byte) (c >>> 8);
          byte lo = (byte) (c & 0xff);
          return byteToHex(hi) + byteToHex(lo);
       }
    } 
    
  • Always give obvious instructions to browsers in HTML pages, such as <meta http-equiv="content-type" content="text/html;charset=gb2312">, and do not assume that the browsers' default settings are correct.
  • Do not embed multibyte characters into links. For example, do not take usernames as query strings, take the user's ID instead.
  • If your links must embed multibyte characters, encode the URL manually, either through server-side Java programming or client-side programming, such as JavaScript or VBscript.

A harder problem to solve: UTF-16

Using the above knowledge, let's analyze a real problem in one of my ISV's (independent software vendor) projects: UTF-16 in J2EE.

The current Chinese character standard (GB18030) defines and supports 27,484 Chinese characters. Though this number seems large, it is not substantial enough to satisfy all Chinese people. Today, the Chinese language has more than 60,000 characters and is rapidly increasing every year. This situation will greatly hinder the Chinese government in its effort to achieve information digitalization. For example, my sister's given name is not in the standard character set, so bank or mail-system computers cannot print it.

My ISV wants to build a complete Chinese character system to satisfy all people. It defines its own character repertoire. Two options exist for defining these characters' character code: Use the GB18030 standard, which can extend to more than 160,00,000 characters. Or use Unicode 3.1, which can support 1,112,064 characters. The GB18030 standard defines encoding rules, also called GB18030; it is simple to use and the current JDK supports it. However, if we use Unicode 3.1, we can choose from three encoding schemes: UTF-8, UTF-16, or UTF-32.

My ISV wants to use UTF-16 encoding to handle its Unicode extension for Chinese characters. The most important feature of UTF-16 encoding is that all the ASCII characters are encoded as 16-bit units, which causes problems at all phases. After trying several servers, the ISV found that J2EE applications cannot support UTF-16 encoding at all. Is this true? Let's analyze every development phase to find the problems.

Edit phase

If we have multibyte literal strings in our Java, JSP, or HTML source files, we need the IDE's support. I use NetBeans, which can easily support UTF-16 encoding; just set the text-encoding attribute to UTF-16. Figure 12 shows a simple UTF-16-encoded JSP page containing only the static literal string "hello world!" This page executes in Tomcat and displays in Mozilla.

Figure 12. UTF-16 page in Mozilla

Compile phase

Since we have UTF-16-encoded characters in our Java or JSP source files, we need compiler support. We can use javac -encoding UTF-16 to compile Java source files. With NetBeans, setting the compiler attribute through the GUI is easy. By running some simple code, we find that we can use UTF-16-encoded characters in servlet files and execute them with no problems.

Compiling JSP files dynamically at runtime proves trickier. Fortunately, most servers can be configured to set Java encoding for its JSP pages. But, unfortunately, when tested in Tomcat and Sun ONE Application Server, I found that the Jasper tool, which converts JSP files to servlet Java source files, fails to recognize JSP tags, such as <%page..%>, encoded with UTF-16—all these tags are treated as literal strings! I think the root cause may lie in Jasper—which most application servers use as a JSP compiler—because it uses byte unit to detect JSP special tokens and tags.

Browser test

Presently, we find that JSP cannot support literal UTF-16-encoded characters because of the failure to detect the UTF-16-encoded JSP tags. But servlets can work with no problems.

Hold on! To make the test more meaningful, let's add a POST function to our test code to let users post some UTF-16 encoded characters through the HTML's form tag. Download the following demos from this article's source code: servlet PostForm.java and servlet ByteTest.java. Servlet PostForm.java is used to output a UTF-16-encoded page with a form ready to post data to the server. And in ByteTest.java, I do not use request.getParameter() to show the post data from the browser because I am unsure if the server is configured for UTF-16 encoding. Instead, I use request.getInputStream() to retrieve the raw data from the request and print every byte of whatever we get from the browser.

Listing 5. PostForm.java

public class PostForm extends HttpServlet {
    ....    
    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
        response.setContentType("text/html;charset=UTF-16");
        PrintWriter out = response.getWriter();
        out.println("<html><head>");
        out.println("<meta content="text/html; charset=UTF-16\" http-equiv="content-type\">");
        out.println("</head><body>");
        out.println("<form action=\"servlet/ByteTest\" method=\"POST\">");
        out.println("<input type=\"text\" name=\"name\"><input type=\"submit\">");
        out.println("</form></body></html>");
        out.close();
    }
    
   ....    
}

Listing 6. ByteTest.java

public class ByteTest extends HttpServlet {
      ... 
    protected void processRequest(HttpServletRequest request, HttpServletResponse response)
    throws ServletException, IOException {
        ServletInputStream in = request.getInputStream();
        response.setContentType("text/html");
        PrintWriter out = response.getWriter();
        byte[] postdata = new byte[50];
        int size = in.read(postdata,0,50);
        in.close();
        out.println("<html>");
        out.println("<head>");
        out.println("<title>Servlet</title>");
        out.println("</head>");
        out.println("<body>");
        printBytes(out,postdata, size, "postdata");
        out.println("</body>");
        out.println("</html>");
        out.close();
    }
...
}

When run, the PostForm page is obviously encoded with UTF-16. So what is the output result from the servlet ByteTest?

  • Internet Explorer: Whatever characters we input, the browser performs UTF-8 encoding in this UTF-16-encoded page.
  • Mozilla: Whatever characters we input in this UTF-16-encoded page, only one character = is shown, an obviously wrong result.

Result

UTF-16 encoding can be used in a J2EE application only if it:

  • Uses only servlet technology
  • Limits the browser types to IE
  • Performs UTF-8 decoding on the server side in spite of UTF-16-encoded pages on the browser side

In fact, the UTF-8 encoding schema can be used in J2EE applications with no difficulties. In Unicode 3.1, UTF-8 can encode the same number of characters as UTF-16. UTF-8 differs from UTF-16 in storage and processing efficiency.

Conclusion

So, if you have multibyte-character problems in J2EE applications, you must dive into every phase of the development lifecycle, check the configurations on both the server and client side, and use the debug tools to help you find these problems' root cause.

Wang Yu presently works for Sun Microsystems as a Java technology engineer and technology architecture consultant. His duties include supporting local ISVs, evangelizing and consulting on important Java technologies such as J2EE, EJB (Enterprise JavaBeans), JSP/Servlet, JMS (Java Message Service), and Web services technologies.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more