Multibyte-character processing in J2EE

Develop J2EE applications with multibyte characters

1 2 3 Page 2
Page 2 of 3

Runtime phase

At runtime, your J2EE application most likely communicates with other external systems. Your applications may read and write files, or use databases to manage your data. In other cases, an LDAP (lightweight directory access protocol) server stores identity information. Under all these situations, data exchange is needed between J2EE applications and external systems. If your data contains multibyte characters such as Chinese characters, you may face some issues.

Most external systems have their own encoding settings. For example, an LDAP server most likely uses UTF-8 to encode characters; Oracle Database System uses environment variable NLS_LANG to indicate encoding style. If you install Oracle on a Chinese OS, this variable resets to ZHS16GBK by default, which uses GBK encoding to store Chinese characters. So if your J2EE application's encoding settings differ from the external system, conversion is needed. The following code is common for these situations:

byte[] defaultBytes = original.getBytes(current_encoding);
String newEncodingStr = new String(defaultBytes, old_encoding);

The above code shows how to convert a string from one encoding to another. For example, you have stored a username (multibyte characters) in an LDAP server, which uses UTF-8 encoding and your J2EE application uses GBK encoding. So when your application gets usernames from LDAP, they may not be encoded correctly. To resolve this, you can use original.getBytes("GBK") to get the original bytes. Then construct a new string using new String(defaultBytes, "UTF-8"), which can display correctly.

Client display phase

Most J2EE applications now use the browser-server architecture, which employs browsers as their clients. To display multibyte characters correctly in browsers, you should take note of the following:

Browser language support

To display multibyte characters correctly, the browser and the OS where the browser runs should have language-specific supports, such as fonts and the character repertoire.

Browser encoding settings

The HTML header that the server returns, such as <meta http-equiv="content-type" content="text/html;charset=gb2312"> gives the browser an instruction about which encoding this page uses. Otherwise, the browser uses the default encoding setting or automatically detects one. Alternatively, users can set the page's encoding as shown in Figure 6.

Figure 6. Netscape's encoding-setting page

Thus, if a page lacks any instructions, the multibyte characters may display incorrectly. Under such situations, users must manually set the current page's encoding.

HTTP POST encoding

The situation grows more complicated when you post data to the server using the form tag in HTML pages. Which encoding the browser uses depends on the current page's encoding settings, which contains the form tag. That means if you construct an ASCII-encoded HTML page using ISO-8859-1, in this page, a user cannot possibly post Chinese characters. Since all post data uses ISO-8859-1 encoding, it causes Chinese characters to lose some bytes. That is the HTML standard, which all browsers abide by.

HTTP GET encoding

Things become more troublesome when you add multibyte characters to URL links, like <A href = getuser.jsp?name=**>View detail information of this user</A> (** represents multibyte characters). Such scenarios are common; for example, you can put usernames or other information in links and transfer them to the next page. But when non US-ASCII characters appear in a URL, its format is not clearly defined in RFC (request for comment) 2396. Different browsers use their own methods for encoding multicharacters in URLs.

Take Mozilla, for example, (shown in figures 7, 8, 9, 10); it will always perform URL encoding before the HTTP request is sent. As we know, during the URL-encoding process, a multibyte character first converts into two or more bytes using some encoding scheme (such as UTF-8 or GBK). Then, each byte is characterized by the 3-character string %xy, where xy is the byte's two-digit hexadecimal representation. For more information, please consult the HTML Specification. However, which encoding scheme the URL-encoding method uses, depends on the current page's encoding scheme.

I use the following gbk_test.jsp page as a demo:

Listing 3. gbk_test.jsp

<%@page contentType="text/html;charset=GBK"%>
<HTML>
   <BODY>
      <a href='/chartest/servlet/httpGetTest?name=王'><h1>Test for GBK encoded URL</h1></a>
   </BODY>
</HTML>

The x738b is the escape sequence of a Chinese character that is my family name. This page displays as Figure 7.

Figure 7. URL in Mozilla

When the mouse moves above the link, you can see the link's address in the status bar, which shows a Chinese character embedded inside this URL. When you click the link in the page, you can see clearly in the address bar that this character is URL-encoded. Character x738b encodes to %CD%F5, which is the result of URL encoding combined with GBK encoding. And on the server side, I can get the query string using a simple method, request.getQueryString(). In the next line, I use another method, request.getParameter(String), to show this character as a comparison to the query string, shown in Figure 8 .

Figure 8. URL encoding in Mozilla

When I change the current page's encoding from GBK to UTF-8, then click the link in the page again, you can see the result: x738b encodes to %E7%8E%8B, shown as Figure 9, which is the result of URL encoding combined with UTF-8 encoding.

Figure 9. URL encoding in Mozilla

But Microsoft Internet Explorer treats the multibyte URL differently. IE never completes URL encoding before the HTTP request is sent; the encoding scheme the URL-encoding method uses depends on the current page's encoding scheme, shown in Figure 10.

Figure 10. No URL encoding in IE

IE also has an advanced optional setting that forces the browser to always send the URL request with UTF-8 encoding, shown in Figure 11.

Figure 11. Advance option setting in IE

According to the above explanation, you will face a problem: if your application pages have multibyte characters embedded into URL links and can work using Mozilla with GBK encoding, this application will encounter problems when users employ IE with the setting that forces the browser to always send the URL request with UTF-8 encoding.

Solution to multibyte-character problems

Writing J2EE applications that can run on any server and be viewed correctly using any browsers is a challenge. Some solutions for multibyte-character problems in J2EE applications follow:

General principle: Never assume any default settings on both the client side (browser) and server side.

  • In the edit phase, never assume that your IDE's default encoding settings are what you want; set them manually.
  • If your IDE does not support a specific language, you can use the \uXXXX escape sequence in your Java code and use the \uXXXX escape sequence in your HTML pages, or use the native2ascii tool shipped with the JDK to convert the native literal string to a Unicode escape sequence. That can help you avoid most of your problems.
  • In the coding phase, never assume your server's default encoding-processing settings are correct . Use the following methods to give specific instructions:
  • Request: setCharacterEncoding()
  • Response: setContentType(), setLocale(), <%@ page contentType="text/html; charset=encoding" %>
  • When developing multilanguage applications, choose a UTF-8-encoding scheme or use the \uXXXX escape sequence for all language characters.
  • When compiling a Java class, ensure the current language environment variables and encoding scheme are correctly set.
  • In the configuration phase, use the standard setting as much as possible. For example, in the Servlet 2.4 specification, a standard is available for configuring every application's character-encoding scheme:
  • <locale-encoding-mapping-list>
        <locale-encoding-mapping>
            <locale>ja</locale>
            <encoding>Shift_JIS</encoding>
        </locale-encoding-mapping>
    </locale-encoding-mapping-list>
    
  • When communicating with an external system, find out as much as possible about that system's encoding scheme. Do the conversion if different encoding is used. Use UnicodeFormatter.java as a debugger to print all the bytes:
  • Listing 4. UnicodeFormatter.java

  • import java.io.*;
    public class UnicodeFormatter  {
       static public String byteToHex(byte b) {
          // Returns hex String representation of byte b
          char hexDigit = {
             '0', '1', '2', '3', '4', '5', '6', '7',
             '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'
          };
          char array = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
          return new String(array);
       }
       static public String charToHex(char c) {
          // Returns hex String representation of char c
          byte hi = (byte) (c >>> 8);
          byte lo = (byte) (c & 0xff);
          return byteToHex(hi) + byteToHex(lo);
       }
    } 
    
  • Always give obvious instructions to browsers in HTML pages, such as <meta http-equiv="content-type" content="text/html;charset=gb2312">, and do not assume that the browsers' default settings are correct.
  • Do not embed multibyte characters into links. For example, do not take usernames as query strings, take the user's ID instead.
  • If your links must embed multibyte characters, encode the URL manually, either through server-side Java programming or client-side programming, such as JavaScript or VBscript.

A harder problem to solve: UTF-16

Using the above knowledge, let's analyze a real problem in one of my ISV's (independent software vendor) projects: UTF-16 in J2EE.

The current Chinese character standard (GB18030) defines and supports 27,484 Chinese characters. Though this number seems large, it is not substantial enough to satisfy all Chinese people. Today, the Chinese language has more than 60,000 characters and is rapidly increasing every year. This situation will greatly hinder the Chinese government in its effort to achieve information digitalization. For example, my sister's given name is not in the standard character set, so bank or mail-system computers cannot print it.

My ISV wants to build a complete Chinese character system to satisfy all people. It defines its own character repertoire. Two options exist for defining these characters' character code: Use the GB18030 standard, which can extend to more than 160,00,000 characters. Or use Unicode 3.1, which can support 1,112,064 characters. The GB18030 standard defines encoding rules, also called GB18030; it is simple to use and the current JDK supports it. However, if we use Unicode 3.1, we can choose from three encoding schemes: UTF-8, UTF-16, or UTF-32.

My ISV wants to use UTF-16 encoding to handle its Unicode extension for Chinese characters. The most important feature of UTF-16 encoding is that all the ASCII characters are encoded as 16-bit units, which causes problems at all phases. After trying several servers, the ISV found that J2EE applications cannot support UTF-16 encoding at all. Is this true? Let's analyze every development phase to find the problems.

Edit phase

If we have multibyte literal strings in our Java, JSP, or HTML source files, we need the IDE's support. I use NetBeans, which can easily support UTF-16 encoding; just set the text-encoding attribute to UTF-16. Figure 12 shows a simple UTF-16-encoded JSP page containing only the static literal string "hello world!" This page executes in Tomcat and displays in Mozilla.

Figure 12. UTF-16 page in Mozilla

Compile phase

Since we have UTF-16-encoded characters in our Java or JSP source files, we need compiler support. We can use javac -encoding UTF-16 to compile Java source files. With NetBeans, setting the compiler attribute through the GUI is easy. By running some simple code, we find that we can use UTF-16-encoded characters in servlet files and execute them with no problems.

Compiling JSP files dynamically at runtime proves trickier. Fortunately, most servers can be configured to set Java encoding for its JSP pages. But, unfortunately, when tested in Tomcat and Sun ONE Application Server, I found that the Jasper tool, which converts JSP files to servlet Java source files, fails to recognize JSP tags, such as <%page..%>, encoded with UTF-16—all these tags are treated as literal strings! I think the root cause may lie in Jasper—which most application servers use as a JSP compiler—because it uses byte unit to detect JSP special tokens and tags.

Browser test

Presently, we find that JSP cannot support literal UTF-16-encoded characters because of the failure to detect the UTF-16-encoded JSP tags. But servlets can work with no problems.

Hold on! To make the test more meaningful, let's add a POST function to our test code to let users post some UTF-16 encoded characters through the HTML's form tag. Download the following demos from this article's source code: servlet PostForm.java and servlet ByteTest.java. Servlet PostForm.java is used to output a UTF-16-encoded page with a form ready to post data to the server. And in ByteTest.java, I do not use request.getParameter() to show the post data from the browser because I am unsure if the server is configured for UTF-16 encoding. Instead, I use request.getInputStream() to retrieve the raw data from the request and print every byte of whatever we get from the browser.

Listing 5. PostForm.java

1 2 3 Page 2
Page 2 of 3