Multibyte-character processing in J2EE

Develop J2EE applications with multibyte characters

The Chinese language is one of the most complex and comprehensive languages in the world. Sometimes I feel lucky to be Chinese, specifically when I see some of my foreign friends struggle to learn the language, especially writing Chinese characters. However, I do not feel so lucky when developing localized Web applications using J2EE. This article explains why.

Though the Java platform and most J2EE servers support internationalization well, I am still confronted by many multibyte-character problems when developing Chinese or Japanese language-based applications:

  • What is the difference between encoding and charset?
  • Why do multibyte-character applications display differently when ported from one operating system to another?
  • Why do multibyte-character applications display differently when ported from one application server to another?
  • Why do my multibyte-character applications display well on the Internet Explorer browser but not on the Mozilla browser?
  • Why do applications on most J2EE servers display poorly when using UTF-16 (universal transformation format) encoding?

If you are asking the same set of questions, this article helps you answer them.

Basic knowledge of characters

Characters have existed long before computers. More than 3,000 years ago, special characters (named Oracles) appeared in ancient China. These characters have special visual forms and special meanings, with most having names and pronunciations. All of these facets compose the character repertoire, a set of distinct characters defined by a special language, with no relationship to the computer at all. Over thousands of years, many languages evolved and thousands of characters were created. And now we are trying to digitize all these characters into 1s and 0s, so computers can understand them.

When typing words with a keyboard, you deal with character input methods. For simple characters, there is one-to-one mapping between a key and a character. For a more complex language, a character needs multiple keystrokes.

Before you can see characters on the screen, the operating system must store characters in memory. In fact, the OS defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers, which are stored in memory and used by the OS. These integers are called character code.

Characters can be stored in a file or transmitted through the network. Software uses character encoding to define a method (algorithm) for mapping sequences of a character's character code into sequences of octets. Some character code maps into one byte, such as ASCII code; other character code, such as Chinese and Japanese, map into two or more bytes, depending on the different character-encoding schemas.

Different languages may use different character repertoires; each character repertoire uses some special encodings. Sometimes, when you choose a language, you may choose a character repertoire implicitly, which uses an implied character encoding. For example, when you choose the Chinese language, you may, by default, use the GBK Chinese character repertoire and a special encoding schema also named GBK.

I avoid the term character set because it causes confusion. Apparently, character set is the synonym of character repertoire. Character set is misused in the HTTP Mime (Multipurpose Internet Mail Extensions) header, where "charset" is used for "encoding."

One of Java's many features is the 16-bit character. This feature supports Unicode use, a standard way of representing many different kinds of characters in various languages. Unfortunately, this character also causes many problems when developing multibyte J2EE applications, which this article focuses on.

Development phases cause display problems

J2EE application development includes several phases (shown in Figure 1); each phase can cause multibyte-character display problems.

Figure 1. J2EE application development life cycle

Coding phase

When you code your J2EE applications, most likely, you use an IDE like JBuilder or NetBeans, or an editor like UltraEdit or vi. Whatever you choose, if you have a literal string in your JSP (JavaServer Pages), Java, or HTML files, and if these literal strings are multibyte characters such as Chinese or Japanese, most likely, you will encounter display problems if you are not careful.

A literal string is static information stored in files. Different encodings are used for different language characters. Most IDEs set their default encoding to ISO-8859-1, which is for ASCII characters and causes multibyte characters to lose information. For example, in the Chinese version of NetBeans, the default setting for file encoding is, unfortunately, ISO-8859-1. When I edit a JSP file with some Chinese characters (shown in Figure 2), everything seems correct. As I mentioned above, we know that all these characters shown in the screen are in memory, having no direct relationship with encoding. After saving this file, if you close the IDE and reopen it, these characters appear incomprehensible (shown in Figure 3) because ISO-8859-1 encoding loses some information when storing Chinese characters.

Figure 2. Chinese characters in NetBeans
Figure 3. Chinese characters in chaos

Character-encoding APIs

There are several APIs in the servlet and JSP specifications that handle the character-encoding process in J2EE applications. For a servlet request, setCharacterEncoding() sets the encoding schema for the current HTTP request's body. For a servlet response, setContentType() and setLocale() set Mime header encoding for the output HTTP response.

These APIs cause no problems themselves. On the contrary, the problems exist when you forget to use them. For example, in some servers, you can display multibyte characters properly without using any of the above APIs in your code, but when you run the application in other servers, characters appear incomprehensible. The reason for this multibyte-character display problem lies in how the servers treat character encoding during HTTP requests and responses. The following rules apply to most servers when determining the character encoding in requests and responses:

When processing a servlet request, the server uses the following order of precedence, first to last, to determine the request character encoding:

  • Code-specific settings (for example: the setCharacterEncoding() method)
  • Vendor-specific settings
  • The default setting

When processing a servlet response, the server uses the following order of precedence, first to last, to determine the response character encoding:

  • Code-specific settings (for example: the setContentType() and setLocale() methods)
  • Vendor-specific settings
  • The default setting

According to the above rules, if you give instruction codes using these APIs, all servers will obey them when choosing the character-encoding schema. Otherwise, different servers will behave differently. Some vendors use hidden fields in the HTTP form to determine the request encoding, others use specific settings in their own configuration files. The default settings can differ also. Most vendors use

ISO-8859-1

as default settings, while a few use the OS's locale settings. Thus, some multibyte character-based applications have display problems when porting to another vendor's J2EE server.

Compile phase

You can store multibyte literal strings, if correctly set, in source files during the edit phase. But these source files cannot execute directly. If you write servlet code, these Java files must be compiled to classfiles before deploying to the application server. For JSP, the application server will automatically compile the JSP files to the classfiles before executing them. During the compile phase, character-encoding problems are still possible. To see the following simple demo, download this article's source code.

Listing 1. EncodingTest.java

1      import java.io.ByteArrayOutputStream;
2      import java.io.OutputStreamWriter;
3
4      public class EncodingTest {  
5         public static void main(String[] args) {
6            OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream());
7            System.out.println("Current Encoding:  "+out.getEncoding());
8            System.out.println("Literal output:  ��ã�"); // You may not see this Chinese String
9         }
10     }

Some explanation about the source code:

  • We use the following code to determine the system's current encoding:
  •  6    OutputStreamWriter out = new OutputStreamWriter(new ByteArrayOutputStream());
    7    System.out.println("Current Encoding:  "+out.getEncoding());
    
  • Line 8 includes a direct print-out of a Chinese character literal string (you may not see this string correctly due to your OS language settings)
  • Store this Java source file with GBK encoding

Look at the execution result shown in Figure 4.

Figure 4. Sample output. Click on thumbnail to view full-sized image.

From the result in Figure 4, we can conclude that:

  • The Java compiler (javac) uses the system's language environment as the default encoding setting, so does the Java Runtime Environment.
  • Only the first result is correct; other strings display incomprehensibly.
  • Only when the runtime encoding setting is the same as the one used to store the source file can multibyte literal strings display correctly (alternatively, you must convert from one encoding schema to another; please see the "Runtime phase" section).

Server configuration phase

Before you run your J2EE application, you should configure your application to meet special needs. In the previous section, we found that different language settings can cause literal-string display problems. Actually, different levels of configuration exist, and they all can cause problems for multibyte characters.

OS level

Language support of the operating system is most important. The language supports on the server side will affect JVM default encoding settings as described above. And the language support on the client side, such as font, can also directly affect character display, but this article doesn't focus on that.

J2EE application server level

Most servers have a per-server setting to configure the default behavior of character-encoding processing. For example, Listing 2 is part of Tomcat's configuration file (located in $TOMCAT_HOME/conf/web.xml):

Listing 2. web.xml

<servlet>
        <servlet-name>jsp</servlet-name>
        <servlet-class>org.apache.jasper.servlet.JspServlet</servlet-class>
        <init-param>
            <param-name>fork</param-name>
            <param-value>false</param-value>
        </init-param>
        <init-param>
            <strong>
            <param-name>javaEncoding</param-name>>
            <param-value>>UTF8</param-value>
            </strong>
        </init-param>
        <load-on-startup>3</load-on-startup>
  </servlet>

Tomcat uses parameter javaEncoding to define Java file encoding for generating Java source files from JSP files; the default is UTF-8. That means if you store Chinese characters in your JSP with GBK encoding and you want to display your characters using UTF-8 (browser setting), problems may result.

JVM level

Most servers can have multiple instances simultaneously, and each server instance can have an individual JVM instance. Plus, you can have separate settings for each JVM instance. Most servers have locale settings for each instance to define the default language support.

Figure 5. Sun ONE Application Server setting

Shown in Figure 5, the Sun ONE (Open Network Environment) Application Server has a per-instance setting for locale. This setting indicates the default behavior of encoding characters for the logging system and standard output.

On the other hand, different servers may use distinct JVM versions; and different JDK versions support various encoding standards. All these issues can cause porting problems. For example, Sun ONE Application Server and Tomcat support J2SE 1.4, while others support only J2SE 1.3. J2SE 1.4 supports Unicode 3.1, which has many new features previous versions lacked.

Per-application level

Every application deployed on the server can be configured with its unique encoding settings before it runs within the server container. This feature allows multiple applications using different languages to run inside one server instance. For example, in some servers, you can give the following character-encoding settings for each deployed application to indicate which encoding schema your application should use:

<locale-charset-info default-locale="en_US">
      </locale-charset-map locale="zh_CN" agent="Mozilla/4.77 [en] (Windows NT 5.0; U)"  charset="GBK">
</locale-charset-info>

The reason for all these configuration levels is flexibility and maintenance. However, unfortunately, they will cause problems when porting from one server to another, because not all configurations adhere to standards. For example, if you develop your application in a server that supports the locale-charset-info setting, you may have difficulties if you want to port the application to another server that does not support this encoding setting.

1 2 3 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more