An in-depth look at Java's character type

Eight (bits) is not enough -- Java's character type adds another eight

1 2 Page 2
Page 2 of 2

At Haloid, the executives were focused on the generation of copies, not on the folks who received documents. Similarly, in Java the classes are focused on getting Unicode characters converted to the native form understood by the underlying platform (the system upon which the JVM is running). Typical (and by far the easiest) usage for the InputStreamReader class is simply to instantiate a reader around a bytestream as shown above. For disk files, the process is streamlined into a single class named FileReader. In typical usage, when the class is instantiated, it "plugs in" the default encoder for the platform. On Windows and Unix this appears to be "8859_1" -- or the encoder that reads ISO Latin-1 files and converts them into Unicode. However, as in the use of copiers, there is another use, which is to convert from some other platform's native format into Unicode and then convert that Unicode into the local platform's format. I will demonstrate a bit later where this ability (multiplatform conversion) is completely lacking in the current design.

There are other encoders -- those that convert between JIS (Japanese) characters and Unicode, Han (Chinese) characters and Unicode, and so on. And yet, if you peruse the documentation that comes with the JDK (also available at the JavaSoft Web site), you will be hard-pressed to find out what the other encoders are. Further, there is no apparent API with which one might enumerate the available encoders. The only way to really know the list of the available encoders is to read the source code for InputStreamReader. And that, unfortunately, uncovers a brand new wart.

In the source for InputStreamReader, the code encapsulates the conversion function in a Sun-specific class called sun.io.ByteToCharConverter. Like all Sun classes, the source code to this class is not available with the JDK nor is it documented. Thus, what this class actually does is something of a mystery. Fortunately, we can get the source code from Sun for the entire JDK, and therein discover what this class does. There is a constructor to InputStreamReader that takes a string as the name of the encoder to use. In the Sun implementation, that string is pasted into the template "sun.io.ByteToCharXXX" where the XXX part is replaced by the string you passed the constructor. Then, by looking at the classes that are packed into the classes.zip file that is supplied with the JDK, you can identify several encoders with names like ByteToChar8859_1, ByteToCharCp1255, and so on. Again, as they are Sun classes, they aren't documented. However, you can find some out-of-date documentation in the form of the Internationalization Specification on the JavaSoft Web site. Within that document there is a page describing most, if not all, of the supported encoders. A link to the exact page is provided in the Resources section. The problem, which I alluded to earlier, is that this architecture will not allow you to add your own converters.

The reason I consider this design flawed is twofold: There is no requirement for any Java conformant platform to support any encoders other than a basic platform to Unicode and vice versa; and there is no way to enumerate the encoders that are supported. This suggests that what was going through the minds of the people who wrote this code was that data files existing on a particular platform were, by definition, created on that platform. And yet, in this world of Internet-connected files on heterogeneous hardware, there is a significant likelihood that the file you want to read was not created on a local platform and therefore will require a special encoder. This requirement brings me to my final problem with this part of the system.

Assume for the moment that the file you are trying to read, and perhaps parse, in your Java application came from a non-local platform -- let's say a Commodore-64. Commodore used its own modified version of ASCII, so in order to read and convert the Commodore-64 file to Unicode, I need to use a specialized subclass of ByteToChar that understands Commodore's character format. The design of the Reader class allows for this, however in the JDK implementation of the design the classes needed for character conversion are part of the Sun private package sun.io. Thus, I would have to use an interface that isn't documented or published! Further, I would have to put my converter in the sun.io package on a JDK system for it to work correctly! As in the Xerox case, a large client of this part of the system may very well be programmers who are supporting legacy data. Those programmers will be forced to work around this particular wart. It's good news, however, that working with the data even is possible, so to a large extent progress has been made.

Wrapping up

The central theme of this column is that Java, unlike C, explicitly requires that the values contained in variables of type char be treated as Unicode characters. This binds the linkage between the characters' numerical values and their glyphs. This concept, while simple, is often uncomfortable for seasoned programmers who are getting their first taste of writing programs that operate using non-English strings. Further, a ton of work has been done on codifying the various languages around the world, and the result of this work is the Unicode standard. More information on Unicode and which glyphs are represented by which code points can be found on the Unicode home page.

Finally, because the linkage between language and strings is more explicit in Java 1.1, Java programmers are forced to consider other ways by which the characters may be represented on the platform in which they are running. This globalization of programming represents a positive step toward bringing computers to people in the language in which they are most comfortable expressing themselves. All in all, in the words of Martha Stewart, "It's a good thing."

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+. Check out his home page.

Learn more about this topic

1 2 Page 2
Page 2 of 2