An in-depth look at Java's character type

Eight (bits) is not enough -- Java's character type adds another eight

The 1.1 version of Java introduces a number of classes for dealing with characters. These new classes create an abstraction for converting from a platform-specific notion of character values into Unicode values. This column looks at what has been added, and the motivations for adding these character classes.

Type char

Perhaps the most abused base type in the C language is the type char. The char type is abused in part because it is defined to be 8 bits, and for the last 25 years, 8 bits has also defined the smallest indivisible chunk of memory on computers. When you combine the latter fact with the fact that the ASCII character set was defined to fit in 7 bits, the char type makes a very convenient "universal" type. Further, in C, a pointer to a variable of type char became the universal pointer type because anything that could be referenced as a char could also be referenced as any other type through the use of casting.

The use and abuse of the char type in the C language led to many incompatibilities between compiler implementations, so in the ANSI standard for C, two specific changes were made: The universal pointer was redefined to have a type of void, thus requiring an explicit declaration by the programmer; and the numerical value of characters was considered to be signed, thus defining how they would be treated when used in numeric computations. Then, in the mid-1980s, engineers and users figured out that 8 bits was insufficient to represent all of the characters in the world. Unfortunately, by that time, C was so entrenched that people were unwilling, perhaps even unable, to change the definition of the char type. Now flash forward to the '90's, to the early beginnings of Java. One of the many principles laid down in the design of the Java language was that characters would be 16 bits. This choice supports the use of Unicode, a standard way of representing many different kinds of characters in many different languages. Unfortunately, it also set the stage for a variety of problems that are only now being rectified.

What is a character anyway?

I knew I was in trouble when I found myself asking the question, "So what is a character?" Well, a character is a letter, right? A bunch of letters make up a word, words form sentences, and so on. The reality, however, is that the relationship between the representation of a character on a computer screen, called its glyph, to the numerical value that specifies that glyph, called a code point, is not really straightforward at all.

I consider myself lucky to be a native speaker of the English language. First, because it was the common language of a significant number of those who contributed to the design and development of the modern-day digital computer; second, because it has a relatively small number of glyphs. There are 96 printable characters in the ASCII definition that can be used to write English. Compare this to Chinese, where there are over 20,000 glyphs defined and that definition is incomplete. From early beginnings in Morse and Baudot code, the overall simplicity (few glyphs, statistical frequency of appearance) of the English language has made it the lingua-franca of the digital age. But as the number of people entering the digital age has increased, so has the number of non-native English speakers. As the numbers grew, more and more people were increasingly disinclined to accept that computers used ASCII and spoke only English. This greatly increased the number of "characters" computers needed to understand. As a result, the number of glyphs encoded by computers had to double.

The number of available characters doubled when the venerable 7-bit ASCII code was incorporated into an 8-bit character encoding called ISO Latin-1 (or ISO 8859_1, "ISO" being the International Standards Organization). As you may have gathered by the encoding name, this standard allowed for the representation of many of the latin-derived languages used in the European continent. Just because the standard was created, however, didn't mean it was usable. At the time, a lot of computers had already started using the other 128 "characters" that might be represented by a an 8-bit character to some advantage. The two surviving examples of the use of these extra characters are the IBM Personal Computer (PC), and the most popular computer terminal ever, the Digital Equipment Corporation VT-100. The latter lives on in the form of terminal emulator software.

The actual time of death for the 8-bit character will no doubt be debated for decades, but I peg it at the introduction of the Macintosh computer in 1984. The Macintosh brought two very revolutionary concepts into mainstream computing: character fonts that were stored in RAM; and WorldScript, which could be used to represent characters in any language. Of course, this was simply a copy of what Xerox had been shipping on its Dandelion class machines in the form of the Star word processing system, but the Macintosh brought these new character sets and fonts to an audience that was still using "dumb" terminals. Once started, the use of different fonts could not be stopped -- it was just too appealing to too many people. By the late '80s, the pressure to standardize the use of all these characters came to a head with the formation of the Unicode Consortium, which published its first specification in 1990. Unfortunately, during the '80s and even into the '90s, the number of character sets multiplied. Very few of the engineers who were creating new character codes at the time considered the nascent Unicode standard viable, and so they created their own mappings of codes to glyphs. So while Unicode was not well accepted, the notion that there were only 128 or at most 256 characters available was definitely gone. After the Macintosh, support for different fonts became a must-have feature for word processing. Eight bit characters were fading into extinction.

Java and Unicode

I entered the story in 1992 when I joined the Oak group (The Java language was called Oak when it was first developed) at Sun. The base type char was defined to be 16 unsigned bits, the only unsigned type in Java. The rationale for the 16-bit character was that it would support any Unicode character representation, thus making Java suitable for representing strings in any language supported by Unicode. But being able to represent the string and being able to print it have always been separate problems. Given that most of the experience in the Oak group came from Unix systems and Unix-derived systems, the most comfortable character set was, again, ISO Latin-1. Also, with the Unix heritage of the group, the Java I/O system was modeled in large part on the Unix stream abstraction whereby every I/O device could be represented by a stream of 8-bit bytes. This combination left something of a misfeature in the language between an 8-bit input device and the 16-bit characters of Java. Thus, anywhere Java strings had to be read from or written to an 8-bit stream, there was a small bit of code, a hack, to magically map 8 bit characters into 16 bit unicode.

In the 1.0 versions of the Java Developer Kit (JDK), the input hack was in the DataInputStream class, and the output hack was the entire PrintStream class. (Actually there was an input class named TextInputStream in the alpha 2 release of Java, but it was supplanted by the DataInputStream hack in the actual release.) This continues to cause problems for beginning Java programmers, as they search desperately for the Java equivalent of the C function getc(). Consider the following Java 1.0 program:

import java.io.*;
public class bogus {
     public static void main(String args[]) {
          FileInputStream fis;
          DataInputStream dis;
          char c;
          try {
              fis = new FileInputStream("data.txt");
              dis = new DataInputStream(fis);
              while (true) {
                  c = dis.readChar();
                  System.out.print(c);
                  System.out.flush();
                  if (c == '\n') break;
              }
              fis.close();
          } catch (Exception e) { }
          System.exit(0);
    }
}

At first glance, this program would appear to open a file, read it one character at a time, and exit when the first newline is read. However, in practice, what you get is junk output. And the reason you get junk is that readChar reads 16-bit Unicode characters and System.out.print prints out what it assumes are ISO Latin-1 8-bit characters. However, if you change the above program to use the readLine function of DataInputStream, it will appear to work because the code in readLine reads a format that is defined with a passing nod to the Unicode specification as "modified UTF-8." (UTF-8 is the format that Unicode specifies for representing Unicode characters in an 8-bit input stream.) So the situation in Java 1.0 is that Java strings are composed of 16-bit Unicode characters, but there is only one mapping that maps ISO Latin-1 characters into Unicode. Fortunately, Unicode defines code page "0" -- that is, the 256 characters whose upper 8 bits are all zero -- to correspond exactly to the ISO Latin-1 set. Thus, the mapping is pretty trivial, and as long as you are only using ISO Latin-1 character files, you won't have any problems when the data leaves a file, is manipulated by a Java class, and then rewritten to a file.

There were two problems with burying the input conversion code into these classes: Not all platforms stored their multilingual files in modified UTF-8 format; and certainly, the applications on these platforms didn't necessarily expect non-Latin characters in this form. Therefore, the implementation support was incomplete, and there was no easy way to add the needed support in a later release.

Java 1.1 and Unicode

The Java 1.1 release introduced an entirely new set of interfaces for handling characters, called Readers and Writers. I modified the class named bogus from above into a class named cool. The cool class uses an InputStreamReader class to process the file rather than the DataInputStream class. Note that InputStreamReader is a subclass of the new Reader class and the System.out is now a PrintWriter object, which is a subclass of the Writer class. The code for this example is shown below:

import java.io.*;
public class cool {
     public static void main(String args[]) {
          FileInputStream fis;
          InputStreamReader irs;
          char c;
          try {
              fis = new FileInputStream("data.txt");
              irs = new InputStreamReader(fis);
              System.out.println("Using encoding : "+irs.getEncoding());
              while (true) {
                  c = (char) irs.read();
                  System.out.print(c);
                  System.out.flush();
                  if (c == '\n') break;
              }
              fis.close();
          } catch (Exception e) { }
          System.exit(0);
    }
}

The primary difference between this example and the previous code listing is the use of the InputStreamReader class rather than the DataInputStream class. Another way in which this example is different from the previous one is that there is an additional line that prints out the encoding used by the InputStreamReader class.

The important point is that the existing code, once undocumented (and ostensibly unknowable) and embedded inside the implementation of the getChar method of the DataInputStream class, has been removed (actually its use is deprecated; it will be removed in a future release). In the 1.1 version of Java, the mechanism that performs the conversion is now encapsulated in the Reader class. This encapsulation provides a way for the Java class libraries to support many different external representations of non-Latin characters while always using Unicode internally.

Of course, like the original I/O subsystem design, there are symmetric counterparts to the reading classes that perform writing. The class OutputStreamWriter can be used to write strings to an output stream, the class BufferedWriter adds a layer of buffering, and so on.

Trading warts or real progress?

The somewhat lofty goal of the design of the Reader and Writerclasses was to tame what is currently a hodge-podge of representation standards for the same information by providing a standard way of converting back and forth between the legacy representation -- be it Macintosh Greek or Windows Cyrillic -- and Unicode. So, a Java class that deals with strings need not change when it moves from platform to platform. This might be the end of the story, except that now that the conversion code is encapsulated, the question arises as to what that code assumes.

While researching this column, I was reminded of a famous quote from a Xerox executive (before it was Xerox, when it was the Haloid Company) about the photocopier being superfluous because it was fairly easy for a secretary to put a piece of carbon paper into her typewriter and make a copy of a document while she was creating the original. Of course, what is obvious in hindsight is that the photocopy machine benefits the person receiving a document much more than it does a person generating a document. JavaSoft has shown a similar lack of insight into the use of the character encoding and decoding classes in their design of this part of the system.

At Haloid, the executives were focused on the generation of copies, not on the folks who received documents. Similarly, in Java the classes are focused on getting Unicode characters converted to the native form understood by the underlying platform (the system upon which the JVM is running). Typical (and by far the easiest) usage for the InputStreamReader class is simply to instantiate a reader around a bytestream as shown above. For disk files, the process is streamlined into a single class named FileReader. In typical usage, when the class is instantiated, it "plugs in" the default encoder for the platform. On Windows and Unix this appears to be "8859_1" -- or the encoder that reads ISO Latin-1 files and converts them into Unicode. However, as in the use of copiers, there is another use, which is to convert from some other platform's native format into Unicode and then convert that Unicode into the local platform's format. I will demonstrate a bit later where this ability (multiplatform conversion) is completely lacking in the current design.

There are other encoders -- those that convert between JIS (Japanese) characters and Unicode, Han (Chinese) characters and Unicode, and so on. And yet, if you peruse the documentation that comes with the JDK (also available at the JavaSoft Web site), you will be hard-pressed to find out what the other encoders are. Further, there is no apparent API with which one might enumerate the available encoders. The only way to really know the list of the available encoders is to read the source code for InputStreamReader. And that, unfortunately, uncovers a brand new wart.

In the source for InputStreamReader, the code encapsulates the conversion function in a Sun-specific class called sun.io.ByteToCharConverter. Like all Sun classes, the source code to this class is not available with the JDK nor is it documented. Thus, what this class actually does is something of a mystery. Fortunately, we can get the source code from Sun for the entire JDK, and therein discover what this class does. There is a constructor to InputStreamReader that takes a string as the name of the encoder to use. In the Sun implementation, that string is pasted into the template "sun.io.ByteToCharXXX" where the XXX part is replaced by the string you passed the constructor. Then, by looking at the classes that are packed into the classes.zip file that is supplied with the JDK, you can identify several encoders with names like ByteToChar8859_1, ByteToCharCp1255, and so on. Again, as they are Sun classes, they aren't documented. However, you can find some out-of-date documentation in the form of the Internationalization Specification on the JavaSoft Web site. Within that document there is a page describing most, if not all, of the supported encoders. A link to the exact page is provided in the Resources section. The problem, which I alluded to earlier, is that this architecture will not allow you to add your own converters.

The reason I consider this design flawed is twofold: There is no requirement for any Java conformant platform to support any encoders other than a basic platform to Unicode and vice versa; and there is no way to enumerate the encoders that are supported. This suggests that what was going through the minds of the people who wrote this code was that data files existing on a particular platform were, by definition, created on that platform. And yet, in this world of Internet-connected files on heterogeneous hardware, there is a significant likelihood that the file you want to read was not created on a local platform and therefore will require a special encoder. This requirement brings me to my final problem with this part of the system.

Assume for the moment that the file you are trying to read, and perhaps parse, in your Java application came from a non-local platform -- let's say a Commodore-64. Commodore used its own modified version of ASCII, so in order to read and convert the Commodore-64 file to Unicode, I need to use a specialized subclass of ByteToChar that understands Commodore's character format. The design of the Reader class allows for this, however in the JDK implementation of the design the classes needed for character conversion are part of the Sun private package sun.io. Thus, I would have to use an interface that isn't documented or published! Further, I would have to put my converter in the sun.io package on a JDK system for it to work correctly! As in the Xerox case, a large client of this part of the system may very well be programmers who are supporting legacy data. Those programmers will be forced to work around this particular wart. It's good news, however, that working with the data even is possible, so to a large extent progress has been made.

Wrapping up

The central theme of this column is that Java, unlike C, explicitly requires that the values contained in variables of type char be treated as Unicode characters. This binds the linkage between the characters' numerical values and their glyphs. This concept, while simple, is often uncomfortable for seasoned programmers who are getting their first taste of writing programs that operate using non-English strings. Further, a ton of work has been done on codifying the various languages around the world, and the result of this work is the Unicode standard. More information on Unicode and which glyphs are represented by which code points can be found on the Unicode home page.

Finally, because the linkage between language and strings is more explicit in Java 1.1, Java programmers are forced to consider other ways by which the characters may be represented on the platform in which they are running. This globalization of programming represents a positive step toward bringing computers to people in the language in which they are most comfortable expressing themselves. All in all, in the words of Martha Stewart, "It's a good thing."

Chuck McManis currently is the director of system software at FreeGate Corp., a venture-funded start-up that is exploring opportunities in the Internet marketplace. Before joining FreeGate, Chuck was a member of the Java Group. He joined the Java Group just after the formation of FirstPerson Inc. and was a member of the portable OS group (the group responsible for the OS portion of Java). Later, when FirstPerson was dissolved, he stayed with the group through the development of the alpha and beta versions of the Java platform. He created the first "all Java" home page on the Internet when he did the programming for the Java version of the Sun home page in May 1995. He also developed a cryptographic library for Java and versions of the Java class loader that could screen classes based on digital signatures. Before joining FirstPerson, Chuck worked in the operating systems area of SunSoft, developing networking applications, where he did the initial design of NIS+. Check out his home page.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more