An in-depth look at Java's character type

Eight (bits) is not enough -- Java's character type adds another eight

The 1.1 version of Java introduces a number of classes for dealing with characters. These new classes create an abstraction for converting from a platform-specific notion of character values into Unicode values. This column looks at what has been added, and the motivations for adding these character classes.

Type char

Perhaps the most abused base type in the C language is the type char. The char type is abused in part because it is defined to be 8 bits, and for the last 25 years, 8 bits has also defined the smallest indivisible chunk of memory on computers. When you combine the latter fact with the fact that the ASCII character set was defined to fit in 7 bits, the char type makes a very convenient "universal" type. Further, in C, a pointer to a variable of type char became the universal pointer type because anything that could be referenced as a char could also be referenced as any other type through the use of casting.

The use and abuse of the char type in the C language led to many incompatibilities between compiler implementations, so in the ANSI standard for C, two specific changes were made: The universal pointer was redefined to have a type of void, thus requiring an explicit declaration by the programmer; and the numerical value of characters was considered to be signed, thus defining how they would be treated when used in numeric computations. Then, in the mid-1980s, engineers and users figured out that 8 bits was insufficient to represent all of the characters in the world. Unfortunately, by that time, C was so entrenched that people were unwilling, perhaps even unable, to change the definition of the char type. Now flash forward to the '90's, to the early beginnings of Java. One of the many principles laid down in the design of the Java language was that characters would be 16 bits. This choice supports the use of Unicode, a standard way of representing many different kinds of characters in many different languages. Unfortunately, it also set the stage for a variety of problems that are only now being rectified.

What is a character anyway?

I knew I was in trouble when I found myself asking the question, "So what is a character?" Well, a character is a letter, right? A bunch of letters make up a word, words form sentences, and so on. The reality, however, is that the relationship between the representation of a character on a computer screen, called its glyph, to the numerical value that specifies that glyph, called a code point, is not really straightforward at all.

I consider myself lucky to be a native speaker of the English language. First, because it was the common language of a significant number of those who contributed to the design and development of the modern-day digital computer; second, because it has a relatively small number of glyphs. There are 96 printable characters in the ASCII definition that can be used to write English. Compare this to Chinese, where there are over 20,000 glyphs defined and that definition is incomplete. From early beginnings in Morse and Baudot code, the overall simplicity (few glyphs, statistical frequency of appearance) of the English language has made it the lingua-franca of the digital age. But as the number of people entering the digital age has increased, so has the number of non-native English speakers. As the numbers grew, more and more people were increasingly disinclined to accept that computers used ASCII and spoke only English. This greatly increased the number of "characters" computers needed to understand. As a result, the number of glyphs encoded by computers had to double.

The number of available characters doubled when the venerable 7-bit ASCII code was incorporated into an 8-bit character encoding called ISO Latin-1 (or ISO 8859_1, "ISO" being the International Standards Organization). As you may have gathered by the encoding name, this standard allowed for the representation of many of the latin-derived languages used in the European continent. Just because the standard was created, however, didn't mean it was usable. At the time, a lot of computers had already started using the other 128 "characters" that might be represented by a an 8-bit character to some advantage. The two surviving examples of the use of these extra characters are the IBM Personal Computer (PC), and the most popular computer terminal ever, the Digital Equipment Corporation VT-100. The latter lives on in the form of terminal emulator software.

The actual time of death for the 8-bit character will no doubt be debated for decades, but I peg it at the introduction of the Macintosh computer in 1984. The Macintosh brought two very revolutionary concepts into mainstream computing: character fonts that were stored in RAM; and WorldScript, which could be used to represent characters in any language. Of course, this was simply a copy of what Xerox had been shipping on its Dandelion class machines in the form of the Star word processing system, but the Macintosh brought these new character sets and fonts to an audience that was still using "dumb" terminals. Once started, the use of different fonts could not be stopped -- it was just too appealing to too many people. By the late '80s, the pressure to standardize the use of all these characters came to a head with the formation of the Unicode Consortium, which published its first specification in 1990. Unfortunately, during the '80s and even into the '90s, the number of character sets multiplied. Very few of the engineers who were creating new character codes at the time considered the nascent Unicode standard viable, and so they created their own mappings of codes to glyphs. So while Unicode was not well accepted, the notion that there were only 128 or at most 256 characters available was definitely gone. After the Macintosh, support for different fonts became a must-have feature for word processing. Eight bit characters were fading into extinction.

Java and Unicode

I entered the story in 1992 when I joined the Oak group (The Java language was called Oak when it was first developed) at Sun. The base type char was defined to be 16 unsigned bits, the only unsigned type in Java. The rationale for the 16-bit character was that it would support any Unicode character representation, thus making Java suitable for representing strings in any language supported by Unicode. But being able to represent the string and being able to print it have always been separate problems. Given that most of the experience in the Oak group came from Unix systems and Unix-derived systems, the most comfortable character set was, again, ISO Latin-1. Also, with the Unix heritage of the group, the Java I/O system was modeled in large part on the Unix stream abstraction whereby every I/O device could be represented by a stream of 8-bit bytes. This combination left something of a misfeature in the language between an 8-bit input device and the 16-bit characters of Java. Thus, anywhere Java strings had to be read from or written to an 8-bit stream, there was a small bit of code, a hack, to magically map 8 bit characters into 16 bit unicode.

In the 1.0 versions of the Java Developer Kit (JDK), the input hack was in the DataInputStream class, and the output hack was the entire PrintStream class. (Actually there was an input class named TextInputStream in the alpha 2 release of Java, but it was supplanted by the DataInputStream hack in the actual release.) This continues to cause problems for beginning Java programmers, as they search desperately for the Java equivalent of the C function getc(). Consider the following Java 1.0 program:

import java.io.*;
public class bogus {
     public static void main(String args[]) {
          FileInputStream fis;
          DataInputStream dis;
          char c;
          try {
              fis = new FileInputStream("data.txt");
              dis = new DataInputStream(fis);
              while (true) {
                  c = dis.readChar();
                  System.out.print(c);
                  System.out.flush();
                  if (c == '\n') break;
              }
              fis.close();
          } catch (Exception e) { }
          System.exit(0);
    }
}

At first glance, this program would appear to open a file, read it one character at a time, and exit when the first newline is read. However, in practice, what you get is junk output. And the reason you get junk is that readChar reads 16-bit Unicode characters and System.out.print prints out what it assumes are ISO Latin-1 8-bit characters. However, if you change the above program to use the readLine function of DataInputStream, it will appear to work because the code in readLine reads a format that is defined with a passing nod to the Unicode specification as "modified UTF-8." (UTF-8 is the format that Unicode specifies for representing Unicode characters in an 8-bit input stream.) So the situation in Java 1.0 is that Java strings are composed of 16-bit Unicode characters, but there is only one mapping that maps ISO Latin-1 characters into Unicode. Fortunately, Unicode defines code page "0" -- that is, the 256 characters whose upper 8 bits are all zero -- to correspond exactly to the ISO Latin-1 set. Thus, the mapping is pretty trivial, and as long as you are only using ISO Latin-1 character files, you won't have any problems when the data leaves a file, is manipulated by a Java class, and then rewritten to a file.

There were two problems with burying the input conversion code into these classes: Not all platforms stored their multilingual files in modified UTF-8 format; and certainly, the applications on these platforms didn't necessarily expect non-Latin characters in this form. Therefore, the implementation support was incomplete, and there was no easy way to add the needed support in a later release.

Java 1.1 and Unicode

The Java 1.1 release introduced an entirely new set of interfaces for handling characters, called Readers and Writers. I modified the class named bogus from above into a class named cool. The cool class uses an InputStreamReader class to process the file rather than the DataInputStream class. Note that InputStreamReader is a subclass of the new Reader class and the System.out is now a PrintWriter object, which is a subclass of the Writer class. The code for this example is shown below:

import java.io.*;
public class cool {
     public static void main(String args[]) {
          FileInputStream fis;
          InputStreamReader irs;
          char c;
          try {
              fis = new FileInputStream("data.txt");
              irs = new InputStreamReader(fis);
              System.out.println("Using encoding : "+irs.getEncoding());
              while (true) {
                  c = (char) irs.read();
                  System.out.print(c);
                  System.out.flush();
                  if (c == '\n') break;
              }
              fis.close();
          } catch (Exception e) { }
          System.exit(0);
    }
}

The primary difference between this example and the previous code listing is the use of the InputStreamReader class rather than the DataInputStream class. Another way in which this example is different from the previous one is that there is an additional line that prints out the encoding used by the InputStreamReader class.

The important point is that the existing code, once undocumented (and ostensibly unknowable) and embedded inside the implementation of the getChar method of the DataInputStream class, has been removed (actually its use is deprecated; it will be removed in a future release). In the 1.1 version of Java, the mechanism that performs the conversion is now encapsulated in the Reader class. This encapsulation provides a way for the Java class libraries to support many different external representations of non-Latin characters while always using Unicode internally.

Of course, like the original I/O subsystem design, there are symmetric counterparts to the reading classes that perform writing. The class OutputStreamWriter can be used to write strings to an output stream, the class BufferedWriter adds a layer of buffering, and so on.

Trading warts or real progress?

The somewhat lofty goal of the design of the Reader and Writerclasses was to tame what is currently a hodge-podge of representation standards for the same information by providing a standard way of converting back and forth between the legacy representation -- be it Macintosh Greek or Windows Cyrillic -- and Unicode. So, a Java class that deals with strings need not change when it moves from platform to platform. This might be the end of the story, except that now that the conversion code is encapsulated, the question arises as to what that code assumes.

While researching this column, I was reminded of a famous quote from a Xerox executive (before it was Xerox, when it was the Haloid Company) about the photocopier being superfluous because it was fairly easy for a secretary to put a piece of carbon paper into her typewriter and make a copy of a document while she was creating the original. Of course, what is obvious in hindsight is that the photocopy machine benefits the person receiving a document much more than it does a person generating a document. JavaSoft has shown a similar lack of insight into the use of the character encoding and decoding classes in their design of this part of the system.

1 2 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more