Java.next -- Four languages that represent the future of Java
Blogger Stuart Halloway has begun a series of posts on trends that point to the future of the Java platform. In his first post, he compares Clojure, Groovy, JRuby, and Scala -- four wildly different languages that nonetheless all play together in the JRE. Find out what unites these languages and what they can tell us about the future of Java-based development ...

Newsletter sign-up

Sign up for our technology specific newsletters.

Enterprise Java
View all newsletters

Email Address:

An in-depth look at Java's character type

Eight (bits) is <em>not</em> enough -- Java's character type adds another eight

The 1.1 version of Java introduces a number of classes for dealing with characters. These new classes create an abstraction for converting from a platform-specific notion of character values into Unicode values. This column looks at what has been added, and the motivations for adding these character classes.

Type char

Perhaps the most abused base type in the C language is the type char. The char type is abused in part because it is defined to be 8 bits, and for the last 25 years, 8 bits has also defined the smallest indivisible chunk of memory on computers. When you combine the latter fact with the fact that the ASCII character set was defined to fit in 7 bits, the char type makes a very convenient "universal" type. Further, in C, a pointer to a variable of type char became the universal pointer type because anything that could be referenced as a char could also be referenced as any other type through the use of casting.

The use and abuse of the char type in the C language led to many incompatibilities between compiler implementations, so in the ANSI standard for C, two specific changes were made: The universal pointer was redefined to have a type of void, thus requiring an explicit declaration by the programmer; and the numerical value of characters was considered to be signed, thus defining how they would be treated when used in numeric computations. Then, in the mid-1980s, engineers and users figured out that 8 bits was insufficient to represent all of the characters in the world. Unfortunately, by that time, C was so entrenched that people were unwilling, perhaps even unable, to change the definition of the char type. Now flash forward to the '90's, to the early beginnings of Java. One of the many principles laid down in the design of the Java language was that characters would be 16 bits. This choice supports the use of Unicode, a standard way of representing many different kinds of characters in many different languages. Unfortunately, it also set the stage for a variety of problems that are only now being rectified.

What is a character anyway?

I knew I was in trouble when I found myself asking the question, "So what is a character?" Well, a character is a letter, right? A bunch of letters make up a word, words form sentences, and so on. The reality, however, is that the relationship between the representation of a character on a computer screen, called its glyph, to the numerical value that specifies that glyph, called a code point, is not really straightforward at all.

I consider myself lucky to be a native speaker of the English language. First, because it was the common language of a significant number of those who contributed to the design and development of the modern-day digital computer; second, because it has a relatively small number of glyphs. There are 96 printable characters in the ASCII definition that can be used to write English. Compare this to Chinese, where there are over 20,000 glyphs defined and that definition is incomplete. From early beginnings in Morse and Baudot code, the overall simplicity (few glyphs, statistical frequency of appearance) of the English language has made it the lingua-franca of the digital age. But as the number of people entering the digital age has increased, so has the number of non-native English speakers. As the numbers grew, more and more people were increasingly disinclined to accept that computers used ASCII and spoke only English. This greatly increased the number of "characters" computers needed to understand. As a result, the number of glyphs encoded by computers had to double.

Resources