Recommended: Sing it, brah! 5 fabulous songs for developers
JW's Top 5
Java Tutor is my platform for teaching about Java 7+ and JavaFX 2.0+, mainly via programming projects.
interface A { class B {} }? Also, were you aware that you can “add” cast operators together in an assignment statement such as byte i = (byte)+(short)+(double) 2000;? (You’re not really adding cast operators.) Neither of these oddities is the subject of this post. Instead, I’m going to show you that not every value that can be assigned to a char variable denotes a character.
A character set is a collection of characters, and a coded character set is a character set in which code points (numeric values) are associated with characters. For example, the American Standard Code for Information Interchange (ASCII) is a coded character set (e.g., hexadecimal value 41 is assigned to “A”).
ASCII is an old coded character set standard with an English language bias. In 1987, work began on a universal coded character set that could accommodate all of the characters of the world’s living (and, eventually, dead) languages. The resulting standard became known as Unicode.
Unicode-based text is encoded for storage or transmission by using a Unicode Transformation Format (UTF) encoding. Various UTF encodings have been devised, with the variable-length UTF-8 and UTF-16 encodings being commonly used.
A plane is a group of 65,536 code points; Unicode supports 17 planes. The first plane (code points 0 through 65535), known as the Basic Multilingual Plane (BMP), represents its code points via syntax U+hhhh, where each “h” represents a hexadecimal digit. Additional planes are known as supplementary planes or astral planes; code points in these planes are represented via syntax U+hhhhh or syntax U+hhhhhh.
| Note: Planes are divided into blocks, where a block is a named and arbitrarily sized (but always a multiple of 16) range of code points within a plane. For example, “C0 Controls and Basic Latin” refers to code points U+0000 through U+007F in the BMP. (This block is essentially the ASCII block.) |
Code points found in the BMP are directly accessible. However, code points in the supplementary planes, which represent supplementary characters, are accessed indirectly via surrogate (substitute) pairs in UTF-16.
Code points ranging from U+D800 through U+DBFF (1,024 code points) are known as high-surrogate code points; code points ranging from U+DC00 through U+DFFF (1,024 code points) are known as low-surrogate code points. A high-surrogate code point (also known as a leading surrogate code point) followed by a low-surrogate code point (also known as a trailing surrogate code point) denotes a surrogate pair that UTF-16 uses to represent the 1,048,576 code points that exist outside of the BMP.
| Note: Leading and trailing surrogate code points never refer to themselves. As a result, only code points less than U+D800 and code points greater than U+DFFF (upto and including U+10FFFF) can be used to represent characters. |
Surrogate pair (U+D800, U+DC00) represents the first supplementary character. Ignoring the most significant six bits of each surrogate results in a combined 20-bit value consisting of all 0s, which represents U+10000. Surrogate pair (U+DBFF, U+DFFF) represents the last supplementary character. Ignoring the most significant six bits of each surrogate results in a combined 20-bit value consisting of all 1s, which represents U+10FFFF. Subtracting U+10000 from U+10FFFF and adding 1 to the result yields 1,048,576. As you can see, surrogate pairs are able to represent the 1,048,576 code points for the supplementary characters.
As Unicode evolved to support supplementary characters, considerable thought went into how Java would respond. For example, would its character type be expanded to 32 bits? How would APIs handle these characters? Eventually, a tier-based solution was chosen for inclusion in Java 5:
int) would be used to represent code points in low-level APIs (e.g., java.lang.Character’s class methods)
char values always would be interpreted as UTF-16 sequences
char and code point-based representations would be provided
Under this new approach, a char value represents a UTF-16 code unit, which isn’t always sufficient to represent a character code point because the code unit might represent a leading surrogate code point (e.g., char c = '\ud800';) or a trailing surrogate code point (e.g., char c = '\udfff';).
The Character class provides methods that let you map between various char and code point-based representations and more. Consider the following examples:
Character.toCodePoint(char high, char low) maps two UTF-16 code units to a code point.
Character.toChars(int codePoint) maps the given code point to one or two UTF-16 code units, wrapped into a char[] array.
Character.isSurrogatePair(char high, char low) returns Boolean true when the UTF-16 code units represent a surrogate pair; otherwise, false is returned.
These methods are fun to play with, and Listing 1 presents a small Mapper application that uses them to help you learn more about supplementary characters.
public class Mapper
{
public static void main(String[] args)
{
if (args.length < 1 || args.length > 2)
{
System.err.println("usage: java Mapper cuhigh culow");
System.err.println(" java Mapper cp");
return;
}
if (args.length == 1)
{
int value = Integer.parseInt(args[0], 16);
char[] codeUnits = Character.toChars(value);
if (codeUnits.length == 1)
System.out.printf("Code units = U+%X%n", (int) codeUnits[0]);
else
System.out.printf("Code units = U+%X U+%X%n",
(int) codeUnits[0], (int) codeUnits[1]);
}
else
{
char value1 = (char) Integer.parseInt(args[0], 16);
char value2 = (char) Integer.parseInt(args[1], 16);
if (!Character.isSurrogatePair(value1, value2))
{
System.err.println("Not a surrogate pair");
return;
}
System.out.printf("Code point = U+%X%n",
Character.toCodePoint(value1, value2));
}
}
}
Listing 1: Mapping code points to code units and vice versa
You can pass one or two hexadecimal-based arguments to Mapper. A single argument identifies a code point and two arguments identify code units. For example, java Mapper 10000 outputs the following:
Code units = U+D800 U+DC00
In contrast, java Mapper D800 DC00 outputs the following:
Code point = U+10000
Be careful to specify valid surrogate values; otherwise, Mapper outputs an error message. For example, java Mapper 0 FFFF outputs the following:
Not a surrogate pair
| Note: For more information on Unicode and supplementary characters, check out Wikipedia’s Unicode entry and Oracle’s Supplementary Characters in the Java Platform article. |
byte i = (byte)+(short)+(double) 2000; means.
Character provide for dealing with supplementary characters?
You can download this post's code and answers here. Code was developed and tested with JDK 7u2 on a Windows XP SP3 platform.
* * *
I welcome your input to this blog, and will write about relevant topics that you suggest. While waiting for the next blog post, check out my TutorTutor website to learn more about Java and other computer technologies (and that's just the beginning).
|
Learn more about the Java 7 language and many of its APIs by reading my book Beginning Java 7. You can obtain information about this book here and here. |