Overhauling the Java UTF-8 charset

By Xueming Shen

The UTF-8 charset implementation, which is available in all JDK/JRE releases from Sun, has been updated recently to reject non-shortest-form UTF-8 byte sequences. This is because the old implementation might be leveraged in security attacks. Since then I have been asked many times about what this "non-shortest-form" issue is and what the possible impact might be, so here are some answers.

The first question usually goes: "What is the non-shortest-form issue"?

The detailed and official answer is at Unicode Corrigendum #1: UTF-8 Shortest Form. Simply put, the problem is that Unicode characters can be represented in more than one way (form) in the "UTF-8 encoding" than many people think or believe. When asked what UTF-8 encoding looks like, the simplest explanation would be the following bit pattern:

# Bits Bit pattern
1 7 0xxxxxxx      
2 11 110xxxxx 10xxxxxx    
3 16 1110xxxx 10xxxxxx 10xxxxxx  
4 21 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The pattern is close, but it's actually wrong, based on the latest definition of UTF-8. The preceding pattern has a loophole in that you can actually have more than one form represent a Unicode character.

For ASCII characters from u+0000 to u+007f, for example, the UTF-8 encoding form maintains transparency for all of them, so they keep their ASCII code values of 0x00..0x7f (in one-byte form) in UTF-8. Based on the preceding pattern, however, these characters can also be represented in 2-bytes form as [c0, 80]..[c1, bf], the "non-shortest-form".

The following code shows all of the non-shortest-2-bytes-form for these ASCII characters, if you run code against the "old" version of the JDK and JRE (Java Runtime Environment).


    byte[] bb = new byte[2];
    for (int b1 = 0xc0; b1 < 0xc2; b1++) {
        for (int b2 = 0x80; b2 < 0xc0; b2++) {
            bb[0] = (byte)b1;
            bb[1] = (byte)b2; 
            String cstr = new String(bb, "UTF8");
            char c = cstr.toCharArray()[0];
            System.out.printf("[%02x, %02x] -> U+%04x [%s]%n",
                              b1, b2, c & 0xffff, (c>=0x20)?cstr:"ctrl");
        }
    }

The output would be as follows:


...
[c0, a0] -> U+0020 [ ]
[c0, a1] -> U+0021 [!]
...
[c0, b6] -> U+0036 [6]
[c0, b7] -> U+0037 [7]
[c0, b8] -> U+0038 [8]
[c0, b9] -> U+0039 [9]
...
[c1, 80] -> U+0040 [@]
[c1, 81] -> U+0041 [A]
[c1, 82] -> U+0042 [B]
[c1, 83] -> U+0043 [C]
[c1, 84] -> U+0044 [D]
...

So, for a string like "ABC", you would have two forms of UTF-8 sequences:


"0x41 0x42 0x43" and "0xc1 0x81 0xc1 0x82 0xc1 0x83"

The Unicode Corrigendum #1: UTF-8 Shortest Form specifies explicitly that "The definition of each UTF specifies the illegal code unit sequences in that UTF. For example, the definition of UTF-8 (D36) specifies that code unit sequences such as [C0, AF] are illegal."

Our old implementation accepts those non-shortest-form (while it never generates them when encoding). The new UTF_8 charset now rejects the non-shortest-form byte sequences for all BMP characters. Only the "legal byte sequences" listed below are accepted.


    /*  Legal UTF-8 Byte Sequences
     *
     * #    Code Points      Bits   Bit/Byte pattern
     * 1                     7      0xxxxxxx
     *      U+0000..U+007F          00..7F
     * 2                     11     110xxxxx    10xxxxxx
     *      U+0080..U+07FF          C2..DF      80..BF
     * 3                     16     1110xxxx    10xxxxxx    10xxxxxx
     *      U+0800..U+0FFF          E0          A0..BF      80..BF
     *      U+1000..U+FFFF          E1..EF      80..BF      80..BF
     * 4                     21     11110xxx    10xxxxxx    10xxxxxx    10xxxxxx
     *     U+10000..U+3FFFF         F0          90..BF      80..BF      80..BF
     *     U+40000..U+FFFFF         F1..F3      80..BF      80..BF      80..BF
     *    U+100000..U10FFFF         F4          80..8F      80..BF      80..BF
     */

The next question usually is: "What would be the issue/problem if we keep using the old version of JDK/JRE?"

First, I'm not a lawyer — oops, I meant I'm not a security expert:-) — so my word does not count. We consulted with our security experts instead. Their conclusion is that while "it is not a security vulnerability in Java SE per se, it may be leveraged to attack systems running software that relies on the UTF-8 charset to reject these non-shortest form of UTF-8 sequences".

A simple scenario that might give you an idea about what the above "may be leveraged to attack..." really means:

  1. A Java application would like to filter the incoming UTF-8 input stream to reject certain key words, for example "ABC".
  2. Instead of decoding the input UTF-8 byte sequences into Java char representation and then filter out the keyword string "ABC" at Java "char" level, for example:
    
           String utfStr = new String(bytes, "UTF-8");
           if ("ABC".equals(strUTF)) { ... }
    
    The application might choose to filter the raw UTF-8 byte sequences "0x41 0x42 0x43" (only) directly against the UTF-8 byte input stream and then rely on (assume) the Java UTF-8 charset to reject any other non-shortest-form of the target keyword, if there is any.
  3. The consequence is the non-shortest form input "0xc1 0x81 0xc1 0x82 0xc1 0x83" will penetrate the filter and trigger a possible security vulnerability, if the underlying JDK/JRE runtime is an OLD version.

So the recommendation is: Update to the latest JDK/JRE releases to avoid the potential risk.

Wait, there is another big bonus for updating: performance.

The UTF-8 charset implementation has not been updated or touched for years. UTF-8 encoding is very widely used as the default encoding for XML, and more and more websites use UTF-8 as their page encoding. Given that fact, we have taken the defensive position of "don't change it if it works" during the past years.

So Martin and I decided to take this opportunity to give it a speed boost as well. The following data is from one of my benchmark's run data, which compares the decoding/encoding operations of new implementation and old implementation under -server vm. (This is not an official benchmark: it is provided only to give a rough idea of the performance boost.)

The new implementation is much faster, especially when decoding or encoding single bytes (those ASCIIs). The new decoding and encoding are faster under -client vm as well, but the gap is not as big as in -server vm. I wanted to show you the best:-)


    Method Millis Millis(OLD)
    Decoding 1b UTF-8         :  1786  12689
    Decoding 2b UTF-8         : 21061  30769
    Decoding 3b UTF-8         : 23412  44256
    Decoding 4b UTF-8         : 30732  35909
    Decoding 1b (direct)UTF-8 : 16015  22352
    Decoding 2b (direct)UTF-8 : 63813  82686
    Decoding 3b (direct)UTF-8 : 89999 111579
    Decoding 4b (direct)UTF-8 : 73126  60366
    Encoding 1b UTF-8         :  2528  12713
    Encoding 2b UTF-8         : 14372  33246
    Encoding 3b UTF-8         : 25734  26000
    Encoding 4b UTF-8         : 23293  31629
    Encoding 1b (direct)UTF-8 : 18776  19883
    Encoding 2b (direct)UTF-8 : 50309  59327
    Encoding 3b (direct)UTF-8 : 77006  74286
    Encoding 4b (direct)UTF-8 : 61626  66517

The new UTF-8 charset implementation has been integrated in JDK7, Open JDK 6, JDK 6 update 11 and later, JDK5.0u17, and 1.4.2_19.

If you are interested in what the change looks like, you can take a peek at the webrev of the new UTF_8.java for OpenJDK7.

Xueming Shen is an engineer at Sun Microsystems, working in the Java core technologies group.