Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Java Tip 117: Transfer binary data in an XML document

Three ways to encode and decode binary data for embedding within an XML document

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
XML has gained considerable popularity over the past few years as the solution to enterprise integration problems. It provides the means for exchanging data in a platform- and language-independent manner. To achieve this independence, XML exchanges encoding efficiency and network bandwidth for simplicity. Applications use XML documents as the universal datatype for passing data between one another without worrying about whether both applications use the same distributed object framework.

While incorporating XML into your distributed applications, you may encounter the need to transfer binary data as part of your XML document. For example, you may need to pass to the client binary images embedded within an XML document, which includes additional data elements such as images. Simply embedding the byte values within the stored XML document won't work due to the XML specification's valid-character restriction and due to character encoding and decoding as the document travels from its source to its parsing destination.

According to the XML 1.0 specification, valid character values include the following ranges of hexadecimal values: 0x9, 0xA, 0xD, 0x20-0xd7ff, 0xe000-0xfffd, and 0x10000-0x10ffff. The specification also uses the character definition specified by the ISO/IEC 10646 standard and requires that all conforming XML processors "...accept the UTF-8 and UTF-16 encodings of 10646."

For readers not familiar with the ISO/IEC 10646, the standard was first published in 1993 by the International Organization for Standardization (ISO), whose objective specifies the encoding of characters used in every written language into binary form. To provide compatibility between the multilingual encodings and most existing software applications that use the ASCII standard, the ISO has defined many transformations including the UTF-8 and UTF-16 encodings. For more information about the ISO/IEC 10646 standard and UTF encodings, see the Resources section.

What does all this have to do with the problem at hand? Well, if you embed the binary data within the XML document within a specific element tag, the receiving XML processor attempts to interpret the byte sequence following the UTF-8 or UTF-16 encodings. This most likely causes the parser to encounter invalid sequences and fail.

This implies that you must encode your own binary data into the valid character set before embedding it into the XML document. Obviously, you then have to decode the data on the receiving side. In the rest of this tip, I describe three different approaches for encoding binary data before embedding it into an XML document.

The brute force approach

The direct approach to solving this encoding problem converts each binary data byte into its two character, hexadecimal representation. By doing that, you encode the 256 possible byte values using for each byte two characters from the character set 0-9, a-f:

  byte[] buffer = readFile(filename);
  int readBytes = buffer.length;
  StringBuffer hexData = new StringBuffer();
  for (int i=0; i < readBytes; i++) {
     hexData.append(padHexString(Integer.toHexString(0xff & buffer[i])));
  }


As the code above illustrates, the conversion is simple enough. Timing the conversion routine above on a typical desktop PC (a Pentium III machine running at 800MHz with 256MB of memory) gave my team a conversion rate of 485 KB/sec. Note, we used a StringBuffer rather than plain String concatenation to build the binary buffer's resulting character representation. We did that to avoid the unnecessary cost of repeatedly creating and then releasing String class instances. If necessary, you could accelerate this conversion using a hexadecimal number lookup table as shown below. Timing the conversion on the same PC gave my team a conversion rate of 1,920 KB/sec using this approach -- about a four-fold increase:

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources