Simplify XML processing with VTD-XML

A new option that overcomes the problems of DOM and SAX

Figure 3. Big XML files. Click on thumbnail to view full-sized image.

Eight years since its inception, XML has already taken off as an open, semi-structured data format for storing data as well as exchanging data over the Web. Due to its simplicity and human readability, XML has seen its popularity skyrocket among application developers and has become an indispensable part of enterprise architecture.

Although it is difficult to enumerate the number of ways XML is being used, one can be certain about one thing: XML must be parsed before anything else can be done. In fact, choosing the right parser is often one of the first decisions that enterprise developers must tackle in their projects. And again and again, that decision comes down to the two popular XML processing models: the Document Object Model (DOM) and the Simple API for XML (SAX).

At first glance, the respective strengths and weaknesses of DOM and SAX seem complementary: DOM builds in-memory object graphs; SAX is event-based and stores nothing in memory. So if the document size is small and the data access pattern, complex, DOM is the way to go; otherwise, use SAX.

However, the truth is never so simplistic. More often than not, developers are unwilling to use SAX because of its complexity, yet still do because no other viable choice is available. Otherwise, if the XML file size is just slightly larger than a few hundreds of kilobytes, DOM's memory overhead and performance drag become a tough roadblock for application developers, preventing them from meeting their projects' minimum performance goals.

But is SAX really that much better? SAX's advertised parsing performance—typically several times faster than DOM—is actually often deceiving. It turns out that the awkward, forward-only nature of SAX parsing not only requires extra implementation effort, but also incurs performance penalties when the document structure becomes only slightly complex. If developers choose not to scan the document multiple times, they will have to buffer the document or build custom object models.

Either way, performance suffers, as exemplified by Apache Axis. On its FAQ page, Axis claims to internally use SAX to create a higher-performing implementation, yet it still builds its own object model that is quite DOM-like, resulting in negligible performance improvements when compared with its predecessor (Apache SOAP). In addition, SAX doesn't work well with XPath, and in general can't drive XSLT (Extensible Stylesheet Language Transformation) processing. So SAX parsing skirts the real problems of XML processing.

Seeking an easier-to-use alternative to SAX, a growing number of developers have turned to StAX (Streaming API for XML). Compared with SAX, StAX parsers pull tokens from XML files instead of using call backs. While they noticeably improve the usability, the fundamental issues persist—StAX's forward-only parsing style still requires tedious implementation effort and, along with it, hidden performance costs.

The bottom line: For any XML processing model to be broadly useful, it must present the hierarchical structure of XML and nothing less. The reason is because XML is designed to move complex data over the Web, and conveying the structural information is an inherent part of what XML does.

VTD-XML changes the game

Suppose we were to start XML processing from scratch to overcome the aforementioned issues with DOM and SAX. The new model should probably have the following properties:

  • Random access capable: The processing model should allow the developer to navigate some sort of hierarchical structure either manually or, better, by using XPath.
  • High performance: The performance should be substantially better than DOM and SAX. And the performance should be "honest," meaning the measurement must include the time spent on building the hierarchical structure.
  • Low memory usage: To make the processing model applicable to a broad range of scenarios and file sizes, it must present the full structure of XML with a minimum amount of memory usage.

Designed to fulfill those goals, VTD-XML is the next-generation open source XML processing model that brings fundamental and all-around improvements over DOM and SAX. One key optimization of VTD-XML is non-extractive tokenization. Internally, VTD-XML retains in memory the intact and undecoded XML message, and represents tokens exclusively based on a binary encoding specification called Virtual Token Descriptor. A VTD record is a 64-bit integer that encodes the token length, starting offset, type, and nesting depth of a token in XML.

Here is a bit of the history of VTD-XML in case you are interested: The basic concept was conceived as a way to port XML processing on dedicated hardware, in the form of FPGA or ASIC, to enable network switches and routers to process XML content at very high speeds. Later, the VTD-XML project team decided to open source VTD-XML, and the initial release—of version 0.5 and implemented in Java—took place in May 2004. Since that release, VTD-XML has undergone several rounds of improvements and matured considerably. In version 0.8, the C version of VTD-XML was released along with the Java version. Built-in XPath support was introduced in version 1.0 and released in October of 2005. The latest release, version 1.5, features a rewritten parsing engine that is more modular and higher performing.

Also introduced in this release is a feature called buffer reuse. The basic idea is that when an XML application sitting behind a network connection needs to repetitively process many incoming XML documents, the application can actually reuse memory buffers allocated during the first processing run. In other words, allocate buffers once and use them many, many times. Specific to VTD-XML, this feature allows the complete elimination of both object creation and garbage collection cost (50-80 percent of overhead in DOM and SAX) from XML processing. The project Website contains the latest software downloads and an in-depth technical description of VTD-XML.

A quick example

To give a feel of VTD-XML's programming style, this article first compares code using both VTD-XML and DOM to parse and navigate a simple XML file named test.xml, whose text content is shown below:

 <purchaseOrder orderDate="1999-10-21">
       <item partNum="872-AA">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
       </item>
</purchaseOrder>

The VTD-XML version looks like this:

 

import com.ximpleware.*; import com.ximpleware.parser.*; import java.io.*;

public class use_vtd { public static void main(String[] args){ try{ File f = new File("test.xml"); FileInputStream fis = new FileInputStream(f); byte[] ba = new byte[(int)f.length()]; fis.read(ba); VTDGen vg = new VTDGen(); vg.setDoc(ba); vg.parse(false); VTDNav vn = vg.getNav(); if (vn.matchElement("purchaseOrder")){ System.out.println(" orderDate==>" + vn.toString(vn.getAttrVal("orderDate"))); if (vn.toElement(VTDNav.FIRST_CHILD,"item")){ if (vn.toElement(VTDNav.FIRST_CHILD)){ do { System.out.print( vn.toString(vn.getCurrentIndex())); System.out.print("==>");

System.out.println( vn.toString(vn.getText())); } while(vn.toElement(VTDNav.NEXT_SIBLING)); } } } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

The DOM version of the same application is shown below:

 

import java.io.*; import org.w3c.dom.*; import org.w3c.*; import javax.xml.parsers.*; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.FactoryConfigurationError; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.*; import org.xml.sax.SAXException;

public class use_dom { public static void main(String[] args){ try{ DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); Document d= parser.parse("test.xml"); Element root = d.getDocumentElement(); if (root.getNodeName().compareTo("purchaseOrder")==0){ System.out.println(" orderDate==> " + root.getAttribute("orderDate"));

Node n = root.getFirstChild(); if (n != null){ do { if (n.getNodeType() == Node.ELEMENT_NODE && n.getNodeName().compareTo("item")==0){ Node n2 = n.getFirstChild(); if (n2!=null){ do { if (n2.getNodeType() == Node.ELEMENT_NODE){ System.out.println( n2.getNodeName() + "==>" + n2.getFirstChild().getNodeValue() ); } }while((n2=n2.getNextSibling())!=null); } } }while ((n=n.getNextSibling()) != null ); } } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

As illustrated in code examples above, VTD-XML navigates the XML hierarchy using a cursor-based API. In contrast, the DOM API navigates the hierarchy by requesting object references. Please visit the VTD-XML project Website for more technical materials and code examples that explain VTD-XML in great depth.

Benchmarking VTD-XML

Next, let's compare VTD-XML's performance and memory usage with some popular XML parsers. It should be noted that most articles containing benchmark numbers, such as "XML Documents on the Run" by Dennis Sosnoski (JavaWorld, April 2002), are from several years ago. Since then, better and faster hardware are following Moore's Law and becoming cheaper than ever. At the same time, XML parsing and the Java virtual machine have not been standing still—they have seen improvements in many key areas.

Test setup

The test platform is a Sony VAIO laptop equipped with a Pentium M 1.7 GHz processor (2 MB integrated L2 cache) and 512 MB DDR2 RAM. The front bus is clocked at 400 MHz. The OS is Windows XP Professional Edition with services pack 2. The JVM is version 1.5.0_06.

The benchmark tests the latest versions of the following XML parsers:

  • Xerces DOM 2.7.1, with and without deferred node expansion
  • Xerces SAX 2.7.1
  • Piccolo SAX 1.04
  • XPP3 1.1.3.4.O
  • VTD-XML 1.5, with and without buffer reuse

I selected a large collection of XML documents of varying sizes and structural complexities for the test. Depending on the file size, the test documents are grouped into three categories. Small files are less than 10 KB in size. Mid-sized files are between 10 KB and 1 MB. Files larger than 1 MB are considered big.

The server JVM was used for all performance measurements to obtain the peak performance. In those tests, the benchmark programs first looped through the parsing or navigation routines numerous times so that the JVM performed the dynamic, just-in-time optimization of the byte code, before averaging the performance of subsequent iterations as the final results. To reduce timing variation due to disk I/O, the benchmark programs read all XML files into in-memory buffers prior to the test runs.

Note: Interested readers can download the benchmark program from Resources.

Parsing throughput comparisons

This section presents the XML parsing performance in both latency and throughput. Notice that while VTD-XML and DOM are directly comparable, it is not fair to compare VTD-XML with SAX or Pull because they don't build any hierarchical structure in memory. So performance for SAX and Pull only serves as an additional reference point.

Throughput

Figure 1. Small files. Click on thumbnail to view full-sized image.
Figure 2. Medium-sized files. Click on thumbnail to view full-sized image.
Figure 3. Big XML files. Click on thumbnail to view full-sized image.

Latency comparisons

Table 1. Small files

File name/sizeVTD-XML (ms)VTD-XML buffer reuse (ms)SAX (ms)DOM(ms)DOM deferred(ms)Piccolo (ms)Pull (ms)
soap2.xml (1727 bytes)0.04460.03460.07820.11220.162250.0920.066
nav_48_0.xml (4608 bytes)0.10540.09280.2660.370.3850.27840.1742
cd_catalog.xml (5035 bytes)0.1180.1080.190.3480.40.20.214
nav_63_0.xml (6848 bytes)0.1490.1350.3540.5130.5570.4840.242
nav_78_0.xml (6920 bytes)0.1530.1420.37040.5880.520.420.29

Table 2. Medium XML files

1 2 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more