Simplify XML processing with VTD-XML

A new option that overcomes the problems of DOM and SAX

Figure 3. Big XML files. Click on thumbnail to view full-sized image.

Eight years since its inception, XML has already taken off as an open, semi-structured data format for storing data as well as exchanging data over the Web. Due to its simplicity and human readability, XML has seen its popularity skyrocket among application developers and has become an indispensable part of enterprise architecture.

Although it is difficult to enumerate the number of ways XML is being used, one can be certain about one thing: XML must be parsed before anything else can be done. In fact, choosing the right parser is often one of the first decisions that enterprise developers must tackle in their projects. And again and again, that decision comes down to the two popular XML processing models: the Document Object Model (DOM) and the Simple API for XML (SAX).

At first glance, the respective strengths and weaknesses of DOM and SAX seem complementary: DOM builds in-memory object graphs; SAX is event-based and stores nothing in memory. So if the document size is small and the data access pattern, complex, DOM is the way to go; otherwise, use SAX.

However, the truth is never so simplistic. More often than not, developers are unwilling to use SAX because of its complexity, yet still do because no other viable choice is available. Otherwise, if the XML file size is just slightly larger than a few hundreds of kilobytes, DOM's memory overhead and performance drag become a tough roadblock for application developers, preventing them from meeting their projects' minimum performance goals.

But is SAX really that much better? SAX's advertised parsing performance—typically several times faster than DOM—is actually often deceiving. It turns out that the awkward, forward-only nature of SAX parsing not only requires extra implementation effort, but also incurs performance penalties when the document structure becomes only slightly complex. If developers choose not to scan the document multiple times, they will have to buffer the document or build custom object models.

Either way, performance suffers, as exemplified by Apache Axis. On its FAQ page, Axis claims to internally use SAX to create a higher-performing implementation, yet it still builds its own object model that is quite DOM-like, resulting in negligible performance improvements when compared with its predecessor (Apache SOAP). In addition, SAX doesn't work well with XPath, and in general can't drive XSLT (Extensible Stylesheet Language Transformation) processing. So SAX parsing skirts the real problems of XML processing.

Seeking an easier-to-use alternative to SAX, a growing number of developers have turned to StAX (Streaming API for XML). Compared with SAX, StAX parsers pull tokens from XML files instead of using call backs. While they noticeably improve the usability, the fundamental issues persist—StAX's forward-only parsing style still requires tedious implementation effort and, along with it, hidden performance costs.

The bottom line: For any XML processing model to be broadly useful, it must present the hierarchical structure of XML and nothing less. The reason is because XML is designed to move complex data over the Web, and conveying the structural information is an inherent part of what XML does.

VTD-XML changes the game

Suppose we were to start XML processing from scratch to overcome the aforementioned issues with DOM and SAX. The new model should probably have the following properties:

  • Random access capable: The processing model should allow the developer to navigate some sort of hierarchical structure either manually or, better, by using XPath.
  • High performance: The performance should be substantially better than DOM and SAX. And the performance should be "honest," meaning the measurement must include the time spent on building the hierarchical structure.
  • Low memory usage: To make the processing model applicable to a broad range of scenarios and file sizes, it must present the full structure of XML with a minimum amount of memory usage.

Designed to fulfill those goals, VTD-XML is the next-generation open source XML processing model that brings fundamental and all-around improvements over DOM and SAX. One key optimization of VTD-XML is non-extractive tokenization. Internally, VTD-XML retains in memory the intact and undecoded XML message, and represents tokens exclusively based on a binary encoding specification called Virtual Token Descriptor. A VTD record is a 64-bit integer that encodes the token length, starting offset, type, and nesting depth of a token in XML.

Here is a bit of the history of VTD-XML in case you are interested: The basic concept was conceived as a way to port XML processing on dedicated hardware, in the form of FPGA or ASIC, to enable network switches and routers to process XML content at very high speeds. Later, the VTD-XML project team decided to open source VTD-XML, and the initial release—of version 0.5 and implemented in Java—took place in May 2004. Since that release, VTD-XML has undergone several rounds of improvements and matured considerably. In version 0.8, the C version of VTD-XML was released along with the Java version. Built-in XPath support was introduced in version 1.0 and released in October of 2005. The latest release, version 1.5, features a rewritten parsing engine that is more modular and higher performing.

Also introduced in this release is a feature called buffer reuse. The basic idea is that when an XML application sitting behind a network connection needs to repetitively process many incoming XML documents, the application can actually reuse memory buffers allocated during the first processing run. In other words, allocate buffers once and use them many, many times. Specific to VTD-XML, this feature allows the complete elimination of both object creation and garbage collection cost (50-80 percent of overhead in DOM and SAX) from XML processing. The project Website contains the latest software downloads and an in-depth technical description of VTD-XML.

A quick example

To give a feel of VTD-XML's programming style, this article first compares code using both VTD-XML and DOM to parse and navigate a simple XML file named test.xml, whose text content is shown below:

 <purchaseOrder orderDate="1999-10-21">
       <item partNum="872-AA">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
       </item>
</purchaseOrder>

The VTD-XML version looks like this:

 

import com.ximpleware.*; import com.ximpleware.parser.*; import java.io.*;

public class use_vtd { public static void main(String[] args){ try{ File f = new File("test.xml"); FileInputStream fis = new FileInputStream(f); byte[] ba = new byte[(int)f.length()]; fis.read(ba); VTDGen vg = new VTDGen(); vg.setDoc(ba); vg.parse(false); VTDNav vn = vg.getNav(); if (vn.matchElement("purchaseOrder")){ System.out.println(" orderDate==>" + vn.toString(vn.getAttrVal("orderDate"))); if (vn.toElement(VTDNav.FIRST_CHILD,"item")){ if (vn.toElement(VTDNav.FIRST_CHILD)){ do { System.out.print( vn.toString(vn.getCurrentIndex())); System.out.print("==>");

System.out.println( vn.toString(vn.getText())); } while(vn.toElement(VTDNav.NEXT_SIBLING)); } } } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

The DOM version of the same application is shown below:

 

import java.io.*; import org.w3c.dom.*; import org.w3c.*; import javax.xml.parsers.*; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.FactoryConfigurationError; import javax.xml.parsers.ParserConfigurationException; import org.w3c.dom.*; import org.xml.sax.SAXException;

public class use_dom { public static void main(String[] args){ try{ DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder parser = factory.newDocumentBuilder(); Document d= parser.parse("test.xml"); Element root = d.getDocumentElement(); if (root.getNodeName().compareTo("purchaseOrder")==0){ System.out.println(" orderDate==> " + root.getAttribute("orderDate"));

Node n = root.getFirstChild(); if (n != null){ do { if (n.getNodeType() == Node.ELEMENT_NODE && n.getNodeName().compareTo("item")==0){ Node n2 = n.getFirstChild(); if (n2!=null){ do { if (n2.getNodeType() == Node.ELEMENT_NODE){ System.out.println( n2.getNodeName() + "==>" + n2.getFirstChild().getNodeValue() ); } }while((n2=n2.getNextSibling())!=null); } } }while ((n=n.getNextSibling()) != null ); } } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

As illustrated in code examples above, VTD-XML navigates the XML hierarchy using a cursor-based API. In contrast, the DOM API navigates the hierarchy by requesting object references. Please visit the VTD-XML project Website for more technical materials and code examples that explain VTD-XML in great depth.

Benchmarking VTD-XML

Next, let's compare VTD-XML's performance and memory usage with some popular XML parsers. It should be noted that most articles containing benchmark numbers, such as "XML Documents on the Run" by Dennis Sosnoski (JavaWorld, April 2002), are from several years ago. Since then, better and faster hardware are following Moore's Law and becoming cheaper than ever. At the same time, XML parsing and the Java virtual machine have not been standing still—they have seen improvements in many key areas.

Test setup

The test platform is a Sony VAIO laptop equipped with a Pentium M 1.7 GHz processor (2 MB integrated L2 cache) and 512 MB DDR2 RAM. The front bus is clocked at 400 MHz. The OS is Windows XP Professional Edition with services pack 2. The JVM is version 1.5.0_06.

The benchmark tests the latest versions of the following XML parsers:

  • Xerces DOM 2.7.1, with and without deferred node expansion
  • Xerces SAX 2.7.1
  • Piccolo SAX 1.04
  • XPP3 1.1.3.4.O
  • VTD-XML 1.5, with and without buffer reuse

I selected a large collection of XML documents of varying sizes and structural complexities for the test. Depending on the file size, the test documents are grouped into three categories. Small files are less than 10 KB in size. Mid-sized files are between 10 KB and 1 MB. Files larger than 1 MB are considered big.

The server JVM was used for all performance measurements to obtain the peak performance. In those tests, the benchmark programs first looped through the parsing or navigation routines numerous times so that the JVM performed the dynamic, just-in-time optimization of the byte code, before averaging the performance of subsequent iterations as the final results. To reduce timing variation due to disk I/O, the benchmark programs read all XML files into in-memory buffers prior to the test runs.

Note: Interested readers can download the benchmark program from Resources.

Parsing throughput comparisons

This section presents the XML parsing performance in both latency and throughput. Notice that while VTD-XML and DOM are directly comparable, it is not fair to compare VTD-XML with SAX or Pull because they don't build any hierarchical structure in memory. So performance for SAX and Pull only serves as an additional reference point.

Throughput

Figure 1. Small files. Click on thumbnail to view full-sized image.
Figure 2. Medium-sized files. Click on thumbnail to view full-sized image.
Figure 3. Big XML files. Click on thumbnail to view full-sized image.

Latency comparisons

Table 1. Small files

File name/sizeVTD-XML (ms)VTD-XML buffer reuse (ms)SAX (ms)DOM(ms)DOM deferred(ms)Piccolo (ms)Pull (ms)
soap2.xml (1727 bytes)0.04460.03460.07820.11220.162250.0920.066
nav_48_0.xml (4608 bytes)0.10540.09280.2660.370.3850.27840.1742
cd_catalog.xml (5035 bytes)0.1180.1080.190.3480.40.20.214
nav_63_0.xml (6848 bytes)0.1490.1350.3540.5130.5570.4840.242
nav_78_0.xml (6920 bytes)0.1530.1420.37040.5880.520.420.29

Table 2. Medium XML files

File name/sizeVTD-XML (ms)VTD-XML buffer reuse (ms)SAX (ms)DOM(ms)DOM deferred(ms)Piccolo (ms)Pull (ms)
nav_50_0.xml (10304 bytes)0.20.1850.550.8020.7730.7010.398
officeOrder.xml (10591 bytes)0.1860.1740.410.6170.6150.5260.432
form.xml (15845 bytes)0.2740.2580.2270.2140.4860.7730.921
book.xml (22996 bytes)0.3680.3540.7432.3912.0460.8430.857
soap_small.xml (26734 bytes)0.580.5631.2213.8253.0681.3461.137
cd.xml (30831 bytes)0.5690.5491.2055.0924.3761.2111.362
bioinfo.xml (34759 bytes)0.5290.5171.0684.1264.3661.1881.33
soap_mid.xml (134334 bytes)2.8852.8046.02832.84621.8966.6685.828

Table 3. Big XML files

File name/sizeVTD-XML (ms)VTD-XML buffer reuse (ms)SAX (ms)DOM(ms)DOM deferred(ms)Piccolo (ms)Pull (ms)
po1m.xml (1.01 MB)25.7120.0836.4186.16115.6747.6263.27
soap.xml (2.59 MB)64.757.27123.18502.32380.74134.8393.96
bioinfo_big.xml (4.27 MB)70.173.9131.8629.1442.02151.62177.64
SUAS.xml (13.13 MB)359.91315.24665.361961.011296.08820.38637.72
address.xml (15.24 MB)315.06276658.562158.51822.22617.48684.57

Memory usage comparisons

Since SAX and Pull do not build datastructures in memory, the meaningful comparison is between DOM (both with and without deferred node expansion) and VTD-XML. To this end, this section benchmarks the multiplying factor, which is the ratio between the memory usage and the document size for large files (as memory usage is a particular concern for large files).

Figure 4. Click on thumbnail to view full-sized image.

Navigation performance comparisons

This section presents the navigation performance in terms of latency for VTD-XML and DOM (without deferred node expansion), which is the time it takes to visit every node in the document. To traverse the nodes, the DOM code relies on the nodeIterator interface, while the VTD-XML code calls the member methods selectElement(...) and iterate(...) of the class AutoPilot. As expected, navigation is much faster than parsing. For VTD-XML, the navigation cost is between 15 percent and 30 percent of the parsing cost. The ratio for DOM is 5 percent to 7 percent. Not that VTD-XML navigates slower than DOM; the difference is entirely the result of VTD-XML's far superior parsing performance.

Table 4. Small files

File name/sizeVTD-XML (ms)DOM(ms)
soap2.xml (1727 bytes)0.006710.00676
nav_48_0.xml (4608 bytes)0.0280.0155
cd_catalog.xml (5035 bytes)0.03880.0385
nav_63_0.xml (6848 bytes)0.04310.0238
nav_78_0.xml (6920 bytes)0.0430.0244

Table 5. Mid-sized files

File name/sizeVTD-XML (ms)DOM(ms)
nav_50_0.xml (10304 bytes)0.0630.034
officeOrder.xml (10591 bytes)0.07880.051
form.xml (15845 bytes)0.0650.046
book.xml (22996 bytes)0.1490.144
soap_small.xml (26734 bytes)0.2250.193
cd.xml (30831 bytes)0.2260.3
bioinfo.xml (34759 bytes)0.2360.178
soap_mid.xml (134334 bytes)1.611.151

Table 6. Large files

File name/sizeVTD-XML (ms)DOM(ms)
po1m.xml (1.01 MB)11.1910.84
soap.xml (2.59 MB)32.8435.44
bioinfo_big.xml (4.27 MB)30.4338.26
SUAS.xml (13.13 MB)21.4321.82
address.xml (15.24 MB)132.18

130.8

Result analysis

In Dennis Sosnoski's JavaWorld

article from four years ago, Piccolo was chosen as the overall winner among many SAX implementations. This has changed: The latest Xerces SAX parser has overtaken the leadership position as the best performing SAX parser. Also this article's test results indicated that, when compared with Xerces SAX, XPP3 turned in a robust performance and wasn't behind by much.

Also interesting is the finding that when the document size is small (less than 10 Kb), DOM parsing performance isn't behind SAX by as much as when the document sizes are big. For small XML files, DOM's deferred node expansion results in slower parsing performance than DOM with full node expansion.

Yet VTD-XML's performance so dominates the other parsers that it is in a class by itself. And the real comparison is between VTD-XML with and without buffer reuse. The significant better memory usage means VTD-XML can be used to process large XML files, and the performance benefit applies broadly to XML of all sizes.Conclusion

VTD-XML is a new, next-generation XML parser that overcomes many technical issues currently surrounding DOM and SAX. The combination of VTD-XML's high performance and low-memory usage has a few interesting implications. First, given how cheap DRAM (dynamic random access memory) chips are, unless there is absolutely no way to hold the XML document in memory, basing application development on a SAX parser offers few incentives. Second, applications become simple to write with VTD-XML, and they will also be much faster than previously thought possible. From big to small, VTD-XML's coverage of XML file sizes means that selecting the right processing model is now a simple process and developers no longer have to switch between the two drastically different parsing styles of DOM and SAX. Last, but not the least, VTD-XML should offer convincing answers to a few long-standing complaints about XML. For example, VTD-XML has built-in native XML indexing capability that should, once and for all, change the perception that XML is slow. Performance-wise, VTD-XML should mark the beginning of the "10x XML" era. And more importantly, the next stop for VTD-XML, just around the corner, is the era "100x XML."

Jimmy Zhang is founder of XimpleWare, a provider of high-performance XML processing solutions. He has working experiences in the fields of electronic design automation and voice-over IP with numerous Silicon Valley tech companies. He graduated from University of California, Berkeley with both a MS and a BS from the department of EECS.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more