Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

Simplify XML processing with VTD-XML

A new option that overcomes the problems of DOM and SAX

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Figure 3. Big XML files. Click on thumbnail to view full-sized image.

Eight years since its inception, XML has already taken off as an open, semi-structured data format for storing data as well as exchanging data over the Web. Due to its simplicity and human readability, XML has seen its popularity skyrocket among application developers and has become an indispensable part of enterprise architecture.

Although it is difficult to enumerate the number of ways XML is being used, one can be certain about one thing: XML must be parsed before anything else can be done. In fact, choosing the right parser is often one of the first decisions that enterprise developers must tackle in their projects. And again and again, that decision comes down to the two popular XML processing models: the Document Object Model (DOM) and the Simple API for XML (SAX).

At first glance, the respective strengths and weaknesses of DOM and SAX seem complementary: DOM builds in-memory object graphs; SAX is event-based and stores nothing in memory. So if the document size is small and the data access pattern, complex, DOM is the way to go; otherwise, use SAX.

However, the truth is never so simplistic. More often than not, developers are unwilling to use SAX because of its complexity, yet still do because no other viable choice is available. Otherwise, if the XML file size is just slightly larger than a few hundreds of kilobytes, DOM's memory overhead and performance drag become a tough roadblock for application developers, preventing them from meeting their projects' minimum performance goals.

But is SAX really that much better? SAX's advertised parsing performance—typically several times faster than DOM—is actually often deceiving. It turns out that the awkward, forward-only nature of SAX parsing not only requires extra implementation effort, but also incurs performance penalties when the document structure becomes only slightly complex. If developers choose not to scan the document multiple times, they will have to buffer the document or build custom object models.

Either way, performance suffers, as exemplified by Apache Axis. On its FAQ page, Axis claims to internally use SAX to create a higher-performing implementation, yet it still builds its own object model that is quite DOM-like, resulting in negligible performance improvements when compared with its predecessor (Apache SOAP). In addition, SAX doesn't work well with XPath, and in general can't drive XSLT (Extensible Stylesheet Language Transformation) processing. So SAX parsing skirts the real problems of XML processing.

Seeking an easier-to-use alternative to SAX, a growing number of developers have turned to StAX (Streaming API for XML). Compared with SAX, StAX parsers pull tokens from XML files instead of using call backs. While they noticeably improve the usability, the fundamental issues persist—StAX's forward-only parsing style still requires tedious implementation effort and, along with it, hidden performance costs.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources