Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

XML documents on the run, Part 1

SAX speeds through XML documents with parse-event streams

  • Print
  • Feedback

One of the oldest approaches to processing XML documents in Java also proves one of the fastest: parse-event streams. That approach became standardized in Java with the SAX (Simple API for XML) interface specification, later revised as SAX2 to include support for XML Namespaces.

Read the whole "XML Documents on the Run" series:



Event-stream processing offers other advantages beyond just speed. Because the parser processes the document on the fly, you can handle it as soon as you read its first part. Other approaches generally require you to parse the complete document before you start working with it -- fine if the document comes off a local disk drive, but if the document is sent from another system, parsing the complete document can cause significant delays.

Event-stream processing also eliminates any document size limits. In contrast, approaches that store the document's representation in memory can run out of space with very large documents. Setting a hard limit on a real-world document's size is often difficult, and potentially a major problem in many applications.

A note on the source code

This article features two example source code files: stock.jar and option.jar, both found in a downloadable zip file in Resources. Each jar file includes full example implementations, along with sample documents and test driver programs. To try an example, create a new directory, then extract the jar's files to that directory with jar xvf stock.jar or jar xvf option.jar. The readme.txt file gives instructions for setting up and running the test drivers.

The event view

Parsers with event-stream interfaces deliver a document one piece at a time. Think of the document's text as spread out in time, as it would be if read from a stream. The parser looks for significant document components (start and end tags, character data, and so on) in the text, generating parse events for each.

For example, here's a simple document:

<author>
  <first-name>Dennis</first-name>
  <last-name>Sosnoski</last-name>
</author>


The table shows the parse-event sequence a SAX2 parser would generate for this document (though the parser can divide up the character data reported by characters events differently than I've shown, as I discuss when I get to the actual code).

Parse events for document
Text processed
Parse event
""
startDocument()
"<author>"
startElement("author")
"\n "
characters("\n ")
"<first-name>"
startElement("first-name")
"Dennis"
characters("Dennis")
"</first-name>"
endElement("first-name")
"\n "
characters("\n ")
"<last-name>"
startElement("last-name")
"Sosnoski"
characters("Sosnoski")
"</last-name>"
endElement("last-name")
"\n"
characters("\n")
"</author>"
endElement("author")

Notice in the table that the parse events include both start of element and end of element notifications -- important information for your program because it lets you track the document's nested structure. Without the end notifications, you couldn't know which elements or character data are part of the content of some earlier element. Also note that the parse events include all the character data in the document, even the whitespace sequences most people would consider unimportant.

  • Print
  • Feedback

Resources
  • For full details of the SAX specification, currently at version 2.0.1, go to
    http://www.saxproject.org
  • The "Links" page within the SAX Project site features links to numerous related areas, including an assortment of SAX2 parsers
    http://www.saxproject.org/?selected=links
  • Sun Microsystems's JAXP page gives links to downloads, documentation, and other resources
    http://java.sun.com/xml/jaxp/index.html
  • For another take on working with the SAX2 APIs, check out Robert Hustead's "Mapping XML to Java" JavaWorld series in which he describes a class library for working with SAX2: