Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
One of the oldest approaches to processing XML documents in Java also proves one of the fastest: parse-event streams. That approach became standardized in Java with the SAX (Simple API for XML) interface specification, later revised as SAX2 to include support for XML Namespaces.
Read the whole "XML Documents on the Run" series:
Event-stream processing offers other advantages beyond just speed. Because the parser processes the document on the fly, you can handle it as soon as you read its first part. Other approaches generally require you to parse the complete document before you start working with it -- fine if the document comes off a local disk drive, but if the document is sent from another system, parsing the complete document can cause significant delays.
Event-stream processing also eliminates any document size limits. In contrast, approaches that store the document's representation in memory can run out of space with very large documents. Setting a hard limit on a real-world document's size is often difficult, and potentially a major problem in many applications.
This article features two example source code files:
option.jar, both found in a downloadable zip file in Resources. Each jar file includes full example implementations, along with sample documents and test driver programs. To try an example,
create a new directory, then extract the jar's files to that directory with
jar xvf stock.jar or
jar xvf option.jar. The
readme.txt file gives instructions for setting up and running the test drivers.
Parsers with event-stream interfaces deliver a document one piece at a time. Think of the document's text as spread out in time, as it would be if read from a stream. The parser looks for significant document components (start and end tags, character data, and so on) in the text, generating parse events for each.
For example, here's a simple document:
<author> <first-name>Dennis</first-name> <last-name>Sosnoski</last-name> </author>
The table shows the parse-event sequence a SAX2 parser would generate for this document (though the parser can divide up the
character data reported by
characters events differently than I've shown, as I discuss when I get to the actual code).
Notice in the table that the parse events include both start of element and end of element notifications -- important information for your program because it lets you track the document's nested structure. Without the end notifications, you couldn't know which elements or character data are part of the content of some earlier element. Also note that the parse events include all the character data in the document, even the whitespace sequences most people would consider unimportant.