XML documents on the run, Part 3

How do SAX2 parsers perform compared to new XMLPull parsers?

In Parts 1 and 2 of this three-part series, I explained both push- (Simple API for XML 2 (SAX2)) and pull-style XML parsers. The pull-side story continues to change rapidly, so, as promised, I'll update you on the latest developments. These include the new Common API for XML Pull Parsing, or XMLPull, announced earlier this month. (Talk about hot off the presses!)

Read the whole "XML Documents on the Run" series:

But that's not all: In Part 2 I left loyal readers hanging on performance differences. Pull parsers offer some big ease-of-use advantages compared to SAX2, but can they measure up to SAX2's industrial-strength performance? You'll find out in this article's second half in which I show performance tests pitting five top SAX2 parsers against two new XMLPull parsers.

XMLPull

Just this month the ringleaders from the two leading pull-parser implementations announced XMLPull. Stefan Haustein from the kXML project and Aleksander Slominski from XPP3 (XML Pull Parser), both feeling that the lack of a common API hindered wider pull parsing adoption, began work on XMLPull in December 2001. The resulting API reflects their substantial experience, drawing from their respective projects to produce an approach that works well for a wide range of applications.

XMLPull supports everything from J2ME (Java 2 Platform, Micro Edition) to J2EE (Java 2 Platform, Enterprise Edition). The J2ME requirement forced them to create a simple interface with the minimal number of classes necessary to function well in limited-memory environments. In contrast, although in J2EE situations, memory isn't usually an issue, flexibility and performance are key. Accommodating both extremes with a single interface is tough. Does XMLPull succeed? I tackle that question below. Let's start by looking at the basic interface.

The all-in-one approach

The XMLPull API consists of a single interface, org.xmlpull.v1.XmlPullParser, along with two supporting classes: org.xmlpull.v1.XmlPullParserException and org.xmlpull.v1.XmlPullParserFactory. The XmlPullParser defines XMLPull's interesting parts, so let's examine the interface and ignore the two support classes.

Think of the XmlPullParser interface as defining a special kind of iterator. That iterator delivers an XML document's components to you one at a time. It's up to you, in your program, to decide when you're done with the current component and ready to move to the next one.

The parser always holds a particular state that matches the current component type. Many of XmlPullParser's methods prove meaningful only when the parser is in a particular state, identified by a set of constant definitions in the interface. When you begin parsing a document, the parser always resides in the START_DOCUMENT state.

How do you determine the parser's state once you begin parsing? Two ways: As the value returned by a call to the interface's next() or nextToken() methods, which advances the parser to the next document component. Or as the value returned by getEventType(), which just gives you the current state.

Cleared for access

XMLPull offers two access levels to the document data, letting you choose the detail level your program wants to see. When you call the next() method, the parser ignores a document's minor details and only reports the meatier components: elements and text. The next() method limits the values to four:

  • START_TAG for an element's start tag
  • TEXT for character data content
  • END_TAG for an element's end tag
  • END_DOCUMENT for when you've reached the end of the document data

In contrast, the nextToken() method provides more detailed access to the document structure, including components such as processing instructions, comments, entity references, and more. In fact, the nextToken() method gives a "full disclosure" document view; where next() silently skips components it doesn't report, nextToken() reports everything.

Why support full disclosure in a parser API? Reporting everything present in the input stream allows you to layer functionality. For example, neither current XMLPull implementation supports document validation, but the nextToken() parse view of the document offers enough detail that validation could sit as a wrapper layer on top of the basic parsers. Using that approach, only one validation code implementation adds validation support for all XMLPull implementations.

Layering represents a powerful feature. The original SAX interface did not report all the information needed for document validation, so parser writers had to build validation into the parser if they wanted to support it at all. That led to duplicated effort to implement validation within different parsers. Even now many SAX2 parsers do not support validation. In contrast, XMLPull's design avoids the problem completely.

Basic component handling

Most XML applications need only the five basic document components the next() method reports. Of the five, only START_TAG and TEXT warrant a closer look, as START_DOCUMENT, END_TAG, and END_DOCUMENT are self explanatory.

START_TAG provides information from an element's start tag, including the element's attributes. The XmlPullParser interface defines three methods for accessing the element name information: getName() for the local name, along with getNamespace() and getPrefix() for namespace information. The interface also defines six methods for accessing attribute values: getAttributeValue(namespace, name) to retrieve an attribute value by name, along with getAttributeCount(), getAttributeName(index), getAttributeNamespace(index), getAttributePrefix(index), and getAttributeValue(index) for direct indexed access to attributes.

TEXT supplies character-data content information. You can access the character data in two ways: First, the getText() method can get just the text as a string and avoid any details. Second, the getTextCharacters(holder) method can access the raw characters (as with the characters(ch, start, length) handler call in the SAX2 interface). The latter method requires some explanation: it directly returns an array that holds the characters, but the starting position in the array and the length of the character data are returned as values in the int[2] array passed as a call parameter—the start position at [0] and the number of characters at [1].

That's all you need to know for most XMLPull uses. You'll find much more in the API, including access to the internal namespace stack, document text position, and element nesting depth, but you can dig into these details directly in the Javadocs if you're interested.

Convert from XPP2

In Part 2, I included code for processing a financial-trade history document using the XPP2 pull-parser interface. Let's look at the changes required to bring that code up to XMLPull compatibility.

Fortunately, you'll need to substantially change only the PullWrapper class, since it has most of the parser-dependent code. Here's the new version:

public class PullWrapper
{
    /** Parser in use. */
    protected XmlPullParser m_parser;
    
    /** Constructor. Builds the shared objects used for parsing. */
    public PullHandler() throws XmlPullParserException {
        XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
        m_parser = factory.newPullParser();
    }
    
    /** Parse start of element from document. */
    protected void parseStartTag(String tag) 
        throws IOException, XmlPullParserException {
        while (true) {
            switch (m_parser.next()) {
                
                case XmlPullParser.START_TAG:
                    if (m_parser.getName().equals(tag)) {
                        return;
                    }
                    // Fall through for error handling.
                
                case XmlPullParser.END_TAG:
                case XmlPullParser.END_DOCUMENT:
                    throw new XmlPullParserException
                        ("Missing expected start tag " + tag);
            }
        }
    }
    
    /** Parse end of element from document. */
    protected String parseEndTag(String tag) 
        throws IOException, XmlPullParserException {
        String text = null;
        while (true) {
            switch (m_parser.next()) {
                
                case XmlPullParser.TEXT:
                    text = m_parser.getText().trim());
                    break;
                
                case XmlPullParser.END_TAG:
                        if (m_parser.getName().equals(tag)) {
                        return text;
                    }
                    // Fall through for error handling.
                
                case XmlPullParser.START_TAG:
                case XmlPullParser.END_DOCUMENT:
                    throw new XmlPullParserException
                        ("Missing expected end tag " + tag);
            }
        }
    }
    
    /** Parse element, returning content with white space trimmed. */
    protected String parseElementContent(String tag) 
        throws IOException, XmlPullParserException {
        parseStartTag(tag);
        return parseEndTag(tag);
    }
    
    /** Get attribute value from current start tag. */
    protected String attributeValue(String name) 
        throws IOException, XmlPullParserException {
        String value = m_parser.getAttributeValue(null, name);
        if (value == null) {
            throw new XmlPullParserException("Missing attribute " + name);
        } else {
            return value;
        }
    }
}

Not much has changed, except that the XPP2 interface used separate objects (XmlStartTag and XmlEndTag) to report information about a start or end tag, while the XMLPull common API makes the information directly available from the parser.

The only other necessary change: Remove the call to the parser's reset() method from the TradePullHandler class. When that's done, everything works as expected, and the example program can now use any XMLPull implementation (currently XPP3 and kXML, but more will be coming soon).

The once and future standard

A new Java Community Process (JCP) specification request specifies a standard API for Java pull parsers. As of yet, I can't say what will happen because the project, JSR-173: Streaming API for XML, has just started, but the results will prove important for the long term.

You don't, however, need to wait for the specification. Using a wrapper, such as the PullWrapper class shown above, can both isolate your program from the API details and provide you with higher-level building blocks for your application code (such as the parseElementContent(name) and attributeValue(name) methods). If the underlying parser API changes, you need only modify the wrapper code, not your application.

Performance comparisons

In this series I've shown how application programming can be much easier for pull parsers than for the SAX2-style push parsers. But do the relatively new pull parsers deliver the performance necessary for real applications? It's time to run some tests and find out!

For this purpose, I modified my existing XMLBench program. The original version tested XML document-model performance in Java (DOM (Document Object Model), JDOM, dom4j, and so on). For the performance tests, I extended the program to also test SAX2 and XMLPull parser speed. Although the test mainly concerns performance, it does check that each parser accurately reports the document structure's basics (element number, attribute number, total attribute length, and character-data content text).

Test documents

The tests use various XML documents to both verify that the parsers can handle the documents properly and to measure how different document sizes and structures affect performance. I break the results into small, mid-sized, and large documents.

The small document tests use the following document collections (each collection consisting of 20 to 30 individual documents, with performance measured for the whole collection rather than individual documents):

  • soaps (0.4-1.4 KB): SOAP (Simple Object Access Protocol) request and response messages, taken from the Apache Axis interoperability test results. The documents use namespaces extensively and some attributes along with character-data content.
  • fms (each about 5 KB): RDF (Resource Description Framework) files giving new release information from freshmeat.net. They heavily use namespaces and include a few attributes along with character-data content.
  • ants (.5-9.9 KB): Ant build.xml files taken from Jakarta Taglibs projects. They heavily use attributes and comments with little character data content and no namespaces.
  • webs (.2-36 KB): Jakarta Taglibs taglib.tld and web.xml files. The documents use character data content with no attributes and no namespaces.

The mid-sized document tests use the following individual documents:

  • soap.xml (131 KB): A generated SOAP document containing a large values list. It includes some namespaces and attributes and a flat structure consisting of simple elements with short character-data content. Aleksander Slominski generated this document as a SOAP test case.
  • much_ado.xml (197 KB): The Shakespeare play marked up as XML from Jon Bosak's document collection. It includes no namespaces or attributes and has a flat structure consisting of simple elements with relatively long character-data content.
  • periodic_table.xml (114 KB): Elliotte Rusty Harold's periodic table of the elements in XML. It includes no namespaces and has light attribute usage and a fairly complex structure consisting mainly of elements with short character data content.
  • xml.xml (192 KB): The XML specification as XHTML, with the DTD (document type definition) reference removed and all entities substituted (necessary for some models used in the tests). I chose this document as a typical document presentation markup, with heavy mixed content. It has no namespaces and light attribute use.

The large document tests employed David Mertz's weblog.xml (2.9 MB), a Web server log file formatted as XML. It comprises approximately 10,000 elements representing page hits, each containing several child elements with character-data content for the information fields. It has no namespaces and no attributes.

I've included each document, except for xml.xml, in the XMLBench download. I couldn't include xml.xml because its license requires that it be distributed only in unmodified form. (To try it yourself, remove the DTD reference and substitute &#xxx; values for entities.)

The parsers tested

I tested five SAX2 parsers, along with both currently available XMLPull implementations. First, here's the SAX2 parser list:

  • Crimson: Apache project based on the Sun Project X parser. Crimson is stable and distributed under an Apache license.
  • Xerces: Apache project based on the IBM XML4J parser. It's also stable and distributed under an Apache license.
  • Xerces2: Xerces redesign and new implementation as a separate Apache project. It's fairly new but stable and also distributed under an Apache license.
  • AElfred2: The parser included in the GNU JAXP (Java API for XML Parsing) project, based on the Microstar AElfred parser. It's a beta distributed as GPL (GNU General Public License) with library exception license.
  • Piccolo: New parser development using parser generator tools. It's distributed under an LGPL (GNU Library or Lessor Public License) license.

The XMLPull implementations are:

  • kXML: A compact, J2ME-compatible parser. The XMLPull version is in beta, distributed under an LGPL license.
  • XPP3: A compact parser originally designed for SOAP. The XMLPull version is in beta, distributed under an Apache-style license.

Performance results

I tested the parsers on a Red Hat 7.2 Linux system with a 1.4-GHz Athlon processor and 256 MB of RAM. The tests used Sun's JRE (Java Runtime Environment) 1.3.1 running under Linux kernel version 2.4.18.

Figure 1. Small document parsing speed

Figure 1 shows the small document test results. The XMLPull parsers performed extremely well with the small documents, beating all the SAX2 parsers except the new Piccolo parser. Piccolo blows away even the fast pull parsers, though, giving the fastest times for every tested document-type collection.

The other SAX2 parsers don't fare nearly as well as Piccolo. AElfred2 and Xerces2 both deliver acceptable performance, although taking more than twice as long as Piccolo. The original Xerces and Crimson bring up the rear; if you're working mainly with small files, you should probably avoid these two.

Figure 2. Mid-sized document parsing speed

For the mid-sized XML documents shown in Figure 2, performance differs little between the XMLPull parsers and most of the SAX2 parsers. Piccolo again stands out, though, with about twice the speed of the slowest parsers. AElfred2 and Crimson both show relatively poor performance with the SOAP document, which doesn't contain much character-data content. The XMLPull parsers, along with Xerces2, show relatively poor performance on the content-heavy much_ado.xml file. The performance range here is much smaller than for the small documents, however.

Figure 3. Large document parsing speed

The large document results in Figure 3 show a smaller performance range than the mid-sized documents. Here again, Piccolo proves the best. The XMLPull parsers and Xerces2 trail the other parsers in this test. Keep in mind this test uses only one document,so the results aren't as representative of general performance as the earlier tests.

Performance summary

The results show that the new XMLPull parsers perform well compared to the older SAX2 parsers. The new Piccolo SAX2 parser does even better, but that's probably because of the parser's implementation rather than the SAX2 interface. We'll see how performance works out once the XMLPull implementations have been tuned for better performance. In fact, a new XPP3 version claims some dramatic performance gains. Check my XMLBench Website for updated test results, beginning in early May 2002.

Note: None of the best performers support validation. Of the parsers tested, Xerces2 offers the best validation support and shows reasonable performance. The performance tests are done with validation disabled, though, so performance may prove different when it's turned on. If you need validation in your application and performance is an issue, try using different validating parsers to see which works best in your environment.

Parsing wrap-up

Now that you've seen both usage and performance comparisons for SAX2 and XMLPull parsers, how do you decide what to use for your applications? Here's a list of pros and cons:

  • SAX2 pros:
    • Stable: SAX2 features many compatible implementations and good support
    • Widely used: Compatibility with existing code is often important
    • Great validation support: The only real choice if you need document validation on parse
  • SAX2 cons:
    • No parse control: The parser takes control and doesn't give it back until it's done
    • Event-driven programming: Data is delivered piecemeal to your handler
  • XMLPull pros:
    • Controlled parsing: Your program completely controls the parsing
    • Event-request programming: You request events as you want to process them
  • XMLPull cons:
    • New: Only two implementations currently exist
    • No validation support: Layering approach should give great support in time, but not yet

Both interface types perform well, so you don't need to focus on performance when making your decision. Be careful with your specific SAX2 parser choice if performance is a concern, though; you've seen that the choice can make a big difference in some circumstances.

If you do use SAX2, a flexible approach such as the one I outlined in Part 2 can help you manage event-driven programming's complexities. If you go with an XMLPull approach, using a wrapper to provide higher-level primitives for your program (as shown in Part 2 and updated here) makes the programming even easier. In either case, I hope you enjoy your work with event-stream XML processing!

Dennis Sosnoski is an enterprise architecture consultant and developer with more than 30 years' experience. As the president and lead consultant of Seattle-based consulting company Sosnoski Software Solutions, he's spent the last four years designing and building enterprise applications in Java. Dennis started working with XML in Java two-and-a-half years ago, and in addition to originating both commercial and open source projects for XML, he's chaired the Seattle Java-XML SIG since its founding in 1999.

Learn more about this topic

  • Visit my XMLBench homepage for the latest parser performance updates and to check out Java document-model performance comparisons
    http://www.sosnoski.com/opensrc/xmlbench
  • For full details of the SAX specification, currently at version 2.0.1, go to
    http://www.saxproject.org
  • For the five SAX2 parsers included in the performance tests, see:
  • For full details of the XMLPull specification, currently at version 1.0.7, go to
    http://www.xmlpull.org
  • Here are the XMLPull parsers tested:

Join the discussion
Be the first to comment on this article. Our Commenting Policies