XML documents on the run, Part 3

How do SAX2 parsers perform compared to new XMLPull parsers?

In Parts 1 and 2 of this three-part series, I explained both push- (Simple API for XML 2 (SAX2)) and pull-style XML parsers. The pull-side story continues to change rapidly, so, as promised, I'll update you on the latest developments. These include the new Common API for XML Pull Parsing, or XMLPull, announced earlier this month. (Talk about hot off the presses!)

Read the whole "XML Documents on the Run" series:

But that's not all: In Part 2 I left loyal readers hanging on performance differences. Pull parsers offer some big ease-of-use advantages compared to SAX2, but can they measure up to SAX2's industrial-strength performance? You'll find out in this article's second half in which I show performance tests pitting five top SAX2 parsers against two new XMLPull parsers.

XMLPull

Just this month the ringleaders from the two leading pull-parser implementations announced XMLPull. Stefan Haustein from the kXML project and Aleksander Slominski from XPP3 (XML Pull Parser), both feeling that the lack of a common API hindered wider pull parsing adoption, began work on XMLPull in December 2001. The resulting API reflects their substantial experience, drawing from their respective projects to produce an approach that works well for a wide range of applications.

XMLPull supports everything from J2ME (Java 2 Platform, Micro Edition) to J2EE (Java 2 Platform, Enterprise Edition). The J2ME requirement forced them to create a simple interface with the minimal number of classes necessary to function well in limited-memory environments. In contrast, although in J2EE situations, memory isn't usually an issue, flexibility and performance are key. Accommodating both extremes with a single interface is tough. Does XMLPull succeed? I tackle that question below. Let's start by looking at the basic interface.

The all-in-one approach

The XMLPull API consists of a single interface, org.xmlpull.v1.XmlPullParser, along with two supporting classes: org.xmlpull.v1.XmlPullParserException and org.xmlpull.v1.XmlPullParserFactory. The XmlPullParser defines XMLPull's interesting parts, so let's examine the interface and ignore the two support classes.

Think of the XmlPullParser interface as defining a special kind of iterator. That iterator delivers an XML document's components to you one at a time. It's up to you, in your program, to decide when you're done with the current component and ready to move to the next one.

The parser always holds a particular state that matches the current component type. Many of XmlPullParser's methods prove meaningful only when the parser is in a particular state, identified by a set of constant definitions in the interface. When you begin parsing a document, the parser always resides in the START_DOCUMENT state.

How do you determine the parser's state once you begin parsing? Two ways: As the value returned by a call to the interface's next() or nextToken() methods, which advances the parser to the next document component. Or as the value returned by getEventType(), which just gives you the current state.

Cleared for access

XMLPull offers two access levels to the document data, letting you choose the detail level your program wants to see. When you call the next() method, the parser ignores a document's minor details and only reports the meatier components: elements and text. The next() method limits the values to four:

  • START_TAG for an element's start tag
  • TEXT for character data content
  • END_TAG for an element's end tag
  • END_DOCUMENT for when you've reached the end of the document data

In contrast, the nextToken() method provides more detailed access to the document structure, including components such as processing instructions, comments, entity references, and more. In fact, the nextToken() method gives a "full disclosure" document view; where next() silently skips components it doesn't report, nextToken() reports everything.

Why support full disclosure in a parser API? Reporting everything present in the input stream allows you to layer functionality. For example, neither current XMLPull implementation supports document validation, but the nextToken() parse view of the document offers enough detail that validation could sit as a wrapper layer on top of the basic parsers. Using that approach, only one validation code implementation adds validation support for all XMLPull implementations.

Layering represents a powerful feature. The original SAX interface did not report all the information needed for document validation, so parser writers had to build validation into the parser if they wanted to support it at all. That led to duplicated effort to implement validation within different parsers. Even now many SAX2 parsers do not support validation. In contrast, XMLPull's design avoids the problem completely.

Basic component handling

Most XML applications need only the five basic document components the next() method reports. Of the five, only START_TAG and TEXT warrant a closer look, as START_DOCUMENT, END_TAG, and END_DOCUMENT are self explanatory.

START_TAG provides information from an element's start tag, including the element's attributes. The XmlPullParser interface defines three methods for accessing the element name information: getName() for the local name, along with getNamespace() and getPrefix() for namespace information. The interface also defines six methods for accessing attribute values: getAttributeValue(namespace, name) to retrieve an attribute value by name, along with getAttributeCount(), getAttributeName(index), getAttributeNamespace(index), getAttributePrefix(index), and getAttributeValue(index) for direct indexed access to attributes.

TEXT supplies character-data content information. You can access the character data in two ways: First, the getText() method can get just the text as a string and avoid any details. Second, the getTextCharacters(holder) method can access the raw characters (as with the characters(ch, start, length) handler call in the SAX2 interface). The latter method requires some explanation: it directly returns an array that holds the characters, but the starting position in the array and the length of the character data are returned as values in the int[2] array passed as a call parameter—the start position at [0] and the number of characters at [1].

That's all you need to know for most XMLPull uses. You'll find much more in the API, including access to the internal namespace stack, document text position, and element nesting depth, but you can dig into these details directly in the Javadocs if you're interested.

Convert from XPP2

In Part 2, I included code for processing a financial-trade history document using the XPP2 pull-parser interface. Let's look at the changes required to bring that code up to XMLPull compatibility.

Fortunately, you'll need to substantially change only the PullWrapper class, since it has most of the parser-dependent code. Here's the new version:

public class PullWrapper
{
    /** Parser in use. */
    protected XmlPullParser m_parser;
    
    /** Constructor. Builds the shared objects used for parsing. */
    public PullHandler() throws XmlPullParserException {
        XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
        m_parser = factory.newPullParser();
    }
    
    /** Parse start of element from document. */
    protected void parseStartTag(String tag) 
        throws IOException, XmlPullParserException {
        while (true) {
            switch (m_parser.next()) {
                
                case XmlPullParser.START_TAG:
                    if (m_parser.getName().equals(tag)) {
                        return;
                    }
                    // Fall through for error handling.
                
                case XmlPullParser.END_TAG:
                case XmlPullParser.END_DOCUMENT:
                    throw new XmlPullParserException
                        ("Missing expected start tag " + tag);
            }
        }
    }
    
    /** Parse end of element from document. */
    protected String parseEndTag(String tag) 
        throws IOException, XmlPullParserException {
        String text = null;
        while (true) {
            switch (m_parser.next()) {
                
                case XmlPullParser.TEXT:
                    text = m_parser.getText().trim());
                    break;
                
                case XmlPullParser.END_TAG:
                        if (m_parser.getName().equals(tag)) {
                        return text;
                    }
                    // Fall through for error handling.
                
                case XmlPullParser.START_TAG:
                case XmlPullParser.END_DOCUMENT:
                    throw new XmlPullParserException
                        ("Missing expected end tag " + tag);
            }
        }
    }
    
    /** Parse element, returning content with white space trimmed. */
    protected String parseElementContent(String tag) 
        throws IOException, XmlPullParserException {
        parseStartTag(tag);
        return parseEndTag(tag);
    }
    
    /** Get attribute value from current start tag. */
    protected String attributeValue(String name) 
        throws IOException, XmlPullParserException {
        String value = m_parser.getAttributeValue(null, name);
        if (value == null) {
            throw new XmlPullParserException("Missing attribute " + name);
        } else {
            return value;
        }
    }
}

Not much has changed, except that the XPP2 interface used separate objects (XmlStartTag and XmlEndTag) to report information about a start or end tag, while the XMLPull common API makes the information directly available from the parser.

The only other necessary change: Remove the call to the parser's reset() method from the TradePullHandler class. When that's done, everything works as expected, and the example program can now use any XMLPull implementation (currently XPP3 and kXML, but more will be coming soon).

The once and future standard

A new Java Community Process (JCP) specification request specifies a standard API for Java pull parsers. As of yet, I can't say what will happen because the project, JSR-173: Streaming API for XML, has just started, but the results will prove important for the long term.

You don't, however, need to wait for the specification. Using a wrapper, such as the PullWrapper class shown above, can both isolate your program from the API details and provide you with higher-level building blocks for your application code (such as the parseElementContent(name) and attributeValue(name) methods). If the underlying parser API changes, you need only modify the wrapper code, not your application.

Performance comparisons

In this series I've shown how application programming can be much easier for pull parsers than for the SAX2-style push parsers. But do the relatively new pull parsers deliver the performance necessary for real applications? It's time to run some tests and find out!

For this purpose, I modified my existing XMLBench program. The original version tested XML document-model performance in Java (DOM (Document Object Model), JDOM, dom4j, and so on). For the performance tests, I extended the program to also test SAX2 and XMLPull parser speed. Although the test mainly concerns performance, it does check that each parser accurately reports the document structure's basics (element number, attribute number, total attribute length, and character-data content text).

Test documents

The tests use various XML documents to both verify that the parsers can handle the documents properly and to measure how different document sizes and structures affect performance. I break the results into small, mid-sized, and large documents.

The small document tests use the following document collections (each collection consisting of 20 to 30 individual documents, with performance measured for the whole collection rather than individual documents):

  • soaps (0.4-1.4 KB): SOAP (Simple Object Access Protocol) request and response messages, taken from the Apache Axis interoperability test results. The documents use namespaces extensively and some attributes along with character-data content.
  • fms (each about 5 KB): RDF (Resource Description Framework) files giving new release information from freshmeat.net. They heavily use namespaces and include a few attributes along with character-data content.
  • ants (.5-9.9 KB): Ant build.xml files taken from Jakarta Taglibs projects. They heavily use attributes and comments with little character data content and no namespaces.
  • webs (.2-36 KB): Jakarta Taglibs taglib.tld and web.xml files. The documents use character data content with no attributes and no namespaces.

The mid-sized document tests use the following individual documents:

1 2 Page
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more