Newsletter sign-up
View all newsletters

Sign up for our technology specific newsletters.

Enterprise Java
Email Address:

XML documents on the run, Part 3

How do SAX2 parsers perform compared to new XMLPull parsers?

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone

Page 2 of 5

Cleared for access

XMLPull offers two access levels to the document data, letting you choose the detail level your program wants to see. When you call the next() method, the parser ignores a document's minor details and only reports the meatier components: elements and text. The next() method limits the values to four:

  • START_TAG for an element's start tag
  • TEXT for character data content
  • END_TAG for an element's end tag
  • END_DOCUMENT for when you've reached the end of the document data


In contrast, the nextToken() method provides more detailed access to the document structure, including components such as processing instructions, comments, entity references, and more. In fact, the nextToken() method gives a "full disclosure" document view; where next() silently skips components it doesn't report, nextToken() reports everything.

Why support full disclosure in a parser API? Reporting everything present in the input stream allows you to layer functionality. For example, neither current XMLPull implementation supports document validation, but the nextToken() parse view of the document offers enough detail that validation could sit as a wrapper layer on top of the basic parsers. Using that approach, only one validation code implementation adds validation support for all XMLPull implementations.

Layering represents a powerful feature. The original SAX interface did not report all the information needed for document validation, so parser writers had to build validation into the parser if they wanted to support it at all. That led to duplicated effort to implement validation within different parsers. Even now many SAX2 parsers do not support validation. In contrast, XMLPull's design avoids the problem completely.

Basic component handling

Most XML applications need only the five basic document components the next() method reports. Of the five, only START_TAG and TEXT warrant a closer look, as START_DOCUMENT, END_TAG, and END_DOCUMENT are self explanatory.

START_TAG provides information from an element's start tag, including the element's attributes. The XmlPullParser interface defines three methods for accessing the element name information: getName() for the local name, along with getNamespace() and getPrefix() for namespace information. The interface also defines six methods for accessing attribute values: getAttributeValue(namespace, name) to retrieve an attribute value by name, along with getAttributeCount(), getAttributeName(index), getAttributeNamespace(index), getAttributePrefix(index), and getAttributeValue(index) for direct indexed access to attributes.

TEXT supplies character-data content information. You can access the character data in two ways: First, the getText() method can get just the text as a string and avoid any details. Second, the getTextCharacters(holder) method can access the raw characters (as with the characters(ch, start, length) handler call in the SAX2 interface). The latter method requires some explanation: it directly returns an array that holds the characters, but the starting position in the array and the length of the character data are returned as values in the int[2] array passed as a call parameter—the start position at [0] and the number of characters at [1].

That's all you need to know for most XMLPull uses. You'll find much more in the API, including access to the internal namespace stack, document text position, and element nesting depth, but you can dig into these details directly in the Javadocs if you're interested.

Convert from XPP2

In Part 2, I included code for processing a financial-trade history document using the XPP2 pull-parser interface. Let's look at the changes required to bring that code up to XMLPull compatibility.

Fortunately, you'll need to substantially change only the PullWrapper class, since it has most of the parser-dependent code. Here's the new version:

public class PullWrapper
{
    /** Parser in use. */
    protected XmlPullParser m_parser;
    
    /** Constructor. Builds the shared objects used for parsing. */
    public PullHandler() throws XmlPullParserException {
        XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
        m_parser = factory.newPullParser();
    }
    
    /** Parse start of element from document. */
    protected void parseStartTag(String tag) 
        throws IOException, XmlPullParserException {
        while (true) {
            switch (m_parser.next()) {
                
                case XmlPullParser.START_TAG:
                    if (m_parser.getName().equals(tag)) {
                        return;
                    }
                    // Fall through for error handling.
                
                case XmlPullParser.END_TAG:
                case XmlPullParser.END_DOCUMENT:
                    throw new XmlPullParserException
                        ("Missing expected start tag " + tag);
            }
        }
    }
    
    /** Parse end of element from document. */
    protected String parseEndTag(String tag) 
        throws IOException, XmlPullParserException {
        String text = null;
        while (true) {
            switch (m_parser.next()) {
                
                case XmlPullParser.TEXT:
                    text = m_parser.getText().trim());
                    break;
                
                case XmlPullParser.END_TAG:
                        if (m_parser.getName().equals(tag)) {
                        return text;
                    }
                    // Fall through for error handling.
                
                case XmlPullParser.START_TAG:
                case XmlPullParser.END_DOCUMENT:
                    throw new XmlPullParserException
                        ("Missing expected end tag " + tag);
            }
        }
    }
    
    /** Parse element, returning content with white space trimmed. */
    protected String parseElementContent(String tag) 
        throws IOException, XmlPullParserException {
        parseStartTag(tag);
        return parseEndTag(tag);
    }
    
    /** Get attribute value from current start tag. */
    protected String attributeValue(String name) 
        throws IOException, XmlPullParserException {
        String value = m_parser.getAttributeValue(null, name);
        if (value == null) {
            throw new XmlPullParserException("Missing attribute " + name);
        } else {
            return value;
        }
    }
}


Not much has changed, except that the XPP2 interface used separate objects (XmlStartTag and XmlEndTag) to report information about a start or end tag, while the XMLPull common API makes the information directly available from the parser.

The only other necessary change: Remove the call to the parser's reset() method from the TradePullHandler class. When that's done, everything works as expected, and the example program can now use any XMLPull implementation (currently XPP3 and kXML, but more will be coming soon).

The once and future standard

A new Java Community Process (JCP) specification request specifies a standard API for Java pull parsers. As of yet, I can't say what will happen because the project, JSR-173: Streaming API for XML, has just started, but the results will prove important for the long term.

  • Digg
  • Reddit
  • SlashDot
  • Stumble
  • del.icio.us
  • Technorati
  • dzone
Comment
Login
Forgot your account info?
Add comment
Anonymous comments subject to approval. Register here for member benefits.
Have a JavaWorld account? Log in here. Register now for a free account.
Resources
  • Visit my XMLBench homepage for the latest parser performance updates and to check out Java document-model performance comparisons
    http://www.sosnoski.com/opensrc/xmlbench
  • For full details of the SAX specification, currently at version 2.0.1, go to
    http://www.saxproject.org
  • For the five SAX2 parsers included in the performance tests, see:
  • For full details of the XMLPull specification, currently at version 1.0.7, go to
    http://www.xmlpull.org
  • Here are the XMLPull parsers tested: