XML documents on the run, Part 2

Better SAX2 handling and the pull-parser alternative

Is event-driven programming for SAX2 (Simple API for XML) endangering your sanity? After Part 1 of this three-part series introduced SAX2 parsing, you should feel more in touch with reality! In that article, I supplied basic handler techniques, which we'll build on in this article, to keep your code manageable.

Read the whole "XML Documents on the Run" series:

In this article, I extend the SAX2 handling approach suggested in Part 1 to cope with multiple nested-structure levels within an XML document. Using that approach, you can implement a class for each structure type you need to handle, keeping your code clean and eliminating event-driven programming's messiness.

Our quest for improved XML event-stream processing doesn't end with SAX2, though. I also introduce the pull-parser approach that's increasingly gaining attention as a SAX2 alternative. With pull parsing, your program keeps control, rather than relinquishing it to the parser -- letting you avoid the event-driven hassles completely!

Note: You can download this article's example source code from Resources.

Handling SAX2

I ended Part 1 with ways to extend the event-driven handling model we'd started to develop. I mentioned that you could enhance the interface to include start elements and nesting, and promised I'd address that in Part 2. So, let's get to it!

First, here's a more complicated version of the trade history documents from Part 1:

<?xml version="1.0"?>
<trade-history>
  <option-trade>
    <symbol>SUNW</symbol>
    <tracking id="7495733">
      <time>08:45:19</time>
      <seller ident="XBA" type="direct"/>
      <buyer ident="ZFT" type="agent"/>
      <exchange>XA</exchange>
    </tracking>
    <option-type>call</option-type>
    <strike-price>100</strike-price>
    <expiration-month>9</expiration-month>
    <trade-price>13.47</trade-price>
    <quantity>500</quantity>
  </option-trade>
  <stock-trade>
    <symbol>SUNW</symbol>
    <tracking id="7499345">
      <time>08:45:19</time>
      <seller ident="CCC" type="agent"/>
      <buyer ident="ABT" type="agent"/>
      <exchange>XA</exchange>
    </tracking>
    <price>86.24</price>
    <quantity>500</quantity>
  </stock-trade>
  ...
</trade-history>

The new version uses a new tracking element present in both the stock-trade and option-trade elements. The tracking element provides information applicable to all trade types. The types include the trade time, which we'd previously included directly in the stock and option trade information, as well as additional items to track the parties involved and the trade exchange.

In Part 1, to keep things simple, I stuck to the basics of using element content for our information. Now that you've seen the basics, in this article I extend the coverage to attributes. Using attributes for information rather than element content depends mainly on your style, and you'll often need to work with documents that combine the two approaches. I set up the trade-history document's new format with that in mind, and I included attribute values for useful information in the added tracking-element substructure. We'll look at how to handle such information in the following code examples.

A better interface

The last example from Part 1 employed a simple interface for our own handler classes:

public interface EndElementHandler
{
    public void endElement(String lname, String content);
}

Since that code set the handler directly and used only element content, just one simple method was necessary. Now we want to handle attributes as well as content. We also want to nest handlers -- have one handler pass off control to a separate handler for processing a substructure, like the tracking information in our revised document. Figure 1 shows how this should work when we're processing an option-trade element, for example.

Figure 1. Stackable handlers in action: tracking element within the stock-trade element

We'll need a more complex interface to handle these requirements; it'll need support for attributes and some way to set a nested handler. Here's the definition of the StructureHandler interface we'll use for this purpose:

public interface StructureHandler
{
    // Start of the root element in the structure being handled.
    public void startElement(String lname, Attributes attrs);
    
    // End of the root element in the structure being handled.
    public void endElement(String lname, String content);
    
    // Start of child element in the structure being handled -- this can
    //  invoke a nested handler, by passing back a non-null value.
    public StructureHandler startChild(String lname, Attributes attrs);
    
    // End of child element handled directly.
    public void endDirectChild(String lname, String content);
    
    // End of child element handled by nested handler.
    public void endStructureChild(String lname, StructureHandler handler);
}

The above interface gives us the information necessary for handling our new, more complicated document format. The startElement() method call informs us that we're beginning our handling and gives a convenient hook for any initialization code. The endElement() method call then informs us when we're finished.

The startChild() method supplies the information for a child element start tag, and gives us the choice of handling it directly (by returning null) or invoking a nested handler (by returning the handler instance). If we handle the child directly, we'll get a call to endDirectChild() on the end tag; if we invoke a nested handler, we'll get a call on endStructureChild() on the end tag. If we use the nested handler, we won't be called for anything between the start and end of the child element -- the nested handler will instead be used for any contained children.

Build a base

We want several classes to implement the StructureHandler interface; most won't actually use all the methods, though. To simplify our later code, we can define a simple base class with dummy interface-method implementations. Our other classes can then subclass that base and override only those methods they actually need to use.

Here's the base class implementation:

public class StructureHandlerBase implements StructureHandler
{
    public void startElement(String lname, Attributes attrs) {}
    
    public void endElement(String lname, String content) {}
    
    public StructureHandler startChild(String lname, Attributes attrs) {
        return null;
    }
    
    public void endDirectChild(String lname, String content) {}
    
    public void endStructureChild(String lname, StructureHandler handler) {}
}

Not the most sophisticated code in the world, but it saves us duplicating these dummy methods in classes that don't need them.

Drive the interface

To use the spiffy new interface, we must modify our SAX2 handler class from the examples in Part 1. The following class extends the SAX2 DefaultHandler base class and overrides the methods we use for our application. Here's the new version:

public class StructuredDocumentHandler extends DefaultHandler
{
    /** Structure handler context stack. */
    protected Stack m_contextStack;
    
    /** Current nested element depth. */
    protected int m_nestingDepth;
    
    /** Depth at which to pop handler context. */
    protected int m_contextDepth;
    
    /** Active structure handler. */
    protected StructureHandler m_handler;
    
    /** Character data collection buffer. */
    protected StringBuffer m_contentBuffer = new StringBuffer();
    
    public StructuredDocumentHandler(StructureHandler handler) {
        
        // set base handler for document
        m_handler = handler;
        m_contextDepth = -1;
        m_contextStack = new Stack();
    }
    
    public void startElement(String uri, String lname, String qname,
        Attributes attributes) {
            
        // Initialize content and check handler.
        m_contentBuffer.setLength(0);
        StructureHandler next = m_handler.startChild(lname, attributes);
        if (next != null) {
            
            // Save current handler context.
            HandlerContext context = 
                new HandlerContext(m_contextDepth, m_handler);
            m_contextStack.push(context);
            
            // Change to new nested handler.
            m_handler = next;
            m_contextDepth = m_nestingDepth;
            next.startElement(lname, attributes);
        }
        
        // Bump the nested element count.
        m_nestingDepth++;
    }
    
    public void characters(char[] chars, int start, int length) {
        m_contentBuffer.append(chars, start, length);
    }
    
    public void endElement(String uri, String lname, String qname) {
        
        // Clean up content and check if context end.
        String content = m_contentBuffer.toString().trim();
        m_nestingDepth--;
        if (m_nestingDepth == m_contextDepth) {
            
            // Report end element for current handler.
            m_handler.endElement(lname, content);
            
            // Restore higher level handler context.
            HandlerContext context = (HandlerContext)m_contextStack.pop();
            StructureHandler last = m_handler;
            m_handler = context.getHandler();
            m_contextDepth = context.getDepth();
            
            // Report child structure end to higher level handler.
            m_handler.endStructureChild(lname, last);
            
        } else {
            
            // Report end of child element.
            m_handler.endDirectChild(lname, content);
            
        }
    }
    
    protected class HandlerContext
    {
        private final int m_depth;
        private final StructureHandler m_handler;
        
        protected HandlerContext(int depth, StructureHandler handler) {
            m_depth = depth;
            m_handler = handler;
        }
        
        protected int getDepth() {
            return m_depth;
        }
        
        protected StructureHandler getHandler() {
            return m_handler;
        }
    }
}

StructuredDocumentHandler works with a stack of StructureHandler instances. m_handler always references this stack's current instance, which stays valid as long as we stay within the top-level element where it was first supplied. m_contextDepth gives the top-level element's depth for the current handler, and m_nestingDepth tracks our current depth within the document structure -- how many start tags we've seen without seeing the corresponding end tags.

When we're passed a nested handler (within the startElement() method, if the call to startChild() returns non-null), we create an inner helper-class instance, HandlerContext. That helper remembers the handler we used previously, along with the depth of its root element. We save the new helper class instance on the m_contextStack stack so we can restore the information later.

Once the current handler finishes (in endElement(), when m_nestingDepth equals m_contextDepth), we retrieve the saved HandlerContext instance from the stack and restore the information for the higher-level handler.

Character-data content accumulation works the same as in the examples from Part 1. We trim whitespace from the accumulated content and pass it to the appropriate StructureHandler method within our endElement() handling. The StructureHandler implementation can decide for itself how to handle the content.

The aforementioned code seems a little complex for something I promised would simplify your programming, but please realize that it is general purpose and not specific to our document format. You can use this same StructuredDocumentHandler implementation (as well as the earlier interface and base class) for most SAX2-based parsing purposes. For some document types, you might need to add the namespace URI (Uniform Resource Identifiers) as a parameter to the interface methods, while for others you should remove the whitespace trimming on the content. These issues are beyond the scope of this article, though. For our purposes, this implementation gives everything we need in a convenient form, as we'll see next.

Get on track

Next, we see the code's document-specific section. With our interface structure we employ a separate StructureHandlerBase subclass for each main element type we want to handle in the document. Together we define four subclasses: one each for the option-trade, stock-trade, and tracking elements, along with one for the document root trade-history element. We start from the bottom, with the tracking element handler class and the helper class it uses for passing information in and out:

Related:
1 2 3 Page 1