XML documents on the run, Part 1

SAX speeds through XML documents with parse-event streams

One of the oldest approaches to processing XML documents in Java also proves one of the fastest: parse-event streams. That approach became standardized in Java with the SAX (Simple API for XML) interface specification, later revised as SAX2 to include support for XML Namespaces.

Read the whole "XML Documents on the Run" series:

Event-stream processing offers other advantages beyond just speed. Because the parser processes the document on the fly, you can handle it as soon as you read its first part. Other approaches generally require you to parse the complete document before you start working with it -- fine if the document comes off a local disk drive, but if the document is sent from another system, parsing the complete document can cause significant delays.

Event-stream processing also eliminates any document size limits. In contrast, approaches that store the document's representation in memory can run out of space with very large documents. Setting a hard limit on a real-world document's size is often difficult, and potentially a major problem in many applications.

A note on the source code

This article features two example source code files: stock.jar and option.jar, both found in a downloadable zip file in Resources. Each jar file includes full example implementations, along with sample documents and test driver programs. To try an example, create a new directory, then extract the jar's files to that directory with jar xvf stock.jar or jar xvf option.jar. The readme.txt file gives instructions for setting up and running the test drivers.

The event view

Parsers with event-stream interfaces deliver a document one piece at a time. Think of the document's text as spread out in time, as it would be if read from a stream. The parser looks for significant document components (start and end tags, character data, and so on) in the text, generating parse events for each.

For example, here's a simple document:

<author>
  <first-name>Dennis</first-name>
  <last-name>Sosnoski</last-name>
</author>

The table shows the parse-event sequence a SAX2 parser would generate for this document (though the parser can divide up the character data reported by characters events differently than I've shown, as I discuss when I get to the actual code).

Parse events for document
Text processed
Parse event
""
startDocument()
"<author>"
startElement("author")
"\n "
characters("\n ")
"<first-name>"
startElement("first-name")
"Dennis"
characters("Dennis")
"</first-name>"
endElement("first-name")
"\n "
characters("\n ")
"<last-name>"
startElement("last-name")
"Sosnoski"
characters("Sosnoski")
"</last-name>"
endElement("last-name")
"\n"
characters("\n")
"</author>"
endElement("author")

Notice in the table that the parse events include both start of element and end of element notifications -- important information for your program because it lets you track the document's nested structure. Without the end notifications, you couldn't know which elements or character data are part of the content of some earlier element. Also note that the parse events include all the character data in the document, even the whitespace sequences most people would consider unimportant.

With the event-driven approach, your application turns control over to the parser, passing it the document (as a stream or URI/URL). The parser reads the document, then breaks it into components, calling a method in a handler class supplied by your program to report each event. That isn't the only way of working with parse-event streams (as I'll show in Part 2), but it's the mostly widely used approach at present.

SAX and SAX2

Most event-stream parsers for XML in Java first used SAX. Unlike most other Internet and Web standards, SAX originally materialized without the official involvement of any sponsoring standards organization. Instead, it developed through a series of discussions, prototypes, and eventual consensus, coordinated by David Megginson on the XML-DEV mailing list.

SAX2 extends the SAX API to include full support for XML Namespaces. It also incorporates fixes to the original SAX interface. Most current parsers implement the SAX2 interface natively, though the original SAX interface is available if desired. New development should probably use the SAX2 interface even if Namespaces are not required, if for no other reason than to avoid deprecated APIs. The example code in this article follows that approach.

Event-driven programming

Enough of the background material, let's plunge in to programming the interface. You first want to get a parser, in the form of an org.xml.sax.XMLReader instance. These parser instances are serially reusable, meaning you can use one for parsing as many documents as you like, but only one document at a time. Indeed, if you're writing a simple single-threaded application, you can simply use the same instance over and over.

Usually you get the XMLReader by calling the static org.xml.sax.helpers.XMLReaderFactory.createXMLReader() method (you need to have a SAX2 parser implementation in your classpath for this to work, of course; see Resources for a link to the SAX2 project page where you can find a list of parsers supporting SAX2). createXMLReader() lets you specify a particular implementation class, or you can simply use the default one defined by a system property.

Once you have the XMLReader, you can set and check a variety of options for the parser. You can also hook up various handler types for the parse events. Each handler type must implement a particular interface. For your purposes, you'll build on the handy handler base class defined by SAX2, org.xml.sax.helpers.DefaultHandler, which supplies default implementations for the full handler set. By using that as a base class, you can override only the methods you're interested in, while not worrying about the rest.

If you're working with Sun's JAXP (Java API for XML Parsing) 1.1 or higher, you can get your SAX2 parser instance through the JAXP API. With this approach, you first call the static javax.xml.parsers.SAXParserFactory.newInstance() method to get an SAXParserFactory instance, then use that instance's newSAXParser() method to get a javax.xml.parsers.SAXParser instance. That gives you an interface for parsing a document using a specified DefaultHandler.

Both approaches support a variety of options for the parser type you want to create, including whether or not you want to validate the parsed documents. Let's ignore most of those options (and the whole validation issue) for this introduction to SAX2 parsing, but you can find the full details on the official SAX2 and JAXP sites.

One option I won't ignore is the namespace handling. Directly created SAX2 parsers default to namespace-handling enabled, while those created through JAXP have it disabled by default. This option affects how element names are reported, even if you don't use namespaces in your documents. For the sample code in this article, I assume that namespace handling is enabled. The easiest way to enable it with JAXP is to call the SAXParserFactory.setNamespaceAware() method with a true value before creating your parser.

So far this doesn't sound too bad, but the interesting part starts when you call the parser with a document. The parser won't return from that call until parsing completes, but in the meantime, it'll call your handler methods for each and every parse event of the types you registered to handle. Your handler code makes sense of the call sequence and interprets it for your application.

Writing event-driven programs, as this handler technique is known, can be difficult. The problem: event streams turn the normal program structure inside out; instead of your program running the operation and requesting what it wants from the document, it hooks to an event stream hose that pushes the document at it, one small piece at a time.

Most applications need more structure than basic event streams provide. If you're working with an event-based parser, you must provide that structure by keeping state information that tracks your location in the document. Your state-information needs depend on the structure level you're working with. Using an event-based approach to handling your documents will be easiest when you work with simple structures within the document.

Watch the market

As an example, we'll work with a document that gives the history of stock trades over some span of time:

<?xml version="1.0"?>
<trade-history>
  <stock-trade>
    <symbol>SUNW</symbol>
    <time>08:45:19</time>
    <price>86.24</price>
    <quantity>500</quantity>
  </stock-trade>
  <stock-trade>
    <symbol>MSFT</symbol>
    <time>08:45:20</time>
    <price>22.26</price>
    <quantity>1000</quantity>
  </stock-trade>
</trade-history>

For each trade, the document above includes the symbol for traded stock, the time the trade occurred, the price, and the number of shares, all as content of specific elements. The above sample shows only two trades (taking place at some unspecified future date), but you could easily extend it to any number of trades over any time period. In particular, it makes sense to use such a format in a ticker stream that provided a feed of all trades on an exchange during a trading day.

Suppose you want to parse such a stream and track all stock information, including high, low, and last trade prices for the day, along with share and dollar volumes, for each stock traded. An event-stream parser approach should give you what you need -- you can handle each individual stock-trade element as it's received, immediately updating your accumulated information so that it's always kept up to date. For your hypothetical ticker stream, this immediate handling is important, since the document won't end until the market closes. If you couldn't access the information until then, it might be too late to do you any good!

The stream may include many trades for each stock, so you'll need some form of data structure to hold onto the tracking information. Here's a class to handle that aspect:

public class StockTrack
{
    // Map of stock symbols to tracking information
    protected static HashMap s_symbolMap = new HashMap();
    
    // Instance variables for information on a particular stock
    protected String m_stockSymbol;     // Symbol for this stock
    protected String m_lastTime;        // Time of last trade
    protected double m_highPrice;       // High trade price
    protected double m_lowPrice;        // Low trade price
    protected double m_lastPrice;       // Last trade price
    protected int m_totalShares;        // Total number of shares traded
    protected double m_totalDollars;    // Total dollar volume traded
    
    protected StockTrack(String sym) {
        m_stockSymbol = sym;
        m_lowPrice = Double.MAX_VALUE;
    }
    
    public String getSymbol() {
        return m_stockSymbol;
    }
    
    public String getLastTime() {
        return m_lastTime;
    }
    
    public double getHighPrice() {
        return m_highPrice;
    }
    
    public double getLowPrice() {
        return m_lowPrice;
    }
    
    public double getLastPrice() {
        return m_lastPrice;
    }
    
    public int getShareVolume() {
        return m_totalShares;
    }
    
    public int getDollarVolume() {
        return m_totalDollars;
    }
    
    public static StockTrack getTrack(String sym) {
        StockTrack track = (StockTrack)s_symbolMap.get(sym);
        if (track == null) {
            track = new StockTrack(sym);
            s_symbolMap.put(symbol, track);
        }
        return track;
    }
    
    public static void recordTrade(String sym, String time, double price, 
        int shares) {
        StockTrack track = getTrack(sym);
        track.m_lastTime = time;
        if (track.m_highPrice < price) {
            track.m_highPrice = price;
        }
        if (track.m_lowPrice > price) {
            track.m_lowPrice = price;
        }
        track.m_lastPrice = price;
        track.m_totalShares += shares;
        track.m_totalDollars += shares*price;
    }
}

In StockTrack, a static HashMap links stock symbols with their respective tracking information. Along with some access methods for the member variables, the class includes a protected constructor and a pair of public client access methods. getTrack() looks up the tracking information for a supplied stock symbol, creating a new instance of the class for that stock symbol if no trades have been recorded yet. recordTrade() records a stock trade, first finding the stock information with getTrack(), then updating the information to reflect the new trade.

You should now have all you need for tracking. Next, let's look at how to interface between a parse-event handler and the tracking code.

1 2 3 Page
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more