Take the sting out of SAX

Generate SAX parsers with XML Schemas

A Simple API for XML (SAX) parser offers an invaluable tool for parsing XML files, especially if you need to parse large XML input files that cannot load into main memory. A SAX parser can also prove helpful if you have a slow input stream, like an Internet connection, and you need to process bytes as soon as they arrive, instead of waiting for the complete input. As a bonus, a well-designed SAX parser is generally faster than the approach of processing a DOM (Document Object Model) tree; you need only one pass over the XML data as opposed to the two passes needed with a DOM tree (one to build the tree, and one to do the processing).

Unfortunately, a SAX parser can be difficult to develop because of its event-driven nature. In this article, I create a source code generator that will help you easily develop a SAX parser.

Note: I don't explain SAX in detail here; see Resources below for some excellent references.

SAX reviewed

SAX is a standard API that parses an XML input stream, like a file or network connection, and triggers events in an event-handler class. Many different SAX parser implementations are available for Java. In my examples here, I use Xerces from the Apache XML Project, one of the most popular parser implementations.

Listings 1 and 2 below show an XML file and a SAX event handler, respectively. (You can download all source code and examples for this article from Resources.)

Listing 1. Example XML

<company name="My Widgets Inc.">
  <employees>
    <employee>
      <name>
        <first>John</first>
        <last>Dole</last>
      </name>
      <office>1-50</office>
      <telephone>123456</telephone>
    </employee>
    <employee>
      <name>
        <first>Jane</first>
        <last>Dole</last>
      </name>
      <office>1-51</office>
      <telephone>123457</telephone>
    </employee>
  </employees>
</company>

Listing 2. SAX handler

    public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes attributes) throws SAXException
    {
        text.reset();
        
        if (qName.equals ("company"))
        {
            String name = attributes.getValue("name");
            String header = "Employee Listing For "+name;
            System.out.println (header);
            System.out.println ();
        }
        
    }
    public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws SAXException
    {
        if (qName.equals ("first"))
        {
            firstName = getText();
        }
        if (qName.equals ("last"))
        {
            lastName = getText();
        }
        
        if (qName.equals ("office"))
        {
            office = getText();
        }
        
        if (qName.equals ("telephone"))
        {
            telephone = getText ();
        }
        
        if (qName.equals ("employee"))
        {
            System.out.println (office + "\t " + firstName + "\t" + 
lastName + "\t" + telephone);
        }
        
    }

The SAX handler above merely prints the XML file's data to the standard output device. It prints a header line containing the company name followed by tab-delimited employee data.

As you can see from Listing 2, parsing even a simple XML file can produce a significant amount of source code. SAX's event-driven (as opposed to document-driven) nature also makes the source code difficult to maintain and debug because you must be constantly aware of the parser's state when writing SAX code. Writing a SAX parser for complex document definitions can prove even more demanding; see Resources for challenging real-life examples.

We must reduce the work involved in writing an event-handler structure so we have more time to work on actual processing.

XML Schemas

To lighten our workload, we can automate most of the process of writing the event-handler structure. Luckily, the computer already knows the format of the XML file we will parse; the format is defined in a computer-readable DTD (document type definition) or in an XML Schema. I explore ways to use this knowledge for generating source code that removes the sting from SAX parser development. For this article, I rely on XML Schemas only. Though younger than DTDs, the XML Schema standard will probably replace DTDs in the future. You can easily convert your existing DTD files to XML Schemas with the help of some simple tools.

The first step towards building our code generator is to load the information contained in the XML Schema into a memory model. For this article, I use a simple memory model that defines only the XML entity and attribute names, as well as the entities' relationship to each other. This custom model eases the code generation process. My simplified memory model consists of two classes: Element and Elements. The former stores information for an entity, and the latter manages a list of entities.

Next, we need a mechanism that populates the memory model from an XML Schema. Because an XML Schema is also an XML file, you can use a SAX parser to parse an XML Schema and populate the memory model. In this case, a SAX parser does offer a good choice: you need to only handle events for the entity parts and attribute definitions you're interested in, and ignore extra information by letting the unneeded SAX events pass without handling them. See Resources for the XML Schema parser's full source code.

Once we load the XML Schema information into memory, we can start generating source code for our new SAX parser.

Source code templates

To generate the SAX parser's source code, I use a text-based template engine, which lets me easily insert the memory model's information into source code templates. My favorite text-based template engine is Velocity from Apache's Jakarta project.

You can easily change my source code templates to suit your needs; doing so requires a text editor for editing the templates and only a basic knowledge of Velocity's syntax.

My SAX parser source code templates generate a separate event handler, or Java class, for each complex XML entity. I define a complex entity as one that might contain other XML entities. Methods inside the complex entities' event handlers handle simple entities—that is, those entities that contain only text content and/or attributes. Because of the multiple class separation, you can more easily find the right place to insert custom source code. The separate event handlers also make code easier to maintain, should any bugs occur later.

The first source code template is for the class that handles events for complex XML entities. It creates methods for each child entity as well as temporary storage for XML attributes:

Listing 3. Event handler template

package ${package};
// JDK Classes
import java.util.*;
import java.io.*;
// Xerces Classes
import org.xml.sax.*;
import org.apache.xerces.parsers.*;
import org.xml.sax.helpers.DefaultHandler;
public class ${element.Name}handler extends DefaultHandler
{
    private CharArrayWriter text = new CharArrayWriter ();
    private Stack path;
    private Map params;
    private DefaultHandler parent;
    private SAXParser parser;
    public ${element.Name}handler(Stack path, Map params, Attributes attributes, SAXParser parser, DefaultHandler parent)  throws SAXException
    {
        this.path = path;
        this.params = params;
        this.parent = parent;
        this.parser = parser;
        start(attributes);
    }
    
## Some code omitted
    public void endElement(java.lang.String uri, java.lang.String localName, java.lang.String qName) throws SAXException
    {
        if (qName.equals("${element.Name}"))
        {
            end();
            path.pop();
            parser.setContentHandler (parent);
        }
        #foreach ($child in $element.Children)
          #if ($child.hasChildren())
          #else
            if (qName.equals("${child.Name}")) end${child.Name} ();
          #end
        #end
    }

The second class template is the entry point for the SAX parser and is responsible for initialization tasks and for calling the root element's handler:

Listing 4. The parser template

## Some code omitted
    public void startElement(java.lang.String uri, java.lang.String localName, java.lang.String qName, Attributes attributes) throws SAXException
    {
        if (qName.equals("${elements.RootElement.Name}"))
        {
            DefaultHandler handler = new ${elements.RootElement.Name}handler(path,params,attributes,parser,this);
            path.push ("${elements.RootElement.Name}");
            parser.setContentHandler (handler);
        }
    }
## Some code omitted

The controller class

Now we simply put everything together in a controller class (download the class's source code from Resources). A controller class handles the process's logic—see the MVC (Model-View-Controller) model.

Called Generator, the controller class requires two command-line parameters. The first parameter indicates the XML Schema to use, and the second gives the output classes' package name. Generator then loads the XML Schema into memory and executes the source code templates.

With the Generator class, you can easily create a SAX parser. To illustrate how to use the SAX generator, let's create a SAX parser for Listing 1's XML. I include that listing's XML Schema (example1.xsd) in Resources as well as the SAX generator's source and binary versions. Before you use the SAX generator's prepackaged binary version, read the readme.txt file for usage directions and required external jar libraries. Also, make sure you correctly set your $JAVA_HOME environment variable. Now you can use generate.bat (for Windows machines) or generate.sh (for Unix/Linux machines) to start the SAX generator. To create a SAX parser for example1.xsd, execute one of the following on the command line:

For Windows:

generate examples\example-1.xsd com.mycompany.package

For Unix/Linux:

./generate.sh examples/example-1.xsd com.mycompany.package     

The first parameter indicates the XML Schema the program should use to build the SAX parser; the second parameter indicates the Java package name for the new classes.

This process gives you a set of new classes that form the basis of a new SAX parser. They are located in your SAX generator's output/ subdirectory. Assuming you used example1.xsd, you will have classes called CompanyHandler, EmployeesHandler, EmployeeHandler, and NameHandler.

Use the generated SAX parser

To use the generated SAX parser, you must create an entry-point class instance, named Parser by default, and call the parse() method. Listing 5 shows you how to initiate the SAX parser:

Listing 5. Initiate the SAX parser

public static void main (String[] args) throws Exception
{
    Parser parser = new Parser();
    FileInputStream fis = new FileInputStream (args[0]);
    parser.parse (fis);
}

At this stage, the newly generated classes do nothing; we must write implementations for the relevant methods. We must write an implementation for the CompanyHandler class to print the company heading. Currently, the CompanyHandler class has only empty methods: The SAX parser calls this handler's start() method when it encounters the <company> element; the end() method executes when the closing </company> is parsed. The startEmployees() method executes when the parser enters an <employees> element.

In this case, we want to print the company name when the <company> element starts, so we must add code to the start() method. Note that the SAX generator has already declared local variables for the entity's attributes. After we add code to print the header line, the method looks like this:

Listing 6. Print the company header

public void start (Attributes attributes)  throws SAXException
{
    String Name = attributes.getValue("Name");
    System.out.println ("Employee Listing For "+Name);
    System.out.println ();
}

Before we can print the employee information, we first must handle the <name> entity. Because this entity is complex, it has a separate handler class and needs some way to communicate name information back to the EmployeeHandler class. For this purpose, I created a global map object, called params, which allows you to pass information from one handler to the other. The Parser class automatically creates this map for you; to use the map, you simply need to add some information to it.

Now we need to add the text data enclosed in the XML elements to the params map. To access an XML element's text content, use the getText() method. A utility method, getText() returns the text enclosed in an entity with leading and trailing white-space characters removed. The following code snippet adds the text from <first> and <last> to the params map:

Listing 7. Add information to global params

public void endfirst () throws SAXException
{
    params.put ("firstname",getText());
}
  
public void endlast () throws SAXException
{
    params.put ("lastname",getText());
}

Now we can extract this information from the EmployeeHandler class. Because we must wait until the entire entity has been parsed (remember the <name> tag has to be handled before we print the employee information), we must add the following code to the end() method:

Listing 8. The employee handler

1 2 Page
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more