Programming XML in Java, Part 1

Create Java apps with SAX appeal

So, you understand (more or less) how you would represent your data in XML, and you're interested in using XML to solve many of your data-management problems. Yet you're not sure how to use XML with your Java programs.

TEXTBOX: TEXTBOX_HEAD: Programming XML in Java: Read the whole series!

:END_TEXTBOX

This article is a follow-up to my introductory article, "XML for the absolute beginner", in the April 1999 issue of JavaWorld (see the Resources section below for the URL). That article described XML; I will now build on that description and show in detail how to create an application that uses the Simple API for Java (SAX), a lightweight and powerful standard Java API for processing XML.

The example code used here uses the SAX API to read an XML file and create a useful structure of objects. By the time you've finished this article, you'll be ready to create your own XML-based applications.

The virtue of laziness

Larry Wall, mad genius creator of Perl (the second-greatest programming language in existence), has stated that laziness is one of the "three great virtues" of a programmer (the other two being impatience and hubris). Laziness is a virtue because a lazy programmer will go to almost any length to avoid work, even going so far as creating general, reusable programming frameworks that can be used repeatedly. Creating such frameworks entails a great deal of work, but the time saved on future assignments more than makes up for the initial effort invested. The best frameworks let programmers do amazing things with little or no work -- and that's why laziness is virtuous.

XML is an enabling technology for the virtuous (lazy) programmer. A basic XML parser does a great deal of work for the programmer, recognizing tokens, translating encoded characters, enforcing rules on XML file structure, checking the validity of some data values, and making calls to application-specific code, where appropriate. In fact, early standardization, combined with a fiercely competitive marketplace, has produced scores of freely available implementations of standard XML parsers in many languages, including C, C++, Tcl, Perl, Python, and, of course, Java.

The SAX API is one of the simplest and most lightweight interfaces for handling XML. In this article, I'll use IBM's XML4J implementation of SAX, but since the API is standardized, your application could substitute any package that implements SAX.

SAX is an event-based API, operating on the callback principle. An application programmer will typically create a SAX Parser object, and pass it both input XML and a document handler, which receives callbacks for SAX events. The SAX Parser converts its input into a stream of events corresponding to structural features of the input, such as XML tags or blocks of text. As each event occurs, it is passed to the appropriate method of a programmer-defined document handler, which implements the callback interface org.xml.sax.DocumentHandler. The methods in this handler class perform the application-specific functionality during the parse.

For example, imagine that a SAX parser receives a document containing the tiny XML document shown in Listing 1 below. (See Resources for the XML file.)

<POEM>
<AUTHOR>Ogden Nash</AUTHOR>
<TITLE>Fleas</TITLE>
<LINE>Adam</LINE>
<LINE>Had 'em.</LINE>
</POEM>

Listing 1. XML representing a short poem

When the SAX parser encounters the <POEM> tag, it calls the user-defined DocumentHandler.startElement() with the string POEM as an argument. You implement the startElement() method to do whatever the application is meant to do when a POEM begins. The stream of events and resulting calls for the piece of XML above appears in Table 1 below.

Table 1. The sequence of callbacks SAX produces while parsing Listing 1
Item encounteredParser callback
{Beginning of document}startDocument()
<POEM> startElement("POEM", {AttributeList})
"\n"characters("<POEM>\n...", 6, 1)
<AUTHOR>startElement("AUTHOR", {AttributeList})
"Ogden Nash"characters("<POEM>\n...", 15, 10)
</AUTHOR> endElement("AUTHOR")
"\n"characters("<POEM>\n...", 34, 1)
<TITLE> startElement("TITLE", {AttributeList})
"Fleas"characters("<POEM>\n...", 42, 5)
</TITLE> endElement("TITLE")
"\n"characters("<POEM>\n...", 55, 1)
<LINE> startElement("LINE", {AttributeList})
"Adam"characters("<POEM>\n...", 62, 4)
</LINE> endElement("LINE")
<LINE> startElement("LINE", {AttributeList})
"Had 'em."characters("<POEM>\n...", 67, 8)
</LINE> endElement("LINE")
"\n"characters("<POEM>\n...", 82, 1)
</POEM> endElement("POEM")
{End of document}endDocument()

You create a class that implements DocumentHandler to respond to events that occur in the SAX parser. These events aren't Java events as you may know them from the Abstract Windowing Toolkit (AWT). They are conditions the SAX parser detects as it parses, such as the start of a document or the occurrence of a closing tag in the input stream. As each of these conditions (or events) occurs, SAX calls the method corresponding to the condition in its DocumentHandler.

So, the key to writing programs that process XML with SAX is to figure out what the DocumentHandler should do in response to a stream of method callbacks from SAX. The SAX parser takes care of all the mechanics of identifying tags, substituting entity values, and so on, leaving you free to concentrate on the application-specific functionality that uses the data encoded in the XML.

Table 1 shows only events associated with elements and characters. SAX also includes facilities for handling other structural features of XML files, such as entities and processing instructions, but these are beyond the scope of this article.

The astute reader will notice that an XML document can be represented as a tree of typed objects, and that the order of the stream of events presented to the DocumentHandler corresponds to an in-order, depth-first traversal of the document tree. (It isn't essential to understand this point, but the concept of an XML document as a tree data structure is useful in more sophisticated types of document processing, which will be covered in later articles in this series.)

The key to understanding how to use SAX is understanding the DocumentHandler interface, which I will discuss next.

Customize the parser with org.xml.sax.DocumentHandler

Since the DocumentHandler interface is so central to processing XML with SAX, it's worthwhile to understand what the methods in the interface do. I'll cover the essential methods in this section, and skip those that deal with more advanced topics. Remember, DocumentHandler is an interface, so the methods I'm describing are methods that you will implement to handle application-specific functionality whenever the corresponding event occurs.

Document initialization and cleanup

For each document parsed, the SAX XML parser calls the DocumentHandler interface methods startDocument() (called before processing begins) and endDocument() (called after processing is complete). You can use these methods to initialize your DocumentHandler to prepare it for receiving events and to clean up or produce output after parsing is complete. endDocument() is particularly interesting, since it's only called if an input document has been successfully parsed. If the Parser generates a fatal error, it simply aborts the event stream and stops parsing, and endDocument() is never called.

Processing tags

The SAX parser calls startElement() whenever it encounters an open tag, and endElement() whenever it encounters a close tag. These methods often contain the code that does the majority of the work while parsing an XML file. startElement()'s first argument is a string, which is the tag name of the element encountered. The second argument is an object of type AttributeList, an interface defined in package org.xml.sax that provides sequential or random access to element attributes by name. (You've undoubtedly seen attributes before in HTML; in the line <TABLE BORDER="1">, BORDER is an attribute whose value is "1"). Since Listing 1 includes no attributes, they don't appear in Table 1. You'll see examples of attributes in the sample application later in this article.

Since SAX doesn't provide any information about the context of the elements it encounters (that <AUTHOR> appears inside <POEM> in Listing 1 above, for example), it is up to you to supply that information. Application programmers often use stacks in startElement() and endElement(), pushing objects onto a stack when an element starts, and popping them off of the stack when the element ends.

Process blocks of text

The characters() method indicates character content in the XML document -- characters that don't appear inside an XML tag, in other words. This method's signature is a bit odd. The first argument is an array of bytes, the second is an index into that array indicating the first character of the range to be processed, and the third argument is the length of the character range.

It might seem that an easier API would have simply passed a String object containing the data, but characters() was defined in this way for efficiency reasons. The parser has no way of knowing whether or not you're going to use the characters, so as the parser parses its input buffer, it passes a reference to the buffer and the indices of the string it is viewing, trusting that you will construct your own String if you want one. It's a bit more work, but it lets you decide whether or not to incur the overhead of String construction for content pieces in an XML file.

The characters() method handles both regular text content and content inside CDATA sections, which are used to prevent blocks of literal text from being parsed by an XML parser.

Other methods

There are three other methods in the DocumentHandler interface: ignorableWhitespace(), processingInstruction(), and setDocumentLocator(). ignorableWhitespace() reports occurrences of white space, and is usually unused in nonvalidating SAX parsers (such as the one we're using for this article); processingInstruction() handles most things within <? and ?> delimiters; and setDocumentLocator() is optionally implemented by SAX parsers to give you access to the locations of SAX events in the original input stream. You can read up on these methods by following the links on the SAX interfaces in Resources.

Implementing all of the methods in an interface can be tedious if you're only interested in the behavior of one or two of them. The SAX package includes a class called HandlerBase that basically does nothing, but can help you take advantage of just one or two of these methods. Let's examine this class in more detail.

HandlerBase: A do-nothing class

Often, you're only interested in implementing one or two methods in an interface, and want the other methods to simply do nothing. The class org.xml.sax.HandlerBase simplifies the implementation of the DocumentHandler interface by implementing all of the interface's methods with do-nothing bodies. Then, instead of implementing DocumentHandler, you can subclass HandlerBase, and only override the methods that interest you.

For example, say you wanted to write a program that just printed the title of any XML-formatted poem (like TitleFinder in Listing 1). You could define a new DocumentHandler, like the one in Listing 2 below, that subclasses HandlerBase, and only overrides the methods you need. (See Resources for an HTML file of TitleFinder.)

012 /**
013 * SAX DocumentHandler class that prints the contents of "TITLE" element
014 * of an input document.
015 */
016 public class TitleFinder extends HandlerBase {
017     boolean _isTitle = false;
018 public TitleFinder() {
019     super();
020 }
021     /**
022  * Print any text found inside a <TITLE> element.
023   */
024 public void characters(char[] chars, int iStart, int iLen) {
025     if (_isTitle) {
026         String sTitle = new String(chars, iStart, iLen);
027         System.out.println("Title: " + sTitle);
028     }
029 }
030     /**
031  * Mark title element end.
032   */
033 public void endElement(String element) {
034     if (element.equals("TITLE")) {
035         _isTitle = false;
036     }
037 }
038     /**
039  * Find contents of titles
040   */
041 public static void main(String args[]) {
042     TitleFinder titleFinder = new TitleFinder();
043     try {
044         Parser parser = ParserFactory.makeParser("com.ibm.xml.parsers.SAXParser");
045         parser.setDocumentHandler(titleFinder);
046         parser.parse(new InputSource(args[0]));
047     } catch (Exception ex) {
048         ; // OK, so sometimes laziness *isn't* a virtue.
049     }
050 }
051     /**
052  * Mark title element start
053   */
054 public void startElement(String element, AttributeList attrlist) {
055     if (element.equals("TITLE")) {
056         _isTitle = true;
057     }
058 }


Listing 2. TitleFinder: A DocumentHandler derived from HandlerBase that prints TITLEs

1 2 3 4 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more