Programming XML in Java, Part 3

DOMination: Take control of structured documents with the Document Object Model

The Simple API for XML (SAX) is an excellent interface for many XML applications. It is intuitive, extremely easy to learn, and, as its name implies, simple. Any Java programmer can, in just an hour or two, learn to use and develop an application using SAX. It is especially useful in situations where the data in an XML file is already in a form that is structurally similar to the desired output. For instance, the recipe example in Part 2 of this series formatted Recipe XML into an HTML representation of a recipe page and a shopping list. The structure of the output HTML was very similar to the structure of the input XML. The ingredients in the Recipe XML were grouped together in an <Ingredients> element; the ingredients in the output HTML were grouped together in an unordered list (<ul>). The tags were somewhat different, but the basic structure was the same.


TEXTBOX_HEAD: Programming XML in Java: Read the whole series!


In real data-processing situations, however, the structure of the input data often differs greatly from the eventual output structure. Since SAX passes SAX events to a programmer-defined handler in the order in which they appear in the input XML, as the programmer you are responsible for any data restructuring or reordering. Also, if the same data is to be used in more than one place in the output, you must either perform multiple passes over the XML or arrange for the handler to "remember" that data while producing output. One example of this was the recipe title in Part 2, which the handler maintained in an internal variable for use both in the browser title bar and in the Webpage.

For tasks of low and intermediate complexity, SAX works just fine. As an application's complexity (and functionality) increases, however, the SAX handler code can become extremely difficult to understand. SAX code can spend most of its time storing information from the input in an internal form usable for producing the desired output. When using SAX, you are generally responsible for creating an internal object model of your application's information.

DOM to the rescue

The Document Object Model, or DOM, is a standardized object model for XML documents. DOM is a set of interfaces describing an abstract structure for an XML document. Programs that access document structures through the DOM interface can arbitrarily insert, delete, and rearrange the nodes of an XML document programmatically.

DOM and SAX parsers work in different ways. A SAX parser processes the XML document as it parses the XML input stream, passing SAX events to a programmer-defined handler method. A DOM parser, on the other hand, parses the entire input XML stream and returns a Document object. Document is the programmatic, language-neutral interface that represents a document. The Document returned by the DOM parser has an API that lets you manipulate a (virtual) tree of Node objects; this tree represents the structure of the input XML. Figures 1 and 2 illustrate this difference between the APIs.

Figure 1. The SAX parser calling programmer-defined handler routines
Figure 2. The DOM parser returning a Document object

In Figure 1, you see the SAX parser calling programmer-defined handler routines for each tag in the XML document. In Figure 2, the DOM parser returns a Document object, which represents the hierarchical structure of the tags (and such other informational elements as attributes, text blocks, and so on) in the original XML. When the parse has completed, you use the methods that are in the Document API to access the contents of the XML tree.

One major benefit of the DOM parser is that it provides random access to the structures inside the XML tree. Imagine, for example, that you are writing a genealogy application that could show any individual's relatives from that individual's point of view. The original XML document representing your family would include you as the child of two parents and possibly a parent of one or more children. Now let's say you want to create a program that could print a personal report for any person in the tree. If you were to write that program using SAX, you'd have two tasks. First, you'd probably need to build a representation of your family tree in memory, so you could access any node in the tree and print that node's relatives. Your second task, after the parse was complete, would be to print the genealogy report starting at a specified node in the tree.

A DOM parser would relieve you of the first task, building the family tree, by actually building a tree of objects for you, as shown in Figure 2. You could produce an identical report, but you'd do half as much work (or even less).

The origins of SAX and DOM are different as well. SAX, originally an interface for writing XML parsers, was created by a group of people on the XML-DEV mailing list. DOM was created and is maintained by the members of the W3C (World Wide Web Consortium) DOM working group as a standard API for accessing XML structures. In fact, many DOM parsers use a SAX parser to create the document tree that the parser returns.

It would be incorrect to say that DOM is superior to SAX. DOM provides an information model that is richer and correspondingly more complex than the one provided by SAX. With a SAX parser, the handler object receives a stream of tokens only once. A DOM parser lets you look at any node in the tree as many times as you like, manipulate the tree, write the tree out in different formats, and pass the tree to other pieces of software that understand the DOM interfaces.

So far, I've told you that a DOM document is made up of Node objects, but I haven't told you precisely what a Node object is. Of exactly what kinds of objects is this document tree composed? The answer, it turns out, is that any object can appear in the tree of DOM nodes, as long as that object implements one of the DOM interfaces. I'll look at the types of DOM interfaces in the next section.

Anatomy of a document

Figure 3 below illustrates the inheritance graph of the DOM Level 1 interfaces. (DOM Level 1 is the first, simplest implementation of DOM from the W3C. DOM Levels 2 and 3 are currently under development. See Resources for a link to the official documentation.) As you can see, just about everything in a document tree is a Node. Most DOM interfaces are descended from Node.

Figure 3. The inheritance graph of the DOM Level 1 interfaces

DOM defines a document as a tree of objects that implement the interfaces in the DOM package. All of these objects implement Node, because all of the DOM interfaces are subinterfaces of Node. Element, for example, inherits the methods of Node, as well as additional methods necessary to represent a single tag in a structure document (which is its role).

Note that the DOM package does not consist of classes; rather, it contains only interfaces (with one exception). This is because DOM is a specification of interfaces between pieces of software, not a particular implementation of DOM document Nodes. This is powerful partly because the interface specification defines what the program does, and different vendors can provide various implementations for the interfaces. In fact, most DOM parsers include implementation classes that implement all of the interfaces in the package. DOM parsers generally return trees of these implementation classes, but all the application programmer knows about these returned objects is that they implement the appropriate interface.

The Node interface represents the general node in a DOM tree.

For any particular node, the interface has methods for accessing the node's child nodes, its parent node, and the Document node at the top of the tree in which the node lives -- essentially, all of the methods needed to access and manipulate the tree of nodes. Elements, Comments, Text, and so on are all types of Nodes.

Here are the subinterfaces of Node that form the document tree:

  • Element: The Element interface represents a single tag in an XML document. (There are interfaces for such objects in the DOM for HTML as well, but I'll limit this discussion to XML.) This interface inherits all of Node's methods; it also adds additional methods for manipulating Element's attributes and foraccessing all sub-Elements with a particular tagname.
  • CharacterData: The CharacterData interface represents (what else?) character data. Its subinterfaces are Text, CDATASection, and Comment (see below for descriptions). The CharacterData interface provides methods for adding, deleting, inserting, and otherwise manipulating the text data in the node.
  • Text: This subinterface of CharacterData is a representation of character data content within an element or attribute. The text inside a Text node contains no markup. Any entity, comment, or other text that contains markup will appear in separate nodes.
  • CDATASection: CDATASection is a subinterface of Text that can contain markup. The markup within a CDATASection is not interpreted by the XML parser. This makes it easier to create text in the document that contains many characters that might be misinterpreted as markup. A CDATASection in an XML document begins with the markup <![CDATA[ and ends with ]]>. So, for example, the following CDATASection:

    <![CDATA[Markup &amp; Mayhem]]>

    represents the text:

    Markup &amp; Mayhem

    It would represent:

    Markup & Mayhem

    outside the context of the CDATASection.

  • Attr: Attr nodes contain those variable = value pairs that you see within element tags. In the tag:

    <a href="">

    the attribute is href, and the attribute value is the URL. Attributes are a little strange. Although they inherit the methods of their superinterface Node, they do not exist outside the context of a particular tag; they do not have their own identity. Therefore, the Node methods having to do with accessing ordered nodes in the DOM tree, such as getSibling(), simply return null when called. Attrs implement the Node interface, but aren't truly document nodes.

  • Comment: This node simply contains a comment. One difference between SAX and DOM is that DOM preserves comments, while SAX does not (at least not in version 1). If you are writing a system that requires that comments be preserved, or that uses the contents of comments in some way, you'll have to choose either DOM or SAX Level 2 (the new version of SAX).
  • Entity: Entity represents an entity in the XML document. The &amp; example above is an example of an entity. In this case, it is very similar to a #define in a C program. Since any entity may be defined outside of the XML file being parsed, entities are read-only.
  • ProcessingInstruction: Processing instructions appear at the top of the XML file. The most common processing instruction in XML is the XML file declaration: <xml version="1.0">. You can use the other processing instructions to control external programs, the XML parser, or other processing steps. Processing instructions are roughly similar to #pragma in C.
  • DocumentType: The document type node is a placeholder for the DTD in the document tree. The document type node contains little more than a list of entities and notations defined for the document.

There are other interfaces in the DOM package, which you can explore by reading the documentation (see Resources). The interfaces described here are sufficient for the purposes of this discussion, however.

Now that you understand what the various DOM node interfaces do, I'll provide an example of an XML document that has been parsed into a tree of object instances that implement these interfaces. Imagine you had the following simple XML document:

Listing 1. A simple XML document describing a word in a vocabulary

001 <?xml version="1.0" encoding="UTF-8"?>
002 <!-- This document contains a small vocabulary -->
003 <!DOCTYPE VOCABULARY SYSTEM "vocabulary.dtd">
005  <WORD GENDER="m">
006    <ENGLISH>suspicion</ENGLISH>
007    <FRENCH>soup&ccedil;on</FRENCH>
008   </WORD>

If you were to give this document to a DOM parser, you would receive as the result Document a structure that would look something like Figure 4:

Figure 4. The DOM tree corresponding to Listing 1

As you can see in Figure 4, the DOM tree directly reflects the structure of the document shown in Listing 1. At the top of the tree is a Document node, which has two children. The first child of the top Document node corresponds to DOCTYPE node on line 3. The second child of the Document node contains the WORD node. The WORD node, in turn, contains FRENCH and ENGLISH versions of the VOCABULARY word. You'll notice in the FRENCH version of the word, there is an odd symbol &ccedil;, which signifies the French ç. This construction is known as a character entity, and is defined in the DTD file declared in line 3. (See Resources for a link to the DTD file.)

1 2 3 Page 1
Page 1 of 3