Programming XML in Java, Part 2

Experience the joy of SAX, LAX, and DTDs

If you read last month's article, you already understand how you can use SAX (the Simple API for XML) to process XML documents. (If you haven't read it yet, you may want to start there; see "Read the Whole Series!" below). In that article, I explained how application writers implement the SAX DocumentHandler interface, which takes a specific action when a particular condition (such as the start of a tag) occurs during the parsing of an XML document. But what good is that function? Read on.

TEXTBOX: TEXTBOX_HEAD: Programming XML in Java: Read the whole series!


You'll also remember that an XML parser checks that the document is well formed (meaning that roughly all of the open and close tags match and don't overlap in nonsensical ways). But even well-formed documents can contain meaningless data or have a senseless structure. How can such conditions be detected and reported?

This article answers both questions through an illustrative example. I'll start first with the latter question: once the document is parsed, how do you ensure that the XML your program is processing actually makes sense? Then I'll demonstrate an extension to XML that I call LAX (the Lazy API for XML), which makes writing handlers for SAX events even easier. Finally, I'll tie all of the themes together and demonstrate the technology's usefulness with a small example that produces both formatted recipes and shopping lists from the same XML document.

Garbage in, garbage out

One thing you may have heard about XML is that it lets the system developer define custom tags. With a nonvalidating parser (discussed in Part 1 of this series), you certainly have that ability. You can make up any tag you want and, as long as you balance your open and close tags and don't overlap them in absurd ways, the nonvalidating SAX parser will parse the document without any problems. For example, a nonvalidating SAX parser would correctly parse and fire events for the document in Listing 1.

Listing 1. A well-formed, meaningless document

001 <?xml version="1.0">
002 <Art CENTURY="20">

003 <Dada>

004 <Author CENTURY="18" NOMDEPLUME="Voltaire"> 005 François-Marie Arouet 006 </Author> 007 <Tree SPECIES="Maple"> 008 <Yes/> 009 <Book AUTHOR="Musashi, Miyamoto"> 010 <Title LANG="English">The Book of Five Rings</Title> 011 <Title LANG="Nihongo">Go Rin No Sho</Title> 012 <Filter POLY="Chebyshev" POLES="2"/> 013 <Title LANG="Espanol">El Libro de Cinco Anillos</Title> 014 <Title LANG="Francais">Le Livre de Cinq Bagues</Title> 015 </Book> 016 <Bahrain FORMAT="MP3"> 017 <Cathedral CITTA="Firenze"> 018 <Nome>Santa Maria del Fiore</Nome> 019 <Architetto>Brunelleschi, Filippo (1377-1466)</Architetto> 020 <Ora FORMAT="DMY24">22032000134591</Ora> 021 </Cathedral> 022 </Bahrain> 023 <Phobias> 024 <Herbs NAME="Ma Huang"/> 025 <Appliance COLOR="Harvest Gold">Yuck</Appliance> 026 </Phobias> 027 </Tree> 028 </Dada> 029 </Art>

A nonvalidating SAX parser would produce a valid event stream for the document in Listing 1 because the input document is well formed. It's really stupid input, but it is well formed. Every opening tag has a corresponding close tag, and the tags don't overlap (meaning there are no combinations of tags like <A><B></A></B>). So a nonvalidating SAX parser will have no problem with Listing 1.

Unfortunately, if you write a program that, for example, summarizes museum collections, formats architectural information, or prints multilingual card catalogs for libraries, your program could read this really stupid XML and produce really stupid output, because it might pull out tags it recognizes (like <Dada>, <Cathedral>, or <Book>). As the saying goes, "Garbage in, garbage out."

To minimize the chance that your program produces garbage you should devise a way to detect and reject garbage in the input. Then, given meaningful input, you can focus on creating reasonable output.

Think of a document as having three levels of correctness: lexical, syntactic, and semantic. Lexical correctness is what I mean when I say "well formed": the basic structure of the document is reasonable and correct, but nothing about the content of the tags is checked. Any tag can occur inside any other tag any number of times, any tag can take any attribute, and attributes can take on any value. So, Listing 1 is well formed, but it makes no sense, because there is no control over what tags and attributes appear in the structure, and where.

Syntactic correctness means that the document is not only well formed, but that it also contains certain tags, in certain combinations. An XML document can include a section, called a document type definition (DTD), that specifies the rules for syntactic correctness.

A DTD lets a system designer create a custom markup language, a dialect of XML. A DTD indicates which tags may (or must) occur inside other specified tags, what attributes a tag may have, the required order of the tags, and so on. A validating parser uses a DTD to check the document it is parsing for syntactic correctness. The parser prints error and warning messages for any problems it finds, and then rejects any document that doesn't conform to the DTD. The application programmer can then write code assuming that the structure of the document is correct, because the parser already checked it.

So, for example, in Listing 1 a designer might write a DTD that defines a <Book> tag as containing only one or more <Title> tags. The parser would report the presence of the <Filter> tag in line 12 as an error, because the DTD doesn't allow it.

A DTD is also an excellent way to specify the input to your program. An XML input document either corresponds to a particular DTD or it doesn't. Your program can correctly process any input that conforms to a given DTD. A DTD also lets you test your application for correctness or completeness; if an input document conforms to the DTD, but your program doesn't process it properly, then you have a bug or a missing feature.

XML parsers don't provide much in the way of checking for semantic correctness. Semantic correctness means that the actual instance data is true for the purposes of the application. A validating parser could report an error when it finds a FORMAT attribute on a <Bahrain> tag (as occurs in line 16, Listing 1). But it's a lot to ask any parser to check whether the Cathedral of Santa Maria del Fiore is in Bahrain or in Italy. Semantic correctness remains the domain of your application: it's up to you to add meaning to the XML document you've defined. A validating XML parser and a DTD help to automate the detection of gross lexical and syntactic errors in the input to your program, allowing you to focus on the data's meaning.

As a side note, the HTML used to create Web pages is specified in an SGML DTD, which is considerably more complex and powerful than an XML DTD. XML DTDs are essentially a subset of these SGML DTDs, with some minor notational differences. The HTML DTD clearly specifies what kind of input an HTML-processing program can accept. XHTML, an XML-compatible version of HTML, specifies an XML DTD for HTML. It has just been released by the World Wide Web Consortium (W3C).

In the next section, I'll create a DTD for a small XML dialect for describing recipes.

Parlez-vous DTD?

Two people generally can't talk to one another unless they speak a mutually understood language. Likewise, two programs can't communicate via XML unless the programs agree on the XML language they use. A DTD defines a set of rules for the allowable tags and attributes in an XML document, and the order and cardinality of the tags. Programs using the DTD must still agree on what the tags mean (semantics again), but a DTD defines the words (or, the tags) and the grammatical rules for a particular XML dialect.

Listing 2 shows a simple DTD for a tiny XML language I call Recipe XML.

Listing 2. The DTD for Recipe XML

001 <!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
003 <!ELEMENT Name (#PCDATA)>
005 <!ELEMENT Description (#PCDATA)>
007 <!ELEMENT Ingredients (Ingredient)*>
009 <!ELEMENT Ingredient (Qty, Item)>
010 <!ATTLIST Ingredient
011   vegetarian CDATA "true">
013 <!ELEMENT Qty (#PCDATA)>
014 <!ATTLIST Qty
015   unit CDATA #IMPLIED>
017 <!ELEMENT Item (#PCDATA)>
018 <!ATTLIST Item
019   optional CDATA "0">
021 <!ELEMENT Instructions (Step)+>
022 <!ELEMENT Step (#PCDATA)>

The DTD in Listing 2 defines a complete, tiny language for transmitting recipes. Programs that use this DTD can count on the structure of conforming files to match the rules in the DTD.

I'll go over this file, line by line:

001 <!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>

This line defines a tag using <!ELEMENT. The entire line from the opening <!ELEMENT to the closing > is called an element type declaration. The declaration says that a Recipe is composed of a Name, followed by the optional occurrence of a Description, Ingredients, and Instructions. The comma operator (,) indicates the valid tags the defined tag may contain, and the order in which those tags must appear. The question mark operator (?) indicates that the item to its left is optional. Since Name has only a comma operator after it, a Recipe must have precisely one Name. The parentheses are for grouping, and don't appear in the input document.

Therefore, the sequence:


is a valid Recipe, because it matches the DTD (that is, it consists of a <Name> followed optionally by a <Description>.) However:

   <Description>Italian dessert</Description>

is not a valid Recipe, because the Description comes before the Name.

003 <!ELEMENT Name (#PCDATA)>

This line states that a Name tag (or element) contains no other tag types, and may contain text between its open and close tags. A validating parser will mark any tag within a Name tag as an error.

007 <!ELEMENT Ingredients (Ingredient)*>

This line states that an Ingredients tag may contain zero or more Ingredient tags. The asterisk or star operator (*) indicates the tag's zero-or-more cardinality.

010 <!ATTLIST Ingredient
011   vegetarian CDATA "true">

An attribute list declaration, which uses <!ATTLIST, defines the attributes for a tag. Only attributes within the attribute list declaration for a tag are allowed. This line says that the Ingredient tag previously defined has a single attribute, vegetarian, which is character data (CDATA), and whose default value is "true". Attribute list declarations all follow this pattern; one may define multiple attributes, each with a type and default value, following the tag name.

014 <!ATTLIST Qty
015   unit CDATA #IMPLIED>

This attribute list declaration defines the default value for the unit attribute as #IMPLIED. That means that the attribute may or may not appear with the tag; if it doesn't appear, the application supplies the value. This is how you create an optional attribute.

021 <!ELEMENT Instructions (Step)+>

This line states that an Instructions tag, if present, must contain at least one Step. The plus-sign operator (+) indicates one or more occurrences of the item to its left.

DTDs have more operators and conventions, but this example covers the basics. (You can find out the whole scoop on DTDs in XML in the XML recommendation; see Resources.)

DTDs are meta-information; that is, they are information about information. You may already be familiar with this concept. A table in a relational database has a schema describing such things as the column names, data types, sizes, and default values for its data. But the table description doesn't contain data values, it contains a description of the values. Likewise, a DTD is a simple sort of schema that defines what may be in a particular document type. (There is currently an effort underway to create an XML schema that is much more like a database schema; see Resources.)

DTDs are also a bit like BNF, or Backus-Naur Form (see Resources for a discussion), which describes transformation rules for grammars; however, BNF can express structures that XML DTDs cannot.

An XML document declares its DTD with a <!DOCTYPE declaration, as shown in Listing 3. The document type specifies the external DTD used to validate the document. The top-level tag of the document must be the same as the document defined by the <!DOCTYPE (in this case, it's Recipe.)

1 2 3 Page 1