Wizard API updated!
Tim Boudreau has released a new version of the Swing Wizard library (version 0.997) that fixes the WizardException bug reported in JavaWorld's recent Open Source Java Project profile. The article's examples have been reworked to test out the new, improved WizardException. Thanks, Tim, for this helpful fix!
Open Source Java Projects: The Wizard API

Newsletter sign-up

Sign up for our technology specific newsletters.

Enterprise Java
View all newsletters

Email Address:

Programming XML in Java, Part 2

Experience the joy of SAX, LAX, and DTDs

If you read last month's article, you already understand how you can use SAX (the Simple API for XML) to process XML documents. (If you haven't read it yet, you may want to start there; see "Read the Whole Series!" below). In that article, I explained how application writers implement the SAX DocumentHandler interface, which takes a specific action when a particular condition (such as the start of a tag) occurs during the parsing of an XML document. But what good is that function? Read on.

TEXTBOX: TEXTBOX_HEAD: Programming XML in Java: Read the whole series!

:END_TEXTBOX

You'll also remember that an XML parser checks that the document is well formed (meaning that roughly all of the open and close tags match and don't overlap in nonsensical ways). But even well-formed documents can contain meaningless data or have a senseless structure. How can such conditions be detected and reported?

This article answers both questions through an illustrative example. I'll start first with the latter question: once the document is parsed, how do you ensure that the XML your program is processing actually makes sense? Then I'll demonstrate an extension to XML that I call LAX (the Lazy API for XML), which makes writing handlers for SAX events even easier. Finally, I'll tie all of the themes together and demonstrate the technology's usefulness with a small example that produces both formatted recipes and shopping lists from the same XML document.

Garbage in, garbage out

One thing you may have heard about XML is that it lets the system developer define custom tags. With a nonvalidating parser (discussed in Part 1 of this series), you certainly have that ability. You can make up any tag you want and, as long as you balance your open and close tags and don't overlap them in absurd ways, the nonvalidating SAX parser will parse the document without any problems. For example, a nonvalidating SAX parser would correctly parse and fire events for the document in Listing 1.

Listing 1. A well-formed, meaningless document



A nonvalidating SAX parser would produce a valid event stream for the document in Listing 1 because the input document is well formed. It's really stupid input, but it is well formed. Every opening tag has a corresponding close tag, and the tags don't overlap (meaning there are no combinations of tags like <A><B></A></B>). So a nonvalidating SAX parser will have no problem with Listing 1.

Unfortunately, if you write a program that, for example, summarizes museum collections, formats architectural information, or prints multilingual card catalogs for libraries, your program could read this really stupid XML and produce really stupid output, because it might pull out tags it recognizes (like <Dada>, <Cathedral>, or <Book>). As the saying goes, "Garbage in, garbage out."

To minimize the chance that your program produces garbage you should devise a way to detect and reject garbage in the input. Then, given meaningful input, you can focus on creating reasonable output.

Think of a document as having three levels of correctness: lexical, syntactic, and semantic. Lexical correctness is what I mean when I say "well formed": the basic structure of the document is reasonable and correct, but nothing about the content of the tags is checked. Any tag can occur inside any other tag any number of times, any tag can take any attribute, and attributes can take on any value. So, Listing 1 is well formed, but it makes no sense, because there is no control over what tags and attributes appear in the structure, and where.

Syntactic correctness means that the document is not only well formed, but that it also contains certain tags, in certain combinations. An XML document can include a section, called a document type definition (DTD), that specifies the rules for syntactic correctness.

A DTD lets a system designer create a custom markup language, a dialect of XML. A DTD indicates which tags may (or must) occur inside other specified tags, what attributes a tag may have, the required order of the tags, and so on. A validating parser uses a DTD to check the document it is parsing for syntactic correctness. The parser prints error and warning messages for any problems it finds, and then rejects any document that doesn't conform to the DTD. The application programmer can then write code assuming that the structure of the document is correct, because the parser already checked it.

So, for example, in Listing 1 a designer might write a DTD that defines a <Book> tag as containing only one or more <Title> tags. The parser would report the presence of the <Filter> tag in line 12 as an error, because the DTD doesn't allow it.

A DTD is also an excellent way to specify the input to your program. An XML input document either corresponds to a particular DTD or it doesn't. Your program can correctly process any input that conforms to a given DTD. A DTD also lets you test your application for correctness or completeness; if an input document conforms to the DTD, but your program doesn't process it properly, then you have a bug or a missing feature.

1 | 2 | 3 | 4 | 5 | 6 |  Next >
Resources
  • Download the source code and class files for this article
Additional resources