XML for the absolute beginner

A guided tour from HTML to processing XML with Java

1 2 3 Page 2
Page 2 of 3

Make up a markup

While a well-formed document is well-formed because it follows rules defined by the XML spec, a valid document is valid because it matches its document type definition (DTD). The DTD is the grammar for a markup language, defined by the designer of the markup language. For my little XML recipe in Listing 3, for example, that designer would be me. The DTD specifies what elements may exist, what attributes the elements may have, what elements may or must be found inside other elements, and in what order.

Nonvalidating parsers read the XML and, if it's well-formed, give you back the document structure as a tree of objects. We'll discuss the document structure you get from a parser in the section below entitled "The Document Object Model." If the document is well-formed but the elements are nonsensical (as was the case with the two <Qty> elements in the <Ingredient> above), that's your problem.

This is, in fact, how HTML browsers work. Generally, HTML parsers are nonvalidating. The various "HTML checking" parsers, which report sytax errors in HTML, are essentially validating HTML parsers (with additional functionality, like link checking).

Validating parsers read XML, verify that it's well-formed (just as nonvalidating parsers do), and then go on to determine whether the document's element tags are legal, whether the attribute names make sense, whether every element nested inside another element belongs there, and so on.

The DTD defines the document type. It accounts for the Extensible in XML. The DTD is how you actually define a new markup language -- what I often call a dialect of XML. DTDs currently are being written for an enormous number of different problem domains, and each DTD defines a new markup language. New markup languages now exist, or are being designed, to mark up the plays of Shakespeare; to define general data resources (RDF); to model information in the health care industry (HL7 SGML/XML); to typeset, display, and actively use mathematical equations (MathML); and to perform electronic data interchange (XML/EDI). There's even a proposal for a markup language for business data in the footwear industry (FDX). (No, I'm not joking.)

Central to each of these new languages is a DTD that describes what tags the markup language has, what those tags' attributes may be, and how they may be combined. A DTD specifies very clearly what information may or may not be included in a markup language. For instance, the DTD for HTML does not allow for markup tags to select paper size for printing.

Let's take a look at a DTD for the recipe XML in Listing 3. I'm going to call it JWSRML (JavaWorld Scary Recipe Markup Language). Apologies to anyone already using that acronym.

<!-- This is the example DTD for the example XML -->
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
<!ELEMENT Description (#PCDATA)>
<!ELEMENT Ingredients (Ingredient)*>
<!ELEMENT Ingredient (Qty, Item)>
<!ATTLIST Item optional CDATA "0"
                  isVegetarian CDATA "true">
<!ELEMENT Instructions (Step)+>

Listing 4. The DTD for JWSRML

The document type definition in Listing 4 defines a language for a validating parser to accept -- meaning, the parser will produce errors if the rules listed in the DTD aren't followed. To get a general idea of how a DTD works, let's look at what a few of the lines in this file mean.

  • <!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>

    The <!ELEMENT...> statement defines a tag in the document. This tag defines a <Recipe> tag, stating that it can contain a <Name>, an optional <Description> (the question mark [?] denotes optionality), an optional<Ingredients> tag, and an optional<Instructions> tag.

  • <!ELEMENT Name (#PCDATA)>

    This simply states that a <Name> tag can contain character data and nothing else.

  • <!ATTLIST Item optional CDATA "0" isVegetarian CDATA "true">

    This section states that the


    tag has two possible attributes:


    , whose default value is


    ; and


    , whose default value is


    . Notice that attribute values aren't limited to numbers; they can be any text.

A DTD is associated with an XML document by way of a document type declaration, which appears at the top the XML file (after the <?xml...?> line). The document type declaration may contain either an inline copy of the document type definition or contain a reference to that document as a system filename or URI (universal resource ID). For example,

<!DOCTYPE Recipe SYSTEM "example.dtd">

tells the parser to start looking for a <Recipe> tag as the top-level tag of the document. It also declares that the DTD is in the system file example.dtd.

There are other characters and notations in the DTD, but writing DTDs is a topic unto itself. If you're interested in learning more, check out the DTD-related links in Resources.

You now know a lot about how XML is structured and controlled, but you haven't heard what it's good for. Why are people so excited about this technology?

So, what good is made-up markup?

Here are some benefits of representing information in XML:

  • XML is at least as readable as HTML and probably more so

    Anyone who understands, more or less, what HTML is probably understands just about everything in Listing 4. They're also likely to have a good idea what the markup means, since the markup uses fairly intuitive terms (<Ingredient><OBJECT CLASSID="000DDA23432...">).

  • The tags don't have anything to do with how the document is displayed

    Listing 4 is pure content: It's information. The markup indicates what the information means, not how to display it. The formatting information for an XML file (if there is any need for formatting) is usually written in a style language and stored separately from the XML. (See the sections on CSS and XSL below for more on formatting XML.) Separation of content and presentation is a key concept inherited from SGML.

  • A lot of the programming is already done for you

    If you write a DTD and use a validating parser, much of the error checking for the validity of your input is done by the parser. There's no need to write the parser yourself, since there are so many high-quality parsers available for free. If you want to change the language, you simply change the DTD; the parser then obeys your new rules. Moreover, if your system needs to interoperate with other systems, you can choose a standard DTD (like XML/EDI, for example), so that other systems will automatically understand your system's vocabulary, and vice versa.

  • XML is more versatile than HTML

    Let's think about all the ways a document like Listing 4 could be used:

    • You could display this recipe in an online recipe database, with a page style easily modifiable across all recipes

    • The recipes are automatically scalable, convenient if you're planning a dinner party for 200

    • The recipe is already in a standard recipe format for transmission to the database

    • Online recipe servers could exchange recipes using this format, or recipe applications could share data

    • Such recipes would be much easier to search accurately (for example, "all recipes with lime Jello and Tabasco sauce") than HTML would be

    • It would be easy, based on the contents of your on your "legacy" pantry inventory database, to produce a shopping list

In fact, CSS (Cascading Style Sheets) and XSL (the Extensible Stylesheet Language) do precisely that: They're the style languages for XML. Let's take a quick look at these two technologies.

In Listing 3 above, you've seen what may be your first XML document. You've got a problem with that document, though: It's going to be pretty difficult to convince the browser manufacturers (not to mention the W3C) to add the <Ingredient> tag to their browsers. What if there were a way to turn this XML into a text file, a PostScript document, a photo-typesetting file, or input to a text-to-speech system for the visually-impaired? Or what if the XML could somehow be transformed into HTML and displayed in a browser?

The members of the appropriate committees at the W3C have addressed these concerns with two specifications: CSS and XSL. While both are declarative languages (meaning that there are no instructions in the first-do-this, then-do-that sense), they serve different functions. CSS exists as a current recommendation from the W3C, usable with HTML or XML, is simpler to use and less powerful than XSL, and is supported by most current-generation browsers (to varying degrees). XSL is used exclusively to format XML or SGML and is more complex and powerful than CSS.

Great strides have been made with XSL in the past year. While XSL is still just a "working draft" (meaning its design isn't yet complete), you can experiment today with working implementations of the draft. Just this month (March 18, 1999), Microsoft released Internet Explorer 5.0, which includes support for part of the XSL specification. And Mozilla (the open source project based on the Netscape source code) can display XML using CSS. At the XTech '99 conference in San Jose, CA, in early March, Sun Microsystems "pre-announced" a request for proposals (for a grant) and a contest relating to the implementation of an XSL batch-processor and the addition of full XSL to Mozilla. (See Resources.)

Again, the purpose of creating these new standards is to make most things very simple for most people, just like HTML has made hypertext and structured documents attainable to your grandma (or your nine-year-old).

Cascading Style Sheets: not just for HTML anymore

You probably already know that HTML documents have a common tree-like structure wherein elements are nested inside other elements. Nonetheless, take a look at Listing 5 below.

<H1>A Theory About the Brontosaurus</H1>
My theory about the brontosaurus is...

Listing 5. <HTML> contains <BODY> contains <H1> contains text

As the caption says, the <H1> element is contained inside the <BODY> element, which itself is contained inside the <HTML> element. And, of course, the title itself is inside the <H1> element.

The whole idea of a style sheet is to use these structural relationships to indicate where changes in text style, spacing, and so on should occur. Then, a style sheet can be "applied" to a document, to change its overall look. For example, Listing 6 shows a tiny style sheet that sets the font size, color, and underlining for the <H1> heading in Listing 5.

<STYLE TYPE="text/css">
H1 { color: red; font-size: 16pt; text-decoration: underline }

Listing 6. A style sheet that sets the style for <H1> in Listing 5

If this style sheet were to appear at the top of the document, most HTML browsers these days would use the settings in the style sheet (or simply "style"), and change all <H1> headings to 16-point, red-underlined type. Styled with our style sheet, our little document would look something like this:

<SPAN STYLE="color: red; font-size: 16pt; text-decoration: underline"> A Theory About the Brontosaurus </SPAN>

My theory about the brontosaurus is...

(If this example doesn't show up properly, you either have styles turned off in your browser or you're using an old browser that doesn't support styles.) A document can reference its style sheet with a hyperlink, and some browsers allow you to switch style sheets for the document you're viewing, effectively changing how the document looks on the fly.

These style sheets are called cascading style sheets, because styles (like fonts, colors, and so on) for one markup element "cascade" down, and apply to all of the element's contents. For example, if a paragraph tag (<P>) is set to show its text in red, all text and any other elements inside that paragraph will be displayed in red, unless one of the paragraph's sub elements specifies a color for its contents.

The example we just looked at was for HTML, but what about XML? CSS can be used to style XML, too, and in precisely the same way. You simply specify the style for, say, an <Ingredient>, and all the ingredients look the same. And, interestingly enough, if you change the style sheet, the formatting of all ingredients changes. It's really quite powerful.

Most browsers these days (Netscape 4 and above, Internet Explorer 3 and above, Opera 3.5 and above) implement CSS pretty consistently for HTML. You'll be reading a lot in the next few months about CSS and XML availability in browsers. Also, keep in mind that CSS could be used to apply style to documents on the server and serve "straight HTML" without the CSS markup.

As powerful as CSS is, it has one major limitation: It can't "transform" the data it's styling. CSS can make an HTML or XML document look different, and even hide elements, but it can't reshuffle, cross-reference, or restructure them. For example, say you wanted to transform the XML recipe in Listing 3 to the HTML in Listing 1. Notice that you want the title to appear both in the browser's title bar (in an HTML <TITLE> element), and as a heading on the page (in a <H3> element), as is shown in Listing 1. CSS can't do that; all it can do is apply style to an existing structure.

To take an existing XML structure and produce a new structure of something else (in this case, HTML), you need XSL: the Extended Style Language.

1 2 3 Page 2
Page 2 of 3