Recommended: Sing it, brah! 5 fabulous songs for developers
JW's Top 5
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 2 of 3
While a well-formed document is well-formed because it follows rules defined by the XML spec, a valid document is valid because it matches its document type definition (DTD). The DTD is the grammar for a markup language, defined by the designer of the markup language. For my little XML recipe in Listing 3, for example, that designer would be me. The DTD specifies what elements may exist, what attributes the elements may have, what elements may or must be found inside other elements, and in what order.
Nonvalidating parsers read the XML and, if it's well-formed, give you back the document structure as a tree of objects. We'll discuss the document
structure you get from a parser in the section below entitled "The Document Object Model." If the document is well-formed
but the elements are nonsensical (as was the case with the two <Qty> elements in the <Ingredient> above), that's your problem.
This is, in fact, how HTML browsers work. Generally, HTML parsers are nonvalidating. The various "HTML checking" parsers, which report sytax errors in HTML, are essentially validating HTML parsers (with additional functionality, like link checking).
Validating parsers read XML, verify that it's well-formed (just as nonvalidating parsers do), and then go on to determine whether the document's element tags are legal, whether the attribute names make sense, whether every element nested inside another element belongs there, and so on.
The DTD defines the document type. It accounts for the Extensible in XML. The DTD is how you actually define a new markup language -- what I often call a dialect of XML. DTDs currently are being written for an enormous number of different problem domains, and each DTD defines a new markup language. New markup languages now exist, or are being designed, to mark up the plays of Shakespeare; to define general data resources (RDF); to model information in the health care industry (HL7 SGML/XML); to typeset, display, and actively use mathematical equations (MathML); and to perform electronic data interchange (XML/EDI). There's even a proposal for a markup language for business data in the footwear industry (FDX). (No, I'm not joking.)
Central to each of these new languages is a DTD that describes what tags the markup language has, what those tags' attributes may be, and how they may be combined. A DTD specifies very clearly what information may or may not be included in a markup language. For instance, the DTD for HTML does not allow for markup tags to select paper size for printing.
Let's take a look at a DTD for the recipe XML in Listing 3. I'm going to call it JWSRML (JavaWorld Scary Recipe Markup Language). Apologies to anyone already using that acronym.
<!-- This is the example DTD for the example XML -->
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)>
<!ELEMENT Name (#PCDATA)>
<!ELEMENT Description (#PCDATA)>
<!ELEMENT Ingredients (Ingredient)*>
<!ELEMENT Ingredient (Qty, Item)>
<!ELEMENT Qty (#PCDATA)>
<!ATTLIST Qty unit CDATA #REQUIRED>
<!ELEMENT Item (#PCDATA)>
<!ATTLIST Item optional CDATA "0"
isVegetarian CDATA "true">
<!ELEMENT Instructions (Step)+>
Listing 4. The DTD for JWSRML
The document type definition in Listing 4 defines a language for a validating parser to accept -- meaning, the parser will produce errors if the rules listed in the DTD aren't followed. To get a general idea of how a DTD works, let's look at what a few of the lines in this file mean.
<!ELEMENT Recipe (Name, Description?, Ingredients?, Instructions?)><!ELEMENT...> statement defines a tag in the document. This tag defines a <Recipe> tag, stating that it can contain a <Name>, an optional <Description> (the question mark [?] denotes optionality), an optional<Ingredients> tag, and an optional<Instructions> tag.<!ELEMENT Name (#PCDATA)><Name> tag can contain character data and nothing else.<!ATTLIST Item optional CDATA "0" isVegetarian CDATA "true"><Item> tag has two possible attributes: optional, whose default value is 0; and isVegetarian, whose default value is true. Notice that attribute values aren't limited to numbers; they can be any text.
A DTD is associated with an XML document by way of a document type declaration, which appears at the top the XML file (after the <?xml...?> line). The document type declaration may contain either an inline copy of the document type definition or contain a reference
to that document as a system filename or URI (universal resource ID). For example,
<!DOCTYPE Recipe SYSTEM "example.dtd">
tells the parser to start looking for a <Recipe> tag as the top-level tag of the document. It also declares that the DTD is in the system file example.dtd.
There are other characters and notations in the DTD, but writing DTDs is a topic unto itself. If you're interested in learning more, check out the DTD-related links in Resources.
You now know a lot about how XML is structured and controlled, but you haven't heard what it's good for. Why are people so excited about this technology?
Here are some benefits of representing information in XML:
<Ingredient><OBJECT CLASSID="000DDA23432...">).In fact, CSS (Cascading Style Sheets) and XSL (the Extensible Stylesheet Language) do precisely that: They're the style languages for XML. Let's take a quick look at these two technologies.
In Listing 3 above, you've seen what may be your first XML document. You've got a problem with that document, though: It's
going to be pretty difficult to convince the browser manufacturers (not to mention the W3C) to add the <Ingredient> tag to their browsers. What if there were a way to turn this XML into a text file, a PostScript document, a photo-typesetting
file, or input to a text-to-speech system for the visually-impaired? Or what if the XML could somehow be transformed into
HTML and displayed in a browser?
The members of the appropriate committees at the W3C have addressed these concerns with two specifications: CSS and XSL. While both are declarative languages (meaning that there are no instructions in the first-do-this, then-do-that sense), they serve different functions. CSS exists as a current recommendation from the W3C, usable with HTML or XML, is simpler to use and less powerful than XSL, and is supported by most current-generation browsers (to varying degrees). XSL is used exclusively to format XML or SGML and is more complex and powerful than CSS.
Great strides have been made with XSL in the past year. While XSL is still just a "working draft" (meaning its design isn't yet complete), you can experiment today with working implementations of the draft. Just this month (March 18, 1999), Microsoft released Internet Explorer 5.0, which includes support for part of the XSL specification. And Mozilla (the open source project based on the Netscape source code) can display XML using CSS. At the XTech '99 conference in San Jose, CA, in early March, Sun Microsystems "pre-announced" a request for proposals (for a grant) and a contest relating to the implementation of an XSL batch-processor and the addition of full XSL to Mozilla. (See Resources.)
Again, the purpose of creating these new standards is to make most things very simple for most people, just like HTML has made hypertext and structured documents attainable to your grandma (or your nine-year-old).
You probably already know that HTML documents have a common tree-like structure wherein elements are nested inside other elements. Nonetheless, take a look at Listing 5 below.
<HTML><HEAD></HEAD> <BODY> <H1>A Theory About the Brontosaurus</H1> My theory about the brontosaurus is... </BODY> </HTML>
Listing 5. <HTML> contains <BODY> contains <H1> contains text
As the caption says, the <H1> element is contained inside the <BODY> element, which itself is contained inside the <HTML> element. And, of course, the title itself is inside the <H1> element.
The whole idea of a style sheet is to use these structural relationships to indicate where changes in text style, spacing, and so on should occur. Then, a style sheet can be "applied" to a document, to change its overall look. For example, Listing 6 shows a tiny style sheet that sets the font size, color, and underlining for the <H1> heading in Listing 5.
<STYLE TYPE="text/css">
<!--
H1 { color: red; font-size: 16pt; text-decoration: underline }
-->
</STYLE>
Listing 6. A style sheet that sets the style for <H1> in Listing 5
If this style sheet were to appear at the top of the document, most HTML browsers these days would use the settings in the
style sheet (or simply "style"), and change all <H1> headings to 16-point, red-underlined type. Styled with our style sheet, our little document would look something like this:
<SPAN STYLE="color: red; font-size: 16pt; text-decoration: underline"> A Theory About the Brontosaurus </SPAN>My theory about the brontosaurus is...
(If this example doesn't show up properly, you either have styles turned off in your browser or you're using an old browser that doesn't support styles.) A document can reference its style sheet with a hyperlink, and some browsers allow you to switch style sheets for the document you're viewing, effectively changing how the document looks on the fly.
These style sheets are called cascading style sheets, because styles (like fonts, colors, and so on) for one markup element "cascade" down, and apply to all of the element's contents.
For example, if a paragraph tag (<P>) is set to show its text in red, all text and any other elements inside that paragraph will be displayed in red, unless one of the paragraph's sub elements specifies a color for its contents.
The example we just looked at was for HTML, but what about XML? CSS can be used to style XML, too, and in precisely the same
way. You simply specify the style for, say, an <Ingredient>, and all the ingredients look the same. And, interestingly enough, if you change the style sheet, the formatting of all ingredients changes. It's really quite powerful.
Most browsers these days (Netscape 4 and above, Internet Explorer 3 and above, Opera 3.5 and above) implement CSS pretty consistently for HTML. You'll be reading a lot in the next few months about CSS and XML availability in browsers. Also, keep in mind that CSS could be used to apply style to documents on the server and serve "straight HTML" without the CSS markup.
As powerful as CSS is, it has one major limitation: It can't "transform" the data it's styling. CSS can make an HTML or XML
document look different, and even hide elements, but it can't reshuffle, cross-reference, or restructure them. For example,
say you wanted to transform the XML recipe in Listing 3 to the HTML in Listing 1. Notice that you want the title to appear
both in the browser's title bar (in an HTML <TITLE> element), and as a heading on the page (in a <H3> element), as is shown in Listing 1. CSS can't do that; all it can do is apply style to an existing structure.
To take an existing XML structure and produce a new structure of something else (in this case, HTML), you need XSL: the Extended Style Language.