Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
And yet, despite the omnipresence and popularity of HTML, it is severely limited in what it can do. It's fine for disseminating informal documents, but HTML now is being used to do things it was never designed for. Trying to design heavy-duty, flexible, interoperable data systems from HTML is like trying to build an aircraft carrier with hacksaws and soldering irons: the tools (HTML and HTTP) just aren't up to the job.
The good news is that many of the limitations of HTML have been overcome in XML, the Extensible Markup Language. XML is easily comprehensible to anyone who understands HTML, but it is much more powerful. More than just a markup language, XML is a metalanguage -- a language used to define new markup languages. With XML, you can create a language crafted specifically for your application or domain.
XML will complement, rather than replace, HTML. Whereas HTML is used for formatting and displaying data, XML represents the contextual meaning of the data.
This article will present the history of markup languages and how XML came to be. We'll look at sample data in HTML and move gradually into XML, demonstrating why it provides a superior way to represent data. We'll explore the reasons you might need to invent a custom markup language, and I'll teach you how to do it. We'll cover the basics of XML notation, and how to display XML with two different sorts of style languages. Then, we'll dive into the Document Object Model, a powerful tool for manipulating documents as objects (or manipulating object structures as documents, depending upon how you look at it). We'll go over how to write Java programs that extract information from XML documents, with a pointer to a free program useful for experimenting with these new concepts. Finally, we'll take a look at an Internet company that's basing its core technology strategy on XML and Java.
Is XML for you?
Though this article is written for anyone interested in XML, it has a special relationship to the JavaWorld series on XML JavaBeans. (See Resources for links to related articles.) If you've been reading that series and aren't quite "getting it," this article should clarify how to use XML with beans. If you are getting it, this article serves as the perfect companion piece to the XML JavaBeans series, since it covers topics untouched therein. And, if you're one of the lucky few who still have the XML JavaBeans articles to look forward to, I recommend that you read the present article first as introductory material.
A note about Java
There's so much recent XML activity in the computer world that even an article of this length can only skim the surface. Still, the whole point of this article is to give you the context you need to use XML in your Java program designs. This article also covers how XML operates with existing Web technology, since many Java programmers work in such an environment.
XML opens the Internet and Java programming to portable, nonbrowser functionality. XML frees Internet content from the browser in much the same way Java frees program behavior from the platform. XML makes Internet content available to real applications.
Java is an excellent platform for using XML, and XML is an outstanding data representation for Java applications. I'll point out some of Java's strengths with XML as we go along.
Let's begin with a history lesson.
The HTML we all know and love (well, that we know, anyway) was originally designed by Tim Berners-Lee at CERN (le Conseil Européen pour la Recherche Nucléaire, or the European Laboratory for Particle Physics) in Geneva to allow physics nerds (and even non-nerds) to communicate with each other. HTML was released in December 1990 within CERN, and became publicly available in the summer of 1991 for the rest of us. CERN and Berners-Lee gave away the specifications for HTML, HTTP, and URLs, in the fine old tradition of Internet share-and-enjoy.
Berners-Lee defined HTML in SGML, the Standard Generalized Markup Language. SGML, like XML, is a metalanguage -- a language used for defining other languages. Each so-defined language is called an application of SGML. HTML is an application of SGML.
SGML emerged from research done primarily at IBM on text document representation in the late '60s. IBM created GML ("General Markup Language"), a predecessor language to SGML, and in 1978 the American National Standards Institute (ANSI) created its first version of SGML. The first standard was released in 1983, with the draft standard released in 1985, and the first standard was published in 1986. Interestingly enough, the first SGML standard was published using an SGML system developed by Anders Berglund at CERN, the organization that, as we have seen, gave us HTML and the Web.
SGML is widely used in large industries and governments such as in large aerospace, automotive, and telecommunications companies. SGML is used as a document standard at the United States Department of Defense and the Internal Revenue Service. (For readers outside of the US, the IRS are the tax guys.)
Albert Einstein said everything should be made as simple as possible, and no simpler. The reason SGML isn't found in more places is that it's extremely sophisticated and complex. And HTML, which you can find everywhere, is very simple; for a lot of applications, it's too simple.
HTML is a language designed to "talk about" documents: headings, titles, captions, fonts, and so on. It's heavily document structure- and presentation-oriented.
Admittedly, artists and hackers have been able to work miracles with the relatively dull tool called HTML. But HTML has serious drawbacks that make it a poor fit for designing flexible, powerful, evolutionary information systems. Here a few of the major complaints:
SGML has none of these weaknesses, but in order to be general, it's hair-tearingly complex (at least in its complete form). The language used to format SGML (its "style language"), called DSSSL (Document Style Semantics and Specification Language), is extremely powerful but difficult to use. How do we get a language that's roughly as easy to use as HTML but has most of the power of SGML?
As the Web exploded in popularity and people all over the world began learning about HTML, they fairly quickly started running into the limitations outlined above. Heavy-metal SGML wonks, who had been working with SGML for years in relative obscurity, suddenly found that everyday people had some understanding of the concept of markup (that is, HTML). SGML experts began to consider the possibility of using SGML on the Web directly, instead of using just one application of it (again, HTML). At the same time, they knew that SGML, while powerful, was simply too complex for most people to use.
In the summer of 1996, Jon Bosak (currently online information technology architect at Sun Microsystems) convinced the W3C to let him form a committee on using SGML on the Web. He created a high-powered team of muckety-mucks from the SGML world. By November of that year, these folks had created the beginnings of a simplified form of SGML that incorporated tried-and-true features of SGML but with reduced complexity. This was, and is, XML.
In March 1997, Bosak released his landmark paper, "XML, Java and the Future of the Web" (see Resources). Now, two years later (a very long time in the life of the Web), Bosak's short paper is still a good, if dated, introduction to why using XML is such an excellent idea.
SGML was created for general document structuring, and HTML was created as an application of SGML for Web documents. XML is a simplification of SGML for general Web use.
All this talk of "inventing your own tags" is pretty foggy: What kind of tags would a developer want to invent and how would the resulting XML be used? In this section, we'll go over an example that compares and contrasts information representation in HTML and XML. In a later section ("XSL: I like your style") we'll go over XML display.
First, we'll take an example of a recipe, and display it as one possible HTML document. Then, we'll redo the example in XML and discuss what that buys us.
Take a look at the little chunk of HTML in Listing 1:
<!-- The original html recipe --> <HTML> <HEAD> <TITLE>Lime Jello Marshmallow Cottage Cheese Surprise</TITLE> </HEAD> <BODY> <H3>Lime Jello Marshmallow Cottage Cheese Surprise</H3> My grandma's favorite (may she rest in peace). <H4>Ingredients</H4> <TABLE BORDER="1"> <TR BGCOLOR="#308030"><TH>Qty</TH><TH>Units</TH><TH>Item</TH></TR> <TR><TD>1</TD><TD>box</TD><TD>lime gelatin</TD></TR> <TR><TD>500</TD><TD>g</TD><TD>multicolored tiny marshmallows</TD></TR> <TR><TD>500</TD><TD>ml</TD><TD>cottage cheese</TD></TR> <TR><TD></TD><TD>dash</TD><TD>Tabasco sauce (optional)</TD></TR> </TABLE> <P> <H4>Instructions</H4> <OL> <LI>Prepare lime gelatin according to package instructions...</LI> <!-- and so on --> </BODY> </HTML>
Listing 1. Some HTML
(A printable version of this listing can be found at example.html.)
Looking at the HTML code in Listing 1, it's probably clear to just about anyone that this is a recipe for something (something awful, but a recipe nonetheless). In a browser, our HTML produces something like this:
Lime Jello Marshmallow Cottage Cheese SurpriseMy grandma's favorite (may she rest in peace).
Listing 2. What the HTML in Listing 1 looks like in a browser
Now, there are a number of advantages to representing this recipe in HTML, as follows:
There's one major problem with HTML as a data format, however. The meaning of the various pieces of data in the document is lost. It's really hard to take general HTML and figure out what the data
in the HTML mean. The fact that there's an
<Ingredient> of this recipe with a
<Qty> (quantity) of 500 ml (
<Item> cottage cheese would be very hard to extract from this document in a way that's generally meaningful.
Now, the idea of data in an HTML document meaning something may be a bit hard to grasp. Web pages are fine for the human reader, but if a program is going to process a document, it
requires unambiguous definitions of what the tags mean. For instance, the
<TITLE> tag in an HTML document encloses the title of the document. That's what the tag means, and it doesn't mean anything else.
Similarly, an HTML
<TR> tag means "table row," but that's of little use if your program is trying to read recipes in order to, say, create a shopping
list. How could a program find a list of ingredients from a Web page formatted in HTML?
Sure, you could write a program that grabs the headers out of the document, reads the table column headers, figures out the quantities and units of each ingredient, and so on. The problem is, everyone formats recipes differently. What if you're trying to get this information from, say, the Julia Childs Web site, and she keeps messing around with the formatting? If Julia changes the order of the columns or stops using tables, she'll break your program! (Though it has to be said: If Julia starts publishing recipes like this, she may want to think about changing careers.)
Now, imagine that this recipe page came from data in a database and you'd like to be able to ship this data around. Maybe you'd like to add it to your huge recipe database at home, where you can search and use it however you like. Unfortunately, your input is HTML, so you'll need a program that can read this HTML, figure out what all the "Ingredients," "Instructions," "Units," and so forth are, and then import them to your database. That's a lot of work. Especially since all of that semantic information -- again, the meaning of the data -- existed in that original database but were obscured in the process of being transformed into HTML.
Now, imagine you could invent your own custom language for describing recipes. Instead of describing how the recipe was to be displayed, you'd describe the information structure in the recipe: how each piece of information would relate to the other pieces.
Let's just make up a markup language for describing recipes, and rewrite our recipe in that language, as in Listing 3.
<?xml version="1.0"?> <Recipe> <Name>Lime Jello Marshmallow Cottage Cheese Surprise</Name> <Description> My grandma's favorite (may she rest in peace). </Description> <Ingredients> <Ingredient> <Qty unit="box">1</Qty> <Item>lime gelatin</Item> </Ingredient> <Ingredient> <Qty unit="g">500</Qty> <Item>multicolored tiny marshmallows</Item> </Ingredient> <Ingredient> <Qty unit="ml">500</Qty> <Item>Cottage cheese</Item> </Ingredient> <Ingredient> <Qty unit="dash"/> <Item optional="1">Tabasco sauce</Item> </Ingredient> </Ingredients> <Instructions> <Step> Prepare lime gelatin according to package instructions </Step> <!-- And so on... --> </Instructions> </Recipe>
Listing 3. A custom markup language for recipes
It will come as little surprise to you, being the astute reader you are, that this recipe in its new format is actually an XML document. Maybe the fact that the file started with the odd header
gave it away; in fact, every XML file should begin with this header. We've simply invented markup tags that have a particular
meaning; for example, "An
<Ingredient> is a
<Qty> (quantity in specified units) of a single
<Item>, which is possibly
optional." Our XML document describes the information in the recipe in terms of recipes, instead of in terms of how to display the recipe (as in HTML). The semantics, or meaning of the information, is maintained in XML because that's what the tag set
was designed to do.
Notes on notation
It's important to get some nomenclature straight. In Figure 1, you see a start tag, which begins an enclosed area of text, known as an
Item, according to the tag name. As in HTML, XML tags may include a list of attributes (consisting of an attribute name and an attribute value.) The
Item defined by the tag ends with the end tag.
Figure 1. An XML start tag and its corresponding end tag
Not every tag encloses text. In HTML, the
<BR> tag means "line break" and contains no text. In XML, such elements aren't allowed. Instead, XML has empty tags, denoted by a slash before the final right-angle bracket in the tag. Figure 2 shows an empty tag from our XML recipe. Note
that empty tags may have attributes. This empty tag example is standard XML shorthand for
Figure 2. An empty tag
In addition to these notational differences from HTML, the structural rules of XML are more strict. Every XML document must be well-formed. What does that mean? Read on!
Ooh-la-la! Well-formed XML
The concept of well-formedness comes from mathematics: It's possible to write mathematical expressions that don't mean anything. For example, the expression
2 ( + + 5 (=) 9 > 7
looks (sort of) like math, but it isn't math because it doesn't follow the notational and structural rules for a mathematical expression (not on this planet, at least). In other words, the "expression" above isn't well-formed. Mathematical expressions must be well-formed before you can do anything useful with them, because expressions that aren't well-formed are meaningless.
A well-formed XML document is simply one that follows all of the notational and structural rules for XML. Programs that intend to process XML should reject any input XML that doesn't follow the rules for being well-formed. The most important of these rules are as follows:
<LI>and never "close" it with
</LI>. The browser just figures out where the
</LI>would be and automatically inserts it for you. XML doesn't allow this kind of sloppiness. Every start tag must have a corresponding end tag. This is because part of the information in an XML file has to do with how different elements of information relate to one another, and if the structure is ambiguous, so is the information. So, XML simply doesn't allow ambiguous structure. This nonambiguous structure also allows XML documents to be processed as data structures (trees), as I'll explain shortly in the discussion of the Document Object Model.
Let's call <Potato>the whole thing off</Tomato>
isn't well-formed because
<Potato> opens inside of
<Tomato> but doesn't close inside of
<Tomato>. The correct sequence must be
Let's call <Potato>the whole thing off</Potato>
In other words, the structure of the document must be strictly hierarchical.
<TABLE BORDER=1>, where there are no quotes around the attribute value). Every attribute value must have quotes (
>), and (
"), respectively. These characters are special characters for XML. An XML file using, say, the double quote character in the text enclosed in tags in an XML file isn't well-formed, and correctly designed XML parsers will produce an error for such input.
'Well-formed' means 'parsable'
A generic XML parser is a program or class that can read any well-formed XML at its input. Many vendors now offer XML parsers in Java for free; (you'll find links to these packages in Resources at the bottom of this article). XML parsers recognize well-formed documents and produce error messages (much like a compiler would) when they receive input that isn't well-formed. As we'll see, this functionality is very handy for the programmer: You simply call the parser you've selected and it takes care of the error detection and so on. While all XML parsers check the well-formedness of documents (meaning, as we've seen, that all the tags make sense, are nested properly, and so on), validating XML parsers go one step further. Validating parsers also confirm whether the document is valid; that is, that the structure and number of tags make sense.
For example, most browsers will display a document that (nonsensically) has two
<TITLE> elements, but how can this be? Only one title or no title makes sense.
For another example, imagine that in Listing 3 the "cottage cheese" ingredient looked like this:
<Ingredient> <Qty unit="ml">500</Qty> <Qty unit="g">9</Qty> <Item>Cottage cheese</Item> </Ingredient>
This XML document is certainly well-formed, but it doesn't make sense. It isn't structurally valid. It is nonsense for a
<Qty> to contain a <
Qty>. What's the
<Qty> of this
The problem is, we have a document that's well-formed, but it isn't very useful because the XML doesn't make sense. We need
a way to specify what makes an XML document valid. For example, how can we specify that a
<Qty> tag may contain only text (and not any other elements) and report as errors any other case?
The answer to this question lies in something called the document type definition, which we'll look at next.