HTML (Hypertext Markup Language) currently is the document format of the World Wide Web. Lately, though, there's been a lot of noise about XML (Extensible Markup Language), which allows, among other things, the ability to define new markup tags (the bits between <angle brackets>), or even whole new markup languages. Some pundits even claim that XML may supplant HTML as the dominant information format on the Web.
Read the whole "XML JavaBeans" series:
- Part 1. Make JavaBeans mobile and interoperable with XML
- Part 2. Automatically convert JavaBeans to XML documents
- Part 3. Integrate the XMLBeans package with the Java core
For some, XML seems one of those ideas that, while exciting at first, isn't entirely useable in practice. How would a developer use XML in a real life system? What good is the ability to define custom tags if no browsers understand them? In this month's column, we'll look at a possible application of XML -- namely, using it as a serialization format for JavaBeans.
First, you'll read a quick rundown of what XML is and why so many people are so excited about it. Next, you'll hear about the World Wide Web Consortium's (W3C's) Document Object Model, the proposed standard for representing documents as data structures. As an example of processing a document as a data structure, we'll describe a very small custom markup language, and then implement a class that reads an XML file and transforms it into a JavaBean.
Please note that the primary purpose of this article is to provide an example of XML in use. While it is not an introduction to XML for the complete novice, this article should be comprehensible with just a bit of preparatory reading (see the introductory articles listed in the Resources section.)
What's wrong with HTML, anyway?
There's a great deal of introductory material on the Web about XML, so we're going to go over XML basics pretty quickly. Let's start by discussing why XML is necessary in the first place.
It's easy to make the argument that HTML enabled the explosion of the Web. Among the many strengths that have made HTML the dominant format for Web documents are the following:
HTML is very easy to learn and use. Practically anyone with a pulse can learn to write HTML. Reading HTML in a Web browser is so simple and intuitive that just about everyone grasps it instantly.
Logical layout makes HTML documents portable. HTML markup describes to a browser what roles various pieces of text play in a document (title, list element, and so on,) and the browser is free to decide how (or if) to display them. This provides a great deal of device independence.
Hypertext forms webs of knowledge. One of the most useful features of HTML for many applications is the ability to make information "come alive" and refer to other information.
- HTML forms a framework for composite documents. The addition of applets and other sorts of "active" page elements provides immense creative control to developers on the Web "platform."
Despite these and the many other strengths that make HTML so useful and, well, cool, it has some serious drawbacks that are rapidly becoming obstacles to using it in serious data applications:
HTML is a rapidly growing monster. It was originally designed for sharing documents between scientists at CERN. (CERN stands for Conseil Europeen pour la Recherche Nucleaire, the Center for European Nuclear Research, though its Web site consistently describes it as the European Laboratory for Particle Physics.) They wanted structured text with some simple outline capabilities, simple hyperlinks, primitive font control, and maybe some pretty pictures and colors, and that's what they created. It was simple, elegant, and useful. It's still useful, but simple and elegant have gone out the window as developers have demanded, and browser creators have developed, new features for HTML. The HTML specification has ballooned to enormous size with the addition of such features as scripting, frames, layers, tables, forms, style sheets, objects, applets, and on and on.
HTML is set in stone. Within a particular version of the HTML standard, only certain tags, such as <TITLE> or <B> (for boldface), are recognizable HTML tags. If you're working in HTML, you're stuck with the tags recognized by the HTML spec (or your particular browser). If you want to define your own tags for some reason, you're out of luck.
- HTML is very browser-centric. HTML documents are, by and large, plain text with markup to provide display organization, some font control, and graphic content. They are documents written for humans to read, not for client-side programs to analyze and present. Because of this, HTML is not a good choice as an information format for automated data processing systems.
- HTML mostly addresses presentation, not content. Generally, HTML tags describe how or in what context to display a particular piece of text. The semantics of the text, that is, what that text actually means, is lost in HTML.
What do the data mean?
This last deficiency of HTML is the clincher. As data become more mobile in data processing systems, it's necessary to transfer both the information and meta-information about what the data mean. A number in an HTML table may or may not reliably mean something when the document is read by a program. An XML document can be designed to express not simply how to display the data, but what data mean.
For example, an HTML table can display statistics for an individual baseball player, as in Figure 1.
<TABLE> <TR ALIGN="CENTER" VALIGN="BOTTOM" BGCOLOR="#008080"> <TD><B>NO.</B></TD> <TD><B>PLAYER</B></TD> <TD><B>High School</B></TD> <TD><B>AB</B></TD> <TD><B>R</B></TD> ... (and so on) ...
Figure 1. Batting averages in an HTML <TABLE>
A row-column representation of these data is fine if what's needed is simply a static display of data in this particular format, but it's not a great representation if you want to associate meaning with the data in your application. Try writing a program that reads the HTML above, retrieves the information about, say, the hitter's runs-batted-in, and then does something with that quantity. With HTML, that's not easy to do in a general way. Imagine, though, that your data file looked something like what appears in Figure 2:
<?xml version="1.0"> <Player> <Name><First>Jonas</First><Last>Grumby</Last></Name> <Number>12</Number> <HighSchool>Eaton</HighSchool> <Stats"> <Year>1997</Year> <AtBats>69</AtBats> <Runs>31</Runs> <Hits>30</Hits> <HomeRuns>2</HomeRuns> <RunsBattedIn>15</RunsBattedIn> </Stats> </Player>
Figure 2. The batting information specified in XML
Figure 2 is a sample of XML that represents the same information as in Figure 1. It would be easy to pick out the "runs-batted-in" statistic in this document. The document could change structure radically, and the <RunsBattedIn> tag would still be relatively easy to find. The XML code in Figure 2 contains the same information as the HTML code in Figure 1, but it's represented in a way that indicates what the data mean, not just how to present the data.
Just as in HTML, a style sheet can be associated with XML, though XML's style language, XSL, is more powerful and cryptic than HTML's Cascading Style Sheets. In fact, XSL can convert XML into HTML for display by a browser! The XML above could be displayed in a browser just as it appears in Figure 1, but client-side programs could also collect and use such statistics, since there's an indication (via the tag) of what the data mean.
You may be wondering how I knew what tags to use in creating my sample XML file. Where did the tag names (like
RunsBattedIn) come from? The answer is: I made them up. I just invented markup tags for my application out of thin air! Creating a new markup language is just like creating any other kind of custom file format. A developer simply creates a file format that meets the needs of the application. XML files are special in that they conform to the XML definition, and so programs that process them can expect input of a certain structure, and can reasonably reject inputs that don't follow that structure.
In the example above, I've created a new XML sublanguage simply by inventing new tags and using them consistently. XML also provides the option of specifying a Document Type Definition (DTD), which is a specification of what elements form a valid document. A DTD gives a developer much more control over the format of an XML document with a DTD than without one. We're not going to cover DTDs in this article, but they are a core XML concept.
If you think XML looks like HTML, it's because they're close cousins. Both XML and HTML are applications of SGML (Standard Generalized Markup Language), which is a metalanguage -- that is, a language for describing languages. SGML is an extremely powerful, flexible, and complex tool, and its complexity has led to its use primarily in huge organizations, like governments and large corporations. XML is a subset of SGML that retains most of SGML's power while simplifying it for use by common mortals. In fact, both HTML and XML are actually specified as DTDs in SGML. (Are you burned out on acronyms yet?)
Referring again to Figure 2, notice that the XML indicates what the data mean, not how they are to be displayed. Notice also that the tags certainly are not standard HTML. (Let's hope that the <RunsBattedIn> tag is never made part of the HTML standard!) This example shows one of the strengths of XML: the ability to define custom markup tags to suit a particular application. Finally, notice that the batting average doesn't appear in the XML. That's because the average could be calculated from the other values.
One of XML's most powerful abilities is that, with XML, a system designer can create a custom data markup language that maintains the semantics of the data.
Now that we've seen what an XML file looks like, let's find out how to process XML in Java.
The Document Object Model (DOM)
Languages like HTML and XML are often called structured markup languages. This means that the markup in a file has a particular structure that means something to the applications that process it. For example, HTML files start with a
<HEAD> element, which contains a
<TITLE>, which contains text, and so on. An HTML browser displays the information in an HTML file based on this structure.
The structure of XML and HTML is so simple and consistent that it's very easy to represent any XML or HTML document as a tree of objects, whether in Java or in some other programming language. The World Wide Web Consortium (W3C) has defined a complete set of objects to be used for processing XML and HTML documents as trees. The specification for this set of objects is called the Document Object Model (DOM; see the link to the DOM spec in Resources below.) Let's go back to our example and see how to represent a document as a tree.
Take another look at the XML in Figure 2. You'll notice that the document has a hierarchical structure: elements contain other elements. For example, the
<Name> element contains a
<First> element and a
<Last> element. This "contains" relationship could be represented in a class diagram, as shown in Figure 3. In this diagram, a class
Player contains an integer
Number and a string
High School, and also contains references to two other objects, a
PersonName object and a
Statistics object. Each of these objects contains other objects, as well.
HTML documents have this "contains" structure, too, but in an HTML document, the structure reflects information about the document (the title, headings, and so on) instead of reflecting the structure of the information represented by the document. The XML document above reflects how information relates to other information.
In general, you can think of an XML (or HTML) document as a tree structure, with the "contains" relationship in the document corresponding to a parent-child relationship in the tree. One possible graphic representation of this idea appears in Figure 4. (The particular shapes I chose for the document aren't standard notation. They simply distinguish different object types.)
So, an XML document can be represented by a program as a tree of
Element objects, each of which may contain other
Element objects and
Text objects. The entire document is rooted in a single
Document object. The
Text objects contain the data for the object.