Breaking news in XML

Despite the scanty turnout, the recent XTech 2000 show produced several important XML/Java-related announcements

Although sparsely attended, the recent XTech 2000 conference, held February 28 to March 2 in San Jose, Calif., featured some notable XML/Java announcements:

  • JAXP: The Java standard for XML
  • XML: Unexpected security holes
  • EasySAX: A better parsing mechanism?
  • SML: Is a simpler XML a good idea?
  • RELAX: Your schema is here!

JAXP: The Java standard for XML

Sun announced the Java API for XML Parsing (JAXP) standard, released February 25. The API comes with a reference implementation, but developers can plug in parsers from different vendors without changing their programs.

In a sense, this announcement is long overdue -- the simple API for XML (SAX) and Domain Object Model (DOM) standards have been around for quite a while. Sun's strategy here is to implement only accepted standards, defining the bare minimum of additional APIs necessary to package them together in a way that makes parsers into a pluggable commodity for developers. Even those APIs had to go through Sun's Java Community Process (JCP), both to achieve the best possible standard and to forestall any impression that Sun was trying to dictate standards in any way. So the initial JAXP APIs took quite a bit of time.

James Davidson, specification lead for Sun, also announced that the next version of JAXP would cover DOM level 2, the XSLT stylesheet/translation specification, and, if the specification is completed in time, SAX level 2. Presumably, those implementations will take much less time, since they won't be stalled by the need to define the pluggability layer that JAXP provides.

XML: Unexpected security holes

David Megginson, of Megginson Technologies, gave an amusing yet ultimately serious talk on XML's potential vulnerability to content vandalism by even unsophisticated hackers. The problems mostly stem from the ability to reference remote stylesheets in a document. A highly secure industrial system might reference a large stylesheet on a relatively insecure campus computer, for example.

A cracker could then modify that stylesheet in ways that changed the perceived content of the page. Megginson used some amusing examples to demonstrate the potential results of such an attack. He showed that a bolded "not" in a sentence could be changed to match the background, making it disappear. (If "not" came at the end of a line, it's disappearance might not be noticed, drastically changing the sentence's meaning.) In another example, Megginson showed how the ability to add decorations to a line in a stylesheet would make it possible to add the words "BIG LIE:" to the beginning of a list item.

The bottom line for industry: most potential stylesheet security problems can be avoided by copying stylesheets to a secure local area and referencing them there. That might not be the ideal answer, but it is a highly effective, relatively low-cost solution that is likely to be the norm for years to come.

EasySAX: A better parsing mechanism?

Paul Prescod, a consulting engineer at Isogen, introduced a novel approach to XML processing called EasySAX -- although a more accurate name might have been BetterDOM or SmallerDOM. Although he implemented his parser in the Python language, the interest it will generate makes a Java implementation likely.

Prescod noted that SAX programming requires that you write your own dispatch code when processing elements. When a SAX startElement event occurs, for example, you must write code like this:

   if (element.equals("shoe") {
      ...
   } else if (element.equals("size") {
      ...

Furthermore, if the processing you do for an element depends on the current context, then you have to save your own state, as seen in this example:

  startElement(String element) {
    if (element.equals("title")) titleText = true;
    ...
  characters(...)
    if (titleText) {
       ++fontSize;  bold=true;
       ...
    else
       ...

One goal for EasySAX, then, was to eliminate such issues by allowing context-sensitive processing.

Another goal for EasySAX: improve on the DOM mechanism by putting into memory only those parts of the tree that you visit, rather than the entire tree. By doing so, it becomes possible to process huge data sets, for example, that would be difficult to process any other way. But to improve efficiency, Prescod's mechanism also allows the developer to ask that an entire subtree be loaded into memory. (For more information, see the Resources section below.)

The combination of large-model processing, efficient in-memory representation, and context-sensitive processing capability creates an appealing mechanism for processing XML. There is every reason to think that some clever Java programmer will take the same basic idea and implement it in Java.

SML: Is a simpler XML a good idea?

One of the more controversial proposals was the concept that XML -- and the parsers that depend on it -- should be simplified by taking away some of the little-used ingredients, like notations, that cause parser-development headaches. The resulting slimmed-down XML would be called simple XML, or SML.

Software AG's Mike Champion, author Simon St. Laurent, and DocuVerse's Don Park led the pro-SML discussion. They stressed that simplifying XML would mean smaller, faster, and easier-to-develop parsers, which in turn would make it easier to embed XML processing in small devices. It also would also make it easier for plain-text filters to be built using Perl scripts and the like, because the number of strange, seldom-used cases would decrease.

On the other hand, there was no clear agreement on the right ingredients to eliminate. The SML proposal calls for the elimination of attributes and CDATA sections, as well as processing instructions, comments, notations, DTDs, mixed content (text and elements), and external parsed entities. The so-called Common XML proposal, on the other hand, aims at voluntary restrictions on which parts of XML you use. It leaves in attributes and mixed content, but in other respects calls for XML users to leave out the same XML constructs that SML disallows.

For an opposing voice, Evan Lenz, a student at North Seattle Community College advanced an interesting philosophical argument against the proposed simplification. He posited that XML's real power stems from the standardization it enjoys. Because so many parsers, utilities, and XML-based languages are coming online, and because XML erases the distinction between documents and data, a whole host of fascinating applications, including AI projects, are becoming possible. If that train is to stay on track, he argued, XML should not be simplified in any way.

As is the case in most interesting arguments, neither camp is entirely wrong. XML could benefit from simplification, most notably in the area of notations -- a SGML holdover for binary (multimedia) objects that is better done with MIME standards.

On the other hand, anyone attempting to simplify the standard needs to ask a fundamental question: what is the basic subset we need to keep all of the XML tools we have developed, and need to develop, using XML? For example, the RELAX schema specification, discussed in more detail below, uses attributes. So perhaps attributes really need to be retained. The RELAX presentation also implied the need for mixed content, so perhaps that is necessary as well.

One questioner pointed out that CDATA sections require a lot of parser code, which make XML processing difficult. But without CDATA sections, how could you put a line drawing into XML? Would you be forced to use graphic-authoring tools? More importantly, since Extensible Style Language (XSL) uses CDATA sections for embedding processing scripts in a stylesheet, CDATA would appear to be necessary there as well.

Similarly, without external-parsed entities, how would you reference material from another document and include it inline? To be fair, SML is targeted at a world of pure data, rather than at defining a general-purpose standard useful for both data and documents, as XML is. For that purpose, then, perhaps external references are not required. (The RELAX schema standard described below doesn't include them, either, so perhaps they really are more trouble than they are worth.)

So, while some simplification seems like a good idea, it is not clear that we know exactly which simplification is in order. For the time being then, we should probably let things sit -- right after we get rid of notations.

RELAX: Your schema is here

Members of the development community have been eagerly awaiting a schema standard they could sink their teeth into. Schemata perform serious data validation and are fundamental to the process of automatically generating Java classes for XML data. Consequently, the need for a schema standard is strong.

(For a more detailed discussion on the advantages of schemata over DTDs, see the Sidebar below.)

However, the hoped-for W3C XML Schema standard remains in development. The industry players who are developing the standard have a long list of must-have features. The eventual result, by all accounts, will turn out to be something of a monolith. It will do what everyone says they need it to do, but it's going to take a lot of complex code to do it, and it's taking quite a bit of time for it to take shape.

Meanwhile, a former member of the schema-standards team came up with a better way, the Regular Language for XML, or RELAX. This Japanese standard is due to be submitted as a fast-track ISO proposal this summer. Makoto Murata, its author, took what appears in retrospect to be a simple idea: take the DTD, reformulate it in XML, take advantage of the structuring to provide context-sensitive definitions, and add the vitally important content validation.

The result: a specification that is simpler than XML Schema level 1, but which includes all of the content-validation mechanisms contained in XML Schema level 2. What may be as important to developers as the quality and simplicity of the standard, though, is the fact that it is available today. In addition, RELAX aims to future-proof your schema definitions by making them compatible with the parts of XML Schema that have already been defined.

At the moment, the RELAX project is at the alpha stage. The full specification and implementation should be available by the end of the summer, but a tutorial is available now (see Resources). Note: the RELAX English translation is almost finished.

The RELAX DTD translator available now allows developers to move away from DTDs immediately. In addition, the promise of compatibility with the XML Schema means that nothing is lost by making the translation. Due to that intelligent combination, RELAX seems well-positioned for early adoption by developers hungry for a schema. And if it fulfills its promise, it may well become the de facto standard for a long time to come.

Eric Armstrong has been programming and writing professionally since before there were personal computers. His production experience includes artificial intelligence (AI) programs, system libraries, real-time programs, and business applications in a variety of languages. He is currently on contract at Sun's Java Software division in the San Francisco Bay Area, and he is a regular contributor to JavaWorld. He wrote The JBuilder2 Bible and authored the Java/XML programming tutorial available at http://www.java.sun.com/xml.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies