Practical XML Schema

A Java programmers guide to XML Schema and Namespaces

XML has two main advantages: first, it offers a standard way of structuring data, and, second, we can specify the vocabulary the data uses. We can define the vocabulary (what elements and attributes an XML document can use) using either a document type definition (DTD) or the XML Schema language.

DTDs were inherited from XML's origins as SGML (Standard Generalized Markup Language) and, as such, are limited in their expressiveness. DTDs are for expressing a text document's structure, so all entities are assumed to be text. The XML Schema language more closely resembles the way a database describes data.

Schemas provide the ability to define an element's type (string, integer, etc.) and much finer constraints (a positive integer, a string starting with an uppercase letter, etc.). DTDs enforce a strict ordering of elements; schemas have a more flexible range of options (elements can be optional as a group, in any order, in strict sequence, etc.). Finally schemas are written in XML, whereas DTDs have their own syntax.

As you'll see in this article, schemas themselves are quite straightforward—I find them easier than DTDs as there is no extra syntax to remember. The difficulties arise in using XML Namespaces and in getting the Java parsers to validate XML against a schema.

In this article, I first cover the basics of XML Schema, then validate XML against some schema using several popular APIs, and finally cover some of the more powerful elements of the XML Schema language. But first, a short detour.

A detour via the W3C

XML, the XML Schema language, XML Namespaces, and a whole range of other standards (such as Cascading Style Sheets (CSS), HTML and XHTML, SOAP, and pretty much any standard that starts with an X) are defined by the World Wide Web Consortium, otherwise known as the W3C. A document only is XML if it conforms to the XML Recommendation issued by the W3C.

Various experts and interested parties gather under the umbrella of the W3C and, after much deliberation, issue a recommendation. Companies, individuals, or foundations such as Apache, will then write implementations of those recommendations.

This article's documents are a combination of these three recommendations:

  • XML 1.0
  • XML Namespaces
  • XML Schema

XML 1.0 or 1.1

XML exists in two versions: 1.0 defined in 1998 and 1.1 defined in 2004. XML 1.1 adds very little to 1.0: support for defining elements and attributes in languages such as Mongolian or Burmese, support for IBM mainframe end-of-line characters, and almost nothing else. For the vast majority of applications, these changes are not needed. Plus, a document declared as XML 1.1 will be rejected by a 1.0 parser. So stick with 1.0.

Well-formed and valid XML

For an application to accept an XML document, it must be both well formed and valid. These terms are defined in the XML 1.0 Recommendation, with XML Schema extending the meaning of valid.

To be well formed, an XML document must follow these rules:

  • The document must have exactly one root element.
  • Every element is either self closing (like <this />) or has a closing tag.
  • Elements are nested properly (i.e., <this><and></this></and> is not allowed).
  • The document has no angle brackets that are not part of tags. Characters <, >, and & outside of tags are replaced by &lt;, &gt;, and &amp;.
  • Attribute values are quoted.

For the full formal details, see Resources.

When producing XML, remember to escape text fields that might contain special characters such as &. This is a common oversight.

A document that is not well formed is not really XML and doesn't conform to the W3C's stipulations for an XML document. A parser will fail when given that document, even if validation is turned off.

To be valid, a document must be well formed, it must have an associated DTD or schema, and it must comply with that DTD or schema. Ensuring a document is well formed is easy. In this article, we focus on ensuring our documents are valid.

Let's get right down to it. First, we're going to need an XML file to validate.

The XML document

Let's assume we have a client (say a terminal in a shop) that posts an XML order back to a server. The XML might look like this:

 <?xml version="1.0" encoding="UTF-8"?>
<order>
    <user>
        <fullname>Bob Jones</fullname>
        <deliveryAddress>
            123 This road,
            That town,
            Bobsville
        </deliveryAddress>
    </user>
    <products>
        <product id="12345" quantity="1" />
        <product id="3232" quantity="3" />
    </products>
</order>

Save this document somewhere. We will use it later in this article to try validation and interesting schema rules later.

The first line <?xml version="1.0"?> is the prologue. It is optional in XML 1.0 and compulsory in XML 1.1. If it is absent, parsers assume we're using XML 1.0—but we like to be thorough.

The schema

For the server to validate our XML, we need a schema:

 

<?xml version="1.0"?> <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" xmlns="urn:nonstandard:test" targetNamespace="urn:nonstandard:test">

<xsd:element name="order" type="Order" /> <xsd:complexType name="Order"> <xsd:all> <xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" /> <xsd:element name="products" type="Products" minOccurs="1" maxOccurs="1" /> </xsd:all> </xsd:complexType>

<xsd:complexType name="User"> <xsd:all> <xsd:element name="deliveryAddress" type="xsd:string" /> <xsd:element name="fullname"> <xsd:simpleType> <xsd:restriction base="xsd:string"> <xsd:maxLength value="30" /> </xsd:restriction> </xsd:simpleType> </xsd:element> </xsd:all> </xsd:complexType>

<xsd:complexType name="Products">

<xsd:sequence> <xsd:element name="product" type="Product" minOccurs="1" maxOccurs="unbounded" /> </xsd:sequence> </xsd:complexType>

<xsd:complexType name="Product"> <xsd:attribute name="id" type="xsd:long" use="required" /> <xsd:attribute name="quantity" type="xsd:positiveInteger" use="required" /> </xsd:complexType>

</xsd:schema>

Save this schema as test.xsd in the same directory as the XML document. And, for the moment, ignore the root node's attributes and the fact that everything is prefixed with xsd.

The first entry after the root schema element is:

 <xsd:element name="order" type="Order" />  

This says our document will have an element called order of type Order. This element is a global declaration (with scope like a global variable). In fact, it is our only global element, so it will be the root element of any document that conforms to this schema.

An element's type will be either built-in (such as string, long, or positiveInteger) or custom. Custom types can be either a simpleType or a complexType. simpleType elements are variations on the built-in types: either a restriction, a list, or a union. If the element has children, it will always be a complexType. For a full list of built-in types, see Resources.

Our Order is a complex type made up of two elements: user and products. These two elements are local. We cannot refer to them anywhere outside the Order type. This distinction between global and local types will prove important when we look at XML Namespaces.

The User type is again made up of two elements. The first, deliveryAddress, is of built-in type string. The second, fullname, lacks a type in its element declaration. Instead, the type is given in-line. This is an anonymous type in that we cannot refer to it anywhere else by name as it doesn't have a name. Anonymous types prevent reuse, and I find them harder to read than named types. Unless a type is simple and unlikely to be reused, avoiding anonymous types is best. The type of fullname is the built-in string type, like deliveryAddress, but with the restriction that it has a maximum length of 30 characters.

The Products type is simply a sequence of product entries. The sequence element allows its children to appear multiple times (all does not).

Finally, the Product type has two attributes and no body. For an example of a type with both attributes and a body, see the "Database Style Constraints: Primary Keys and Foreign Keys" section that appears later in this article.

Add a schema

We must link the document to the schema. To do this, we only need to change the root element. Thus, the start of the document becomes:

 <?xml version="1.0" encoding="UTF-8"?>
 <order xmlns="urn:nonstandard:test"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="urn:nonstandard:test 
     file:./test.xsd">
   <user>
     (...)

Edit the XML document you saved earlier and change the root element to match the entry above.

To understand what we have just added, we need to know about XML Namespaces, but first, let's review URIs.

A quick detour via URIs

A Uniform Resource Identifier (URI) is a compact string of characters for identifying an abstract or physical resource. It can be almost anything. An absolute URI has the format <scheme>:<scheme-specific-part>, where <scheme> starts with a lowercase character (a-z) and is followed by any alphanumeric character. The scheme-specific part can be almost anything. A relative URI doesn't even need the "scheme" part. So this:something is a valid URI, and anythingAtAll is a valid relative URI. To make this workable, a URI is usually a name or a locator.

A Uniform Resource Name (URN) identifies a resource forever—a good example being a book's ISBN number or a product's barcode.

A Uniform Resource Locator (URL) identifies a resource by its location. URIs, in the context of XML Namespaces, are nearly always URLs. The URI identifying a namespace is not required to point to a document, so, if the URI is pasted into a browser, it may not find anything. However, as the URI identifying your namespace looks exactly like a URL, users will expect there to be something at that address, so it is good practice to put something there. Sun and the W3C, for example, have pages at their namespace URLs.

This article's example document does not have a URL as its namespace identifier; instead, it has a made-up URN. Though unusual, it helps to show that the namespace identifier is just that: an identifier. In a real application, our root element would probably read:

 

<order xmlns="http://www.mycompany.com/xml/myproject" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mycompany.com/xml/myproject file:./test.xsd">

Namespaces

An XML namespace is a collection of names, identified by a URI reference, which are used in XML documents as element types and attribute names. A namespace in XML is a bit like a package in Java. It groups a set of elements together. The type user in the urn:nonstandard:test namespace differs from a type user in any other namespace.

Only one namespace can be the default—the others must be given a prefix. The xmlns attribute (which comes from the XML Namespaces Recommendation) defines the default namespace—i.e., the namespace for unprefixed elements. The form xmlns:xsd defines the namespace for entries prefixed with xsd (xsd is commonly used for the schema prefix, but any prefix would do).

When defining a schema, we refer to our own types (Order, User, Product, etc.) and use types from the schema namespace (element, complexType, string, etc.). For this reason, we usually prefix the schema namespace. We could also prefix our types instead and use the schema namespace unprefixed. The first part of our schema would then look like this:

 

<?xml version="1.0"?> <schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:nonstandard:test" elementFormDefault="qualified" xmlns:ts="urn:nonstandard:test">

<element name="order" type="ts:Order" /> <complexType name="Order"> <sequence> <element name="user" type="ts:User" minOccurs="1" maxOccurs="1" /> <element name="products" type="ts:Products" minOccurs="1" maxOccurs="1" /> </sequence> </complexType> (...)

Prefixed names are called qualified names. They contain a single colon separating the name into a namespace prefix and a local part. The prefix, which is mapped to a URI reference, selects a namespace.

In writing schema, we define new elements and attributes. The targetNamespace attribute specifies the namespace these new elements will be a part of. An XML document that conforms to this schema will import that namespace (via an xmlns or xmlns:prefix attribute).

Related:
1 2 3 Page 1
Page 1 of 3