|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
XML has two main advantages: first, it offers a standard way of structuring data, and, second, we can specify the vocabulary the data uses. We can define the vocabulary (what elements and attributes an XML document can use) using either a document type definition (DTD) or the XML Schema language.
DTDs were inherited from XML's origins as SGML (Standard Generalized Markup Language) and, as such, are limited in their expressiveness. DTDs are for expressing a text document's structure, so all entities are assumed to be text. The XML Schema language more closely resembles the way a database describes data.
Schemas provide the ability to define an element's type (string, integer, etc.) and much finer constraints (a positive integer, a string starting with an uppercase letter, etc.). DTDs enforce a strict ordering of elements; schemas have a more flexible range of options (elements can be optional as a group, in any order, in strict sequence, etc.). Finally schemas are written in XML, whereas DTDs have their own syntax.
As you'll see in this article, schemas themselves are quite straightforward—I find them easier than DTDs as there is no extra syntax to remember. The difficulties arise in using XML Namespaces and in getting the Java parsers to validate XML against a schema.
In this article, I first cover the basics of XML Schema, then validate XML against some schema using several popular APIs, and finally cover some of the more powerful elements of the XML Schema language. But first, a short detour.
XML, the XML Schema language, XML Namespaces, and a whole range of other standards (such as Cascading Style Sheets (CSS), HTML and XHTML, SOAP, and pretty much any standard that starts with an X) are defined by the World Wide Web Consortium, otherwise known as the W3C. A document only is XML if it conforms to the XML Recommendation issued by the W3C.
Various experts and interested parties gather under the umbrella of the W3C and, after much deliberation, issue a recommendation. Companies, individuals, or foundations such as Apache, will then write implementations of those recommendations.
This article's documents are a combination of these three recommendations:
XML exists in two versions: 1.0 defined in 1998 and 1.1 defined in 2004. XML 1.1 adds very little to 1.0: support for defining elements and attributes in languages such as Mongolian or Burmese, support for IBM mainframe end-of-line characters, and almost nothing else. For the vast majority of applications, these changes are not needed. Plus, a document declared as XML 1.1 will be rejected by a 1.0 parser. So stick with 1.0.
For an application to accept an XML document, it must be both well formed and valid. These terms are defined in the XML 1.0 Recommendation, with XML Schema extending the meaning of valid.
To be well formed, an XML document must follow these rules:
<this />) or has a closing tag.
<this><and></this></and> is not allowed).
<, >, and & outside of tags are replaced by <, >, and &.
For the full formal details, see Resources.
When producing XML, remember to escape text fields that might contain special characters such as &. This is a common oversight.
A document that is not well formed is not really XML and doesn't conform to the W3C's stipulations for an XML document. A parser will fail when given that document, even if validation is turned off.
To be valid, a document must be well formed, it must have an associated DTD or schema, and it must comply with that DTD or schema. Ensuring a document is well formed is easy. In this article, we focus on ensuring our documents are valid.
Let's get right down to it. First, we're going to need an XML file to validate.
Let's assume we have a client (say a terminal in a shop) that posts an XML order back to a server. The XML might look like this:
<?xml version="1.0" encoding="UTF-8"?>
<order>
<user>
<fullname>Bob Jones</fullname>
<deliveryAddress>
123 This road,
That town,
Bobsville
</deliveryAddress>
</user>
<products>
<product id="12345" quantity="1" />
<product id="3232" quantity="3" />
</products>
</order>
Save this document somewhere. We will use it later in this article to try validation and interesting schema rules later.
The first line <?xml version="1.0"?> is the prologue. It is optional in XML 1.0 and compulsory in XML 1.1. If it is absent, parsers assume we're using XML 1.0—but
we like to be thorough.
For the server to validate our XML, we need a schema:
<?xml version="1.0"?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
elementFormDefault="qualified"
xmlns="urn:nonstandard:test"
targetNamespace="urn:nonstandard:test">
<xsd:element name="order" type="Order" />
<xsd:complexType name="Order">
<xsd:all>
<xsd:element name="user" type="User" minOccurs="1" maxOccurs="1" />
<xsd:element name="products" type="Products" minOccurs="1" maxOccurs="1" />
</xsd:all>
</xsd:complexType>
<xsd:complexType name="User">
<xsd:all>
<xsd:element name="deliveryAddress" type="xsd:string" />
<xsd:element name="fullname">
<xsd:simpleType>
<xsd:restriction base="xsd:string">
<xsd:maxLength value="30" />
</xsd:restriction>
</xsd:simpleType>
</xsd:element>
</xsd:all>
</xsd:complexType>
<xsd:complexType name="Products">
<xsd:sequence>
<xsd:element name="product" type="Product" minOccurs="1" maxOccurs="unbounded" />
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="Product">
<xsd:attribute name="id" type="xsd:long" use="required" />
<xsd:attribute name="quantity" type="xsd:positiveInteger" use="required" />
</xsd:complexType>
</xsd:schema>
Save this schema as test.xsd in the same directory as the XML document. And, for the moment, ignore the root node's attributes and the fact that everything
is prefixed with xsd.
The first entry after the root schema element is:
<xsd:element name="order" type="Order" />
This says our document will have an element called order of type Order. This element is a global declaration (with scope like a global variable). In fact, it is our only global element, so it will be the root element of
any document that conforms to this schema.
Archived Discussions (Read only)