Implement complicated data transformations with SAX and XSLT

Standard Java API provides powerful tools for XML data transformations

I was once asked to help in a project that required a simple data transformer for converting raw bill data into different bill layouts. After receiving a brief introduction to the problem, I suggested using XSLT (Extensible Stylesheet Language Transformations).

When I dug deeper into the requirements, it turned out the problem was not as simple as I had first thought. The input data was manageable, but the data needed to perform the transformation simply could not be depicted with a set of static XSLT stylesheets. Part of the transformation data was dynamic and stored in two separate databases. In addition, to produce the bill layouts, the program had to perform relatively complex calculations on the input data using numbers fetched from the two databases. The XSLT solution was quietly forgotten.

The core problem in this case was the data needed for directing the transformation—it was dynamic. In a perfect world, you would never face this issue. Preparation of the input data and the transformation should be clearly separated, so all the information needed for the transformation could easily be included in a single XSLT template. Unfortunately, we don't live in a perfect world, and the requirements of real-life projects are sometimes quite bizarre.

This article suggests one solution to the problem described above. I show by example how the power of SAX (Simple API for XML) can be harnessed to enhance the applicability of XSLT. In addition, I show how XSLT can be used even if neither the input data nor the desired output is XML.

Introduction to XSLT

XSLT is a programming language for transforming XML data. XSLT stylesheets can be applied to transform an XML document into another XML format or practically any other format. While XSLT may not be a simple language to learn—especially to those more familiar with Java-like languages—it is a powerful and flexible way to accomplish relatively complicated data transformations. If you are not familiar with XSLT, plenty of excellent tutorials are available. See, for instance, Chapter 17 of the XML Bible.

Though XSLT is a great language, some tasks are difficult, or nearly impossible, to accomplish with it. Transformations where you must calculate the combinations of data fields taken from several elements of the input XML are usually possible, but often extremely difficult to write. If the data directing the transformation is itself dynamic, XSLT alone is not enough. XSLT templates are static in nature, and, while it may be possible to dynamically regenerate the templates, I can't imagine a situation when this would be feasible. (If you have a different opinion, feel free to send me feedback.)

After experimenting with various ideas, I concluded that the easiest way to accomplish complicated transformations using XSLT was to manipulate the input XML before feeding it to the XSLT transformer. This may sound ridiculously complicated and inefficient, but it turns out that with SAX manipulating the XML data on the fly, it is quite easy.

SAX is an event-driven interface for parsing an XML document. When the SAX parser parses XML data, it generates "callback" notifications about the XML elements that the parser recognizes. For instance, when the parser encounters the XML start tag, it produces a callback event startElement. The name of the tag and other relevant information are sent in the parameters of the callback call. SAX should be used when efficient XML parsing is needed. For more information about SAX, see Sun's tutorial on JAXP. In this article, I use SAX to modify the flow of events before forwarding them to the XSLT transformer.

SAX and XSLT are both included in the JAXP (Java API for XML Processing) API, which has been a part of J2SE since version 1.4.

Overview of the examples

This section introduces the examples included with this article and introduces you to the possibilities of SAX and XSLT.

Running the examples

If you are not interested in running the examples, skip this section. This article, however, relies heavily on the code examples, so I advise you to at least look at the code. The examples have been tested with J2SE 1.4.2 on the Windows environment. No other packages are needed to run the applications. The instructions, however, assume you use the Ant build tool. If you don't want to use Ant, you can still build and run the examples, but that requires a bit more work.

Detailed instructions are in the README.txt file included in the zip file downloadable from Resources. Once you have unzipped the package and set the relevant environment variables (explained in README.txt), you can use the following Ant commands:

  • ant build: Erases the files created in the build and compiles the whole source code again
  • ant clean: Erases the files created in the build
  • ant example1: Runs Example 1
  • ant example2: Runs Example 2
  • ant example2b: Runs Example 2b (a variation of Example 2)
  • ant example3: Runs Example 3
  • ant example4: Runs Example 4

Overview of Example 1

Though Example 1 is a basic XSLT transformer, introducing it proves necessary because it represents the basis on which the following examples are built. Figure 1 shows Example 1's conceptual picture.

Figure 1. Conceptual view of Example 1. Click on thumbnail to view full-sized image.

The input data (1.1) is proprietary XML data, which models the report of the customers' orders. The data looks like this:

 

<?xml version="1.0"?>

<ORDER_INFO> <CUSTOMER GROUP="exclusive"> <ID>234</ID> <SERVICE_ORDERS>

<ORDER> <PRODUCT_ID>1231</PRODUCT_ID> <PRICE>100</PRICE> <TIMESTAMP>2004-06-05:14:40:05</TIMESTAMP> </ORDER> <ORDER> <PRODUCT_ID>2001</PRODUCT_ID> <PRICE>20</PRICE> <TIMESTAMP>2004-06-12:15:00:44</TIMESTAMP> </ORDER> </SERVICE_ORDERS> </CUSTOMER>...

Example 1's complete input data is in file <EXAMPLE_ROOT>/input/orderInfo_1.1.xml. From now on, <EXAMPLE_ROOT> refers to the directory to which you have unzipped this article's examples.

Figure 1's XSLT template (1.2) is a regular XSLT stylesheet that transforms the input data into HTML. The XSLT template looks like this:

 

<?xml version="1.0"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output omit-xml-declaration="yes"/>

<xsl:template match="/"> <xsl:apply-templates select="ORDER_INFO"/> </xsl:template>

<xsl:template match="ORDER_INFO"> <HTML> <HEAD> <TITLE>Customers&apos; Order information</TITLE> </HEAD> <BODY> <H1>Customers&apos; Order information</H1> <xsl:apply-templates select="CUSTOMER"/> <xsl:apply-templates select="PRICE_SUMMARY"/> </BODY>

</HTML> </xsl:template>...

The complete transformation template is in file <EXAMPLE_ROOT>/template/transform_1.2.xml. When you run Example 1, the program writes the output file <EXAMPLE_ROOT>/output/result_1.3.html.

To readers familiar with XSLT, Example 1 should look like a nice and easy programming exercise. The XSLT template only includes data required for formatting the output and does not include any complicated calculations. Let's go to Example 2 for a more challenging case.

Overview of Example 2

Example 2 may be this article's most interesting sample. The example actually consists of two different transformations (Transformation 2a and Transformation 2b) that we will consider separately, starting with Transformation 2a. Example 2's conceptual picture is shown in Figure 2.

Figure 2. Conceptual view of Example 2. Click on thumbnail to view full-sized image.

The input data (1.1) and the XSLT template (1.2) match those in Example 1. Before the XSLT transformation is applied, the input data goes through the preprocessor, which is actually a set of Java classes that manipulate the XML data using SAX's event-filtering feature. The datasource is a set of classes that implement a sort of dummy datasource. In the real application, this could be a database interface, for example. This dummy database is included to show a simplistic pattern for enriching the XML data with the dynamic data fetched from an external datasource. I wanted to make the examples as simple as possible to install and run, so I did not implement any real database connections—the dummy implementation hopefully gives you the idea.

When you run the examples, the program (when the mode parameter is set to debug) echoes the XML data coming from the preprocessor to the standard output stream (System.out). This data resembles the preprocessor's output data, which is now the new input data to the transformer. Echoing the preprocessor output is a handy way to debug the transformations completed during preprocessing. We'll consider this feature's implementation later in this article.

When you run Transformation 2a, the following data is echoed to the screen:

 <ORDER_INFO>
   <CUSTOMER GROUP="exclusive">
      <ID>
         Jill
      </ID>
      <SERVICE_ORDERS>
         <ORDER>
            <PRODUCT_ID>
               Doohickey
            </PRODUCT_ID>
            <PRICE>
               100
            </PRICE>
         </ORDER>...

If you compare this data to the original input data, (<EXAMPLE_ROOT>/input/orderInfo_1.1.xml), you'll notice the following differences in the data's beginning:

  • The value of the first CUSTOMER/ID element has changed from 234 to Jill.
  • The value of the first CUSTOMER/SERVICE_ORDERS/ORDER/PRODUCT_ID element has changed from 1231 to Doohickey.
  • The TIMESTAMP element has been removed.

The preprocessor has replaced the values of CUSTOMER/ID and CUSTOMER/SERVICE_ORDERS/ORDER/PRODUCT_ID with the values received from its internal mock database. It has also filtered out the TIMESTAMP elements and their values. This modified data now represents the input to the XSLT transformer.

Transformation 2a's output file is <EXAMPLE_ROOT>/output/result_2.1.html.

When you run Transformation 2b, the data echoed to the screen may at first appear similar to that of Transformation 2a. The difference is the PRICE_SUMMARY element at the end, inserted by the preprocessor:

  <PRICE_SUMMARY>
      <PRODUCT>
         <NAME>
            Doohickey
         </NAME>
         <SUM>
            110
         </SUM>
      </PRODUCT>
      <PRODUCT>
         <NAME>
            Nose Cleaner
         </NAME>
         <SUM>
            10
         </SUM>
      </PRODUCT>
      <PRODUCT>
         <NAME>
            Raccoon
         </NAME>
         <SUM>
            40
         </SUM>
      </PRODUCT>
   </PRICE_SUMMARY>
</ORDER_INFO>

The purpose of this example is to demonstrate that the preprocessor can also be applied to include new XML elements, values of which may be calculated directly from the input XML—or by using some supplementary data from an external source.

The output file of Example 2b is <EXAMPLE_ROOT>/output/result_2.1b.html.

Why are these examples interesting? At the conceptual level, this data transformation approach does not seem groundbreaking. The interesting point is that small, dynamic enhancements to the input XML are relatively easy to implement with SAX, but they enable transformations, which are impossible with plain XSLT. On the other hand, implementing the whole transformation using only SAX would be possible, but tedious. Using SAX and XSLT together opens almost endless possibilities in implementing complicated data transformations.

I discuss the pattern for extending the XSLT transformer with the preprocessor in the section entitled "A Deeper Look into Example 2." You'll see that, with minor changes, this article's code examples can be applied to many different purposes.

Examples 3 and 4 enhance Example 2 with capabilities for reading and producing non-XML data. If you are not interested in these options, jump directly to the implementation details of Examples 1 and 2.

Overview of Example 3

Sometimes, input data comes from several sources and, sometimes, not always in XML format. Example 3 shows how SAX can be used to generate events even from non-XML data, thus making it possible to apply XSLT. Example 3's conceptual picture is shown in Figure 3.

Figure 3. Conceptual view of Example 3. Click on thumbnail to view full-sized image.

Example 3's input data (3.1) looks like this:

 3
exclusive:234
2
Order:1231
Price:100
Timestamp:2004-06-05:14:40:05
Order:2001
Price:20
Timestamp:2004-06-12:15:00:44...

Example 3's complete input data is in file <EXAMPLE_ROOT>/input/orderInfoAsText_3.1.txt.

Related:
1 2 3 4 Page 1
Page 1 of 4