Transparently cache XSL transformations with JAXP

Boost performance and retain usability by implementing implicit caching inside transformer factories

No doubt, XSLT (Extensible Stylesheet Language Transformations) is a powerful technology that has many applications in the XML world. In particular, numerous Web developers can take advantage of XSLT at the presentation layer to gain convenience and flexibility. However, the price of these advantages is higher memory and CPU load, which makes developers more attentive to optimization and caching techniques when using XSLT. Caching is even more important in Web environments, where numerous threads share stylesheets.

In these cases, proper transformation caching proves vital for performance. A usual recommendation when using the Java API for XML Processing (JAXP) is to load transformations into a Templates object and then use this object to produce a Transformer rather than instantiate a Transformer object directly from the factory. That way, a Templates object may be reused to produce more transformers later and save time on stylesheet parsing and compilation. In "Top Ten Java and XSLT Tips," Eric Burke gives the following code in Tip 1:

Source xsltSource = new StreamSource(xsltFile);
TransformerFactory transFact = TransformerFactory.newInstance();
Templates cachedXSLT = transFact.newTemplates(xsltSource);
Transformer trans = cachedXSLT.newTransformer();

In this example, transformation from the xsltFile is first loaded into the cachedXSLT Templates object, which is afterwards used to create a new transformer object, trans. The advantage is that later, when we need yet another transformer object, parsing and compilation phases may be skipped:

Transformer anotherTrans = cachedXSLT.newTransformer();

Although this technique positively influences performance (especially when using the same stylesheets repeatedly, like in Web applications), honestly, it is not convenient for the developer. The reason is, apart from the Templates-based transformer instantiation, you must care about observing the date of the last stylesheet modification, reloading outdated transformations, providing safe and efficient multithreaded access to the stylesheet cache, and many other small details. Even a natural move—encapsulating all the required functionality into a standalone transformer cache implementation—will not save a developer from third-party modules, which use standard JAXP routines without any caching. A good example of such a module is a JSTL x:transform tag: its current implementation in the org.apache.taglibs.standard.tag.common.xml.TransformSupport and org.apache.taglibs.standard.tag.el.xml.TransformTag classes directly uses the TransformerFactory's newTransformer(...) method. Obviously, x:transform will not be able to take advantage of any external caching implementation.

There is, however, a simple and elegant solution to this problem. As long as JAXP allows us to replace a used implementation of the TransformerFactory, why don't we simply write a factory that would have intrinsic caching capabilities?

This idea is not difficult to implement. We could extend any suitable TransformerFactory implementation (I use Michael Kay's Saxon 7.3) and override the parent's newTransformer(...) method so that transformations loaded from the file-based stream sources are cached and returned from the cache, if the transformations were not modified since the last load. A new version of the newTransformer(...) method looks like the following:

public Transformer newTransformer(final Source source)
  throws TransformerConfigurationException
{
  // Check that source in a StreamSource
  if (source instanceof StreamSource)
    try
    {
      // Create URI of the source
      final URI uri = new URI(source.getSystemId());
      // If URI points to a file, load transformer from the file
      // (or from the cache)
      if ("file".equalsIgnoreCase(uri.getScheme()))
        return newTransformer(new File(uri));
    }
    catch (URISyntaxException urise)
    {
      throw new TransformerConfigurationException(urise);
    }
  return super.newTransformer(source);
}

As you can see, if the transformer's source is not a stream source or does not point to a file, a parent implementation of newTransformer(...) returns the transformer. But, if the source is a file-based stream source, it gives us the possibility to implement more intelligent transformation loading with the help of a cache.

The caching algorithm for file-based stylesheets is quite simple: for a given file, we check if the transformation's Templates object with the same absolute file name is already stored in the cache. If it is not, we create and cache a new Templates object for this file. If something is already in the cache, we check if the file was updated since Templates was last loaded, comparing the date of the file's last modification with the cache entry. If the file was updated, Templates must be reloaded, otherwise it may be taken from the cache. Finally, with the Templates object (loaded from the cache or from the disk, depending on the situation), we simply produce a new transformer. An implementation of this algorithm is the following method:

protected Transformer newTransformer(final File file)
    throws TransformerConfigurationException
  {
    // Search the cache for the templates entry
    TemplatesCacheEntry templatesCacheEntry = read(file.getAbsolutePath());
    // If entry is found
    if (templatesCacheEntry != null)
    {
      // Check timestamp of modification
      if (templatesCacheEntry.lastModified
        < templatesCacheEntry.templatesFile.lastModified())
        // Clear entry, if it is obsolete
        templatesCacheEntry = null;
    }
    // If no templatesEntry is found or this entry was obsolete
    if (templatesCacheEntry == null)
    {
      logger.debug("Loading transformation [" + file.getAbsolutePath() + "].");
      // If this file does not exists, throw the exception
      if (!file.exists())
      {
        throw new TransformerConfigurationException(
          "Requested transformation ["
          + file.getAbsolutePath()
          + "] does not exist.");
      }
      // Create new cache entry
      templatesCacheEntry =
        new TemplatesCacheEntry(newTemplates(new StreamSource(file)), file);
      // Save this entry to the cache
      write(file.getAbsolutePath(), templatesCacheEntry);
    }
    else
    {
      logger.debug("Using cached transformation [" + file.getAbsolutePath() + "].");
    }
    return templatesCacheEntry.templates.newTransformer();
  }

However, we must consider another issue: thread safety. As long as many concurrent threads share the cache, we must take certain precautions to make read (retrieving cache entries from the cache) and write (saving newly loaded stylesheets into the cache) operations safe. If speaking about the code above, read(...) and write(...) must not cause conflicts, even if running in several threads in parallel.

Although Java offers advanced synchronization services, the problem here is not synchronization as is, but the balance between synchronization and performance. The simplest solution is full synchronization: we declare the whole newTransformer(...) method synchronized and use a synchronized container to store the cache entries or access the cache in synchronized blocks, but all of this proves inefficient. As long as a limited number of stylesheets exists and they do not often change, the transformations cache will be more frequently read than written into. And full synchronization will block concurrent readers, which, first, is not always necessary and, second, may lead to a bottleneck.

On the other hand, using unsynchronized containers, like HashMap, to store cache entries is dangerous. If we don't take any measures, simultaneous reading and writing will (with a certain probability) cause a conflict leading to system instability.

What we basically have here is a classic readers/writers problem: for a given resource, there might be only one writer or several readers at any moment in time. This classic problem has a classic solution, which we will take from Doug Lea's Concurrent Programming in Java. The idea is to track execution state by counting active or waiting reading and writing threads, and allow reading only when no active writers exist and writing only when neither active readers nor writers exist.

To do that, we extract access to the cache into two methods, read() and write():

protected TemplatesCacheEntry read(final String absolutePath)
{
  beforeRead();
  final TemplatesCacheEntry templatesCacheEntry =
    (TemplatesCacheEntry) templatesCache.get(absolutePath);
  afterRead();
  return templatesCacheEntry;
}
protected void write(final String absolutePath, final TemplatesCacheEntry
  templatesCacheEntry)
{
  beforeWrite();
  templatesCache.put(absolutePath, templatesCacheEntry);
  afterWrite();
}

Two pairs of before/after, read/write methods perform thread synchronization, ensuring safe but efficient access to the cache:

protected synchronized void beforeRead()
{
  while (activeWriters > 0)
    try
    {
      wait();
    }
    catch (InterruptedException iex)
    {
    }
  ++activeReaders;
}
protected synchronized void afterRead()
{
  --activeReaders;
  notifyAll();
}
protected synchronized void beforeWrite()
{
  while (activeReaders > 0 || activeWriters > 0)
    try
    {
      wait();
    }
    catch (InterruptedException iex)
    {
    }
  ++activeWriters;
}
protected synchronized void afterWrite()
{
  --activeWriters;
  notifyAll();
}

Having realized access to the cache as shown above, we finally receive a transformer factory that transparently implements efficient caching of file-based stylesheets (you can download the full source code from Resources). The only thing left is to make our factory available through standard JAXP routines.

Several approaches are available for making the TransformerFactory.newInstance() method return an instance of a custom transformer factory implementation. The most straightforward way specifies the factory's class name in the javax.xml.transform.TransformerFactory system property:

System.setProperty("javax.xml.transform.TransformerFactory",
  "de.fzi.dbs.transform.CachingTransformerFactory");

This method has an advantage of being the highest priority, but the disadvantage of being manual.

Another approach uses a JRE (Java Runtime Environment)-wide configuration file ${JRE_HOME}/lib/jaxp.properties to specify your own class name:

...
# Specifies transformer factory implementation
javax.xml.transform.TransformerFactory=de.fzi.dbs.transform.CachingTransformerFactory
...

The last approach uses the Services API to provide the transformer factory name in library meta-information. Just create a file named javax.xml.transform.TransformerFactory in your jar file's META-INF/services directory. This file's content should be a single string specifying a class name of the custom transformer factory. This method, however, has a danger: another JAR may also try to set the factory class through the Services API. For instance, if you put your JAR and Saxon's JAR in your Web application's WEB-INF/lib directory, the actual factory used by JAXP will depend on the order in which these JARs load. To avoid this uncertainty in Web applications, simply configure your factory in the WEB-INF/classes/META-INF/services/javax.xml.transform.TransformerFactory file. In our case, it will contain a single string de.fzi.dbs.transform.CachingTransformerFactory.

Make cache usage transparent

Now, when everything is done, you have one less headache. You no longer have to worry about loading, caching, and reloading stylesheets. You are guaranteed that third-party libraries that use standard JAXP will use caching as well. You can be sure of no concurrent cache access conflicts, and the cache will not be a bottleneck.

There are, however, several disadvantages to using this implementation. First, this factory caches only those stylesheets loaded from files. The reason is because, while we can easily check the timestamp of the file's last modification, this is not always possible for other sources. Another problem remains with stylesheets that import or include other stylesheets. Modification of the imported or included stylesheet will not let the main stylesheet reload. Finally, extending an existing factory implementation binds you to a certain XSLT processor (unless you write a caching extension for every factory you might use). Gladly, in most cases, these issues are not crucial, and we can take advantage of factory-based caching: transparency, convenience, and performance.

Alexey Valikov is a computer scientist with an extensive programming background, especially in Java and XML technologies. His current research at FZI (Research Center for Computer Science, Karlsruhe/Germany) is focused on efficiency issues in Web application development. Working in FZI's XML Competence Center, he also consults, teaches XML technologies, and takes part in European Commission research projects. Alexey authored a popular practical guide into XSLT, The Technology of XSLT, published in Russian.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more