Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

JavaWorld Daily Brew

Is XML Slow? - Benchmarking XML

 

There is a general tendency to think that representing and processing the data of an entire transaction as an XML document is the root cause of slow performance in high volume systems. However, there is a flaw in this kind of thinking and it mostly has to do with a wrong technology or approach being used by the development team. This article will demonstrate how different technologies compare against each other in processing XML documents of varying structural complexity and size. This should allow development teams to choose the most suitable technologies for their application stack, rather than assuming that the slowness is the result of underlying XML processing.

This article will focus on using the Java programming language to evaluate and demonstrate the performance of different XML processing techniques. However, the same results/benchmarks are observed across different languages (.NET, etc.) and systems. As a part of this study we will also make recommendations on which approach offers the best performance based on the underlying system requirements. The article will specifically demonstrate how various technologies compare against each other in terms of the amount of time needed, in each approach, to process an XML document and convert it into the corresponding Object graph.

Techniques Used for Comparison

We picked the most powerful and the most common techniques that are being used by the development community today. Then we ran a suite of tests (for processing XML documents) using each of these chosen alternatives, for a side-by-side comparison of the time taken in the Deserialization process by these alternatives.

The different alternatives that we considered for this article are:

  • Native Serialization/Deserialization provided by the Java Programming language: Java provides automatic serialization which requires that the class be marked by implementing the marker interface java.io.Serializable. Java then handles serialization internally for any such objects. The standard encoding method uses a simple translation of the fields into a byte stream. Primitives as well as non-transient, non-static referenced objects are encoded into the stream. Each object that is referenced by the serialized object and not marked as transient must also be serialized; and if any object in the complete graph of non-transient object references is not serializable, then the serialization will fail. The developer can influence this behavior by marking objects as transient, or by redefining the serialization for an object so that some portion of the reference graph is truncated and not serialized.

  • Java Architecture for XML Bindings (JAXB): JAXB is an APIwith a reference implementation and a set of tools that allows automatic two-way mapping between XML documents and Java objects. With a given Document Type Definition (DTD) or a schema, the JAXB compiler can generate a set of Java classes that allow developers to build applications that can read, manipulate and recreate XML documents without writing any logic to process XML elements. In other words, JAXB allows Java developers to access and process XML data without having to know XML or XML processing in great detail. For example, there is no need to create or use a SAX parser or write callback methods. One of the more useful benefits is that the JAXB XJC compiler generates the corresponding Java mapping objects and processing code, which essentially frees developers from manually writing and debugging conversion code. With the auto generated code, developers can write applications that access XML data through Java interfaces and do not need to worry about the structure of data. It is also important to know that JAXB uses SAX as the underlying parsing mechanism.

  • JAXB with FastInfoSet Library: A binary format standard for XML from the ITU-T and ISO that reduces the size of text-based XML files (XML infosets) and improves parsing speed. Based on ASN.1 notation, Fast Infoset files can be easily compressed to as little as 20% of their original size. When an external schema is used, they can be compressed to an extraordinary 5% of original size.

  • JAXB with custom compression: In order to have a meaningful comparison between FastInfoSet and JAXB we added a compression/un-compression stage to the processing of XML documents by JAXB, and named this stage as such.

  • Deserialization using DOM parser: Since JAXB uses SAX as the underlying parsing mechanism; we thought it logical to include DOM parsing which loads the entire XML in memory for manipulation also as part of this study.

Comparison Report

We executed 100,000 iterations* of deserializing a set of XML files having varying sizes, from very small to large, across all the selected technologies and observed the following results. The goal was to transmit the same atomic piece of information (e.g. a message) in various different formats: XML, binary, etc. What is important is that semantically we are transmitting the exact same information with no loss due to compression. The figure below shows the results of the tests conducted by us.

*(100,000 iterations resulted in convergence of numbers and there were no significant deviations in the observed results by increasing the number of iterations any further.)

Comparison Report

Figure 1: Comparison Report showing average time taken (ms) in processing XML files of various sizes using most commonly used alternatives/libraries.

Analysis of Test Results & Recommendations

We observed the following results from the tests and make corresponding recommendations for the various alternatives:

  • Java Deserialization turned out to be the fastest among all techniques (except for files of very small size where the overhead of native Java Deserialization process made it the slowest).

    Recommendations: Java Serialization performed the fastest among all the alternatives. However serialized data is in native binary format and is linked to a version of the compiled code which leaves it un-portable between different compilations. This is a big drawback especially when the data needs to be used by different modules of a distributed enterprise application and when continuous development/enhancement to the underlying source code is a given fact for any application. Also, the serialized data being in native binary format leaves it not readable by any editor without being deserialized by the Java Serialization framework and this counts as a big drawback as well.

    Hence we do not recommend using Java Serialization and other binary formats except for fairly stable and small applications.

  • FastInfoSet library performed considerably slower than other alternatives and performed the slowest overall. Generally seen as a library which can be used for faster processing of data, the results show that the overhead of using the FastInfoSet data structures by the FastInfoSet library makes it the slowest technique to be used. Test results shows that FastInfoSet performed almost 2X times slower than JAXB and 3X times slower than native Java Deserialization.

    We observed that FastInfoSet performed slower even when we used the highest level of compression using the ZLIB compression library and processed the compressed data stream (effectively we added the time to decompress the data to JAXB tests, for a meaningful comparison with FastInfoSet).

    Recommendations: FastInfoSet library performed the slowest of all the alternatives in processing of the data. However it also compressed the data to small size, reducing the time taken to transport the data -over the wire and in the storage footprint. As an example, an input XML file of 400,384 bytes was converted into compressed binary XML format of only 40,451 bytes, which is a significant reduction.*

    It is also important to understand that unlike the case of Java Deserialization where the data is converted into unreadable binary format, the compressed binary variant of FastInfoSet is actually binary XML and can be easily read and selectively bound using StAX filters and is also interoperable with .NET.

    Hence we would recommend use of FastInfoSet only in case you are more concerned with size of data on the wire and disk storage and are less concerned with processing time; where FastInfoSet clearly is working slowest as compared with other alternatives."

    *(Table 2 below shows that compressing raw data directly leads to much smaller size as compared to FastInfoSet. The underlying data in FI is compressed binary stream which can be read using StAX filters)

  • JAXB performed considerably faster than FastInfoSet and slightly slower than Java Deserialization. DOM has its own advantages and disadvantages, but the overall flexibility offered by JAXB makes it the most viable option. Basically, JAXB has two phases, code generation and parsing. You can either use the JAXB XJC compiler to auto generate all the Java objects based on the types defined in the corresponding XML Schema or write all the Java objects mapping to XML types on your own. The -auto generation of code is a great time saver and productivity boost due to correct mapping of XML types to the corresponding Java classes being generated.

    Also, since JAXB works directly on the XML text data, there is no binary format involved and the XML text can be selectively evaluated or read on demand without any involvement of the JAXB framework.

    Recommendations: JAXB is our recommended choice. It's slightly slower speed compared to Java Deserialization is more than compensated by the advantages of code generation based on an XML schema and by its ability to do runtime binding of the XML to any version of the source code. JAXB also scores high in our recommendation due to its ability to process and read XML as raw text."

  • DOM parser performed slightly faster than JAXB and slightly slower than Java Deserialization for files of larger size. For smaller files DOM parser was the fastest.

    Recommendations: As mentioned earlier, DOM parser can be used if an XML document is not too large and we want the entire XML available in memory for data manipulation. DOM parses the whole document and constructs a complete document tree in memory before it returns control to the client (best way to think of DOM is as a map of maps). Even DOM parsers that employ deferred node expansion, and thus are able to parse a document partially, have high resource demands because the document tree must be at least partially constructed in memory. Also with DOM you have to do all the processing like converting numeric data into a binary representation, etc.

    Hence we do not recommend DOM except for very small XML files and in situations where memory footprint is not a concern

Conclusion - Is XML Slow?

We have shown in this article that XML is generally not slow, and it depends a lot on the technology being used. Even though DOM (Xerces) is on average 25% faster, JAXB is still the recommended choice when considering all the work that the JAXB unmarshaller has to do, like converting numeric data into a binary representation, etc. which, for many applications, would be eventually needed anyhow after the XML is loaded as a DOM tree. XML is not much slower than the alternative no-loss transmission mechanisms, but its capabilities of being so versatile and readable makes it a better choice for any enterprise application.

Source Code for the Application can be obtained by clicking this link and is covered and released under the Apache License, Version 2.0 and can be used freely for your needs and is released on an "As Is" basis, without Warranties or Conditions of any kind.

Test Architecture

We conducted the test across XML files of varying sizes and ran these files using five different alternatives as discussed earlier in the article. Architecture of the sample system used to conduct the test is shown below.

Architecture

Test Result Metrics

Table 1 below displays the observed test results observed on running 100,000 iterations of test and which were used to generate the comparison report shown in Figure 1.

Table 1

Table 2 below displays the size in bytes the data was converted/processed by the chosen technology alternatives.

Table 3

References/Links/Literature

http://en.wikipedia.org/wiki/XML

http://jaxb.java.net/

http://en.wikipedia.org/wiki/Serialization

http://encyclopedia.thefreedictionary.com/Fast+Infoset+Project