soap.xml(131 KB): A generated SOAP document containing a large values list. It includes some namespaces and attributes and a flat structure consisting of simple elements with short character-data content. Aleksander Slominski generated this document as a SOAP test case.
much_ado.xml(197 KB): The Shakespeare play marked up as XML from Jon Bosak's document collection. It includes no namespaces or attributes and has a flat structure consisting of simple elements with relatively long character-data content.
periodic_table.xml(114 KB): Elliotte Rusty Harold's periodic table of the elements in XML. It includes no namespaces and has light attribute usage and a fairly complex structure consisting mainly of elements with short character data content.
xml.xml(192 KB): The XML specification as XHTML, with the DTD (document type definition) reference removed and all entities substituted (necessary for some models used in the tests). I chose this document as a typical document presentation markup, with heavy mixed content. It has no namespaces and light attribute use.
The large document tests employed David Mertz's
weblog.xml (2.9 MB), a Web server log file formatted as XML. It comprises approximately 10,000 elements representing page hits, each containing several child elements with character-data content for the information fields. It has no namespaces and no attributes.
I've included each document, except for
xml.xml, in the XMLBench download. I couldn't include
xml.xml because its license requires that it be distributed only in unmodified form. (To try it yourself, remove the DTD reference and substitute &#xxx; values for entities.)
The parsers tested
I tested five SAX2 parsers, along with both currently available XMLPull implementations. First, here's the SAX2 parser list:
- Crimson: Apache project based on the Sun Project X parser. Crimson is stable and distributed under an Apache license.
- Xerces: Apache project based on the IBM XML4J parser. It's also stable and distributed under an Apache license.
- Xerces2: Xerces redesign and new implementation as a separate Apache project. It's fairly new but stable and also distributed under an Apache license.
- AElfred2: The parser included in the GNU JAXP (Java API for XML Parsing) project, based on the Microstar AElfred parser. It's a beta distributed as GPL (GNU General Public License) with library exception license.
- Piccolo: New parser development using parser generator tools. It's distributed under an LGPL (GNU Library or Lessor Public License) license.
The XMLPull implementations are:
- kXML: A compact, J2ME-compatible parser. The XMLPull version is in beta, distributed under an LGPL license.
- XPP3: A compact parser originally designed for SOAP. The XMLPull version is in beta, distributed under an Apache-style license.
I tested the parsers on a Red Hat 7.2 Linux system with a 1.4-GHz Athlon processor and 256 MB of RAM. The tests used Sun's JRE (Java Runtime Environment) 1.3.1 running under Linux kernel version 2.4.18.
Figure 1 shows the small document test results. The XMLPull parsers performed extremely well with the small documents, beating all the SAX2 parsers except the new Piccolo parser. Piccolo blows away even the fast pull parsers, though, giving the fastest times for every tested document-type collection.
The other SAX2 parsers don't fare nearly as well as Piccolo. AElfred2 and Xerces2 both deliver acceptable performance, although taking more than twice as long as Piccolo. The original Xerces and Crimson bring up the rear; if you're working mainly with small files, you should probably avoid these two.
For the mid-sized XML documents shown in Figure 2, performance differs little between the XMLPull parsers and most of the SAX2 parsers. Piccolo again stands out, though, with about twice the speed of the slowest parsers. AElfred2 and Crimson both show relatively poor performance with the SOAP document, which doesn't contain much character-data content. The XMLPull parsers, along with Xerces2, show relatively poor performance on the content-heavy
much_ado.xml file. The performance range here is much smaller than for the small documents, however.
The large document results in Figure 3 show a smaller performance range than the mid-sized documents. Here again, Piccolo proves the best. The XMLPull parsers and Xerces2 trail the other parsers in this test. Keep in mind this test uses only one document,so the results aren't as representative of general performance as the earlier tests.
The results show that the new XMLPull parsers perform well compared to the older SAX2 parsers. The new Piccolo SAX2 parser does even better, but that's probably because of the parser's implementation rather than the SAX2 interface. We'll see how performance works out once the XMLPull implementations have been tuned for better performance. In fact, a new XPP3 version claims some dramatic performance gains. Check my XMLBench Website for updated test results, beginning in early May 2002.
Note: None of the best performers support validation. Of the parsers tested, Xerces2 offers the best validation support and shows reasonable performance. The performance tests are done with validation disabled, though, so performance may prove different when it's turned on. If you need validation in your application and performance is an issue, try using different validating parsers to see which works best in your environment.
Now that you've seen both usage and performance comparisons for SAX2 and XMLPull parsers, how do you decide what to use for your applications? Here's a list of pros and cons:
- SAX2 pros:
- Stable: SAX2 features many compatible implementations and good support
- Widely used: Compatibility with existing code is often important
- Great validation support: The only real choice if you need document validation on parse
- SAX2 cons:
- No parse control: The parser takes control and doesn't give it back until it's done
- Event-driven programming: Data is delivered piecemeal to your handler
- XMLPull pros:
- Controlled parsing: Your program completely controls the parsing
- Event-request programming: You request events as you want to process them
- XMLPull cons:
- New: Only two implementations currently exist
- No validation support: Layering approach should give great support in time, but not yet
Both interface types perform well, so you don't need to focus on performance when making your decision. Be careful with your specific SAX2 parser choice if performance is a concern, though; you've seen that the choice can make a big difference in some circumstances.
If you do use SAX2, a flexible approach such as the one I outlined in Part 2 can help you manage event-driven programming's complexities. If you go with an XMLPull approach, using a wrapper to provide higher-level primitives for your program (as shown in Part 2 and updated here) makes the programming even easier. In either case, I hope you enjoy your work with event-stream XML processing!
Learn more about this topic
- To download this article's source code, go to
- Dennis Sosnoski's "XML Documents on the Run" (JavaWorld):
- Part 1SAX speeds through XML documents with parse-event streams (February 2002)
- Part 2Better SAX2 handling and the pull-parser alternative (March 2002)
- Part 3How do SAX2 parsers perform compared to new XMLPull parsers? (April 2002)
- Visit my XMLBench homepage for the latest parser performance updates and to check out Java document-model performance comparisons
- For full details of the SAX specification, currently at version 2.0.1, go to
- For the five SAX2 parsers included in the performance tests, see:
- For full details of the XMLPull specification, currently at version 1.0.7, go to
- Here are the XMLPull parsers tested:
- For more SAX2 and pull-parser stories, visit the Java and XML section of JavaWorld's Topical Index
- Dennis Sosnoski also wrote JavaWorld's "Java Performance Programming" series:
- Part 1Learn how to reduce program overhead and improve performance by controlling object creation and garbage collection (November 1999)
- Part 2Reduce overhead and execution errors through type-safe code (December 1999)
- Part 3See how collections alternatives measure up in performance, and find out how to get the most out of each type (February 2000)