Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Java Tip 128: Create a quick-and-dirty XML parser

Parse valid XML using minimal code

  • Print
  • Feedback
XML is a popular data format for several reasons: it is human readable, self-describing, and portable. Unfortunately, many Java-based XML parsers are very large; for example, Sun Microsystems' jaxp.jar and parser.jar libraries are 1.4 MB each. If you are running with limited memory (for example, in a J2ME (Java 2 Platform, Micro Edition) environment), or bandwidth is at a premium (for example, in an applet), using those large parsers might not be a viable solution.

Those libraries' large size is partly due to having a lot of functionality—perhaps more than you require. They validate XML DTDs (document type definitions), possibly schemas, and more. However, you might already know that your application will receive valid XML. Also, you might already decide that you want just the UTF-8 character set. Therefore, you really want event-based processing of XML elements and translation of standard XML entities—you want a nonvalidating parser.

Note: You can download this article's source code in Resources.

Why not just use SAX?

You could implement SAX (Simple API for XML) interfaces with limited functionality, throwing an exception named NotImplemented when you encountered something unnecessary.

Undoubtedly, you could develop something much smaller than the 1.4 MB jaxp.jar/parser.jar libraries. But instead, you can cut down the code size even more by defining your own classes. In fact, the package we construct here will be considerably smaller than the jar file containing the SAX interface definitions.

Our quick-and-dirty parser is event-based like the SAX parser. Also like the SAX parser, it lets you implement an interface to catch and process events corresponding to attributes and start/end element tags. Hopefully, those of you who have used SAX will find this parser familiar.

Limit XML functionality

Many people want XML's simple, self-describing textual data format. They want to easily pick out elements, attributes and their values, and elements' textual content. With that in mind, let's consider what functionality we need to preserve.

Our simple parsing package has just one class, QDParser, and one interface, DocHandler. The QDParser itself has one public static method, parse(DocHandler,Reader), which we will implement as a finite state machine.

Our limited functionality parser treats the DTD <!DOCTYPE> and processing instructions <?xml version="1.0"?> simply as comments, so it won't be confused by their presence nor use their content.

Because we won't process DOCTYPE, our parser cannot read custom entity definitions. We will have only the standard ones available: &amp, &lt;, &gt;, &apos;, and &quot;. If this is a problem, you can insert code to expand custom definitions, as the source code shows. Alternatively, you could preprocess the document—replacing custom entity definitions with their expanded text before handing the document to the QDParser.

Our parser also cannot support conditional sections; for example, <![INCLUDE[ ... ]]> or <![IGNORE[ ... ]]>. Without the ability to define custom entity definitions in DOCTYPE, we don't really need this functionality anyway. We could process such sections, if any, before the data is sent to our limited-space application.

  • Print
  • Feedback

Resources