Cut, paste, split, and assemble XML documents with VTD-XML

VTD-XML eliminates the performance overhead associated with updating XML

Page 2 of 3
 import com.ximpleware.*;
import java.io.*;
// This example shows how to delete all CDs priced above 0
public class cut {
  public static void main(String[] args){
    try{
        VTDGen vg = new VTDGen();
        File fo = new File("cd_after.xml");
        FileOutputStream fos = new FileOutputStream(fo);
        if (vg.parseFile("cd.xml",false)){
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/CATALOG/CD[PRICE > 10]");
            // flb contains all the offset and length of the segments to be 
skipped
            FastLongBuffer flb = new FastLongBuffer(4); // Page size is 2^4 
= 16
            int i;
            byte[] xml = vn.getXML().getBytes();
            while( (i=ap.evalXPath())!= -1){
               flb.append(vn.getElementFragment());
            }
            int size = flb.size();
            if (size == 0){
                fos.write(xml); // No change needed because no CD is above 
0
            }
            else{
               int os1 = 0;
               for (int k = 0;k<size; k++){
                   fos.write(xml, os1, flb.lower32At(k)-1 - os1);
                   os1 = flb.upper32At(k) + flb.lower32At(k);
               }
               fos.write(xml, os1, xml.length - os1);
           }
        }
    }
    catch (Exception e){
        System.out.println("exception occurred ==>"+e);
    }
  }
}

The following is the output XML named cd_after.xml:

 

<CATALOG>

<CD> <TITLE>Hide your heart</TITLE> <ARTIST>Bonnie Tyler</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>1988</YEAR> </CD> <CD> <TITLE>Greatest Hits</TITLE> <ARTIST>Dolly Parton</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>RCA</COMPANY> <PRICE>9.90</PRICE> <YEAR>1982</YEAR> </CD>

Pasting XML

Our second example moves part of the content from one XML file to another. The source XML is the same cd.xml used in the first example. This time, our application copies all CDs priced below 0 from cd.xml and pastes them into cd2.xml (shown below), which contains a few other CDs priced below 0.

 <CATALOG>
    <CD>
        <TITLE>Eros</TITLE>
        <ARTIST>Eros Ramazzotti</ARTIST>
        <COUNTRY>EU</COUNTRY>
        <COMPANY>BMG</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1997</YEAR>
    </CD>
    <CD>
        <TITLE>Sylvias Mother</TITLE>
        <ARTIST>Dr.Hook</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>CBS</COMPANY>
        <PRICE>8.10</PRICE>
        <YEAR>1973</YEAR>
    </CD>
    <CD>
        <TITLE>When a man loves a woman</TITLE>
        <ARTIST>Percy Sledge</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>Atlantic</COMPANY>
        <PRICE>8.70</PRICE>
        <YEAR>1987</YEAR>
    </CD>
</CATALOG>

Our application is quite similar to the first example, except this time, it also parses cd2.xml, so the elements from cd.xml matching the XPath expression /CATALOG/CD [PRICE < 10] can be inserted right after the CD entitled Sylvias Mother:

 

import com.ximpleware.*; import java.io.*; // This example shows how to copy/paste elements between XML files public class paste { public static void main(String[] args){ try{

VTDGen vg = new VTDGen(); File fo = new File("cd_after.xml"); FileOutputStream fos = new FileOutputStream(fo); if (vg.parseFile("cd.xml",false)){ VTDNav vn = vg.getNav(); AutoPilot ap = new AutoPilot(vn); ap.selectXPath("/CATALOG/CD[PRICE < 10]"); // flb contains all the offset and length of the segments to be skipped FastLongBuffer flb = new FastLongBuffer(4); int i; byte[] xml = vn.getXML().getBytes(); while( (i=ap.evalXPath())!= -1){ flb.append(vn.getElementFragment()); } VTDNav vn2 = null; if (vg.parseFile("cd2.xml",false)){ vn2 = vg.getNav(); AutoPilot ap2 = new AutoPilot(vn2); ap2.selectXPath("/CATALOG/CD[TITLE=\"Sylvias Mother\"]"); byte[] xml2 = vn2.getXML().getBytes(); long l2 = 0;

if (ap2.evalXPath()!=-1){ // eval XPath just once l2 = vn2.getElementFragment(); } int os = (int) l2; int len = (int) (l2>>32); int size = flb.size(); if (size ==0) { fos.write(xml2); } else { fos.write(xml2, 0, os + len+1 ); for (int k=0;k<size;k++){ fos.write("\n".getBytes()); fos.write(xml, flb.lower32At(k), flb.upper32At(k)); } fos.write(xml2, os+len, xml2.length - (os+len+1)); } fos.close(); } } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

This time, the output XML contains five CDS, all priced below 0:

 <CATALOG>
    <CD>
        <TITLE>Eros</TITLE>
        <ARTIST>Eros Ramazzotti</ARTIST>
        <COUNTRY>EU</COUNTRY>
        <COMPANY>BMG</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1997</YEAR>
    </CD>
    <CD>
        <TITLE>Sylvias Mother</TITLE>
        <ARTIST>Dr.Hook</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>CBS</COMPANY>
        <PRICE>8.10</PRICE>
        <YEAR>1973</YEAR>
    </CD>
    <CD>
        <TITLE>Hide your heart</TITLE>
        <ARTIST>Bonnie Tyler</ARTIST>
        <COUNTRY>UK</COUNTRY>
        <COMPANY>CBS Records</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1988</YEAR>
    </CD>
    <CD>
        <TITLE>Greatest Hits</TITLE>
        <ARTIST>Dolly Parton</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>RCA</COMPANY>
        <PRICE>9.90</PRICE>
        <YEAR>1982</YEAR>
    </CD>
    <CD>
        <TITLE>When a man loves a woman</TITLE>
        <ARTIST>Percy Sledge</ARTIST>
        <COUNTRY>USA</COUNTRY>
        <COMPANY>Atlantic</COMPANY>
        <PRICE>8.70</PRICE>
        <YEAR>1987</YEAR>
    </CD>
</CATALOG>

Splitting XML

In the same cd.xml used in the previous examples, each CD element itself is well-formed XML. Our next example pulls out every CD matching /CATALOG/CD [ PRICE >10] into its own file:

 

import com.ximpleware.*; import java.io.*; // This example shows how to split XML public class split { public static void main(String[] args){ try{

VTDGen vg = new VTDGen(); if (vg.parseFile("cd.xml",false)){ VTDNav vn = vg.getNav(); AutoPilot ap = new AutoPilot(vn); ap.selectXPath("/CATALOG/CD[PRICE > 10]"); // flb contains all the offset and length of the segments to be skipped FastLongBuffer flb = new FastLongBuffer(4); int i; byte[] xml = vn.getXML().getBytes(); while( (i=ap.evalXPath())!= -1){ flb.append(vn.getElementFragment()); } int size = flb.size(); if (size != 0){ for (int k = 0;k<size; k++){ File fo = new File("cd_"+k+".xml"); FileOutputStream fos = new FileOutputStream(fo); fos.write(xml, flb.lower32At(k), flb.upper32At(k)); fos.close(); }

} } } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

Assembling XML

This final example pulls from cd.xml and cd2.xml all CDs released before 1990, corresponding to the XPath expression /CATALOG/CD [ YEAR < 1990 ], and dumps them into result.xml. This example also demonstrates XPath reuse and late binding of the AutoPilot object:

 

import com.ximpleware.*; import java.io.*; // This example shows how to cut out all CDs less expensive than 0 public class assemble {

public static void main(String[] args){ try{

VTDGen vg = new VTDGen(); File fo = new File("result.xml"); FileOutputStream fos = new FileOutputStream(fo); AutoPilot ap = new AutoPilot(); FastLongBuffer flb = new FastLongBuffer(4); ap.selectXPath("/CATALOG/CD[YEAR < 1990]"); fos.write("<result>\n".getBytes()); if (vg.parseFile("cd.xml",false)){ VTDNav vn = vg.getNav(); ap.bind(vn); // flb contains all the offset and length of the segments to be skipped

int i; byte[] xml = vn.getXML().getBytes(); while( (i=ap.evalXPath())!= -1){ flb.append(vn.getElementFragment());

} int size = flb.size(); for (int k = 0;k<size; k++){ fos.write("\n".getBytes()); fos.write(xml, flb.lower32At(k), flb.upper32At(k)); } ap.resetXPath(); //Reset XPath so it can be reused flb.clear(); }

if (vg.parseFile("cd2.xml",false)){ VTDNav vn = vg.getNav(); // Reuse AutoPilot ap.bind(vn); // flb contains all the offset and length of the segments to be skipped

int i; byte[] xml = vn.getXML().getBytes(); while( (i=ap.evalXPath())!= -1){

flb.append(vn.getElementFragment()); } int size = flb.size(); for (int k = 0;k<size; k++){ fos.write("\n".getBytes()); fos.write(xml, flb.lower32At(k), flb.upper32At(k)); } } fos.write("\n</result>".getBytes()); } catch (Exception e){ System.out.println("exception occurred ==>"+e); } } }

As expected, result.xml now contains all CDs released before 1990:

 

<result>

<CD> <TITLE>Empire Burlesque</TITLE> <ARTIST>Bob Dylan</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Columbia</COMPANY> <PRICE>10.90</PRICE> <YEAR>1985</YEAR> </CD>

<CD> <TITLE>Hide your heart</TITLE> <ARTIST>Bonnie Tyler</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS Records</COMPANY> <PRICE>9.90</PRICE> <YEAR>1988</YEAR> </CD> <CD> <TITLE>Greatest Hits</TITLE> <ARTIST>Dolly Parton</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>RCA</COMPANY> <PRICE>9.90</PRICE> <YEAR>1982</YEAR> </CD> <CD> <TITLE>Sylvias Mother</TITLE> <ARTIST>Dr.Hook</ARTIST> <COUNTRY>UK</COUNTRY> <COMPANY>CBS</COMPANY> <PRICE>8.10</PRICE> <YEAR>1973</YEAR> </CD> <CD> <TITLE>When a man loves a woman</TITLE> <ARTIST>Percy Sledge</ARTIST> <COUNTRY>USA</COUNTRY> <COMPANY>Atlantic</COMPANY> <PRICE>8.70</PRICE> <YEAR>1987</YEAR> </CD>

</result>

You can download from Resources all the code samples in this article.

Wrapping up

By demonstrating VTD-XML's incremental update capability and its ability to cut, paste, split, and assemble XML documents, hopefully this article has furthered your understanding of why VTD-XML is the next-generation XML-processing API that goes beyond DOM and SAX in virtually every way imaginable. Given the overwhelming benefit of non-extractive tokenization, you may be wondering why it hasn't been applied to text-processing earlier. The rest of this article outlines some of my thoughts concerning the past, present, and future of VTD-XML, as well as some of the development occurring in the XML community.

The problem has changed

Both DOM and SAX are based on traditional text-processing techniques invented years ago for designing compilers, which do not demand high performance because a compiler is only used to generate binary code, such as executable files or library files. HTML-rendering using DOM also does not demand high performance because in terms of user experience, 10 ms and 200 ms is not a huge difference. In an environment where a constant stream of XML must be parsed, updated, and queried on a real-time basis, DOM, SAX, and their underlying text-processing approaches have finally begun to show their age, paving the way for innovative new techniques such as VTD-XML.

Do not underestimate the harm of bad standards

Although DOM and SAX are widely adopted, it is not clear whether they have done more good than harm in terms of helping enterprises realize the full business benefit of XML. Have you wondered why memory usage and performance of any DOM implementation have not improved significantly for the past eight years? The fact is that there is little anyone can do to make DOM leaner or run faster. The DOM specification is in fact based on the assumption that the hierarchical structure of XML consists entirely of objects exposing the Node interface. The most any DOM implementation can do is alter the implementation of the object sitting behind the Node interface, making excessive object allocation inevitable.

| 1 2 3 Page 2