Modifying archives, Part 2: The Archive class

The Archive class allows you to write or modify stored archive files

Author's note: Before we get started on this month's article, I'd like to mention that my new book on Java threading, Taming Java Threads (APress, June 2000 (see Resources)), is finally out. The book shows you how to create production-quality multithreaded programs; it presents a full-blown industrial-strength threading library along with a lot of advice about threading pitfalls and good architecture. Much of the material in this book first appeared in JavaWorld as a nine-part series on threading (see Resources), though the material has been expanded considerably and the code has been cleaned up and expanded as well.

TEXTBOX:

TEXTBOX_HEAD: Modifying archives: Read the whole series!

:END_TEXTBOX

Modifying Archives

As I discussed in Part 1 of this series, the built-in Java archive classes contain no support for modifying an existing archive. They only let you build one from scratch. To modify an archive, you must copy it to another archive, performing the modifications along the way. Three classes are involved in the transfer:

  • ZipFile: Represents the file as a whole; you get ZipEntry objects that represent the archive's contents from here. The constructor takes the full path name of the .zip or .jar file as an argument.
  • ZipEntry: Essentially the directory entry for a file within the archive. You get an InputStream for a particular file within the archive by calling a_ZipFile_object.getInputStream(a_ZipEntry).
  • ZipOutputStream: An output stream that builds an archive. You can write ZipEntry objects onto this stream as well as the actual data (the ZipEntry object has to be written first, then the data). A ZipOutputStream is a standard java.io-style decorator used along the same lines as BufferedOutputStream. You pass an OutputStream representing the physical archive file to the ZipOutputStream as a constructor argument, and you write to the ZipOutputStream wrapper.

Next, we see the general (but not so easy) process for modifying an archive:

  1. Get all the ZipEntry objects for the existing archive.
  2. Create a temporary file to hold the new archive as it's being built.
  3. Wrap that temporary with a ZipOutputStream.
  4. To remove a file:
    • Remove its entry from the list of ZipEntry objects made in Step 1.
  5. To replace a file in the archive:
    1. Remove the old ZipEntry from the list of entries made in Step 1.
    2. Make a new ZipEntry by copying relevant fields from the old one.
    3. Put the new ZipEntry into the ZipOutputStream.
    4. Copy the new contents of the file to the ZipOutputStream.
    5. Tell the ZipOutputStream that you're done with the entry.
    6. Close the InputStream.
  6. To add a file to the archive:
    • It's just like replacing a file, but there's no ZipEntry in the old archive, so you have to create one from scratch.
  7. Once you've made all the modifications, transfer the contents of the files represented by the ZipEntry objects that remain in the list created in Step 1 (that is, the files you haven't deleted or replaced). To do this, you'll have to open an InputStream for each of the entries remaining in the list (by asking the ZipFile for an InputStream for a particular ZipEntry), then transfer bytes from that stream to the ZipOutputStream using the process described earlier.
  8. Close the new and old archives, then rename the new one to have the same name as the old one.

To make matters worse, the requirements for writing a compressed (ZipEntry.DEFLATED) file differ from those for writing an uncompressed (ZipEntry.STORED) file. The ZipEntry for uncompressed files must be initialized with a CRC value (a checksum) and file size before it can be written to the ZipOutputStream. The checksum can be built using Java's CRC32 class (which is passed the bytes that comprise the file and provides a checksum when all the bytes have been imported). The ZipEntry must be written before the file contents, however, so you have to process the data twice -- once to figure out the CRC and once again to copy the bytes to the ZipOutputStream. Fortunately, the process isn't so brain dead for a compressed file; you can give the ZipOutputStream a ZipEntry with uninitialized size and CRC fields, and the ZipOutputStream will modify the fields for you as it does the compression.

The entire process proves ridiculously complicated, and it's mysterious to me that Sun, which has worked so hard to hide complexity elsewhere in the Java packages, has given us this hideous mechanism for archive management.

Using the Archive class

I've hidden all this complexity in the Archive class -- the subject of this month's Java Toolbox. Compared to Sun's APIs, the Archive is refreshingly easy to use.

To get started, first create an Archive object using one of two constructors:

In both cases, the first argument is a pathname string that identifies the .zip or .jar file that you want to access. Note: the file doesn't need to exist if you're creating a new archive, as opposed to modifying or examining an existing one.

Meanwhile, the compress argument tells the Archive what to do with new files that you add to the archive. If it's true, we compress the new files (using the maximum compression ratio); otherwise, we simply store the files. If you're modifying a file in an existing archive, the original compression mode is preserved.

Once you've created the Archive, reading or writing a file is simply a matter of asking for an appropriate InputStream or OutputStream. Three methods are provided for this purpose:

  • InputStream input_stream_for(String internal_path)
  • OutputStream output_stream_for(String internal_path, boolean appending)
  • OutputStream output_stream_for(String internal_path)

The internal_path argument specifies the path (within the archive) to the file you want to access. If we've specified the appending flag and it's true, then the characters sent to the returned OutputStream are appended to the existing file rather than overwriting the original contents. The version of output_stream_for() without an append argument always overwrites. When you're done with the read or write operation, just close the stream in the normal way (by passing it a close() message).

You can remove a file from the archive by calling:

void remove(java.lang.String internal_path) 

which works as expected.

When you're done with the archive, you have to close it using one of two methods. The close() request closes the archive file, preserving any changes you've made, while the revert() request closes the archive file, discarding any changes you've made. It's important to call revert() if you're discarding changes (as opposed to simply abandoning the Archive reference) because otherwise temporary files used to perform the archive manipulation will remain on the disk.

Listing 1 shows a simple example of a program that copies standard input into a file called input.txt in the root directory of an archive called input.zip. If the archive already exists, then the existing input.txt file is overwritten with new contents.

Listing 1. Arc.java
   1: import com.holub.io.Archive;
   2: import java.io.*;
   3: 
   4: public class Arc
   5: {
   6:   public static void main(String[] s) throws Exception
   7:     {
   8:         Archive archive = new Archive("input.zip");
   9: 
  10:         OutputStream out = archive.output_stream_for( "input.txt" );
  11: 
  12:         int c;
  13:         while( (c = System.in.read()) != -1 )
  14:             out.write( c );
  15: 
  16:         out.close();
  17:         archive.close();
  18:     }
  19: }
         

Be aware that Archive is thread safe in a rather primitive way: no two threads may access the Archive simultaneously. The input_stream_for(...) or output_stream_for(...) methods effectively lock the Archive object, and it remains locked until the stream returned by one or the other of these methods closes. Any thread that tries to get an input or output stream or otherwise use the Archive object while a stream remains active will block until that stream closes. Keep in mind that at some point I might make this access a bit less restrictive. Indeed, there's no theoretical reason why one thread couldn't read an archive while another is writing, for example, provided that they aren't accessing the same file. I haven't had need of this behavior, however, so I haven't implemented it.

There's one other foible of the existing implementation: once you've written to a file within the archive, that file is no longer available for reading. If you want to read the modified file, you'll have to close() the Archive and then reopen it by creating a new Archive object. Again, I could change this behavior to allow reading a modified file, but I saw no reason to complicate the code by implementing features that I didn't use.

The final implementation detail focuses on three minor methods that let you get (limited) information about entries in the original source archive without having to deal with the ZipEntry object. First up, the is_newer_than(String file_name,Date d) method returns true if the file indicated by the first argument was modified after the Date passed in as the second argument. Second, the is_older_than method(String file_name, Date d) method does the obvious. Finally, the contains(String file_name) method returns true if the source archive contains the file specified by its argument. If you need to delve deeper into the attributes of the entries, you'll have to open a ZipFile and get the attributes that way. Note that none of these last three methods work reliably if you enquire about a file that's been modified or didn't exist in the original source archive.

The architecture

The architecture for the Archive class falls roughly under the aegis of the Abstract Factory design pattern, so let's look at the pattern first. A good example of a pure Abstract Factory in Java is a Collection with respect to an Iterator. In this example, you can write a method that traverses a data structure entirely in terms of interfaces, without having any idea what data structure you're traversing. Here's a method that leverages this ability to print all the elements of some unknown data structure:

public void print( Collection data_structure )
{
    Iterator i = data_structure.iterator();
    while( i.hasNext )
    {   Object current = i.next();
        System.out.println( current.toString() );
    }
}

This way of working gives you tremendous flexibility at implementation time -- you can completely change the data structure and the way it's traversed without at all modifying the print() method.

Figure 1. The Abstract Factory pattern

Figure 1 shows the general pattern, which you might implement as follows:

public interface Collection
{   //...
   Iterator iterator();
}
public interface Iterator
{   boolean hasNext();
   Object next();
   void remove();
}
public class LinkedList
{
   private static class List_iterator implements Iterator
    {   boolean hasNext(){  /*...*/ }
        Object  next()   {  /*...*/ }
        void    remove() {  /*...*/ }
    }
   public Iterator iterator()
    {   return new List_iterator();
    }
}

At the basic level, the implementation of the Iterator interface is completely hidden from the user of the interface. The user knows literally nothing about the implementation other than the fact that it implements a well-defined interface. The iterator() factory method returns a private inner class that implements a public interface, thereby guaranteeing that even if the users of the Iterator object know the actual class name, they still can't access any methods of that implementation class other than the ones defined in the public interface. (Other methods may exist to provide a private communication system between the Iterator implementation and the object across which it's iterating.)

The Archive

Figure 2 shows the static model for the Archive class implemented in Listing 2. As you can see from the figure, Archive follows the Abstract Factory pattern with one exception: there is no player in the actual Abstract Factory role -- there's only a Concrete Factory. The main point, though, is that when you ask an Archive for an OutputStream (by calling output_stream_for (Listing 2, line 124)), the method returns an instance of a private inner class Archive_OutputStream (Listing 2, line 370) that implements a public interface (java.io.OutputStream). The same reasoning applies to the InputStream derivative returned from input_stream_for(...) (Listing 2, line 195). (Yeah, I know that InputStream and OutputStream are abstract classes, not Java interfaces, but they're both interfaces in the design sense of the word regardless of the implementation details.)

You can find lots of similar examples of this design pattern in Java. For example, a URL object returns a generic implementation of the URLConnection interface in response to an openConnection() request. You, the user of the URL object, know nothing about the class that extends URLConnection. You must access it through the effective interface (again, URLConnection is a class that's being used here as an interface in the design sense.)

Figure 2. The Archive-class static model

The Archive_InputStream class

Related:
1 2 3 4 Page 1