Java Tip 26: How to improve Java's I/O performance

The JDK 1.0.2 java.io package has meant problems for I/O performance, but here's a tip for making the situation better -- plus an extra tip on turning off synchronization

Java's I/O performance has been a bottleneck for a lot of Java applications because of a poorly designed and implemented JDK 1.0.2 java.io package. A key problem is buffer -- most classes in java.io are not buffered. In fact, the only classes with buffers are BufferedInputStream and BufferedOutputStream, but they provide very limited methods. For example, in most file-related applications, you need to parse a file line by line. But the only class that provides the readLine method is the DataInputStream, and it has no internal buffer. The readLine method in the DataInputStream class actually reads the input stream character by character until it hits a "\n" or "\r\n". Each character-read operation involves file I/O. This is extremely inefficient when reading a large file. A 5-megabyte file requires at least 5 million character-read file I/O operations when no buffer is provided.

The new JDK 1.1 improves I/O performance with the addition of a collection of Reader and Writer classes. The readLine method in BufferedReader is at least 10 to 20 times faster than the one in DataInputStream when a large file is encountered. Unfortunately, JDK 1.1 does not solve all the performance problems. For example, RandomAccessFile is a very useful class when you want to parse a large file and do not want to read it into memory. Still it is not buffered in JDK 1.1, and no equivalent Reader class has been provided.

How to tackle the I/O problem

To tackle the problem of inefficient file I/O, we need a buffered RandomAccessFile class. A new class is derived from the RandomAccessFile class, in order to reuse all the methods in it. The new class is named Braf(Bufferedrandomaccessfile).

  public class Braf extends RandomAccessFile {
  }

For efficiency reasons, we define a byte buffer instead of char buffer. The variables buf_end, buf_pos, and real_pos are used to record the effective positions on the buffer:

  byte buffer[];
  int buf_end = 0;
  int buf_pos = 0;
  long real_pos = 0;

A new constructor is added with an additional parameter to specify the size of the buffer:

  public Braf(String filename, String mode, int bufsize) 
   throws IOException{
    super(filename,mode);
    invalidate();
    BUF_SIZE = bufsize;
    buffer = new byte[BUF_SIZE];    
  }

The new read method is written such that it always reads from the buffer first. It overrides the native read method in the original class, which is never engaged until the buffer has run out of room. In that case, the fillBuffer method is called to fill in the buffer. In fillBuffer, the original read is invoked. The private method invalidate is used to indicate that the buffer no longer contains valid contents. This is necessary when the seek method moves the file pointer out of the buffer.

  public final int read() throws IOException{
    if(buf_pos >= buf_end) {
       if(fillBuffer() < 0)
       return -1;
    }
    if(buf_end == 0) {
         return -1;
    } else {
         return buffer[buf_pos++];
    }
  }
  private int fillBuffer() throws IOException {
    int n = super.read(buffer, 0, BUF_SIZE);
    if(n >= 0) {
      real_pos +=n;
      buf_end = n;
      buf_pos = 0;
    }
    return n;
  }
  private void invalidate() throws IOException {
    buf_end = 0;
    buf_pos = 0;
    real_pos = super.getFilePointer();
  }

The other parameterized read method also is overridden. The code for the new read is listed below. If there is enough buffer, it will simply call System.arraycopy to copy a portion of the buffer directly into the user-provided area. This presents the most significant performance gain because the read method is heavily used in the getNextLine method, which is our replacement for readLine.

  public int read(byte b[], int off, int len) throws IOException {
   int leftover = buf_end - buf_pos;
   if(len <= leftover) {
             System.arraycopy(buffer, buf_pos, b, off, len);
        buf_pos += len;
        return len;
   }
   for(int i = 0; i < len; i++) {
      int c = this.read();
      if(c != -1)
         b[off+i] = (byte)c;
      else {
         return i;
      }
   }
   return len;
  }

The original methods getFilePointer and seek need to be overridden as well in order to take advantage of the buffer. Most of time, both methods will simply operate inside the buffer.

  public long getFilePointer() throws IOException{
    long l = real_pos;
    return (l - buf_end + buf_pos) ;
  }
  public void seek(long pos) throws IOException {
    int n = (int)(real_pos - pos);
    if(n >= 0 && n <= buf_end) {
      buf_pos = buf_end - n;
    } else {
      super.seek(pos);
      invalidate();
    }
  }

Most important, a new method, getNextLine, is added to replace the readLine method. We can not simply override the readLine method because it is defined as final in the original class. The getNextLine method first decides if the buffer still contains unread contents. If it doesn't, the buffer needs to be filled up. If the new line delimiter can be found in the buffer, then a new line is read from the buffer and converted into String. Otherwise, it will simply call the read method to read byte by byte. Although the code of the latter portion is similar to the original readLine, performance is better here because the read method is buffered in the new class.

  /**
   * return a next line in String 
   */
  public final String getNextLine() throws IOException {
   String str = null;
   if(buf_end-buf_pos <= 0) {
      if(fillBuffer() < 0) {
                throw new IOException("error in filling buffer!");
      }
   }
   int lineend = -1;
   for(int i = buf_pos; i < buf_end; i++) {
        if(buffer[i] == '\n') {
         lineend = i;
          break;
          }
   }
   if(lineend < 0) {
        StringBuffer input = new StringBuffer(256);
        int c;
             while (((c = read()) != -1) && (c != '\n')) {
                 input.append((char)c);
        }
        if ((c == -1) && (input.length() == 0)) {
          return null;
        }
        return input.toString();
   }
   if(lineend > 0 && buffer[lineend-1] == '\r')
        str = new String(buffer, 0, buf_pos, lineend - buf_pos -1);
   else str = new String(buffer, 0, buf_pos, lineend - buf_pos);
   buf_pos = lineend +1;
   return str;
   }

With the new Braf class, we have experienced at least 25 times performance improvement over RandomAccessFile when a large file needs to be parsed line by line. The method described here also applies to other places where intensive file I/O operations are involved.

Synchronization turn-off: An extra tip

Another factor responsible for slowing down Java's performance, besides the I/O problem discussed above, is the synchronized statement. Generally, the overhead of a synchronized method is about 6 times that of a conventional method. If you are writing an application without multithreading -- or a part of an application in which you know for sure that only one thread is involved -- you don't need anything to be synchronized. Currently, there is no mechanism in Java to turn off synchronization. A simple trick is to get the source code of a class, remove synchronized statements, and generate a new class. For example, in BufferedInputStream, both read methods are synchronized, whereas all other I/O methods depend on them. You can simply rename the class to NewBIS,for example, copy the source code from BufferedInputStream.java provided by JavaSoft's JDK 1.1, remove synchronized statements from NewBIS.java, and recompile NewBIS.

Nick Zhang is a senior software engineer at Enterprise Integration Technologies. When he is away from Java, he listens to and sometimes plays LaoSheng roles in Peking Opera.
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more