Tweak your IO performance for faster runtime

Increase the speed of Java programs by tuning IO performance

Most of you are probably tired of hearing the standard claim that Java programs are slower than C programs. In reality, the situation is more complex than that trite assertion. Many Java programs are slow, but that is not necessarily an intrinsic characteristic of programs written in Java. Many Java programs can perform just as fast as comparable programs written in C or C++, but only when the designer and programmer pay careful attention to performance issues throughout the development process.

In this article, I will focus on tuning IO performance. Many applications spend much of their runtime on network or file IO, and poorly performing IO code can run many times slower than well-tuned IO code.

General performance-tuning concepts

Some general concepts appear over and over again when assessing the performance of Java programs. The examples in this article focus specifically on tuning an IO-bound application, but you can apply those principles to many other performance situations.

Perhaps the most important principle is this: Measure early, measure often. You can't effectively manage performance if you don't know the source of your problem. Programmers are notoriously bad at guessing where performance problems lie. Spending days tuning a subsystem that accounts for 1 percent of an application's total runtime simply cannot yield more than a 1 percent improvement in application performance. Instead of guessing, use performance measurement tools -- such as profiling or nonintrusive time-stamped logging -- to identify where your application spends its time and to focus your energy on those hot spots. After you tune, measure again. Not only will measuring help you concentrate your energy where it will count the most, but it will also provide evidence of the success -- or lack thereof -- of your efforts.

In tuning a program for performance, you may want to measure a number of quantities such as total runtime, average memory usage, peak memory usage, throughput, request latency, and object creations. Which factors you should focus on depend on your situation and performance requirements. Several good commercial profiling tools are available that can help you measure many of those quantities, but you don't need to buy an expensive package to collect useful performance data.

In the performance data gathered for this article, I focused entirely on runtime and, to gather my measurements, I used a class similar to the simple Timer class below. (It can easily be extended to support additional operations such as pause() and restart().) Timer also allows you to gather timing information without cluttering your output with time-stamped logging statements, which can also distort runtime measurements since the logging statements may create objects and perform IO.

public class Timer {
  // A simple "stopwatch" class with millisecond accuracy
  private long startTime, endTime;
  public void start()   {  startTime = System.currentTimeMillis();       }
  public void stop()    {  endTime   = System.currentTimeMillis();       }
  public long getTime() {  return endTime - startTime;                   }

One of the most common causes of Java performance problems is the excessive creation of temporary objects. While newer Java VMs more effectively reduce the performance impact of creating many small objects, the fact remains that object creation is still an expensive operation. Since strings are immutable, the String class is one of the biggest offenders; every time a String is modified, one or more new objects is created. That brings us to our second performance principle: Avoid excessive object instantiations.

IO performance tuning

Many applications process large volumes of data, and IO is one of those areas where small differences in implementation can have a large impact on performance. I derived this article's examples from tuning a text-processing application that consumes and analyzes large quantities of text. The amount of time spent reading and processing the input was significant, and the methods employed in tuning the application provide good examples of the performance principles stated above.

One of the principal culprits that affects Java IO performance is the use of character-by-character IO -- calling the or the methods to read one character. Java inherited that idiom from C programming, in which it is quite common -- and efficient -- to read a file by repeatedly calling getc(). In C, character IO is fast because the getc() and putc() functions are implemented as macros and provide buffered file access, which means they only take a few machine instructions to execute. In Java, you encounter a quite different situation. Not only do you incur the cost of one or more method calls for each character but, more significantly, if you don't use any sort of buffering, you also suffer the cost of a system call to obtain the character. While a Java program that relies on read() may look and function just like its C counterpart, it will not perform in the same way. Fortunately, Java offers several easy approaches to achieving higher performance IO.

You can address buffering in one of two ways -- use the standard BufferedReader and BufferedInputStream classes or use the block-read methods to read larger blocks of data at a time. The former offers a quick and easy solution, and substantially improves performance with little additional coding and opportunity for error. The do-it-yourself technique offers a somewhat more complicated -- although still not difficult -- remedy, and it is even faster. It also features some additional advantages, which I will describe later.

To measure the effect of different IO buffering strategies, I wrote six small programs that read several hundred files and examine each character. Table 1 shows the runtimes of those six programs, using five commonly used Linux Java VMs -- the Sun 1.1.7, 1.2.2, and 1.3 Java VMs, and the 1.1.8 and 1.3 Java VMs from IBM.

The programs are:

  1. RawBytes: Read data one byte at a time, using
  2. RawChars: Read data one char at a time, using
  3. BufferedIS: Wrap FileInputStream with BufferedInputStream and read data one byte at a time with read()
  4. BufferedR: Wrap FileReader with BufferedReader and read data one char at a time with read()
  5. SelfBufferedIS: Read data 1 K at a time, using[]), and access the data from the buffer
  6. SelfBufferedR: Read data 1 K at a time, using[]), and access the data from the buffer
Table 1. Runtimes of six programs using common Linux VMs
 Sun 1.1.7IBM 1.1.8Sun 1.2.2Sun 1.3IBM 1.3

We can draw several obvious conclusions from the performance data in Table 1 (measured as total runtime to process several hundred text files after adjusting for Java VM and program startup):

  • InputStream runs faster than Reader. A char uses two bytes to represent a character, whereas byte requires only one, so using byte to store characters requires less memory and fewer machine instructions. More importantly, using byte eliminates the need to perform Unicode conversion. The choice of character representation will depend on your requirements -- if you need internationalization support, you will have to use char, but if you read from a source that is inherently ASCII (such as HTTP or MIME headers) or you know the input text will be English, you can use the faster byte.
  • Unbuffered character IO runs really slow. Character IO is bad, but unbuffered character IO is terrible. At the very least, you should wrap your streams with buffered streams, which can make your IO code run up to 10 times faster.
  • Block IO with self-buffering runs even faster than character IO with buffered streams. While buffered streams eliminate the system-call overhead of reading each character, which is significant, they still require one or more method calls in the middle of a tight loop. Self-buffered IO code runs 2 to 4 times faster than buffered streams and 4 to 40 times faster than unbuffered IO.

One item that is not obvious from the chart is that character IO can undermine the performance advantages of faster Java VMs. In most performance tests, the IBM 1.1.8 Linux Java VM runs approximately twice as fast as the Sun 1.1.7 Linux Java VM. But the RawBytes and RawChars tests show the two as equally slow. The extra time they spend executing system calls to read the data negates the speedup offered by the faster Java VM.

Using block IO results in another subtle benefit. When used, buffered character IO sometimes requires more coordination between components, creating more room for errors. Often, a component will complete the IO on behalf of an application; the application passes a Reader or InputStream to the component, which will then consume the contents of the stream. Some components may erroneously assume the stream will be a buffered stream without documenting that requirement. Or the application programmer may fail to notice when the component does document that condition. In such cases, the IO ends up unintentionally unbuffered, with serious performance consequences. When using block IO instead, that confusion cannot occur. (It is usually better to design components so that they cannot be misused, rather than rely on the documentation to ensure their proper use.)

Those simple tests show that the most obvious way to complete a simple task such as reading text can run 40-60 times slower than a more carefully chosen alternative. In those tests, the programs did perform some computation as it retrieved and examined each character. In situations where you just want to copy data from one stream to another, the performance gap between unbuffered character IO and block IO is even greater; block IO can run 300-500 times faster than unbuffered character IO. (See Chapter 4 in Java Platform Performance, for data. Direct link available in Resources.)

Next step -- measure again

The performance-tuning process is a necessarily iterative one because secondary performance issues often don't surface until the primary performance issue is addressed. In the case of the text-processing application, initial profiling showed that the program spent a tremendous percentage of its time reading characters, and adding buffering resulted in a dramatic performance improvement. Only after the first bottleneck -- character IO -- was removed did additional hot spots appear. Profiling again now showed that the program was consuming a lot of time creating String objects, and it turned out that it was creating more than one String object for each word of the input text.

That text analysis system featured a modular design in which users can chain together various text-processing operations to achieve a desired result. For example, a user might chain together the word identifier, which reads the input characters and groups them into words; the lowercase converter, which converts words into lowercase; and the stemmer, which converts words to their base form (e.g., converts jumper and jumped into jump).

While the benefits of that sort of building-block approach are obvious, it can impose performance penalties. Because the interfaces between the components are rigid (each takes a String as its input and produces another String as output), some duplication of effort across components may result. If certain combinations of components are frequently used together, optimizing those common cases may be worthwhile.

In that system, it turned out that in actual use, users would almost always use the lowercase converter component immediately following the word identifier component. The word identifier would examine each character, looking for a word boundary, filling up a word buffer as it went. When it identified a complete word, it would create a String object for it. The next component in the chain, in that case the lowercase converter, would then call String.toLowerCase() on that String, thereby creating another String object. Using those two components together generates two String objects for every word in the input text. Since the two were so often used together, adding an optimized lowercase word indentifier, which combined the functions of those two components -- but which only created a single String object for each word -- produced a significant additional performance improvement. Table 2 below shows that process' results.

Table 2. Program runtimes improve
  Sun 1.1.7IBM 1.1.8Sun 1.2.2Sun 1.3IBM 1.3
AWord filter only23.03.610.72.62.9
BWord filter + lower case filter39.66.713.93.93.9
CCombined word and lower case filter29.03.812.93.13.1
 Temporary String creation time (B-C)

We can draw several useful observations from Table 2:

1 2 Page 1