Accelerate your Java apps!

Where does the time go? Find out with these speed benchmarks

As a Java programmer, knowing the performance characteristics of different Java environments running on different operating systems is crucial. Having this information at hand can prepare you for potential bottlenecks, and it can save you from building bottlenecks into your apps accidentally. This article tests six different Java environments -- some with a just-in-time (JIT) compiler, some without -- running on four OSs and provides valuable benchmarks that can help you out with your Java development efforts.

The testing process

To understand performance characteristics and therefore where to expect bottlenecks, I ran benchmark tests on the following typical Java language constructs: method call, try/catch set-up, object creation, array creation, and array accessing. I didn't run tests of network I/O, disk I/O, or AWT performance -- the focus was purely on Java language performance. The tests were designed to avoid paging to disk.

Most of the tests required no garbage collection, so general system performance cannot be inferred by simply adding the results from the various tests together. I ran no general computation tests like "Tower of Hanoi" or "Sieve of Erastothenes." I omitted general computational tests because they're dedicated to showing relative speeds on differing platforms, and rarely show where bottlenecks are.

Target systems and environments

The benchmark tests I ran for this article were performed on a range of hardware systems and Java environments. The Java environments were:

Java Environments
Description JIT
Netscape Navigator 4.05 for Windows NT/95Symantec Java! ByteCode Compiler Version 210.065
Netscape Navigator 4.05 for Power MacintoshYes
Internet Explorer 4.0 for Windows NT/95Yes
Symantec Visual Cafe PDE 2.1a for Windows NT/95 JDK 1.1.4Symantec Java! ByteCode Compiler Version i300.009
Netscape Navigator 4.05 for SPARCNo
Netscape Navigator 4.05 for LinuxNo

The hardware/OS platforms were:

Hardware/OS Systems
OS CPU (MegaHertz = MHz) RAM (megabytes = MB)
Windows NT SP3Pentium Pro 200 MHz128 MB
Macintosh 7.6.1PowerPC 604e 180 MHz 
Solaris 2.5.1UltraSPARC-1 167 MHz128 MB
Red Hat Linux 5.1Pentium-II 266 MHz128 MB
Windows NTDual Pentium Pro 180 MHz32 MB

To compare the various systems, I converted the time it took to perform the various operations into clock cycles. Why? This conversion makes it possible to compare machines running CPUs at different speeds. In general, comparing different CPUs to each other in such a crude way can be dangerous, because the amount of work that can be done in a single clock cycle can vary a lot from CPU to CPU. The 80486, for example, averages about 2 clock cycles per instruction, while the Pentium executes closer to 1. Fortunately, the PowerPC 604e, UltraSPARC, Pentium Pro, and Pentium-II are roughly comparable. While cache behavior could be different between the various systems, this seems not to affect the performance much. All the tests ran without paging to disk.

Special resources

For information relating to the benchmark testing, I've provided the following links:

Method calls

The ability to write and call methods (or functions) is a critical tool for building and maintaining large systems. Methods allow programs to be broken into smaller, more easily handled chunks. However, if method calls slow down a running program, programmers will design systems with bigger parts and fewer method calls.

Object-oriented programming increases the number of method calls when compared to equivalent procedural programs because it encourages more data encapsulation (among other things). Compare these two lines of code and notice the extra method call in the line showing encapsulation:

Without encapsulation: int x = someObject.x;
With encapsulation: int x = someObject.getX();

Encapsulation increases the number of method calls in a program, so it is essential that those method calls execute quickly. If method calls don't execute quickly, programmers often attempt to speed up their programs by avoiding encapsulating the data in their programs. Examples of this lack of encapsulation can be seen in some of the standard Java classes. The class java.awt.Dimension, for example, is written with both of its data members public. A better design would have hidden the data members by making them private and providing public accessor methods:

private int height;
       private int width;
       public int getHeight()
       {
           return height;
       }
       public int getWidth()
       {
           return width;
       }

Because the early Java environments shipped without JIT compilers, method calls were much slower than current Java environments. The encapsulation shown above may have been unacceptably slow to run in those early environments, with the result that the data is public.

Fortunately, today's JIT-enabled Java environments perform method calls much faster than earlier non-JIT-enabled environments. There is less of a need to make speed-versus-encapsulation tradeoffs in these environments. With the best JIT, static methods returning nothing and taking no arguments execute in 2 clock cycles. Non-static method calls returning integer quantities execute in 7 clock cycles. Non-static method calls returning floating-point numbers execute in 8 clock cycles.

By making these accessor methods final, you can expect to reduce these times by one clock cycle. When running in a Java environment without a JIT, method calls take anywhere between 280 and 500 clock cycles. A good JIT can speed up method calls by a factor of more than 100 -- so in target environments with a good JIT, you can have both encapsulation and speed. In environments without a JIT or with a poor JIT, programmers must decide on a case-by-case basis whether speed or encapsulation is more important. A good JIT can make this decision unnecessary.

The graph below shows the effect adding parameters has on the time a method call takes under various JIT-enabled Java runtimes. While the time a method call takes to execute varies considerably from one runtime to another, adding parameters to a method call frequently increases the time required to execute the method call. Often, adding one parameter does not increase the time required to execute a method call. Only rarely does adding a parameter speed up a method call. Also note that, regardless of the number of parameters, there is some general overhead in setting up a method call. Once a decision has been made to call a method, adding a few parameters will have little impact on the time it takes to make the call.

Notice that the JIT for Netscape Navigator on the Macintosh runs 25 percent as fast as the JITs on Windows. I have no numbers for JIT-enabled runtimes on Solaris. If you expect to support Macintosh and Windows clients, be sure to do your performance benchmarking on Macintoshes as well as on Windows clients.

Several popular environments do not yet come with a JIT. The graph below shows the effects of adding parameters to the time a method call takes under two non-JIT-enabled Java runtimes. You'll see that the cost of adding parameters is still mostly monotonic but that the general overhead of setting up a function call is very high.

Both Linux and Solaris have Java environments with JITs, but not under Navigator. I did not have access to these environments and have no data for them.

Finally, this graph compares the best Java time with C/C++.

Java seems to be only about 1 clock cycle slower than C++.

Recommendations

If you expect to run with a reasonable JIT, method calls are no more expensive in Java than they are in C or C++. If you expect to run on a system without a JIT or without a very good one, this is something you'll have to pay attention to in the speed-critical portions of your application.

Object creation

Modern microprocessors run at speeds of up to 600 MHz. Unfortunately, modern DRAM runs considerably slower. In burst mode, a modern SDRAM runs at about 100 MHz. If programs accessed memory in a truly random fashion, CPUs would spend most of their time waiting for DRAM. Fortunately, programs don't access memory in a random fashion. If a memory location has been accessed recently, it is quite likely to be accessed again soon. This property is called locality of reference.

Unfortunately, the locality of reference for Java programs can be worse than it is in equivalent C or C++ programs. This is due to object creation. Object creation in a Java program is fundamentally different than that in an equivalent C or C++ program. Many of the small temporary objects that C or C++ would create on the stack are created on the heap in Java.

In C and C++, when these objects are discarded at the end of a method call, the space is available for more temporary objects, and the stack area is almost always in the on-board Level 1 cache.

In Java, the objects are discarded, but the space typically is not reclaimed until the next garbage collection -- which usually doesn't happen until the heap memory is exhausted. The space for the next temporary object in a Java program always comes off the heap. The space for the new temporary object is rarely in a cache, and so the initial use of the temporary object should run slower than the initial use of a temporary object in a C or C++ program.

With the current Java runtimes, performance is also affected because creating a new object is roughly as costly as a mallocin C or a new operation in C++. Creating a new object in any of these ways takes hundreds or even thousands of clock cycles. In C and C++, creating an object on the stack takes about 20 clock cycles. C and C++ programs create many temporary objects on the stack, but Java programs don't have this option.

The graph below shows the cost of creating objects of various sizes under different JIT-enabled Java runtimes. The fastest time is 1.2 microseconds running under Microsoft Internet Explorer (IE) 4.0 on a 200-MHz Pentium Pro. The time would likely be cut to about 0.6 microseconds on a 400-MHz Pentium-II.

Netscape Navigator on Macintosh, IE on Windows NT, and Symantec on Windows NT all show reasonable profiles: the time required to create an object increases as the object size increases.

Navigator on Windows, however, shows very different behavior. The time it takes to create objects of different sizes cycles up and down. Objects containing 4 integers ("ints") and objects containing 12 ints are created fastest. A similar profile appears under Navigator for Solaris as well. I have no explanation for this profile, but there is a solution. When running applets under Navigator, padding object sizes should result in a faster-running applet if object creation proves to be a performance bottleneck.

The graph below shows object-creation times for objects of different sizes when running under non-JIT-enabled systems.

One important thing to notice is that a JIT does not speed up object creation time nearly as much as it speeds up method calls. The bottlenecks for the same program running under a JIT-enabled environment and running under a non-JIT-enabled environment might well be different -- if only subtly -- because of this. While profiling a Java application, you should bear in mind whether or not the profiler has performance characteristics similar to the expected runtime environment or environments.

Finally, the following graph shows the Java object-creation time in the best environment, compared to C/C++ object creation on the stack. The speed difference is about 10 to 12 times. Certain shortcuts can speed up C/C++ object creation on the stack slightly more than these already-fast times, but the shortcuts can reduce the maintainability of the C/C++ code. Java doesn't give you the option of taking these shortcuts.

Recommendations

In all Java runtime environments, object creation can become a performance bottleneck. Here JITs don't help very much! Object creation is something you must pay attention to for all Java programs. This doesn't mean you should avoid object creation at all cost -- simply be aware of where you are creating lots of objects, and watch for bottlenecks there.

Synchronization

Correct, nontrivial, multithreaded environments require some degree of synchronization. The chart below shows the times required for various monitor operations in clock cycles.

Cost of synchronized operation both with and without monitor on object and class
System Object synch w/o monitor Object synch w/ monitor Class synch w/o monitor Class synch w/ monitor
Navigator 4 under Windows NT 589259571289
Navigator 4 under MacOS 1209111611951193
Internet Explorer 4 under Windows NT 3941255236
Symantec with JIT under Windows NT 92368832
Navigator 4 under Solaris 371258465352
Navigator 4 under Linux 488417519429

Several things are worth noting:

First, synchronizing on an already-acquired monitor is usually, but not always, cheaper than acquiring and releasing the monitor.

Second, the time required for the same operation when executed by different Java runtimes can vary by a ratio of almost 40 to 1! The bottlenecks on various systems may well be different if there is a significant amount of synchronization. Profiling for performance bottlenecks should, if at all possible, be achieved on a runtime similar to the target runtime.

1 2 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more