How does the performance of Java applications compare with similar fully optimized C++ programs in theory, benchmarks, and real-world applications?
Most industry analysts make the blanket assumption that Java will always suffer performance disadvantages compared with other languages because Java was developed to allow Java programs to run on multiple platforms. You can find this assumption woven through almost all mainstream articles and opinions regarding Java and NCs in modern enterprises.
So we decided to find out for ourselves just how much of a disadvantage there is between Java and C++, the language to which Java is most often compared. We examined the architectural components of Java and compared the performance of programs written in Java to similar programs in C++.
We expected to see a modicum of lagging performance in Java in each test, though we were skeptical about it running several times slower than C++. To our shock, we rarely found any differences in speed at all. Where Java is significantly slower than C++, it's due to Java's stringent security model or to garbage collection.
While we welcome any performance improvements in any language, it appears that if you can use it in the proper context, Java has already come a long way since its inception -- far enough to be considered a top performer along with C++ in many cases.
The test suite
The analysis divides the execution of a program into four functional groups:
- Loading Program Executable
- Running Program Instructions
- Allocating Memory
- Accessing System Resources
These are functions that must be performed by any program running on a computer regardless of its implementation language.
Using these functional groups, we can develop a theoretical performance comparison between programs written in C++ and Java. In addition, we tried different coding approaches to demonstrate the performance characteristics for programs in each group.
The tests are divided into three programs:
The Simple Loop Test, which tests performance on function calls, mathematical operations, and up-casting/down-casting
The Memory Allocation Test, which tests memory allocation and release
- The Bouncing Globes Test, which measures animation performance and handling of resources
All performance numbers were generated using the following environment:
- Platform: Windows NT 4.0 service pack 2
- Hardware configuration: Pentium Pro 200, 128MB RAM
- Software: Visual C++ 5.0 and Sun JDK 1.1.5
For the purposes of this article, we define a "platform" as a combination of CPU type and operating system. For example, Windows NT running on an Intel processor is one platform, Linux running on Intel processor is another, and Linux running on a Digital Alpha processor is still another platform.
Loading Program Executable (LPE)
Developers create new programs by writing code into one or more source files. A compiler/linker translates these source files into executable files. These executable files can be run on the target machine. The first step in running a program is to load the executable files into memory.
Performance implications: Java versus native C++ (LPE)
If the program is located on a disk drive local to the target platform, loading time of larger programs is rarely of concern. If, however, the program is located on a Web site on the Internet or a corporate intranet, executable size may become the limiting factor in the performance of a program. Over the Internet or intranet, Java programs and resources are, in general, much smaller and faster to load than native applications. There are two main contributors to this size difference: executable size and selective loading.
Windows NT executables that are written in C++ are significantly larger than similar Java executables. There are three contributing factors that account for this size difference.
First, the binary executable format for C++ programs can inflate code by as much as a factor of two over Java code.
Second, Java provides a series of well-defined consistent libraries, such as mathematics, various network services, collection classes, and graphics classes. The Java virtual machines (JVMs) contain these libraries. In contrast, C++ defines application programming interfaces (APIs) that allow developers to access many of these functions in a consistent manner. Unfortunately, if a developer wants to use any special functions outside the core C++ API, he must deliver the implementation of the supporting libraries with his program. The inclusion of these libraries can double or even treble the size of delivered code.
C++ program loaders must generally load the entire executable file before execution begins.
There are two ways to link DLLs in Win32: statically and dynamically. Statically linked DLLs (by far the most common) are loaded before the program is executed. Dynamically linked DLLs are loaded upon request but are rarely used because the syntax for loading the DLLs and accessing the contained library functions requires a large programming effort in Win32.
In addition, there is no run-time type checking for dynamically linked DLLs. This means that there is no way to tell if a DLL has changed except via a program crash. Finally, if the program is on a remote drive, the DLL must still be brought onto the local machine before it can be used. Establishing an automatic method for doing this in Win32 will require extensive programming on the part of the developer.
In contrast, the Java Loader can selectively load classes in a properly designed program as they are needed. For example, consider a full-featured word processor with such features as a thesaurus, a spell checker, mail merge, and export. These features typically produce multi-megabyte files.
The average user will use only a small fraction of the features at any given time. If the program was written in C++, the user would have to load the entire file before proceeding. If the program is written in Java, only those features immediately needed are downloaded, such as the main window. The user downloads additional features only as she needs them.
Real-world example (LPE)
The table below shows the sizes of the programs and resources (if applicable) used in this article. The huge size differences in the C++ versus Java programs are due to the size of the libraries that are required for the C++ program to run. The difference in the resource sizes is due to the compression difference between the GIF files used in the Java program versus the bitmap files used in the C++ program.
|Program Name||Program Size: C++ vs Java||Resource Size: C++ vs Java|
|Simple Loop||46K vs 3.9K||N/A|
|Memory Allocation||34K vs 1.4K||N/A|
|Bouncing Globes||103K vs21K||485K vs153K|
Running Program Instructions (RPI)
Once the program executable is loaded, its instructions must be executed by the target platform's CPU. In traditional C++ programming, these executable files contain binary instructions that are executed by the target platform's CPU. A developer must create a different executable by recompiling the original source code for each target platform. In addition, the peculiarities of each target platform usually forces the developer to modify the original source code.
In contrast, the executable files (called class files) produced by a Java compiler contain collections of platform-independent bytecode which cannot be run on a target platform without translation into binary instructions suitable for each target platform's CPU. The JVM is responsible for performing this translation. There are two possible methods a JVM uses to do so: a bytecode interpreter or a just-in-time (JIT) compiler.
Performance implications: Java versus native C++ (RPI)
Therein lies Java's bad reputation. Most performance perceptions for Java were derived from older JVMs that included bytecode interpretation as the only method for running program instructions.
Bytecode interpreters perform many times slower than comparable C++ programs because each bytecode instruction must be interpreted every time it is executed, which can lead to a great deal of unnecessary overhead. For example, if you code a repeat loop, and that loop executes the same set of bytecode instructions many times, the JVM will have to perform the exact same interpretation process on every instruction over and over again, each time it processes an iteration of the loop.
The perceptions earned by early JVMs are no longer valid, since most JVMs are delivered with JIT compilers. A JIT compiler translates and stores the entire class file. This eliminates the need for repeated translations of each bytecode instruction.
Performance among compilers
C++ compilers are able to improve the performance of a piece of code by detecting and improving inefficiencies through a process called code optimization. For example, a good compiler can detect if the programmer was sloppy and performed a "static" calculation within a loop. In this case, even though the calculation takes place within the loop, its results remain constant throughout the loop no matter how the program is used.
Recognizing this, the compiler will move the calculation outside of the loop. It will perform this calculation once before the loop is executed, and then use the constant value within the loop without affecting the logic of the program.
This type of optimization is called expression raising. The calculation of most optimizations requires knowledge about a group of instructions and may require multiple passes over these instructions. In the expression raising example above, all of the instructions in the loop must be known ahead of time in order to determine if the calculation truly has a constant value for each execution of the loop.
A JVM without a JIT sees each instruction as it is executed, so it cannot perform these types of optimizations on the fly. A JIT can, however, perform code optimization on the entire class file.
As a result, the only significant performance difference between a Java program run with a JIT and a native C++ application will be the amount of time it takes to perform the initial translation of the class file and the types of optimization that are performed.
This overhead will only be a significant proportion of the total execution time if a program is composed of a large number of Java classes that are not used a significant number of times by a program. Real-world programs use the same classes many times, so the proportion of the amount of time spent translating the class will be very low compared to the time spent actually running the code within the class.
Originally, most companies that produced JITs tried to make intelligent decisions about which classes should be compiled and which should not, based on number of times a particular class was used in a program. Many of these companies have since changed their JIT compilers to translate all code, because it turns out the overhead for the translation process is usually insignificant.
Theory and practice
In theory there should be only a negligible difference between JIT-compiled Java bytecode and native C++. In practice, there are two factors that cause performance differences.
First, there will usually be several valid translations of platform-specific instructions when a bytecode instruction is translated into one or more platform-specific instructions. Each of these valid translations will produce the same result, but may have vastly different performance characteristics. If the programmers that create the JIT and C++ compiler are of the same caliber, the performance of both solutions should be similar. (For the purposes of this discussion, we're only considering performance optimizations.)
Second, there is a significant trade-off between compilation time and the number or level of optimizations that are performed on a piece of code. In the expression raising example above, it is usually fairly easy to examine all of the instructions in a loop to determine if a calculation changes throughout the operation of the loop.
In contrast, there is an optimization technique called dead code elimination which is much harder to perform. This optimization determines if a particular piece of code is ever used during program execution. If not, it eliminates the code from the executable file
The performance gains from dead code elimination can be significant, but the overhead of the optimization calculation would most likely be prohibitive in a JIT compiler. It should be noted, however, that dead code is a result of sloppy programming. Competent programmers should be sensitive to the necessity of eliminating any dead code without having to rely upon the compiler to do it for them.
The common optimizations that compilers perform may be divided into groups based on performance gains and computational expense:
Primary and secondary optimizations typically afford a program 10 to 15 percent performance gains with minimal computational overhead.
Tertiary optimizations can add an additional 5 percent performance gain, but at much greater expense.
This discussion has only listed two of the simpler optimizations that a compiler can perform. For a complete discussion of compilers and optimization theory, see "Compilers. Principles, Techniques, and Tools" and "Crafting a Compiler" (in the Resources section at the end of the article).