JVM performance optimization, Part 2: Compilers

Use the right Java compiler for your Java application

1 2 Page 2
Page 2 of 2

As compared to a client-side compiler, a server-side compiler usually increases code performance by a measurable 30 percent to 50 percent. In most cases that performance improvement will balance the additional resource cost.

Tiered compilation combines the best features of both compilers. Client-side compilation yields quick startup time and speedy optimization, while server-side compilation delivers more advanced optimizations later in the execution cycle.

Some common compiler optimizations

I've so far discussed the value of optimizing code and how and when common JVM compilers optimize code. I'll conclude with some of the actual optimizations available to compilers. JVM optimization actually happens at the bytecode level (or on lower representative language levels), but I'll demonstrate the optimizations using the Java language. I couldn't possibly cover all of the JVM optimizations in this section; rather, I mean to inspire you to explore on your own and learn about the hundreds of advanced optimizations and innovations in compiler technology (see Resources).

Dead code elimination

Dead code elimination is what it sounds like: the elimination of code that has never been called -- i.e., "dead" code. If a compiler discovers during runtime that some instructions are unnecessary, it will simply eliminate them from the execution instruction set. For example, in Listing 1 a certain value assignment for a variable is never used and can be fully ignored at execution time. On a bytecode level this could correspond to never needing to execute the load of the value into a register. Not having to do the load means less CPU time, and hence a quicker code execution, and therefore the application -- especially if the code is hot and called several times per second.

Listing 1 shows Java code exemplifying a variable that is never used, an unnecessary operation.

Listing 1. Dead code

int timeToScaleMyApp(boolean endlessOfResources) {
   int reArchitect = 24;
   int patchByClustering = 15;
   int useZing = 2;

       return reArchitect + useZing;
       return useZing;

On a bytecode level, if a value is loaded but never used, the compiler can detect this and eliminate the dead code, as shown in Listing 2. Never executing the load saves CPU time and thus improves the program's execution speed.

Listing 2. The same code following optimization

int timeToScaleMyApp(boolean endlessOfResources) {
   int reArchitect = 24;
   //unnecessary operation removed here...
   int useZing = 2;

       return reArchitect + useZing;
       return useZing;

Redundancy elimination is a similar optimization that removes duplicate instructions to improve application performance.


Many optimizations try to eliminate machine-level jump instructions (e.g., JMP for x86 architectures). A jump instruction changes the instruction pointer register and thereby transfers the execution flow. This is an expensive operation relative to other ASSEMBLY instructions, which is why it is a common target to reduce or eliminate. A very useful and well-known optimization that targets this is called inlining. Since jumping is expensive, it can be helpful to inline many frequent calls to small methods, with different entry addresses, into the calling function. The Java code in Listings 3 through 5 exemplifies the benefits of inlining.

Listing 3. Caller method

int whenToEvaluateZing(int y) {
   return daysLeft(y) + daysLeft(0) + daysLeft(y+1);

Listing 4. Called method

int daysLeft(int x){
   if (x == 0)
      return 0;
      return x - 1;

Listing 5. Inlined method

int whenToEvaluateZing(int y){
   int temp = 0;
   if(y == 0) temp += 0; else temp += y - 1;
   if(0 == 0) temp += 0; else temp += 0 - 1;
   if(y+1 == 0) temp += 0; else temp += (y + 1) - 1;
   return temp; 

In Listings 3 through 5 the calling method makes three calls to a small method, which we assume for this example's sake is more beneficial to inline than to jump to three times.

It might not make much difference to inline a method that is called rarely, but inlining a so-called "hot" method that is frequently called could mean a huge difference in performance. Inlining also frequently makes way for further optimizations, as shown in Listing 6.

Listing 6. After inlining, more optimizations can be applied

int whenToEvaluateZing(int y){
   if(y == 0) return y;
   else if (y == -1) return y - 1;
   else return y + y - 1;

Loop optimization

Loop optimization plays a big role when it comes to reducing the overhead that comes with executing loops. Overhead in this case means expensive jumps, number of checks of the condition, non-optimal instruction pipeline (i.e., an order of instructions that causes no-operations or extra cycles in the CPU). There are many kinds of loop optimizations, amounting to a vast set of optimizations. Notables include:

  • Combining loops: When two nearby loops are iterated the same amount of times, the compiler can try to combine the bodies of the loops, to be executed at the same time (in parallel) in the case where nothing in the bodies reference each other, i.e., they are fully independent of each other.
  • Inversion loops: Basically you replace a regular while loop with a do-while loop. And the do-while loop is set within an if clause. This replacement leads to two less jumps. However, it adds to the condition check and hence increases the code size. This optimization is an excellent example of how using slightly more resources leads to a more efficient code - a cost-gain balance the compiler has to evaluate and decide on dynamically during runtime.
  • Tiling loops: Reorganizes the loop so that it iterates over blocks of data that are sized to fit in the cache.
  • Unrolling loops: Reduces the number of times the loop condition has to be evaluated and also the number of jumps. You can think of this as "inlining" several iterations of the body to be executed without crossing the loop condition. Unrolling loops comes with risk, as it might decrease performance by impairing the pipeline and causing multiple redundant instruction fetches. Again, this is a judgment call by the compiler to make at runtime, i.e., if the gain is enough, the cost might be worth it.

This has been an overview of what a compiler does on a bytecode level (and below) to improve an application's execution performance on a target platform. The optimizations discussed are common and popular, but only a brief sampling of the available options. These have been very simple and broad explanations, which hopefully serve to pique your interest for more in-depth exploration. See Resources for further reading.

In conclusion: Reflection points and highlights

Use different compilers for different needs.

  • Interpretation is the simplest form of bytecode translation to machine instructions, and works based on an instruction lookup table.
  • Compilers allow for optimization based on performance counters, but will require some additional resources (code cache, optimization threads, etc.)
  • Client-side compilers improve the performance of execution code by an order of magnitude (5 to 10 times better) when compared to interpreted code.
  • Server-side compilers improve application performance by 30 percent to 50 percent over client-side compilers, but utilize more resources.
  • Tiered compilation provides the best of two worlds. Enable client compilation to get your code performing well quickly, and server compilation over time, to make frequently called code execute even better.

There are many possible code optimizations. An important task for the compiler is to analyze all possibilities and weigh the cost of using an optimization against the execution speed benefit of the output machine code.

Eva Andreasson has been involved with Java virtual machine technologies, SOA, cloud computing, and other enterprise middleware solutions for 10 years. She joined the startup Appeal Virtual Solutions (later acquired by BEA Systems) in 2001 as a developer of the JRockit JVM. Eva has been awarded two patents for garbage collection heuristics and algorithms. She also pioneered Deterministic Garbage Collection which later became productized through JRockit Real Time. Eva has worked closely with Sun and Intel on technical partnerships, as well as various integration projects of JRockit Product Group, WebLogic, and Coherence (post Oracle acquisition in 2008). In 2009 Eva joined Azul Systems as product manager for the new Zing Java Platform. Recently she switched gears and joined the team at Cloudera as senior product manager for Cloudera's Hadoop distribution, where she is engaged in the exciting future and innovation path of highly scalable, distributed data processing frameworks.

Learn more about this topic

  • "JVM performance optimization, Part 1: A JVM technology primer" (Eva Andreasson, JavaWorld, August 2012) launches the JVM performance optimization series with an overview of how a classic Java virtual machine works, including Java's write-once, run-anywhere engine, garbage collection basics, and some common GC algorithms.
  • See "Watch your HotSpot compiler go" (Vladimir Roubtsov, JavaWorld.com, April 2003) for more about the mechanics of hotspot optimization and why it pays to warm up your compiler.
  • If you want to learn more about bytecode and the JVM, see "Bytecode basics" (Bill Venners, JavaWorld, 1996), which takes an initial look at the bytecode instruction set of the Java virtual machine, including primitive types operated upon by bytecodes, bytecodes that convert between types, and bytecodes that operate on the stack.
  • The Java compiler javac is fully discussed in the formal Java platform documentation.
  • Get more of the basics of JVM (JIT) compilers, see the IBM Research Java JIT compiler page.
  • Also see Oracle JRockit's "Understanding Just-In-Time Compilation and Optimization" (Oracle® JRockit Introduction Release R28).
  • Dr. Cliff Click gives a complete tutorial on tiered compilation in his Azul Systems blog (July 2010).
  • Learn more about using performance counters for JVM performance optimization: "Using Platform-Specific Performance Counters for Dynamic Compilation" (Florian Schneider and Thomas R. Gross; Proceedings of the 18th international conference on Languages and Compilers for Parallel Computing, published by ACM Digital Library).
  • Oracle JRockit: The Definitive Guide (Marcus Hirt, Marcus Lagergren; Packt Publishing, 2010): A complete guide to the JRockit JVM.
1 2 Page 2
Page 2 of 2