JVM performance optimization, Part 2: Compilers

Use the right Java compiler for your Java application

Java compilers take center stage in this second article in the JVM performance optimization series. Eva Andreasson introduces the different breeds of compiler and compares performance results from client, server, and tiered compilation. She concludes with an overview of common JVM optimizations such as dead-code elimination, inlining, and loop optimization.

A Java compiler is the source of Java's famous platform independence. A software developer writes the best Java application that he or she can, and then the compiler works behind the scenes to produce efficient and well-performing execution code for the intended target platform. Different kinds of compilers meet various application needs, thus yielding specific desired performance results. The more that you understand about compilers, in terms of how they work and what kinds are available, the more you'll be able to optimize Java application performance.

This second article in the JVM performance optimization series highlights and explains the differences between various Java virtual machine compilers. I'll also discuss some common optimizations used by Just-In-Time (JIT) compilers for Java. (See "JVM performance optimization, Part 1" for a JVM overview and introduction to the series.)

What is a compiler?

Simply speaking a compiler takes a programming language as an input and produces an executable language as an output. One commonly known compiler is javac, which is included in all standard Java development kits (JDKs). javac takes Java code as input and translates it into bytecode -- the executable language for a JVM. The bytecode is stored into .class files that are loaded into the Java runtime when the Java process is started.

Bytecode can't be read by standard CPUs and needs to be translated into an instruction language that the underlying execution platform can understand. The component in the JVM that is responsible for translating bytecode to executable platform instructions is yet another compiler. Some JVM compilers handle several levels of translation; for instance, a compiler might create various levels of intermediate representation of the bytecode before it turns into actual machine instructions, the final step of translation.

Bytecode and the JVM

If you want to learn more about bytecode and the JVM, see "Bytecode basics" (Bill Venners, JavaWorld).

From a platform-agnostic perspective we want to keep code platform-independent as far as possible, so that the last translation level -- from the lowest representation to actual machine code -- is the step that locks the execution to a specific platform's processor architecture. The highest level of separation is between static and dynamic compilers. From there, we have options depending on what execution environment we're targeting, what performance results we desire, and what resource restrictions we need to meet. I briefly discussed static and dynamic compilers in Part 1 of this series. In the following sections I'll explain a bit more.

Static vs dynamic compilation

An example of a static compiler is the previously mentioned javac. With static compilers the input code is interpreted once and the output executable is in the form that will be used when the program is executed. Unless you make changes to your original source and recompile the code (using the compiler), the output will always result in the same outcome; this is because the input is a static input and the compiler is a static compiler.

In a static compilation, the following Java code

static int add7( int x ) {
      return x+7;
}

would result in something similar to this bytecode:

iload0
 bipush 7
 iadd
 ireturn

A dynamic compiler translates from one language to another dynamically, meaning that it happens as the code is executed -- during runtime! Dynamic compilation and optimization give runtimes the advantage of being able to adapt to changes in application load. Dynamic compilers are very well suited to Java runtimes, which commonly execute in unpredictable and ever-changing environments. Most JVMs use a dynamic compiler such as a Just-In-Time (JIT) compiler. The catch is that dynamic compilers and code optimization sometimes need extra data structures, thread, and CPU resources. The more advanced the optimization or bytecode-context analyzing, the more resources are consumed by compilation. In most environments the overhead is still very small compared to the significant performance gain of the output code.

JVM varieties and Java platform independence

All JVM implementations have one thing in common, which is their attempt to get application bytecode translated into machine instructions. Some JVMs interpret application code on load and use performance counters to focus on "hot" code. Some JVMs skip interpretation and rely on compilation alone. The resource intensiveness of compilation can be a bigger hit (especially for client-side applications) but it also enables more advanced optimizations. See Resources for more information.

If you are a beginner to Java, the intricacies of JVMs will be a lot to wrap your head around. The good news is you don't really need to! The JVM manages code compilation and optimization, so you don't have to worry about machine instructions and the optimal way of writing application code for an underlying platform architecture.

From Java bytecode to execution

Once you have your Java code compiled into bytecode, the next steps are to translate the bytecode instructions to machine code. This can be done by either an interpreter or a compiler.

Interpretation

The simplest form of bytecode compilation is called interpretation. An interpreter simply looks up the hardware instructions for every bytecode instruction and sends it off to be executed by the CPU.

You could think of interpretation similar to using a dictionary: for a specific word (bytecode instruction) there is an exact translation (machine code instruction). Since the interpreter reads and immediately executes one bytecode instruction at a time, there is no opportunity to optimize over an instructions set. An interpreter also has to do the interpretation every time a bytecode is invoked, which makes it fairly slow. Interpretation is an accurate way of executing code, but the un-optimized output instruction set will likely not be the highest-performing sequence for the target platform's processor.

Compilation

A compiler on the other hand loads the entire code to be executed into the runtime. As it translates bytecode, it has ability to look at the entire or partial runtime context and make decisions about how to actually translate the code. Its decisions are based on analysis of code graphs such as different execution branches of instructions and runtime-context data.

When a bytecode sequence is translated into a machine-code instruction set and optimizations can be done to this instruction set, the replacing instruction set (e.g., the optimized sequence) is stored into a structure called the code cache. The next time that bytecode is executed, the previously optimized code can be immediately located in the code cache and used for execution. In some cases a performance counter might kick in and override the previous optimization, in which case the compiler will run a new optimization sequence. The advantage of a code cache is that the resulting instruction set can be executed at once -- no need for interpretive lookups or compilation! This speeds up execution time, especially for Java applications where the same methods are called multiple times.

Optimization

Along with dynamic compilation comes the opportunity to insert performance counters. The compiler might, for instance, insert a performance counter to count every time a bytecode block (e.g, corresponding to a specific method) was called. Compilers use data about how "hot" a given bytecode is to determine where in the code optimizations will best impact the running application. Runtime profiling data enables the compiler to make a rich set of code optimization decisions on the fly, further improving code-execution performance. As more refined code-profiling data becomes available it can be used to make additional and better optimization decisions, such as: how to better sequence instructions in the compiled-to language, whether to replace a set of instructions with more efficient sets, or even whether to eliminate redundant operations.

Example

Consider the Java code:

static int add7( int x ) {
      return x+7;
}

This could be statically compiled by javac to the bytecode:

iload0
 bipush 7
 iadd
 ireturn

When the method is called the bytecode block will be dynamically compiled to machine instructions. When a performance counter (if present for the code block) hits a threshold it might also get optimized. The end result could look like the following machine instruction set for a given execution platform:

lea rax,[rdx+7]
  ret

Different compilers for different applications

Different Java applications have different needs. Long-running enterprise server-side applications could allow for more optimizations, while smaller client-side applications may need fast execution with minimal resource consumption. Let's consider three different compiler settings and their respective pros and cons.

Client-side compilers

A well-known optimizing compiler is C1, the compiler that is enabled through the -client JVM startup option. As its startup name suggests, C1 is a client-side compiler. It is designed for client-side applications that have fewer resources available and are, in many cases, sensitive to application startup time. C1 use performance counters for code profiling to enable simple, relatively unintrusive optimizations.

Server-side compilers

For long-running applications such as server-side enterprise Java applications, a client-side compiler might not be enough. A server-side compiler like C2 could be used instead. C2 is usually enabled by adding the JVM startup option -server to your startup command-line. Since most server-side programs are expected to run for a long time, enabling C2 means that you will be able to gather more profiling data than you would with a short-running light-weight client application. So you'll be able to apply more advanced optimization techniques and algorithms.

Tip: Warm up your server-side compiler

For server-side deployments it may take some time before the compiler has optimized the initial "hot" parts of the code, so server-side deployments often require a "warm up" phase. Before doing any kind of performance measurement on a server-side deployment, make sure that your application has reached the steady state! Allowing the compiler enough time to compile properly will work to your benefit! (See the JavaWorld article "Watch your HotSpot compiler go" for more about warming up your compiler and the mechanics of profiling.)

A server compiler accounts for more profiling data than a client-side compiler does, and allows more complex branch analysis, meaning that it will consider which optimization path would be more beneficial. Having more profiling data available yields better application results. Of course, doing more extensive profiling and analysis requires expending more resources on the compiler. A JVM with C2 enabled will use more threads and more CPU cycles, require a a larger code cache, and so on.

Tiered compilation

Tiered compilation combines client-side and server-side compilation. Azul first made tiered compilation available in its Zing JVM. More recently (as of Java SE 7) it has been adopted by Oracle Java Hotspot JVM. Tiered compilation takes advantage of both client and server compiler advantages in your JVM. The client compiler is most active during application startup and handles optimizations triggered by lower performance-counter thresholds. The client-side compiler also inserts performance counters and prepares instruction sets for more advanced optimizations, which will be addressed at a later stage by the server-side compiler. Tiered compilation is a very resource-efficient way of profiling because the compiler is able to collect data during low-impact compiler activity, which can be used for more advanced optimizations later. This approach also yields more information than you'll get from using interpreted code profile counters alone.

The chart schema in Figure 1 depicts the performance differences between pure interpretation, client-side, server-side, and tiered compilation. The X-axis shows execution time (time unit) and the Y-axis performance (ops/time unit).

Figure 1. Performance differences between compilers (click to enlarge)

Compiler performance comparisons

Compared to purely interpreted code, using a client-side compiler leads to approximately 5 to 10 times better execution performance (in ops/s), thus improving application performance. The variation in gain is of course dependent on how efficient the compiler is, what optimizations are enabled or implemented, and (to a lesser extent) how well-designed the application is with regard to the target platform of execution. The latter is really something a Java developer should never have to worry about, though.

As compared to a client-side compiler, a server-side compiler usually increases code performance by a measurable 30 percent to 50 percent. In most cases that performance improvement will balance the additional resource cost.

1 2 Page
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more