Make Java fast: Optimize!

How to get the greatest performance out of your code through low-level optimizations in Java -- complete with a Benchmark applet

According to the pioneering computer scientist Donald Knuth, "Premature optimization is the root of all evil." Any article on optimization must start by pointing out that there are usually more reasons not to optimize than to optimize.

  • If your code already works, optimizing it is a sure way to introduce new, and possibly subtle, bugs

  • Optimization tends to make code harder to understand and maintain

  • Some of the techniques presented here increase speed by reducing the extensibility of the code

  • Optimizing code for one platform may actually make it worse on another platform

  • A lot of time can be spent optimizing, with little gain in performance, and can result in obfuscated code

  • If you're overly obsessed with optimizing code, people will call you a nerd behind your back

Before optimizing, you should carefully consider whether you need to optimize at all. Optimization in Java can be an elusive target since the execution environments vary. Using a better algorithm probably will yield a bigger performance increase than any amount of low-level optimizations and is more likely to deliver an improvement under all execution conditions. As a rule, high-level optimizations should be considered before doing low-level optimizations.

So why optimize?

If it's such a bad idea, why optimize at all? Well, in an ideal world you wouldn't. But the reality is that sometimes the biggest problem with a program is that it requires simply too many resources, and these resources (memory, CPU cycles, network bandwidth, or a combination) may be limited. Code fragments that occur multiple times throughout a program are likely to be size-sensitive, while code with many execution iterations may be speed-sensitive.

Make Java fast!

As an interpreted language with a compact bytecode, speed, or the lack of it, is what most often pops up as a problem in Java. We'll primarily look at how to make Java run faster rather than making it fit into a smaller space -- although we'll point out where and how these approaches affect memory or network bandwidth. The focus will be on the core language rather than on the Java APIs.

By the way, one thing we won't discuss here is the use of native methods written in C or assembly. While using native methods can give the ultimate performance boost, it does so at the cost of Java's platform independence. It is possible to write both a Java version of a method and native versions for selected platforms; this leads to increased performance on some platforms without giving up the ability to run on all platforms. But this is all I'm going to say on the subject of replacing Java with C code. (See the Java Tip, "Write native methods" for more information on this topic.) Our focus in this article is on how to make Java fast.

90/10, 80/20, hut, hut, hike!

As a rule, 90 percent of a program's excution time is spent executing 10 percent of the code. (Some people use the 80 percent/20 percent rule, but my experience writing and optimizing commercial games in several languages for the last 15 years has shown that the 90 percent/10 percent formula is typical for performance-hungry programs since few tasks tend to be performed with great frequency.) Optimizing the other 90 percent of the program (where 10 percent of the execution time was spent) has no noticeable effect on performance. If you were able to make that 90 percent of the code execute twice as fast, the program would only be 5 percent faster. So the first task in optimizing code is to identify the 10 percent (frequently it is less than this) of the program that consumes most of the execution time. This isn't always where you expect it to be.

General optimization techniques

There are several common optimization techniques that apply regardless of the language being used. Some of these techniques, such as global register allocation, are sophisticated strategies to allocate machine resources (for example, CPU registers) and don't apply to Java bytecodes. We'll focus on the techniques that basically involve restructuring code and substituting equivalent operations within a method.

Strength reduction

Strength reduction occurs when an operation is replaced by an equivalent operation that executes faster. The most common example of strength reduction is using the shift operator to multiply and divide integers by a power of 2. For example, x >> 2 can be used in place of x / 4, and x << 1 replaces x * 2.

Common sub expression elimination

Common sub expression elimination removes redundant calculations. Instead of writing

double x = d * (lim / max) * sx;
double y = d * (lim / max) * sy;

the common sub expression is calculated once and used for both calculations:

double depth = d * (lim / max);
double x = depth * sx;
double y = depth * sy;

Code motion

Code motion moves code that performs an operation or calculates an expression whose result doesn't change, or is invariant. The code is moved so that it only executes when the result may change, rather than executing each time the result is required. This is most common with loops, but it can also involve code repeated on each invocation of a method. The following is an example of invariant code motion in a loop:

for (int i = 0; i < x.length; i++)
    x[i] *= Math.PI * Math.cos(y);

becomes

double picosy = Math.PI * Math.cos(y);for (int i = 0; i < x.length; i++)
    x[i] *= picosy;

Unrolling loops

Unrolling loops reduces the overhead of loop control code by performing more than one operation each time through the loop, and consequently executing fewer iterations. Working from the previous example, if we know that the length of x[] is always a multiple of two, we might rewrite the loop as:

double picosy = Math.PI * Math.cos(y);for (int i = 0; i < x.length; i += 2) {
    x[i] *= picosy;
    x[i+1] *= picosy;
}

In practice, unrolling loops such as this -- in which the value of the loop index is used within the loop and must be separately incremented -- does not yield an appreciable speed increase in interpreted Java because the bytecodes lack instructions to efficiently combine the "+1" into the array index.

All of the optimization tips in this article embody one or more of the general techniques listed above.

Putting the compiler to work

Modern C and Fortran compilers produce highly optimized code. C++ compilers generally produce less efficient code, but are still well along the path to producing optimal code. All of these compilers have gone through many generations under the influence of strong market competition and have become finely honed tools for squeezing every last drop of performance out of ordinary code. They almost certainly use all of the general optimization techniques presented above. But there are still plenty of tricks left for making compilers generate efficient code.

javac, JITs, and native code compilers

The level of optimization that javac performs when compiling code at this point is minimal. It defaults to doing the following:

  • Constant folding -- the compiler resolves any constant expressions such that i = (10 *10) compiles to i = 100.

  • Branch folding (most of the time) -- unnecessary goto bytecodes are avoided.

  • Limited dead code elimination -- no code is produced for statements like if(false) i = 1.

The level of optimization javac provides should improve, probably dramatically, as the language matures and compiler vendors begin to compete in earnest on the basis of code generation. Java just now is getting second-generation compilers.

Then there are just-in-time (JIT) compilers that convert Java bytecodes into native code at run time. Several are already available, and while they can increase the execution speed of your program dramatically, the level of optimization they can perform is constrained because optimization occurs at run time. A JIT compiler is more concerned with generating the code quickly than with generating the quickest code.

Native code compilers that compile Java directly to native code should offer the greatest performance but at the cost of platform independence. Fortunately, many of the tricks presented here will be achieved by future compilers, but for now it takes a little work to get the most out the compiler.

javac does offer one performance option you can enable: invoking the -O option to cause the compiler to inline certain method calls:

javac -O MyClass

Inlining a method call inserts the code for the method directly into the code making the method call. This eliminates the overhead of the method call. For a small method this overhead can represent a significant percentage of its execution time. Note that only methods declared as either private, static, or final can be considered for inlining, because only these methods are statically resolved by the compiler. Also, synchronized methods won't be inlined. The compiler will only inline small methods typically consisting of only one or two lines of code.

Unfortunately, the 1.0 versions of the javac compiler have a bug that will generate code that can't pass the bytecode verifier when the -O option is used. This has been fixed in JDK 1.1. (The bytecode verifier checks code before it is allowed to run to make sure that it doesn't violate any Java rules.) It will inline methods that reference class members inaccessible to the calling class. For example, if the following classes are compiled together using the -O option

class A {
    private static int x = 10;
    public static void getX () { return x; }
}
 
class B {
    int y = A.getX();
}

the call to A.getX() in class B will get inlined in class B as if B had been written as:

class B {
    int y = A.x;
}

However, this will cause the generation of bytecodes to access the private A.x variable that will be generated in B's code. This code will execute fine, but since it violates Java's access restrictions, it will get flagged by the verifier with an IllegalAccessError the first time the code is executed.

This bug doesn't make the -O option useless, but you do have to be careful about how you use it. If invoked on a single class, it can inline certain method calls within the class without risk. Several classes can be inlined together as long as there aren't any potential access restrictions. And some code (such as applications) isn't subjected to the bytecode verifier. You can ignore the bug if you know your code will only execute without being subjected to the verifier. For additional information, see my javac-O FAQ.

Profilers

Fortunately, the JDK comes with a built-in profiler to help identify where time is spent in a program. It will keep track of the time spent in each routine and write the information to the file java.prof. To run the profiler, use the -prof option when invoking the Java interpreter:

java -prof myClass

Or for use with an applet:

java -prof sun.applet.AppletViewer myApplet.html

There are a few caveats for using the profiler. The profiler output is not particularly easy to decipher. Also, in JDK 1.0.2 it truncates the method names to 30 characters, so it may not be possible to distinguish some methods. Unfortunately, with the Mac there is no means to invoke the profiler, so Mac users are out of luck. On top of all this, Sun's Java document page (see Resources) no longer includes the documentation for the -prof option). However, if your platform does support the -prof option, either Vladimir Bulatov's HyperProf or Greg White's ProfileViewer can be used to help interpret the results (see Resources).

It is also possible to "profile" code by inserting explicit timing into the code:

long start = System.currentTimeMillis();
// do operation to be timed here
long time = System.currentTimeMillis() - start;

System.currentTimeMillis() returns time in 1/1000ths of a second. However, some systems, such as a Windows PC, have a system timer with less (much less) resolution than a 1/1000th of a second. Even 1/1000th of a second isn't long enough to accurately time many operations. In these cases, or on systems with low-resolution timers, it may be necessary to time how long it takes to repeat the operation n times and then divide the total time by n to get the actual time. Even when profiling is available, this technique can be useful for timing a specific task or operation.

Here are a few closing notes on profiling:

  • Always time the code before and after making changes to verify that, at least on the test platform, your changes improved the program

  • Try to make each timing test under identical conditions

  • If possible, contrive a test that doesn't rely on any user input, as the variations in user response can cause the results to fluctuate

The Benchmark applet

The Benchmark applet measures the time it takes to do an operation thousands (or even millions) of times, subtracts the time spent doing operations other than the test (such as the loop overhead), and then uses this information to compute how long each operation took. It runs each test for approximately one second. In an attempt to eliminate random delays from other operations the computer may perform during a test, it runs each test three times and uses the best result. It also attempts to eliminate garbage collection as a factor in the tests. Because of this, the more memory available to the benchmark, the more accurate the benchmark results are.

1 2 3 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more