Make Java fast: Optimize!

How to get the greatest performance out of your code through low-level optimizations in Java -- complete with a Benchmark applet

Page 3 of 3

You need a Java-enabled browser to see this applet.

Briefly, the individual benchmarks are:

  • Calibrate -- calculates the loop overhead for the one-second timed loop. It's included in the benchmarks to make it easy to test the environment the applet is running in.

  • Mixed -- a collection of tests from the other benchmarks.

  • Loop -- tests a variety of looping constructs for empty loops and array loops.

  • Variable -- tests accessing different classes of variables.

  • Method -- shows the timing of method invocations under various conditions.

  • Operator -- times the execution of arithmetic operators on the data types.

  • Casting -- tests casting and instanceof expressions.

  • Instantiation -- tests object and array instantiations.

  • Exception -- times throwing and catching exceptions and try/catch/finally statements.

  • Thread -- measures the timing for thread instantiation and execution as well as task switching with wait()/notify(). (Developing a test for threads that actually ran on all the platforms I could test turned out to be quite a challenge.)

To test another Java virtual machine (JVM) besides your browser, download the applet and run it. To test output from another compiler, download the source, recompile the code, and run it. You also can easily add your own tests to the benchmark. For comparison, here are benchmark results for a few Java runtimes.

Understanding the benchmarks

The purpose of benchmarking operations is to know which operations to avoid or to replace with alternate implementations. This varies between environments -- and especially between interpreters and JIT compilers that convert Java's bytecodes into native code at run time. Interpreted code will want to take advantage of the bytecode instructions, while a JIT compiler often will have strengths and weaknesses separate from the bytecode. We'll go over each of the benchmarks and identify the types of operations that work best for both interpreted and non-interpreted environments. (Note that some compilers, such as Asymetrix's SuperCede on the PC, produce native code directly. Wherever there is a reference to a JIT compiler, you can substitute any compiler that produces native code.)

The loop benchmark

Almost all speed problems involve loops, and the loop itself is an easy place to get a quick performance boost. The impact of the loop overhead depends on how many iterations the loop executes and how much it does during each iteration. Obviously, the more iterations and the less work done per iteration, the greater the benefit with optimizing the loop itself. The basic loop we're going to look at is:

for (i = 0; i < N; i++) {
    // do something
}

The first rule is to use a local int variable for the loop counter. We'll see the reason for this in the Variable benchmark. It's unlikely that the loop counter would be anything other than a local int, but I mention it just in case. If it isn't necessary inside the loop for the loop variable i to count from 0 to N-1, then the loop can be restructured to use a more efficient test. Rather than comparing i against N at each iteration, which requires N to be loaded (assuming N isn't a constant or static final variable), restructure the loop to take advantage of the efficient checks against 0. Comparing against zero is almost always more efficient in any language since the underlying tests are based on < 0,<= 0, == 0, != 0, >= 0 and > 0. To optimize, restructure the loop as:

for (i = N; --i >= 0; ) {
    // do something
}

By combining the predecrement of i with the comparison against 0, the necessary condition flags for the comparison will already be set -- so no explicit comparison is required. Interpreted code doesn't actually get this free test because the bytecodes used for conditional tests require the condition value to be on the stack. But interpreted code does get a cheaper test because i is used for the test directly without requiring an explicit comparison. JIT compiler code would almost certainly get the benefit of both the free test and the implicit comparison to zero.

One common use of small loops is to iterate through the elements in an array. If the purpose of the loop simply is to copy elements from one array to another array of the same data type (that is, from an array of int to another array of int -- not from an array of byte to an array of int where a data type conversion is required), then use System.arraycopy(). This is a special-purpose optimized native method for copying array elements; for copies of more than a few elements, it's by far the fastest way to copy array elements.

If you're not copying and want to iterate to the end of the array, there is an alternative. Because all array accesses in Java are bounds checked, you can rely on this implicit testing and simply catch the ArrayIndexOutOfBoundsException, which will be thrown when the loop runs past the end of the array.

try {
    for (i = 0; ; i++)
        array[i] = i;
}
catch (ArrayIndexOutOfBoundsException e) {}

The above isn't necessarily good coding style but coding style frequently is compromised by optimizations. You'll want to make sure the loop performs enough iterations to cover the cost of throwing the exception. (See the exception benchmark for these cases.) Because an exception is not cheap to throw, this technique usually would be used only when the loop typically would perform a lot of iterations. In fact, consider using an exception to terminate a loop as a last-ditch effort since the results are likely to vary on different platforms.

The variable benchmark

The performance of a variable is determined by its scope and its type. Local method variables are the fastest to access. There are a couple of reasons for this. One is that there is no reference either to the object instance, in the case of an instance variable, or to the class in the case of a static variable. There simply is an index into the data space. The other reason, for interpreted code at least, is that there are special bytecodes with implicit offsets for accessing the first four local variables and parameters. Variables are assigned to slots in a method according to the following rules:

  • If it is an instance method, the first local variable is always the this reference (this is implicitly passed as the left-most parameter to an instance method).

  • After the this reference, if any, is assigned, the parameters to the method are assigned to slots, starting with the left-most parameter through the right-most parameter.

  • Any remaining slots are filled by local variables in the order that they are declared.

  • Each double or long occupies two variable slots. The first slot must be within the first four slots to use the special bytecodes for accessing the long or double variable.

Under a JIT compiler, some local variables are likely to get a further boost in performance by being kept in a register.

The fastest types of variables are int and reference variables. For values stored in variables (as opposed to values stored in arrays), int, boolean, byte, short, and char variables all are represented as 32-bit values and use the same bytecodes for loading and storing. In arrays, boolean and byte values are stored as 8-bit values, while short and char values are stored as 16-bit values. int, float, and object reference values are always stored as 32-bits each, and long and double values are always stored as 64-bit values.

The operations byte, short, and char are performed as ints, and assigning the results to a variable of the corresponding type requires an explicit cast. In the following, (b + c) is an int value and has to be cast to assign to a byte variable:

byte a, b, c;
a = (byte) (b + c);

This cast requires an extra bytecode instruction, so there is another penalty for using smaller types. The speed differences in using these types when using a JIT compiler are less clear. In this case, it depends on what data types the processor handles efficiently.

When using arrays in Java, it's important to realize how an array initializer is implemented. The array

byte[][] myData = {{1, 2}, {3, 4}};

is initialized at run time one element at a time. It is translated literally by the compiler to the equivalent of:

byte[][] myData = new byte[2][];
myData[0] = new byte[2];
myData[0][0] = 1;
myData[0][1] = 2;
myData[1] = new byte[2];
myData[1][0] = 3;
myData[1][1] = 4;

This array initialization has a couple of ramifications: 1) It can bloat your class files to embed data in arrays in the class file. For data of any significant size, it is probably faster to load the data as a separate element. 2) An initialized array as a local variable will execute all this code each time the method is invoked. Declare the array as a static or instance member to eliminate the initialization for each iteration. Even if you need a fresh copy each time the method is called, for non-trivial arrays it is faster to store a single initialized copy and make a copy of it.

private static int[] data = {1,1,2,3,5,8,13,21,34,55,89};
int[] fibo (int inc) {
    int[] data = (int[])this.data.clone();
    for (int i = data.length; --i >=0; )
        data[i] += inc;
    return data;
}

(For a multidimensional array, System.arraycopy() andclone() both will only do a shallow copy of the outermost dimension. The subarrays will be shared unless they are individually copied or cloned.)

The method benchmark

There are four different bytecodes for invoking methods, and even more quick bytecodes. Quick bytecodes are special bytecodes used by the JVM to accelerate operations at run time. You can think of the quick bytecodes as the interpreter's version of a JIT compiler. They convert an operation that requires looking up a method (or field or class) by name into one that uses an index so that it only has to resolve the name once. The quick bytecodes are only generated at run time and are never produced by the compiler. You don't really need to know anything about quick bytecodes other than that they make certain operations faster the second and consecutive times they are executed. The first time the method benchmark is run in an interpreter after the Benchmark applet is loaded, it runs a test of the non-quick bytecodes. After that the interpreter has changed the code to use quick bytecodes so the benchmark then runs those tests four to five times faster.

There are two basic categories of methods: those that can be statically resolved by the compiler and the rest, which must be dynamically resolved at run time. To statically resolve a method, the compiler must be able to determine absolutely which method should be invoked. Methods declared as static, private, and final, as well as all constructors, may be resolved at compile time because the class the method belongs to is known. A statically resolved method executes more quickly than a dynamically resolved method.

The best way to optimize method calls is to eliminate them. The compiler has an option for inlining certain statically resolved methods (as described above in "Putting the compiler to work") -- or you can simply inline method calls yourself by replacing the calls with equivalent code. In a critical section of code, calling small, easily inlined methods such as Math.min(), Math.max(), and Math.abs() can speed up those operations dramatically.

The next best option is to convert method and especially interface invocations to static, private, or final calls. If you have methods in your class that don't make use of instance variables or call other instance methods, they can be declared static. If a method doesn't need to be overridden, it can be declared final. And methods that shouldn't be called from outside the class should always be declared private anyway.

By far the most expensive type of call in Java is to a synchronized method. Not only does the overhead of calling a synchronized method dramatically increase the calling time, especially with a JIT compiler, but there is always the potential that the call will be delayed waiting for the monitor for the object. And when the method returns, the monitor must be released, which takes more time.

There are a couple things you can do to speed up synchronized code. Again, the first line of defense is to eliminate calls to synchronized methods, so make sure that the operations actually require synchronization. Be careful though; this is not an area of your program to be overly ambitious in optimizing as the problems you can cause can be difficult to track down.

Because of the way a synchronized method is invoked, it is slightly faster to call a synchronized method than to enter a synchronized block. Also, a call to a synchronized method when the monitor is already owned by the thread executes somewhat faster than the synchronized method when the monitor isn't owned. So for the four combinations of these you get:

class Test {
    static int x;
 
    void test () {
        meth1();
        meth2();               // slowest
        synchronized (this) {
            meth1();           // fastest
            meth2();
        }
    }
 
    synchronized void meth1 () {
        x = 1;
    }
 
    void meth2 () {
        synchronized (this) { x = 1; }
    }
}

Of course, these timings don't include the time it took to enter the synchronized block inside test(). If you have a heavily synchronized class that is calling lots of synchronized methods from other synchronized methods, you may want to consider having synchronized methods delegate the work to private non-synchronized methods to avoid the overhead of reacquiring the monitor.

class X {
    public synchronized void setStuff (int i) {
        doSetStuff();
    }
 
    private doSetStuff () {
        // setStuff code goes here
    }
 
    synchronized void doLotsOfStuff () {
        doSetStuff();  // instead of calling setStuff()
        // do more stuff...
    }
}

The operator benchmark

The operator benchmark runs a lot of tests but doesn't tell us too much. When run on an interpreter, however, it does point out one interesting tidbit about the bytecodes. The int++ test has a significantly faster execution than any of the other operators. This is because the only bytecode that directly modifies a variable instead of operating on the stack is the iinc instruction; this increments a local int directly with a sign-extended byte (-128to 127). This bytecode is one big reason why you should always use a local int for a loop counter.

In general, it is this benchmark that illustrates where strength reduction optimization techniques can be used. Under an interpreter there is no real difference between writing x = x + y; vs. x += y;, because the bytecodes generated are identical when x is a local variable. You should still use the compound operator because it is easier to see the operation being performed. However, if x is changed to an instance variable, array element, or just about any other lvalue, then x += y; becomes the preferred expression. (An lvalue refers the the subexpression on the left-hand side of an assignment operator.)

Another issue is using shifting rather than multiplying and dividing by powers of 2. Many modern RISC CPUs have or nearly have eliminated the difference in execution time between multiply and shift. However, divide remains the slow operation on most if not all architectures. So while x <<= 1; generally has little to no advantage over x *= 2, x >>= 1;, still it is an improvement over x /= 2.

If you are running under an interpreter, then the benchmark also should show the general advantage int offers over using byte, short, char, or long.

The casting benchmark

Casting object references to refer to different object types in Java (object casts) can get pretty dense -- especially if you work with a lot of vectors or other container classes. It turns out that the cost of a cast is high enough that any object cast that can't be resolved at compile time (an object cast that can be resolved at compile time is an unnecessary cast) takes long enough that it is better to save the cast object in a local variable than to repeat the cast. So instead of writing:

boolean equals (Object obj) {
    if (obj instanceof Point)
        return (((Point)obj).x == this.x && ((Point)obj).y == this.y);
    return false;
}

write it as:

boolean equals (Object obj) {
    if (obj instanceof Point) {
        Point p = (Point)obj;
        return (p.x == this.x && p.y == this.y);
    }
    return false;
}

Try it out yourself. These two examples are the last two tests in the casting benchmark.

If the cast is to an interface, it's probably at least twice as bad speedwise as casting to a class. In fact, there is one type of cast that can take much longer to execute. If you have an object hierarchy like

interface X {}
class Super {}
class Sub extends Super implements X {}

then the following cast, to an interface implemented by a subclass, takes anywhere from two to three times as long as casting to the subclass.

Super cali = new Sub();
X x = (X) cali;

The further the separation between the interface and the subclass (that is, the further back in the interface inheritance chain the cast interface is from the implemented interface), the longer the cast takes to resolve.

Also beware of unnecessary uses of instanceof. The following cast is resolved by the compiler and produces no runtime code to implement the unnecessary cast.

Point q = new Point();
Point p = (Point) q;

However,

Point p = new Point();
boolean b = (p instanceof Point);

can not be resolved by the compiler because instanceof must return false if p == null.

Casting data types is simpler and cheaper than casting objects because the type of the data value being cast is known (for example, an int never is actually a subclass of an int.) However, again since int is the natural size used by the JVM (ints are the common data type that all other data types can be directly converted to and from), beware of using the other data types. Casting from a long to a short requires first casting from a long to an int and then from an int to a short.

The instantiation benchmark

Instantiating an object is fairly expensive in terms of CPU cycles. On top of that, discarded objects will need to be garbage collected at some point. The efficiency of garbage collection is difficult to measure, and the Benchmark applet makes no attempt to measure it. However, if you run the instantiation benchmark, most of the extra time between the "test" time and the total running time is spent garbage collecting, so this gives some idea of the impact garbage collection can have on the performance of code that instantiates lots of objects. It appears to take about as long to garbage collect an object as it takes to create one.

Also, the longer the hierarchy chain for the object, the more constructors that must be called. This adds to the instantiation overhead. If you add extra layers to your class hierarchy for increased abstraction, you'll also get increased instantiation time.

The best option is to avoid instantiating objects in tight loops. Where possible, reuse an existing object. The loop

for (int i = 0; i < limit; i++) {
    StringBuffer sb = new StringBuffer();
    // do something with sb...
}

would be faster if written as

StringBuffer sb = new StringBuffer();
for (int i = 0; i < limit; i++) {
    sb.setLength(0);
    // do something with sb...
}

One other minor issue with instantiating objects: When an object is instantiated, all of its instance variables automatically are initialized to their default values, which not coincidentally is the value with all bits set to zero for all data types. If the instance variables are explicitly initialized to their default values, this will duplicate the default initialization, generate additional bytecodes, and make the instantiation take longer. (There are rare cases when the explicit initialization is needed to reinitialize a variable that was altered from the default value during the super class's constructor.)

The exception benchmark

Throwing exceptions and catching them with try/catch blocks is normally an exceptional circumstance and therefore not typically a performance concern for optimization. Perhaps surprisingly, a try {} statement is "free" -- that is to say, no bytecodes are generated, and unless an exception is thrown, all the try {} statement adds to your code as far as overhead is the goto bytecode to skip the catch () {} statement and one or more entries in the method's exception table. (An exception table keeps track of the beginning and ending bytecode offsets for try statements and the offsets of the corresponding catch and finally statements.) A finally statement adds a little more overhead as it is implemented as an inline subroutine. When an exception is thrown, a quick check against the exception table for each method in the call chain determines whether or not the exception is handled by the method. Overall, the try/catch/finally mechanism in the JVM is efficient. (When translated to native code, try/catch/finally can present a challenge to producing optimized code because of issues with register variables, so using these statements may reduce the efficiency of code generated by a JIT compiler.)

As you can see from the exception benchmark, instantiating a new exception is not as streamlined as throwing and catching the exception. An exception generates and stores a snapshot of the stack at the time the exception was instantiated. This is where most of the time for instantiating an exception goes. If you're throwing exceptions regularly as part of the normal operation of your code, you may want to rethink your design. But, if you do have a need to do this (such as returning a result from several calls deep), and if you're simply going to catch and handle the exception, then repeatedly instantiating an exception can be a waste of time. Instead, it is possible to reuse an exception:

MyException myexcept = new MyException();
void doit (int val) throws MyException {
    if (val == 0)
        throw myexcept;
}

(I warned you that optimizing isn't pretty.)

The thread benchmark

I could write an entire article on the difficulties and bugs in various Java implementations that I encountered while writing a thread benchmark. Suffice it to say that just getting threads to behave similarly on all platforms can be a difficult proposition in itself. There are a number of performance issues related to threads, and they tend to be platform-specific.

To start with, each running thread has both a native or "C" stack and a Java stack, so memory can be an issue. There is an overhead for switching between threads (a context switch), which can vary significantly between platforms. The thread benchmark shows the timing of using wait()/notify() to initiate context switches. Using threads also requires releasing and acquiring the monitor for an object, so a context switch without the synchronized overhead would be somewhat faster. Synchronized access between threads, as discussed in the method benchmark section, can result in deadlock and delays or it simply adds the overhead for the monitor to calls.

For example, a game that uses sprite animation could implement the sprites with a separate thread for each sprite. (A sprite is an arbitrarily shaped graphic that moves over the background without destroying it.) If there were many sprites, the overhead just from the context switches could bring the system to its knees, let alone what would happen if several sprites wanted to synchronize on the same object (like the game field) at the same time. A more efficient implementation would be to implement a single sprite manager that would handle all of the sprites in a single thread.

Timer threads provide another opportunity for optimization. The Benchmark applet uses threads to time its tests. It starts a new thread for each test, which is simple and straightforward. The Benchmark applet would spend a lot less time with the timer if it simply left a single thread running and signaled it when to start the timer. However, following my own advice, I didn't bother optimizing the timer because doing so would unnecessarily complicate the code.

Conclusions

Low-level optimizations in Java can increase the speed of your code. Frequently, just being aware of which operations are time consuming is enough to avoid problems in the first place. If it's necessary to optimize your code, make sure to follow these simple guidelines:

  • Don't optimize unless you know it's necessary

  • If you do need to optimize, start with the algorithm

  • Before you optimize the low-level operation of your code, profile or time the existing code to determine where the time is being spent

  • Profile the code again afterwards to see what effect your changes had

Doug Bell is vice president of development for FTL Games. FTL Games has been producing entertainment software for dozens of different home game consoles and home computers since the early '80s (before it was cool). During his 15-plus years in the entertainment software business, he has led the design and development of many top-selling commercial games, including the groundbreaking DungeonMaster RPG series. He has been working full-time on programming in Java for the last year and is a frequent contributor to the technical discussions in the comp.lang.java.* newsgroups. He is a co-founder of the San Diego County Java SIG user's group. He can be reached at dbell@shvn.com, and his neglected home page can be found at http://www.2nu.com/Doug/index.html.

Learn more about this topic

| 1 2 3 Page 3