Make Java fast: Optimize!

How to get the greatest performance out of your code through low-level optimizations in Java -- complete with a Benchmark applet

According to the pioneering computer scientist Donald Knuth, "Premature optimization is the root of all evil." Any article on optimization must start by pointing out that there are usually more reasons not to optimize than to optimize.

  • If your code already works, optimizing it is a sure way to introduce new, and possibly subtle, bugs

  • Optimization tends to make code harder to understand and maintain

  • Some of the techniques presented here increase speed by reducing the extensibility of the code

  • Optimizing code for one platform may actually make it worse on another platform

  • A lot of time can be spent optimizing, with little gain in performance, and can result in obfuscated code

  • If you're overly obsessed with optimizing code, people will call you a nerd behind your back

Before optimizing, you should carefully consider whether you need to optimize at all. Optimization in Java can be an elusive target since the execution environments vary. Using a better algorithm probably will yield a bigger performance increase than any amount of low-level optimizations and is more likely to deliver an improvement under all execution conditions. As a rule, high-level optimizations should be considered before doing low-level optimizations.

So why optimize?

If it's such a bad idea, why optimize at all? Well, in an ideal world you wouldn't. But the reality is that sometimes the biggest problem with a program is that it requires simply too many resources, and these resources (memory, CPU cycles, network bandwidth, or a combination) may be limited. Code fragments that occur multiple times throughout a program are likely to be size-sensitive, while code with many execution iterations may be speed-sensitive.

Make Java fast!

As an interpreted language with a compact bytecode, speed, or the lack of it, is what most often pops up as a problem in Java. We'll primarily look at how to make Java run faster rather than making it fit into a smaller space -- although we'll point out where and how these approaches affect memory or network bandwidth. The focus will be on the core language rather than on the Java APIs.

By the way, one thing we won't discuss here is the use of native methods written in C or assembly. While using native methods can give the ultimate performance boost, it does so at the cost of Java's platform independence. It is possible to write both a Java version of a method and native versions for selected platforms; this leads to increased performance on some platforms without giving up the ability to run on all platforms. But this is all I'm going to say on the subject of replacing Java with C code. (See the Java Tip, "Write native methods" for more information on this topic.) Our focus in this article is on how to make Java fast.

90/10, 80/20, hut, hut, hike!

As a rule, 90 percent of a program's excution time is spent executing 10 percent of the code. (Some people use the 80 percent/20 percent rule, but my experience writing and optimizing commercial games in several languages for the last 15 years has shown that the 90 percent/10 percent formula is typical for performance-hungry programs since few tasks tend to be performed with great frequency.) Optimizing the other 90 percent of the program (where 10 percent of the execution time was spent) has no noticeable effect on performance. If you were able to make that 90 percent of the code execute twice as fast, the program would only be 5 percent faster. So the first task in optimizing code is to identify the 10 percent (frequently it is less than this) of the program that consumes most of the execution time. This isn't always where you expect it to be.

General optimization techniques

There are several common optimization techniques that apply regardless of the language being used. Some of these techniques, such as global register allocation, are sophisticated strategies to allocate machine resources (for example, CPU registers) and don't apply to Java bytecodes. We'll focus on the techniques that basically involve restructuring code and substituting equivalent operations within a method.

Strength reduction

Strength reduction occurs when an operation is replaced by an equivalent operation that executes faster. The most common example of strength reduction is using the shift operator to multiply and divide integers by a power of 2. For example, x >> 2 can be used in place of x / 4, and x << 1 replaces x * 2.

Common sub expression elimination

Common sub expression elimination removes redundant calculations. Instead of writing

double x = d * (lim / max) * sx;
double y = d * (lim / max) * sy;

the common sub expression is calculated once and used for both calculations:

double depth = d * (lim / max);
double x = depth * sx;
double y = depth * sy;

Code motion

Code motion moves code that performs an operation or calculates an expression whose result doesn't change, or is invariant. The code is moved so that it only executes when the result may change, rather than executing each time the result is required. This is most common with loops, but it can also involve code repeated on each invocation of a method. The following is an example of invariant code motion in a loop:

for (int i = 0; i < x.length; i++)
    x[i] *= Math.PI * Math.cos(y);

becomes

double picosy = Math.PI * Math.cos(y);for (int i = 0; i < x.length; i++)
    x[i] *= picosy;

Unrolling loops

Unrolling loops reduces the overhead of loop control code by performing more than one operation each time through the loop, and consequently executing fewer iterations. Working from the previous example, if we know that the length of x[] is always a multiple of two, we might rewrite the loop as:

double picosy = Math.PI * Math.cos(y);for (int i = 0; i < x.length; i += 2) {
    x[i] *= picosy;
    x[i+1] *= picosy;
}

In practice, unrolling loops such as this -- in which the value of the loop index is used within the loop and must be separately incremented -- does not yield an appreciable speed increase in interpreted Java because the bytecodes lack instructions to efficiently combine the "+1" into the array index.

All of the optimization tips in this article embody one or more of the general techniques listed above.

Putting the compiler to work

Modern C and Fortran compilers produce highly optimized code. C++ compilers generally produce less efficient code, but are still well along the path to producing optimal code. All of these compilers have gone through many generations under the influence of strong market competition and have become finely honed tools for squeezing every last drop of performance out of ordinary code. They almost certainly use all of the general optimization techniques presented above. But there are still plenty of tricks left for making compilers generate efficient code.

javac, JITs, and native code compilers

The level of optimization that javac performs when compiling code at this point is minimal. It defaults to doing the following:

  • Constant folding -- the compiler resolves any constant expressions such that i = (10 *10) compiles to i = 100.

  • Branch folding (most of the time) -- unnecessary goto bytecodes are avoided.

  • Limited dead code elimination -- no code is produced for statements like if(false) i = 1.

The level of optimization javac provides should improve, probably dramatically, as the language matures and compiler vendors begin to compete in earnest on the basis of code generation. Java just now is getting second-generation compilers.

Then there are just-in-time (JIT) compilers that convert Java bytecodes into native code at run time. Several are already available, and while they can increase the execution speed of your program dramatically, the level of optimization they can perform is constrained because optimization occurs at run time. A JIT compiler is more concerned with generating the code quickly than with generating the quickest code.

Native code compilers that compile Java directly to native code should offer the greatest performance but at the cost of platform independence. Fortunately, many of the tricks presented here will be achieved by future compilers, but for now it takes a little work to get the most out the compiler.

javac does offer one performance option you can enable: invoking the -O option to cause the compiler to inline certain method calls:

javac -O MyClass

Inlining a method call inserts the code for the method directly into the code making the method call. This eliminates the overhead of the method call. For a small method this overhead can represent a significant percentage of its execution time. Note that only methods declared as either private, static, or final can be considered for inlining, because only these methods are statically resolved by the compiler. Also, synchronized methods won't be inlined. The compiler will only inline small methods typically consisting of only one or two lines of code.

Unfortunately, the 1.0 versions of the javac compiler have a bug that will generate code that can't pass the bytecode verifier when the -O option is used. This has been fixed in JDK 1.1. (The bytecode verifier checks code before it is allowed to run to make sure that it doesn't violate any Java rules.) It will inline methods that reference class members inaccessible to the calling class. For example, if the following classes are compiled together using the -O option

class A {
    private static int x = 10;
    public static void getX () { return x; }
}
 
class B {
    int y = A.getX();
}

the call to A.getX() in class B will get inlined in class B as if B had been written as:

class B {
    int y = A.x;
}

However, this will cause the generation of bytecodes to access the private A.x variable that will be generated in B's code. This code will execute fine, but since it violates Java's access restrictions, it will get flagged by the verifier with an IllegalAccessError the first time the code is executed.

This bug doesn't make the -O option useless, but you do have to be careful about how you use it. If invoked on a single class, it can inline certain method calls within the class without risk. Several classes can be inlined together as long as there aren't any potential access restrictions. And some code (such as applications) isn't subjected to the bytecode verifier. You can ignore the bug if you know your code will only execute without being subjected to the verifier. For additional information, see my javac-O FAQ.

Profilers

Fortunately, the JDK comes with a built-in profiler to help identify where time is spent in a program. It will keep track of the time spent in each routine and write the information to the file java.prof. To run the profiler, use the -prof option when invoking the Java interpreter:

java -prof myClass

Or for use with an applet:

java -prof sun.applet.AppletViewer myApplet.html

There are a few caveats for using the profiler. The profiler output is not particularly easy to decipher. Also, in JDK 1.0.2 it truncates the method names to 30 characters, so it may not be possible to distinguish some methods. Unfortunately, with the Mac there is no means to invoke the profiler, so Mac users are out of luck. On top of all this, Sun's Java document page (see Resources) no longer includes the documentation for the -prof option). However, if your platform does support the -prof option, either Vladimir Bulatov's HyperProf or Greg White's ProfileViewer can be used to help interpret the results (see Resources).

It is also possible to "profile" code by inserting explicit timing into the code:

long start = System.currentTimeMillis();
// do operation to be timed here
long time = System.currentTimeMillis() - start;

System.currentTimeMillis() returns time in 1/1000ths of a second. However, some systems, such as a Windows PC, have a system timer with less (much less) resolution than a 1/1000th of a second. Even 1/1000th of a second isn't long enough to accurately time many operations. In these cases, or on systems with low-resolution timers, it may be necessary to time how long it takes to repeat the operation n times and then divide the total time by n to get the actual time. Even when profiling is available, this technique can be useful for timing a specific task or operation.

Here are a few closing notes on profiling:

  • Always time the code before and after making changes to verify that, at least on the test platform, your changes improved the program

  • Try to make each timing test under identical conditions

  • If possible, contrive a test that doesn't rely on any user input, as the variations in user response can cause the results to fluctuate

The Benchmark applet

The Benchmark applet measures the time it takes to do an operation thousands (or even millions) of times, subtracts the time spent doing operations other than the test (such as the loop overhead), and then uses this information to compute how long each operation took. It runs each test for approximately one second. In an attempt to eliminate random delays from other operations the computer may perform during a test, it runs each test three times and uses the best result. It also attempts to eliminate garbage collection as a factor in the tests. Because of this, the more memory available to the benchmark, the more accurate the benchmark results are.

The times for operations are given in nanoseconds, or billionths of a second, designated by "ns." While this is a very small increment of time, it actually may not be small enough to measure some operations accurately on very fast computers! If you are lucky enough to be using a fast machine, the benchmark will adjust the timing units to picoseconds (trillionths of a second), designated by "ps." If, on the other hand, your computer is slow, it may adjust back to using microseconds (millionths of a second), designated by "┬Ás," or even all the way back to milliseconds (thousandths of a second), designated by "ms."

To run the Benchmark applet, select the tests you wish to run on the left, click the Run Benchmark button, and sit back and wait. Even moving the mouse around the screen can affect the accuracy of the benchmark, so just be patient while it runs the tests. You can stop the current test by clicking the Stop button, although it may take the applet a second or two to respond. Hitting Clear simply clears the text area. If Console is checked, the output will be mirrored to the Java console. If you want to save the results, you may need to check this.

You need a Java-enabled browser to see this applet.

Briefly, the individual benchmarks are:

  • Calibrate -- calculates the loop overhead for the one-second timed loop. It's included in the benchmarks to make it easy to test the environment the applet is running in.

  • Mixed -- a collection of tests from the other benchmarks.

  • Loop -- tests a variety of looping constructs for empty loops and array loops.

  • Variable -- tests accessing different classes of variables.

  • Method -- shows the timing of method invocations under various conditions.

  • Operator -- times the execution of arithmetic operators on the data types.

  • Casting -- tests casting and instanceof expressions.

  • Instantiation -- tests object and array instantiations.

  • Exception -- times throwing and catching exceptions and try/catch/finally statements.

  • Thread -- measures the timing for thread instantiation and execution as well as task switching with wait()/notify(). (Developing a test for threads that actually ran on all the platforms I could test turned out to be quite a challenge.)

To test another Java virtual machine (JVM) besides your browser, download the applet and run it. To test output from another compiler, download the source, recompile the code, and run it. You also can easily add your own tests to the benchmark. For comparison, here are benchmark results for a few Java runtimes.

Understanding the benchmarks

The purpose of benchmarking operations is to know which operations to avoid or to replace with alternate implementations. This varies between environments -- and especially between interpreters and JIT compilers that convert Java's bytecodes into native code at run time. Interpreted code will want to take advantage of the bytecode instructions, while a JIT compiler often will have strengths and weaknesses separate from the bytecode. We'll go over each of the benchmarks and identify the types of operations that work best for both interpreted and non-interpreted environments. (Note that some compilers, such as Asymetrix's SuperCede on the PC, produce native code directly. Wherever there is a reference to a JIT compiler, you can substitute any compiler that produces native code.)

The loop benchmark

Almost all speed problems involve loops, and the loop itself is an easy place to get a quick performance boost. The impact of the loop overhead depends on how many iterations the loop executes and how much it does during each iteration. Obviously, the more iterations and the less work done per iteration, the greater the benefit with optimizing the loop itself. The basic loop we're going to look at is:

for (i = 0; i < N; i++) {
    // do something
}

The first rule is to use a local int variable for the loop counter. We'll see the reason for this in the Variable benchmark. It's unlikely that the loop counter would be anything other than a local int, but I mention it just in case. If it isn't necessary inside the loop for the loop variable i to count from 0 to N-1, then the loop can be restructured to use a more efficient test. Rather than comparing i against N at each iteration, which requires N to be loaded (assuming N isn't a constant or static final variable), restructure the loop to take advantage of the efficient checks against 0. Comparing against zero is almost always more efficient in any language since the underlying tests are based on < 0,<= 0, == 0, != 0, >= 0 and > 0. To optimize, restructure the loop as:

for (i = N; --i >= 0; ) {
    // do something
}

By combining the predecrement of i with the comparison against 0, the necessary condition flags for the comparison will already be set -- so no explicit comparison is required. Interpreted code doesn't actually get this free test because the bytecodes used for conditional tests require the condition value to be on the stack. But interpreted code does get a cheaper test because i is used for the test directly without requiring an explicit comparison. JIT compiler code would almost certainly get the benefit of both the free test and the implicit comparison to zero.

One common use of small loops is to iterate through the elements in an array. If the purpose of the loop simply is to copy elements from one array to another array of the same data type (that is, from an array of int to another array of int -- not from an array of byte to an array of int where a data type conversion is required), then use System.arraycopy(). This is a special-purpose optimized native method for copying array elements; for copies of more than a few elements, it's by far the fastest way to copy array elements.

If you're not copying and want to iterate to the end of the array, there is an alternative. Because all array accesses in Java are bounds checked, you can rely on this implicit testing and simply catch the ArrayIndexOutOfBoundsException, which will be thrown when the loop runs past the end of the array.

try {
    for (i = 0; ; i++)
        array[i] = i;
}
catch (ArrayIndexOutOfBoundsException e) {}

The above isn't necessarily good coding style but coding style frequently is compromised by optimizations. You'll want to make sure the loop performs enough iterations to cover the cost of throwing the exception. (See the exception benchmark for these cases.) Because an exception is not cheap to throw, this technique usually would be used only when the loop typically would perform a lot of iterations. In fact, consider using an exception to terminate a loop as a last-ditch effort since the results are likely to vary on different platforms.

The variable benchmark

The performance of a variable is determined by its scope and its type. Local method variables are the fastest to access. There are a couple of reasons for this. One is that there is no reference either to the object instance, in the case of an instance variable, or to the class in the case of a static variable. There simply is an index into the data space. The other reason, for interpreted code at least, is that there are special bytecodes with implicit offsets for accessing the first four local variables and parameters. Variables are assigned to slots in a method according to the following rules:

  • If it is an instance method, the first local variable is always the this reference (this is implicitly passed as the left-most parameter to an instance method).

  • After the this reference, if any, is assigned, the parameters to the method are assigned to slots, starting with the left-most parameter through the right-most parameter.

  • Any remaining slots are filled by local variables in the order that they are declared.

  • Each double or long occupies two variable slots. The first slot must be within the first four slots to use the special bytecodes for accessing the long or double variable.

Under a JIT compiler, some local variables are likely to get a further boost in performance by being kept in a register.

The fastest types of variables are int and reference variables. For values stored in variables (as opposed to values stored in arrays), int, boolean, byte, short, and char variables all are represented as 32-bit values and use the same bytecodes for loading and storing. In arrays, boolean and byte values are stored as 8-bit values, while short and char values are stored as 16-bit values. int, float, and object reference values are always stored as 32-bits each, and long and double values are always stored as 64-bit values.

The operations byte, short, and char are performed as ints, and assigning the results to a variable of the corresponding type requires an explicit cast. In the following, (b + c) is an int value and has to be cast to assign to a byte variable:

byte a, b, c;
a = (byte) (b + c);

This cast requires an extra bytecode instruction, so there is another penalty for using smaller types. The speed differences in using these types when using a JIT compiler are less clear. In this case, it depends on what data types the processor handles efficiently.

When using arrays in Java, it's important to realize how an array initializer is implemented. The array

byte[][] myData = {{1, 2}, {3, 4}};

is initialized at run time one element at a time. It is translated literally by the compiler to the equivalent of:

byte[][] myData = new byte[2][];
myData[0] = new byte[2];
myData[0][0] = 1;
myData[0][1] = 2;
myData[1] = new byte[2];
myData[1][0] = 3;
myData[1][1] = 4;

This array initialization has a couple of ramifications: 1) It can bloat your class files to embed data in arrays in the class file. For data of any significant size, it is probably faster to load the data as a separate element. 2) An initialized array as a local variable will execute all this code each time the method is invoked. Declare the array as a static or instance member to eliminate the initialization for each iteration. Even if you need a fresh copy each time the method is called, for non-trivial arrays it is faster to store a single initialized copy and make a copy of it.

private static int[] data = {1,1,2,3,5,8,13,21,34,55,89};
int[] fibo (int inc) {
    int[] data = (int[])this.data.clone();
    for (int i = data.length; --i >=0; )
        data[i] += inc;
    return data;
}

(For a multidimensional array, System.arraycopy() andclone() both will only do a shallow copy of the outermost dimension. The subarrays will be shared unless they are individually copied or cloned.)

The method benchmark

There are four different bytecodes for invoking methods, and even more quick bytecodes. Quick bytecodes are special bytecodes used by the JVM to accelerate operations at run time. You can think of the quick bytecodes as the interpreter's version of a JIT compiler. They convert an operation that requires looking up a method (or field or class) by name into one that uses an index so that it only has to resolve the name once. The quick bytecodes are only generated at run time and are never produced by the compiler. You don't really need to know anything about quick bytecodes other than that they make certain operations faster the second and consecutive times they are executed. The first time the method benchmark is run in an interpreter after the Benchmark applet is loaded, it runs a test of the non-quick bytecodes. After that the interpreter has changed the code to use quick bytecodes so the benchmark then runs those tests four to five times faster.

There are two basic categories of methods: those that can be statically resolved by the compiler and the rest, which must be dynamically resolved at run time. To statically resolve a method, the compiler must be able to determine absolutely which method should be invoked. Methods declared as static, private, and final, as well as all constructors, may be resolved at compile time because the class the method belongs to is known. A statically resolved method executes more quickly than a dynamically resolved method.

The best way to optimize method calls is to eliminate them. The compiler has an option for inlining certain statically resolved methods (as described above in "Putting the compiler to work") -- or you can simply inline method calls yourself by replacing the calls with equivalent code. In a critical section of code, calling small, easily inlined methods such as Math.min(), Math.max(), and Math.abs() can speed up those operations dramatically.

The next best option is to convert method and especially interface invocations to static, private, or final calls. If you have methods in your class that don't make use of instance variables or call other instance methods, they can be declared static. If a method doesn't need to be overridden, it can be declared final. And methods that shouldn't be called from outside the class should always be declared private anyway.

By far the most expensive type of call in Java is to a synchronized method. Not only does the overhead of calling a synchronized method dramatically increase the calling time, especially with a JIT compiler, but there is always the potential that the call will be delayed waiting for the monitor for the object. And when the method returns, the monitor must be released, which takes more time.

There are a couple things you can do to speed up synchronized code. Again, the first line of defense is to eliminate calls to synchronized methods, so make sure that the operations actually require synchronization. Be careful though; this is not an area of your program to be overly ambitious in optimizing as the problems you can cause can be difficult to track down.

Because of the way a synchronized method is invoked, it is slightly faster to call a synchronized method than to enter a synchronized block. Also, a call to a synchronized method when the monitor is already owned by the thread executes somewhat faster than the synchronized method when the monitor isn't owned. So for the four combinations of these you get:

class Test {
    static int x;
 
    void test () {
        meth1();
        meth2();               // slowest
        synchronized (this) {
            meth1();           // fastest
            meth2();
        }
    }
 
    synchronized void meth1 () {
        x = 1;
    }
 
    void meth2 () {
        synchronized (this) { x = 1; }
    }
}

Of course, these timings don't include the time it took to enter the synchronized block inside test(). If you have a heavily synchronized class that is calling lots of synchronized methods from other synchronized methods, you may want to consider having synchronized methods delegate the work to private non-synchronized methods to avoid the overhead of reacquiring the monitor.

class X {
    public synchronized void setStuff (int i) {
        doSetStuff();
    }
 
    private doSetStuff () {
        // setStuff code goes here
    }
 
    synchronized void doLotsOfStuff () {
        doSetStuff();  // instead of calling setStuff()
        // do more stuff...
    }
}

The operator benchmark

The operator benchmark runs a lot of tests but doesn't tell us too much. When run on an interpreter, however, it does point out one interesting tidbit about the bytecodes. The int++ test has a significantly faster execution than any of the other operators. This is because the only bytecode that directly modifies a variable instead of operating on the stack is the iinc instruction; this increments a local int directly with a sign-extended byte (-128to 127). This bytecode is one big reason why you should always use a local int for a loop counter.

In general, it is this benchmark that illustrates where strength reduction optimization techniques can be used. Under an interpreter there is no real difference between writing x = x + y; vs. x += y;, because the bytecodes generated are identical when x is a local variable. You should still use the compound operator because it is easier to see the operation being performed. However, if x is changed to an instance variable, array element, or just about any other lvalue, then x += y; becomes the preferred expression. (An lvalue refers the the subexpression on the left-hand side of an assignment operator.)

Another issue is using shifting rather than multiplying and dividing by powers of 2. Many modern RISC CPUs have or nearly have eliminated the difference in execution time between multiply and shift. However, divide remains the slow operation on most if not all architectures. So while x <<= 1; generally has little to no advantage over x *= 2, x >>= 1;, still it is an improvement over x /= 2.

If you are running under an interpreter, then the benchmark also should show the general advantage int offers over using byte, short, char, or long.

The casting benchmark

Casting object references to refer to different object types in Java (object casts) can get pretty dense -- especially if you work with a lot of vectors or other container classes. It turns out that the cost of a cast is high enough that any object cast that can't be resolved at compile time (an object cast that can be resolved at compile time is an unnecessary cast) takes long enough that it is better to save the cast object in a local variable than to repeat the cast. So instead of writing:

boolean equals (Object obj) {
    if (obj instanceof Point)
        return (((Point)obj).x == this.x && ((Point)obj).y == this.y);
    return false;
}

write it as:

boolean equals (Object obj) {
    if (obj instanceof Point) {
        Point p = (Point)obj;
        return (p.x == this.x && p.y == this.y);
    }
    return false;
}

Try it out yourself. These two examples are the last two tests in the casting benchmark.

If the cast is to an interface, it's probably at least twice as bad speedwise as casting to a class. In fact, there is one type of cast that can take much longer to execute. If you have an object hierarchy like

interface X {}
class Super {}
class Sub extends Super implements X {}

then the following cast, to an interface implemented by a subclass, takes anywhere from two to three times as long as casting to the subclass.

Super cali = new Sub();
X x = (X) cali;

The further the separation between the interface and the subclass (that is, the further back in the interface inheritance chain the cast interface is from the implemented interface), the longer the cast takes to resolve.

Also beware of unnecessary uses of instanceof. The following cast is resolved by the compiler and produces no runtime code to implement the unnecessary cast.

Point q = new Point();
Point p = (Point) q;

However,

Point p = new Point();
boolean b = (p instanceof Point);

can not be resolved by the compiler because instanceof must return false if p == null.

Casting data types is simpler and cheaper than casting objects because the type of the data value being cast is known (for example, an int never is actually a subclass of an int.) However, again since int is the natural size used by the JVM (ints are the common data type that all other data types can be directly converted to and from), beware of using the other data types. Casting from a long to a short requires first casting from a long to an int and then from an int to a short.

The instantiation benchmark

Instantiating an object is fairly expensive in terms of CPU cycles. On top of that, discarded objects will need to be garbage collected at some point. The efficiency of garbage collection is difficult to measure, and the Benchmark applet makes no attempt to measure it. However, if you run the instantiation benchmark, most of the extra time between the "test" time and the total running time is spent garbage collecting, so this gives some idea of the impact garbage collection can have on the performance of code that instantiates lots of objects. It appears to take about as long to garbage collect an object as it takes to create one.

Also, the longer the hierarchy chain for the object, the more constructors that must be called. This adds to the instantiation overhead. If you add extra layers to your class hierarchy for increased abstraction, you'll also get increased instantiation time.

The best option is to avoid instantiating objects in tight loops. Where possible, reuse an existing object. The loop

for (int i = 0; i < limit; i++) {
    StringBuffer sb = new StringBuffer();
    // do something with sb...
}

would be faster if written as

StringBuffer sb = new StringBuffer();
for (int i = 0; i < limit; i++) {
    sb.setLength(0);
    // do something with sb...
}

One other minor issue with instantiating objects: When an object is instantiated, all of its instance variables automatically are initialized to their default values, which not coincidentally is the value with all bits set to zero for all data types. If the instance variables are explicitly initialized to their default values, this will duplicate the default initialization, generate additional bytecodes, and make the instantiation take longer. (There are rare cases when the explicit initialization is needed to reinitialize a variable that was altered from the default value during the super class's constructor.)

The exception benchmark

Throwing exceptions and catching them with try/catch blocks is normally an exceptional circumstance and therefore not typically a performance concern for optimization. Perhaps surprisingly, a try {} statement is "free" -- that is to say, no bytecodes are generated, and unless an exception is thrown, all the try {} statement adds to your code as far as overhead is the goto bytecode to skip the catch () {} statement and one or more entries in the method's exception table. (An exception table keeps track of the beginning and ending bytecode offsets for try statements and the offsets of the corresponding catch and finally statements.) A finally statement adds a little more overhead as it is implemented as an inline subroutine. When an exception is thrown, a quick check against the exception table for each method in the call chain determines whether or not the exception is handled by the method. Overall, the try/catch/finally mechanism in the JVM is efficient. (When translated to native code, try/catch/finally can present a challenge to producing optimized code because of issues with register variables, so using these statements may reduce the efficiency of code generated by a JIT compiler.)

As you can see from the exception benchmark, instantiating a new exception is not as streamlined as throwing and catching the exception. An exception generates and stores a snapshot of the stack at the time the exception was instantiated. This is where most of the time for instantiating an exception goes. If you're throwing exceptions regularly as part of the normal operation of your code, you may want to rethink your design. But, if you do have a need to do this (such as returning a result from several calls deep), and if you're simply going to catch and handle the exception, then repeatedly instantiating an exception can be a waste of time. Instead, it is possible to reuse an exception:

MyException myexcept = new MyException();
void doit (int val) throws MyException {
    if (val == 0)
        throw myexcept;
}

(I warned you that optimizing isn't pretty.)

The thread benchmark

I could write an entire article on the difficulties and bugs in various Java implementations that I encountered while writing a thread benchmark. Suffice it to say that just getting threads to behave similarly on all platforms can be a difficult proposition in itself. There are a number of performance issues related to threads, and they tend to be platform-specific.

To start with, each running thread has both a native or "C" stack and a Java stack, so memory can be an issue. There is an overhead for switching between threads (a context switch), which can vary significantly between platforms. The thread benchmark shows the timing of using wait()/notify() to initiate context switches. Using threads also requires releasing and acquiring the monitor for an object, so a context switch without the synchronized overhead would be somewhat faster. Synchronized access between threads, as discussed in the method benchmark section, can result in deadlock and delays or it simply adds the overhead for the monitor to calls.

For example, a game that uses sprite animation could implement the sprites with a separate thread for each sprite. (A sprite is an arbitrarily shaped graphic that moves over the background without destroying it.) If there were many sprites, the overhead just from the context switches could bring the system to its knees, let alone what would happen if several sprites wanted to synchronize on the same object (like the game field) at the same time. A more efficient implementation would be to implement a single sprite manager that would handle all of the sprites in a single thread.

Timer threads provide another opportunity for optimization. The Benchmark applet uses threads to time its tests. It starts a new thread for each test, which is simple and straightforward. The Benchmark applet would spend a lot less time with the timer if it simply left a single thread running and signaled it when to start the timer. However, following my own advice, I didn't bother optimizing the timer because doing so would unnecessarily complicate the code.

Conclusions

Low-level optimizations in Java can increase the speed of your code. Frequently, just being aware of which operations are time consuming is enough to avoid problems in the first place. If it's necessary to optimize your code, make sure to follow these simple guidelines:

  • Don't optimize unless you know it's necessary

  • If you do need to optimize, start with the algorithm

  • Before you optimize the low-level operation of your code, profile or time the existing code to determine where the time is being spent

  • Profile the code again afterwards to see what effect your changes had

Doug Bell is vice president of development for FTL Games. FTL Games has been producing entertainment software for dozens of different home game consoles and home computers since the early '80s (before it was cool). During his 15-plus years in the entertainment software business, he has led the design and development of many top-selling commercial games, including the groundbreaking DungeonMaster RPG series. He has been working full-time on programming in Java for the last year and is a frequent contributor to the technical discussions in the comp.lang.java.* newsgroups. He is a co-founder of the San Diego County Java SIG user's group. He can be reached at dbell@shvn.com, and his neglected home page can be found at http://www.2nu.com/Doug/index.html.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies