Starting line: Warming up the engines
Lord knows, interpreters are slow. Pre-compiling Java source code into portable bytecodes saves the time needed for translating syntax, but interpreting bytecodes in the Java virtual machine (JVM) is still exceedingly slow compared to the native code produced by a compiler. Thus, Java performance is generally deemed acceptable for small applets but not for any sizable application.
A just-in-time compiler (JIT) runs many times faster than an interpreter. It makes a drastic, noticeable difference when you're running even an interactive application. Although a JIT falls short of compiled-code speeds, it greatly extends Java's applicability. But a reasonably-performing application is still compute-limited. As long as the app mostly interacts with the user or does I/O, it's fine. But if it starts doing a lot of graphic processing or extensive computing, performance rapidly drops to unacceptable levels.
Sun's new HotSpot technology, due to ship this summer (with a developer release due sooner), is a dynamic compiler -- a compiler built into the virtual machine that promises to run as fast or faster than compiled code in the majority of applications. HotSpot promises to extend Java's applicability into a wide variety of areas, from servers to mainstream desktop applications -- without sacrificing portability. Sun's Java roadmap indicates a HotSpot JVM developer's release was scheduled to be available early in the first quarter of this year; a deployable HotSpot JVM implementation is scheduled to ship this summer.
This article explores the inner workings of Sun's dynamic compiler, HotSpot, and details the kinds of applications in which HotSpot is -- and is not -- more effective than the fastest JIT. (For more information on compilers and interpreters, see the sidebar How compilers and interpreters work. To find out how a JIT improves performance, see the sidebar How a JIT works.)
They're off!: Burning rubber with dynamic compilation
HotSpot is a dynamic compiler. It combines the best features of a JIT compiler and an interpreter, combining the two in a single package. That two-in-one combination provides a number of important benefits for code performance, as you will shortly see.
For all of its apparent newness, the concept of dynamic compilation is based on research done over the past 10 years at Stanford University and the University of California, Santa Barbara (UCSB). So the technology is actually fairly stable, but it's HotSpot that has propelled dynamic compilation into the limelight. And, in addition to dynamic compilation, there are two other major time-sinks that HotSpot addresses: garbage collection and synchronization.
The cost of computing: garbage collection and synchronization
Unfortunately, in the majority of applications, optimizing Java code is not all there is to improving performance. Some of the same features that help Java developers increase productivity and write more versatile, bug-free applications also consume mass quantities of runtime compute cycles. The major cycle-stealing features are garbage collection and threads. As the pie chart below shows, together these two features take up about 40 percent of the average application's run time.
Java's built-in garbage collection eases the programming burden tremendously. It all but eliminates those "memory leaks" that used to freeze up a system when a program kept allocating memory without releasing it, until there was no longer any room for the system to do anything. But garbage collection takes time. Optimizing the application's bytecodes with a JIT has no effect at all on the virtual machine's garbage collector. It still has to spend the time to do its thing. In fact, some applications allocate and release memory so frequently that the time spent in garbage collection dwarfs the time spent in the application itself. In such cases, optimized application code is of little value.
The other major drain on system resources is thread synchronization. Java's threads (concurrently executing code segments) provide a lot of flexibility. For example, threads allow buffering input to improve performance, and let developers write a server app that services multiple clients simultaneously. But synchronization takes time, a surprising amount of time -- in fact, nearly 20 percent of a normal application's CPU time. Furthermore, that 20 percent is time that can't be optimized by compiling the application's code.
In addition to dynamic compilation, HotSpot uses an improved method of garbage collection called generational garbage collection. This technique has seen a lot of academic work, but is now finding a major application in the Java language. The idea behind generational garbage collection is that memory tends to be allocated, and then discarded, in chronological groups.
Dividing the code into different "generations" of allocated storage has two important benefits. This technique allows the garbage collector to work incrementally. Instead of traversing all of memory at one time -- which can bring the application to a screeching halt -- the generational collector can traverse smaller groups of allocations. So it runs frequently, doing a little a time, and therefore has a much less noticeable effect on performance. This is especially important for interactive applications, where users prefer predictable performance to erratic behavior. In addition, the process streamlines the garbage collection process so that it occurs more quickly.
HotSpot also improves the synchronization process. In many cases, what used to be multiple lines of synchronization code has been reduced to a single instruction. So using multiple threads in an application has virtually no penalty.
Together, improved garbage collection and synchronization can make a drastic difference in an application's runtime -- even without optimizing the application code. But this article is about how dynamic compilation works, so lets get on with that.
How the dynamic compiler works
HotSpot includes both a dynamic compiler and a virtual machine to interpret bytecodes, as shown in the following figure.
When bytecodes are first loaded, they are run through the interpreter. The profiler keeps a record of runtimes for each method. When a method is found to be taking a lot of time, HotSpot compiles and optimizes it. Every future call to that method uses the native machine instructions produced by the compiler.
As with a JIT, the results of the compilation are not kept between runs. Because the bytecodes are more compact, this saves loading time as well as storage space. It also retains portability, and allows the optimization to reflect the way the program is currently being used. Currently, no attempt is made to save information that might help future runs become efficient more quickly. Although this approach may be a future possibility, Sun has no such plans at this time.
Profiling and compiling heuristics
The most fascinating aspect of the whole system (to anyone interested in the subject of machine reasoning, at least) is how the profiler decides which methods to optimize. At present, the system uses a very minimal heuristic that does not involve any sort of artificial intelligence. Although the exact details are proprietary, it's reasonable to suppose that the system predicts the time it will take to optimize a given method. Once the time spent in that method reaches a predefined threshold set at, say, 70 percent of the predicted optimization time, it becomes a candidate for optimization. (The exact threshold is undoubtedly the result of a considerable amount of research.)
Fortunately, say the folks at Sun, it's unnecessary for such a system to be perfect. As long as it's right most of the time, it can make an astonishing difference in how a program runs.
And, who knows? There is always the possibility that some day more advanced artificial intelligence heuristics can be used. There are no firm plans to pursue this possibility -- the company expects to have its hands full refining its current implementation over the next year or two -- but it's an interesting possibility to consider.
The advantages of dynamic compilation
The beauty of integrating a compiler with an interpreter is that it can do things a normal static compiler simply can't do. In particular, it can perform optimistic compilation and take advantage of runtime information to do robust inlining.
Optimistic compilation Because the interpreter is always available to run the bytecodes, HotSpot's dynamic compiler can assume that exceptions and other hard-to-optimize cases don't happen. If they do, HotSpot can revert to interpreting the bytecodes. For a static compiler, optimizing in a way that handles all the unusual cases is a very difficult and time-consuming process. By ignoring those cases, HotSpot can rapidly produce highly optimized code for the code that is most likely to run. That produces a lot of bang for the buck -- significant performance improvement for a small investment in optimization time.
The second major advantage of dynamic compilation is the ability to take into account information that is known only at runtime. Again, the details are proprietary. But it's not hard to imagine, for example, that if a method is invoked in a loop that has an upper index value of 10,000, it's an immediate candidate for optimization. If, on the other hand, the upper loop index value is 1, the optimizer will know to ignore that method and let it be interpreted. Because a static compiler has no way of knowing for sure what the values might be, it can't make such judgements.
One of the important optimizations HotSpot performs is inlining frequently-called methods. That means that instead of being invoked with the overhead of a method call, the method's code is copied into the calling routine and executed as though it had been coded there. The smaller the method, the more sense inlining makes. A one-line method might spend more than twice as much time entering and exiting the routine as it does executing.
Inlining provides a tremendous boost in performance, but it has a size penalty. If every method in a program were copied to every place it was called from, the program would balloon to five or ten times its original size. But HotSpot optimizes only the critical sections of a program. The 80/20 rule says that 80 percent of a program's runtime is spent in 20 percent of the code. By inlining as much as possible of that 20 percent, HotSpot can achieve a significant boost in performance with an acceptable size penalty. Here, runtime information makes a big difference, because HotSpot knows where to find that critical 20 percent.
As with all good things, of course, there are tradeoffs. If a method is very large, the time spent entering and exiting it is only a small fraction of its execution time. In such a case, the space required to duplicate the method may not compare favorably to the time saved by inlining. So HotSpot undoubtedly has an upper limit on how large an inlined method can be. (The exact details, however, are proprietary.)
Finally, because HotSpot does optimistic compilation, it can inline only the normally-executing methods, instead of every possible method. So it can ignore difficult cases that rarely occur instead of working hard to resolve them. That way, you get the maximum performance payoff for the minimum optimization effort.
When dynamic compilation succeeds -- and when it doesn't
In the large majority of cases, HotSpot provides an extreme performance advantage over JIT technology, not to mention normal JVM interpretation. But, as with any performance-enhancing technology, the HotSpot dynamic compiler fails to achieve optimum performance in distinct cases. For example, the profiler might decide to optimize a piece of code just as it finished executing for the last time. In an extreme case, the optimizer could conceivably run around optimizing every method just as it finished executing for the last time, which obviously wouldn't work out at all. Fortunately, such cases are rare.
HotSpot would perform badly in a 100 percent compute-bound application with little or no garbage collection, no thread synchronization, and very small amounts of initialization code. An example of such an application might be one that multiplied the first 1,000 prime numbers together and displayed the result. For such a program, a good JIT would provide superior performance because the JIT doesn't wait before optimizing -- it optimizes everything preemptively.