JVM performance optimization, Part 3: Garbage collection

Choose the right garbage collector for your application needs

1 2 3 Page 2
Page 2 of 3

The cost of most parallel collection -- and do consider this, especially for production environments -- is that application threads cannot do any work during a GC, just like with copying collectors. Using a parallel collector that implements stop-the-world collection will have a major impact on response-time sensitive applications, especially if you have a lot of references to trace, which will happen with many live or complex data structures on the heap. (Remember that for mark-and-sweep collectors the time to free up new memory is dependent on the time it takes to trace the live data set plus the time to traverse the heap during the sweep phase.) For a monolithic parallel approach using all resources in parallel, this entire time will be a pause, and that pause corresponds to the entire GC cycle.

Concurrent collectors

A concurrent collector is a much better fit for applications that are sensitive to response time. Concurrent means that some (or most) garbage collection work is performed concurrently with the running application threads. As not all resources are used for GC, you will have the challenge of deciding when to start a garbage collection in order to allow enough time for the cycle to end. You need enough time to trace the live set and reclaim the memory before the application runs out of memory. If the garbage collection doesn't complete in time the application will throw an out-of-memory error. You don't want to do garbage collection all the time because that would consume application resources, thus impacting throughput. It can be extra tricky to keep that balance in very dynamic environments, so heuristics have been designed to determine when to start garbage collection and when to do various GC optimizing tasks and how much at a time, etc.

Another challenge is to determine when it is safe to perform operations that require a complete and true snapshot of the world -- for instance, you need to know when all live objects have been marked, and thus when to switch to the sweep phase. In the monolithic stop-the-world scenario employed by most parallel collectors, this phase-switching is less tricky because the world is already standing still. But in concurrent implementations it might not be safe to switch phases immediately. For instance, if an application has modified an area that has already been traced and marked by the collector, new or unmarked references may have been touched, which would make them live. In some implementations this situation will put your application at risk for long-time running re-marking loops, potentially making it hard for your application to get new free memory when it needs it.

The takeaway from this discussion so far is that you have numerous options among garbage collectors and GC algorithms, some better suited than others to specific application types and workloads. Not only are there different algorithms, but there are actually different implementations of the various algorithms. So it is wise to be informed about your application's allocation needs and characteristics before simply specifying a garbage collector on the command line. In the next section we'll look at some of the pitfalls of the Java platform memory model -- and by pitfalls I mean places where Java developers tend to make assumptions that lead to worse performance for dynamic production loads, not better.

Why tuning doesn't replace garbage collection

Most Java developers know that there are choices to be made if you want to maximize Java performance. The current variety of JVMs, garbage collectors, and the overwhelming selection of tuning options can lead developers to spend a lot of deployment time on the never-ending task of performance tuning. This has led some to conclude that GC is bad and that tuning so that GC happens infrequently or for just a short time is a successful workaround. But there are risks to doing GC this way.

Consider what it means to tune against specific application needs. Most tuning parameters -- such as allocation rate, object sizes, timing of response-time sensitive tasks, and how fast objects die -- are tuned specifically for the application's allocation rate, such as the test workload at hand. The end result could be either (or both) of these:

  1. Things that worked during testing fail in production.
  2. Workload or application changes require you to re-tune the application entirely.

Tuning is something that will always need to be repeated! Concurrent garbage collectors in particular can require a lot of tuning -- especially in production environments. You need the heuristics to match your specific application's needs for the expected worst-case load. The end result becomes a very rigid configuration, leading to a lot of resource waste. This approach to tuning (tuning to eliminate GC) is kind of a Don Quixote quest -- a chase after an imagined enemy with sure-to-lose cause. The fact is that the more you tune your collectors to fit a specific load, the farther you'll be from the dynamic features of a Java runtime. After all, how many applications really have a static load? And how reliably can you really predict for an expected load?

So, what if you didn't focus on tuning? What could you be doing differently to prevent out-of-memory errors and improve response times? The first step toward finding out is to identify the real challenge to Java application performance.

Fragmentation

The real Java performance challenge is not the garbage collector itself; it is fragmentation, and how the garbage collector deals with it. Fragmentation is a state of the heap where free memory is available but not in a big enough consecutive memory space to host a new object about to be allocated. As I mentioned in Part 1, a fragment is either a residual space in the Java heap's TLABs, or (more often) a space previously occupied by small objects between longer living objects.

Over time, as an application runs, these fragments of unusable memory will appear all over the heap. In some cases the state will get worse with statically tuned options (like promotion rates, free lists, etc.) that can't keep up with the needs of a dynamic application. Such left-over (fragmented) space can't be used efficiently by an application. If you don't do anything about it you'll eventually end up with back-to-back garbage collections, the garbage collector's attempt to free up memory for a new object allocation request. In the worst-case scenario, even a number of consecutive GCs will not free up memory (too many fragments) and the JVM will be forced to throw an out-of-memory error. You can deal with fragmentation by restarting the application, which makes your heap a new space of consecutive memory mapped to your process, such that you'll be able to allocate objects freely again. Restarting leads to downtime, however, and besides the heap would eventually get fragmented again and you would need to do another restart.

OutOfMemoryErrors, hung processes, and logs showing that your garbage collector is working overtime are all symptoms of a GC struggling to free up memory, and are indicators of a fragmented heap. While some developers seek to resolve fragmentation by simply re-tuning their garbage collector, I argue that we should explore more innovative solutions to the problem of fragmentation before it becomes an issue for GC. The remainder of this article will focus on the two most effective approaches to managing fragmentation: generational garbage collection and compaction.

Generational garbage collection

You may have heard the theory that in production environments most objects die young. Generational garbage collection is an approach to GC that springs from this theory. In a generational garbage collection you basically divide the heap into different spaces (or generations), each covering different ages of objects, where age could mean the number of GC cycles that the object has survived (that is, how many times it has remained referenced).

When the space for new allocations, also known as young space, nursery, or young generation, is full, objects still referenced within that space are moved into the next generation. Most commonly there are only two generations. Usually generational garbage collectors are one-directional copying collectors, which I discussed earlier. Some more-recent implementations of young generation collectors are parallel collectors. It is also possible to implement different collecting algorithms for young space and old space. If you are using a parallel and/or copying collector then your young generation collectors will be a stop-the-world collector (see earlier explanation).

Old space or old generation hosts the promoted objects that stay referenced for a certain time or certain numbers of young space collections. On occasion, really large objects might be allocated directly into old space, as large objects are more costly to move.

The mechanics of generational GC

In a generational approach, garbage collection happens less frequently in old space and more frequent in young space, where the GC cycle is hopefully shorter and less intrusive. Having a young space can, on rare occasions, lead to more frequent collections in old space, however. Typically, this would happen if you had tuned the nursery size to be too large compared to your current heap size, and if your application's objects tended to survive for a long time (or if you have tuned your promotion rate "incorrectly"). In that case the old generation space would become too small to host all of your long-living objects and the old generation GC would struggle freeing up space to make room for promoted objects. In general, though, by having a generational garbage collector you will gain application performance and latency consistency.

A nice side-effect of having a young generation is that fragmentation is somewhat addressed. Or rather, worst case scenario is postponed. The young, small objects that otherwise might cause fragments are cleaned out of the way. The old space also becomes more compact, because the long-living objects are more tightly allocated as they are moved to the older generation. Over time -- if you run long enough -- the old space will still become fragmented, however. In this case, you will end up with one or several full stop-the-world collections, potentially forcing the JVM to throw an OutOfMemoryError or an error indicating failure to promote. Having a nursery postpones that worst-case scenario, however, and for some applications that is good enough. (In some case it will postpone the full-stop scenario to a point in the application's lifecycle where it doesn't matter anymore.) For most applications having a nursery functions as stop-gap, but it does decrease the frequency of stop-the-world GC and OutOfMemoryError exceptions.

Tuning generational GC

As stated above, with the use of generations comes the responsibility and repetitive work of tuning young generation size and promotion rate for every new version of your application and for every load change. I can't stress enough the trade-off of specializing your runtime: by choosing a fixed number that optimizes for a specific load, you reduce your garbage collector's ability to respond dynamically to change -- and change is inevitable.

My rule-of-thumb for tuning nursery size is that it should be as large as you can make it while ensuring the latency of the young space during stop-the-world collections. (Assuming that the young space is configured to use a parallel collector that is!) Keep in mind, too, that you need to leave enough old space in the heap for long-lived objects -- with margin! Here are some additional factors to consider when tuning a generational garbage collector:

  1. Most implementations of young space collection will ultimately result in a stop-the-world collection, and the larger the nursery, the longer the associated pause-time will be. So for applications that will be heavily impacted by that GC pause, you should carefully consider how large you want your young space to be.
  2. You can mix GC algorithms for your heap generations. You could potentially make the young generation GC parallel, while using a concurrent algorithm for old space collection.
  3. When you see frequent promotion failures it is usually a sign that your old space is fragmented. Promotion failure means that there isn't a large enough space in old space to fit a surviving object from young space. If this happens, consider tweaking the promotion rate (the age tuning option) or make sure your old-space GC algorithm is a compacting one (discussed in the next section) and tune the compaction in a way that fits your current application load. You may also increase the heap size and generation sizes, but doing so could impact the pause times of old space collections further -- remember that fragmentation is inevitable!
  4. A generational collector work best for any Java application that has many short-lived small objects that will die within their first collection cycle. Generational GC works well to reduce fragmentation in this scenario, mainly by postponing it to a point in the application lifecycle where it may no longer be an issue.

Compaction

Although generational garbage collection postpones the worst-case scenario of fragmentation and OutOfMemoryError, compaction is the only real way of handling fragmentation. Compaction refers to a GC strategy of moving objects together to free up larger consecutive chunks of memory. Compaction thus creates large enough free memory to host new objects.

Moving objects and updating their references is a stop-the-world operation, which comes at a cost to Java performance. (This is true in all cases but one, which I'll discuss in the next article in this series.) The more popular (meaning referenced) an object is the longer the pause time will be. In a case where very little memory is left in the heap and the fragmentation situation is severe -- which usually becomes the case the longer an application runs -- compacting a popular area of the heap may mean seconds of pause time. Compacting the entire heap, which you would do if your system is close to running out of memory, could take tens of seconds.

1 2 3 Page 2
Page 2 of 3