Tackle Java server capacity problems

Improve the capacity of your Java server application through load testing and analysis

Engineers and their managers are familiar with organizing a set of concrete tasks and driving them to completion. Simple performance problems, which can be isolated by a single developer on a personal machine, are straightforward to manage and remedy. However, large capacity problems, occurring when the system is under load, are common, and handling them requires a completely different approach. These problems require an isolated test environment, a simulated load, and careful analysis and tracking of changes.

In this article, I create a test environment using some easily obtainable tools and equipment. I then walk through the analysis of two capacity problems, focusing on memory and synchronization issues, which can be difficult to expose using a simple profiler. By walking through a concrete example, I hope to make tackling complex capacity problems less daunting and provide insight into the general process.

Improving server capacity

Server capacity improvements are inherently data driven. Making any application or environment changes without reliable data will generally yield poor results. Profilers provide valuable information about Java server applications, but they are frequently inaccurate because data derived from a single application user may look entirely different from the data derived from dozens or even hundreds of application users. Utilizing profilers to optimize application performance during development is a good place to start, but augmenting this common approach by analyzing the application under load yields far better overall results.

Analyzing a server application under load requires a few basic elements:

  1. A controlled environment to load-test the application
  2. A controlled synthetic load to drive the application to full capacity
  3. Data collection from monitors, applications, and the load-testing software itself
  4. The tracking of capacity changes

Underestimating this last requirement, the tracking of capacity, is a mistake because, if you fail to track the capacity, you have no way of actually managing the project. It is unlikely that a 10 or 20 percent gain in capacity will make any noticeable difference when only a single person is using the application, but this is not necessarily obvious to everyone supporting the project. A 20 percent improvement is significant, and, by tracking the capacity improvements, you can provide important feedback and keep the project on track.

As important as tracking capacity is, unfortunately, it is sometimes necessary to invalidate previous test results when making future test results more accurate. Over the course of a capacity project, improving the load test's accuracy may require changes in the simulation and environment. These changes are necessary and, by holding the application constant—load testing before and after other changes—you can carefully record the transition.

A controlled environment

A controlled environment, at a minimum, requires two dedicated machines and a third for a controller. One of the machines generates a load; a second machine, the controller, communicates with the first to set up the test scenario and receive feedback on the test; and the third machine runs your application. In addition, the network between the load and application machines should be isolated from the rest of the LAN. The controller receives feedback from the loaded application machines about OS metrics, hardware utilization, and application metrics, especially the VM in this case.

Load simulation

The most accurate simulations are constructed using actual user datasets and, in the case of Web servers, access logs. If you either have not deployed yet or lack access to actual user data, then you can do well enough by constructing likely scenarios, querying sales and product management teams for specifics, and making a few educated guesses. Reconciling the discrepancies between the load test and actual user experience is an ongoing process.

Several user scenarios are generally necessary in simulation. For instance, in a common address book application, you have separate scenarios for users updating the address book and those querying it. In the simple GrinderServlet class that serves as my test application, I have only one scenario. A single user accesses the servlet 10 times in succession, pausing briefly between each request. Though this application is trivial, I wanted to replicate a couple of common attributes. Users do not make requests of a server continuously without pause. Without allowing for brief pauses, I would have an inaccurate understanding of the number of active users supportable.

The other reason for stringing 10 requests together is because a real application is unlikely to consist of one HTTP request. Single, separate requests could affect numerous elements in the environment. Specifically, Tomcat may create separate sessions for each request, and the HTTP protocol allows separate requests to reuse connections. I am cultivating my load test somewhat to avoid confusing artifacts.

The GrinderServlet does not operate on data of any sort, but this requirement is frequently at the core of most applications. In these applications, when composing a load test, you will need to create a simulated dataset and then construct usage scenarios parameterized with the simulated data.

For example, if your scenario involves a user logging into a Web application, selecting a user at random from a list of possible users is more accurate than using only one. Otherwise, you may mistakenly invoke caching systems, other optimizations, or some subtle and unlikely element of your application that completely distorts your results.

Load-testing software

Load-testing software allows you to construct scenarios and drive a load against your test server. OpenSTA is the load-testing software I use in the following examples. It is fairly simple and quick to learn, archives data easy to export, supports user data parameterized scripts, and monitors a variety of information. Its main drawback is that it is Windows based, but that was not a problem for my environment. Many other solutions are available, for example, Apache's JMeter and Mercury's LoadRunner. All three of these solutions will spread the load generation across a cluster of servers and collect information back to a central control server. Your tests will be more accurate if you use separate, dedicated servers to generate the load and ensure they do not exhaust their hardware resources.

The GrinderServlet

The GrinderServlet class, shown in Listing 1, and the Grinder class, shown in Listing 2, make up my test application.

Listing 1

 

package pub.capart;

import java.io.*; import java.util.*; import javax.servlet.*; import javax.servlet.http.*;

public class GrindServlet extends HttpServlet { protected void doGet(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException { Grinderv1 grinder = Grinderv1.getGrinder(); long t1 = System.currentTimeMillis(); grinder.grindCPU(13); long t2 = System.currentTimeMillis();

PrintWriter pw = res.getWriter(); pw.print("<html>\n< body> \n"); pw.print("Grind Time = "+(t2-t1)); pw.print("< body> \n< /html> \n"); } }

Listing 2

 

package pub.capart;

/** * This is a simple class designed to simulate an application consuming * CPU, memory, and contending for a synchronization lock. */ public class Grinderv1 { private static Grinderv1 singleton = new Grinderv1(); private static final String randstr = "this is just a random string that I'm going to add up many many times";

public static Grinderv1 getGrinder() { return singleton; } public synchronized void grindCPU(int level) { StringBuffer sb = new StringBuffer(); String s = randstr; for (int i=0;i<level;++i) { sb.append(s); s = getReverse(sb.toString()); } } public String getReverse(String s) { StringBuffer sb = new StringBuffer(s); sb = sb.reverse(); return sb.toString(); } }

These listings are brief, but interesting to study because they reproduce two common problems. The most glaring is probably the bottleneck caused by the synchronization modifier on the grindCPU() method, but the memory consumption will actually prove to be an even worse problem. The results of my first load test, displayed in Figure 1, show a modest load gently ramped up against version one of the GrinderServlet. Ramping up the load is important, because otherwise you are simulating a vastly larger initial load. It is also more accurate to "warm up" your application and avoid artifacts such as JSP (JavaServer Pages) compilation. I generally run a single simulated user through the application before beginning the load test.

Figure 1

I use the same capacity summary plot throughout this article. Much more information is available when performing a load test, but this provides a useful summary. The top panel contains throughput, the number of completed requests per second, and request-duration information from the load-testing software. Throughput most accurately quantifies capacity. The second panel contains the number of active users and a failure rate. I consider timeouts, bad server responses, and any requests taking more than five seconds to be failures. The third panel contains JVM memory statistics and CPU utilization. The CPU is an average of user time across all processors. All machines used in my load testing have two processors. The memory statistics contain a graph of garbage collections and the rate of garbage collections per second.

The two most obvious features from Figure 1 are the 50 percent CPU utilization—this test was run on a dual CPU machine—and the enormous amount of memory being consumed and immediately released. The reasons for both should be readily obvious after examining Listing 2. The synchronization modifier serializes all processing, restricting the number of CPUs to just one. The algorithm itself consumes enormous amounts of memory in local variables.

CPU is frequently the limiting resource, and it is tempting to assume in this test that if I can utilize both processors without adding extra overhead, then I will double capacity. The garbage collector is so active that it impossible to see individual collections. The memory deallocated per second is 100 megabytes for the majority of the load test, and this will turn out to be the limiting factor. The number of failures is also striking and may actually render the application completely unusable.

Monitoring

After generating a reasonable user load, monitoring tools are necessary to gain visibility into the running processes. My ideal monitoring environment gives me access to a tremendous variety of information:

  1. Hardware utilization of all computers, network devices, and so forth
  2. JVM statistics
  3. Individual Java method timings
  4. Database performance information, including everything from SQL queries to general metrics and statistics
  5. Metrics from all other applications involved

All monitoring affects the load test. Ignoring the effect may be possible if it is small. Fundamentally, if I were to retrieve all the information above, I would cripple the system I am trying to test. However, retrieving all of the information presented above is possible without invalidating the load test if it is retrieved cautiously and not all at once. Setting up timers for only specific methods, retrieving only low overhead metrics from the computer hardware, and sampling data at a fairly low rate are a few of the precautions frequently employed. It is best to benchmark your test without monitors—except for the actual load-test servers themselves, which always require monitoring—and then compare them to a benchmark with monitors in place. Invasive monitoring is also sometimes a good idea, but reporting the results is not possible.

Archiving all monitoring data to a central controller affords the best analysis, but using dynamic runtime utilities can also provide useful information. For example, command line utilities such as ps, top, and vmstat provide information about Unix machines; and the perfmon utility provides information about Windows machines. Tools such as TeamQuest, BMC Patrol, SGI's Performance Co-Pilot, and ISM's PerfMan install agents on all computers in the test environment and push back the desired information to a central controller, generally providing archival and visualization. For this article, I use an open source version of Performance Co-Pilot for my basic hardware statistics. I found it to have minimal impact on the test, and it provided data in a straightforward fashion.

Java profilers provide a wealth of information, but generally they are too invasive to be useful when load testing. Tools are available that instrument just a few methods in your code, allowing you to do some analysis even on a loaded server, but it is still easy to invalidate the test. During these tests, I invoked the verbose garbage collector to collect memory information. I also used both the jconsole and jstack utilities, included in J2SE 1.5, to examine the running VM under heavy load. I did not keep the load test's results in these cases because I felt the data was compromised.

Synchronization bottleneck

Thread dumps are a valuable resource when diagnosing server issues, especially synchronization problems. The jstack utility connects to a running process and dumps each thread's stack trace. Previously, VMs on Unix machines would dump a thread stack to standard out when a signal 3 was received, and VMs on Windows would respond similarly when Ctrl-Break was invoked in the console Window. In this first test, jstack indicated that many threads were blocked on the grindCPU() method.

You may have noticed that the synchronization modifier on the grindCPU() method in Listing 2 is not actually necessary. I remove it for the next test, shown in Figure 2.

Figure 2

In Figure 2, you will notice that capacity has actually decreased. I am utilizing more CPU, but both the throughput and number of failures have become worse. The garbage collection cycle has changed, but I am still deallocating 100 megabytes per second. It's obvious that I have not removed the primary bottleneck.

Un-contended synchronization is costly in comparison to a simple function call. Contended synchronization is far worse because, in addition to the memory contracts that must be maintained by synchronization, the VM must also manage the waiting threads (see "Threading Lightly" by Brian Goetz (developerWorks, July 2001)). In this case, these costs are actually less than the memory bottleneck. In fact, by releasing the synchronization bottleneck, I place even more pressure on the VM memory systems and end up with slightly worse throughput, even though I am consuming far more CPU resources. It is obviously best to begin working on the worst bottleneck, but that can be hard to determine sometimes. However, ensuring your VM's memory-processing is healthy is usually a good place to begin.

Memory bottleneck

Now, I place the synchronization back and work with the memory problems instead. Listing 3 is a revision of the GrinderServlet with reused StringBuffer instances. The load-test results are shown in Figure 3.

Listing 3

 

package pub.capart;

/** * This is a simple class designed to simulate an application consuming * CPU, memory, and contending for a synchronization lock. */ public class Grinderv2 { private static Grinderv2 singleton = new Grinderv2(); private static final String randstr = "this is just a random string that I'm going to add up many many times"; private StringBuffer sbuf = new StringBuffer(); private StringBuffer sbufrev = new StringBuffer();

public static Grinderv2 getGrinder() { return singleton; } public synchronized void grindCPU(int level) { sbufrev.setLength(0); sbufrev.append(randstr); sbuf.setLength(0); for (int i=0;i<level;++i) { sbuf.append(sbufrev); reverse(); } return sbuf.toString(); }

public String getReverse(String s) { StringBuffer sb = new StringBuffer(s); sb = sb.reverse(); return sb.toString(); } }

Figure 3

In general, it is a bad idea to reuse StringBuffers, but this is a synthetic case, and I am merely trying to reproduce a couple of common problems, not provide idiomatic solutions. The memory data has disappeared from the figure because the garbage collector did not run during the test. The throughput has increased dramatically and the CPU utilization is back at 50 percent. Listing 3 optimizes more than just memory usage, but I think the improvements are largely due to less hyperactive memory consumption.

Synchronization bottleneck revisited

Listing 4 is another revision of the GrinderServlet class with a minimal resource pool implementation. Figure 4 shows the load-test results.

Listing 4

 

package pub.capart;

/**

* This is just a dummy class designed to simulate a process consuming * CPU, memory, and contending for a synchronization lock. */ public class Grinderv3 { private static Grinderv3 grinders[]; private static int grinderRoundRobin = 0; private static final String randstr = "this is just a random string that I'm going to add up many many times"; private StringBuffer sbuf = new StringBuffer(); private StringBuffer sbufrev = new StringBuffer();

static { grinders = new Grinderv3[10]; for (int i=0;i<grinders.length;++i) { grinders[i] = new Grinderv3(); } } public synchronized static Grinderv3 getGrinder() { Grinderv3 g = grinders[grinderRoundRobin]; grinderRoundRobin = (grinderRoundRobin +1) % grinders.length; return g; } public synchronized void grindCPU(int level) { sbufrev.setLength(0); sbufrev.append(randstr); sbuf.setLength(0); for (int i=0;i<level;++i) { sbuf.append(sbufrev); reverse(); } return sbuf.toString(); } public String getReverse(String s) { StringBuffer sb = new StringBuffer(s); sb = sb.reverse(); return sb.toString(); } }

Figure 4

Throughput has increased appreciably, and I am using better than 50 percent of the CPU resources. Both contended and un-contended synchronization is costly, but generally, the largest synchronization cost is in reduced vertical scalability. My load test is no longer saturating the machine however, so I now revise my load test with more virtual users, shown in Figure 5.

Figure 5

Notice in Figure 5, that the throughput dips when the load hits saturation and then comes back up once the load reduces somewhat. In addition, notice that the load test pushes the CPUs to 100 percent throughout most of the test, suggesting that the server is beyond optimal throughput. One of the outcomes of load testing is a capacity plan. Subjecting your application to loads beyond its capacity rating yields a lower throughput. Pushing a server well past its capacity rating is a stress test and proves somewhat more complex because so many elements of the servers under load will become nonlinear and generally much more complex.

Horizontal scalability

Scaling horizontally allows for much larger capacities, but is not always cost effective. An application constructed to run on separate servers may be considerably more complex than an application running in a single VM. However, horizontal scaling supports the largest increases in capacity.

Figure 6 is the result of my final load test. I have load-balanced three mostly identical servers, with slight variations in memory and CPU speeds, and rerun the previous load test. The overall throughput is better than three times that of a single machine, and the CPUs are never completely loaded. I show only the CPU results of a single machine in Figure 6, but the others are similar.

Figure 6

Conclusion

I once spent nine months at a well funded startup that deployed a complex Java server application with no time spent on capacity planning. Poor performance placed customer contracts in jeopardy; developers worked long hours with profilers, finding small improvements without addressing the primary bottlenecks; and complete confusion ensued. Eventually, load testing provided the answers, but only after considerable pain.

The next startup I went to had a far worse problem; the application was performing at 1/100 of its expected capacity. However, by detecting the problem early and recognizing the necessity of load testing, it was resolved quickly, and in a structured and collegiate atmosphere. Load testing is not terribly costly in comparison to overall software development costs, but the risks avoided are considerable.

Ivan Small has more than 14 years of experience developing software. He began his career leading software development for the Supernovae Cosmology Project at LBNL. The project was one of two that resulted in the discovery of anti-gravity and an ever-expanding universe. He has since worked on projects ranging from data mining to Java enterprise applications. Currently he is a principal software engineer at Innovative Interfaces.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies