Is your code ready for the next wave in commodity computing?

Prepare yourself for multicore processing

Hardware is really just software crystallized early," says Alan C. Kay in his paper "The Early History of Smalltalk" (ACM, 1993). This quote really explains the inspiration for this article. Software developers have always been at the mercy of hardware manufacturers, although we've had a pretty easy ride of it since the inception of the computing industry itself. From then until now, increasing speeds of every single component that goes into the standard Von Neumann architecture have given our software literally free increases in performance.

No longer.

All of the main hardware manufacturers (Intel, IBM, Sun, and AMD) have realized the problems inherent in jacking up the clock speeds of CPUs and are in the process of rolling out a fundamental change in processor architecture—multicore units with more than one processing element, providing true hardware support for multiple threads of execution, instead of simulated as in the past.

Like every other change you have come across in your career as a Java programmer, this one brings opportunities and threats to you. In this article, I highlight these opportunities and challenges, as well as detail how to avoid the main challenges identified.

Why parallelize?

Let's take a step back and examine the factors that have precipitated the advent of parallel computing hardware in all tiers of IT, as opposed to specialized high-end niches. Why would we want hardware that can execute software in true parallel mode? For two reasons: You need an application to run more quickly on a given dataset and/or you need an application to support more end users or a larger dataset.

And if we want a "faster" or "more powerful" application, where powerful means handling more and more users, then we have two options:

  1. Increase the power of the system resources
  2. Add more resources to the system

If we increase the power of the system resources (i.e., in some semantic sense, replace or extend the system within its original boundaries), then we are scaling the system vertically; for example, replacing a 1-GHz Intel CPU with a 2-GHz version that is pin-compatible, which is a straight swap. If however, we choose to add to the system resources such that we extend beyond the original boundaries of the system, then we are scaling the system horizontally; for example, adding another node to our Oracle 10g RAC cluster to improve overall system performance.

Finally, I'd like to make one more point on the advent of parallel hardware—you may not need it or even want it for your application, but you have no choice in the matter—CPUs that provide true hardware support for multiple threads are becoming the norm, not the exception. Some estimates indicate that 75 percent of the Intel CPUs that ship by the end of 2007 will be multicore. Intel itself estimates 25 percent by the end of 2006.

Let's switch gears for a moment and place ourselves in the shoes of a hardware designer. How can we feed the insatiable appetite of software programmers for more powerful hardware? We've spent the last 20 years delivering the promise of Moore's Law, to the extent that we are running into fundamental problems of physics. Continuing to increase the clock speed of single processing units is not a sustainable solution moving forward because of the power required and heat generated. The next logical move is to add more processing units to those self-same chips. But that means no more free lunch. Those software engineers will need to explicitly take advantage of the new hardware resources at their disposal or not realize the benefit.

This approach is well-proven in academic and specialized industrial applications, and is also biologically plausible. Say what? As you read this article, you are processing and understanding it using the most complex parallel piece of hardware known to man—your brain (a BNN, or biological neural network, composed of 100 billion simple processing units). ANNs (artificial neural networks) represent a branch of AI (artificial intelligence) dedicated to reproducing the characteristics of BNNs.

So in summary then, we want to—in fact we need to—parallelize hardware if we are to "kick on" and provide more computing resources in a sustainable way to software applications.

Hardware: State of the nation

Let's now briefly look (an in-depth look would be a meaty article in its own right) at how the main hardware vendors have commoditized parallel hardware architectures. Essentially, every hardware vendor has had to create its own viewpoint on how to support parallelism in hardware, implement it, and then pass it to its marketing department to explain how much better its secret sauce is over the opposition's. The challenge for system designers is to select the right hardware platform for their production environment. The challenge for programmers, especially the team of programmers who develop the JDK for these platforms is two-fold:

  1. How do I take advantage of the new hardware platform (for example the IBM POWER5+ or Sun UltraSparc T1 chipset) without straying from the promise of "Write Once, Run Anywhere" (WORA) for Java?
  2. How do I use the operating system provided by the hardware vendor, again, in an optimal way to map onto key Java constructs like java.lang.Thread, but remain true to WORA?

C and C++ programmers have the luxury of being able to provide architecture-specific flags to GCC (GNU Compiler Collection) or g++ (GNU C++) to optimize for those architectures. Java programmers have no such compile-time luxury. Instead, we must tune the running system itself, often using the JVM's -XX flags (see Resources) to tweak behavior for the actual platform used.

Now, let's examine the main characteristics of the hardware offerings from the main vendors.


Intel's offering was once synonymous with the concept of hyperthreading (see Resources for more information), but has now matured to include true multicore support—for example, the dual core CPU now shipping inside the new Apple MacBook. Initially, Intel's implementation of hyperthreading replicated some parts of the CPU architecture, but retained single copies of others, creating bottlenecks under certain conditions. This type of implementation must be regarded as a stop-gap measure at best, designed to keep the market happy while the true multicore support (two cores at first) filtered into production. Intel plans to release a four-core CPU next year for the server market.

See Resources for more details on Intel's offerings, as well as offerings for the other vendors mentioned here.


AMD (Advanced Micro Devices) offers dual-core CPUs. In addition, it uses its Direct Connect and HyperTransport technologies to team those CPUs together to create larger clusters of CPUs.


Sun has truly embraced the power of parallel computing on the server, and their current line-up of T1 (Niagara) and planned T2 (Rock) systems reflects this. Sun has been producing high-end servers long before the T1-based family of course, but the T1 really starts to deliver on the promise of parallel hardware as a commodity.

Unfortunately, the T1 has a flaw, which is only addressed in the T2—poor parallel floating-point operation support. So if you need this feature, the T1 is not for you. But it's ideal if you want to use the T1 in a classic Web server setting. Apart from this weakness, the T1 offers up to 32 truly independent threads by using an eight-core CPU with each core able to handle four different active thread contexts.


Of all the hardware vendors considered here, IBM has arguably one of the most potent offerings with its Cell processor (developed in conjunction with Sony and Toshiba), providing the multicore brawn inside the Sony PlayStation 3, or PS3. The Cell processor is essentially a POWER5 CPU that uses eight more processing units to burn through the massive number of vector and floating-point operations needed to render video games.

More mainstream, IBM's main offering is the dual-core POWER5+ CPU, with POWER6 scheduled for delivery in 2007 (probably also dual-core). Core for core, most industry observers say that a POWER CPU is much faster than the competition. Like Sun, IBM has significant experience (including mainframe technology) to bring to bear on the engineering problems inherent in building multicore CPUs. Interestingly, with the Cell forming part of the Sony PS3, IBM is uniquely placed to put its parallel hardware in tens of millions of homes over the next decade—an interesting point to ponder for client-side programmers.

Putting it all together: A taxonomy

Simple taxonomy for current parallel hardware processing components from vendors. Click on thumbnail to view full-sized image.

In much the same way as Michael J. Flynn proposed a classification scheme for parallel architectures in 1966, we can get a good view of how the various hardware models stack up against each other if we can produce a taxonomy for the current mainstream processing components (where component is a high-level term describing how that vendor has implemented its parallel hardware strategy). The primary decision for this map was to decide on the axes for the model—I chose to use number of cores per processing component, or CPU, and the power of those cores, tentatively based on clock speed (I know that this is a generalization, and some vendors will balk at it, but an in-depth benchmarking of each vendor offering is not this article's main purpose). A more detailed examination of per-core performance would be expanded to examine memory bandwidth and latency, pipelining, and other techniques.

In any event, the map makes one clear point—Sun has a unique offering at the moment in the market. In fact, my own subjective opinion on the current state of play is that Sun has the best offering, especially when considered from a Java architect's point of view. The combination of T1 hardware providing 32 hardware threads, the Java 5 VM, and Solaris 10 as a bullet-proof OS is hard to resist in the enterprise space. The price is compelling too. The only fly in the ointment is the weak FPU support. What if one of the core frameworks I use suddenly introduces a feature that exposes this weakness in the T1? The Rock iteration of the T1 addresses this weakness, as does the Niagara 2 to a certain extent.

On the other hand, if you are building a system that does not need and will never need 32 hardware threads, then look at the AMD, Intel, and IBM offerings—the individual cores in these CPUs are more powerful than a T1 core. Floating-point performance will be better too.

Hardware: Looking forward

Given that current available processors range from 2 to 32 cores, but vector processors with at least 1,024 processing units have been available for about two decades (Gustafson (see below) wrote his seminal speedup note in 1987 based on a 1,024-processor system), what can we expect for the future? Examining the potential rate at which vendors can add cores to a single die (and Sun is leading the mainstream vanguard in this regard) and applying the simplest form of Moore's Law (a doubling of capacity every 18 months), that suggests we could expect to see 1,024 cores on a CPU in about 7.5 years, or 2014. Unrealistic? If we were to couple four of Sun's 32-thread T1 chips using AMD's HyperTransport technology, we could have 128 cores today. I don't doubt that significant hardware problems need to be solved in scaling out multicore CPUs, but, equally, much R&D is being invested in this area too.

Finally, chips with 1,024 cores only make sense (for now at least) in a server-side environment—commodity client-side computing simply doesn't need this kind of horsepower, yet.

Software: State of the nation

Having surveyed the hardware landscape, let's look at some of the fundamental laws that define the theoretical limits on how scalable software can be. First up, Amdahl's Law.

In true sound bite mode, Gene Amdahl was a pessimist. In 1967, he stated that any given calculation can be split into two parts: F and (1 - F), where F is the fraction that is serial and (1 - F), the remaining parallelizable fraction. He then expressed the theoretical speedup attainable for any algorithm thus defined as follows:


speedup = 1 / (F + (1 -F) / N)


N = number of processors used F = the serialized fraction

Simplifying this relationship reveals that the limiting factor on speedup is the relationship 1/F. In other words, if 20 percent of an algorithm must be executed in serial, then we can hope to achieve 5 times the maximum speedup (no matter how many CPUs we add, and that assumes we get linear speedup). If the serial percentage is 1 percent, then we can expect a theoretical maximum of 20 times the speedup.

However, it turned out that Amdahl's equation omits one critical part of the equation, which was added by John Gustafson of Sandia National Labs in 1987—the size of the problem. Basically, if we can grow the problem size so that the serializable fraction becomes less important and the parallelizable fraction grows more important, then we can achieve any speedup or efficiency we want. Mathematically:


Scaled speedup = N + (1 - N) * s'


N = number of processors s' = the serialized component of the task spent on the parallel system

Applying the 80/20 rule once more, it is Gustafson's Law that affects typical enterprise Java applications most of all. As software engineers and architects, we spend most of our time designing and building systems that need to scale up as the problem itself scales up (whether that is the number of concurrent users or the size of the data that needs to be analyzed). Gustafson's Law tells us that there is no intrinsic theoretical limit to the scalability of the systems we build.

In fact, we can regard the vast majority of Java EE (Java EE is Sun's new name for J2EE) systems as a simplified case of a parallelizable problem, where the average task size is small, task boundaries are well defined, and few interdependencies between the tasks are carried out in the system.

Software: Looking forward

What advances can we expect to see in terms of additional software support for parallel hardware in the future?

Tools in IDEs to predict parallelism and/or optimize for parallelism

Software engineers will be expected to have more explicit knowledge about concurrency, which will be reflected in the tools we use. Java IDEs like Eclipse, NetBeans, and IntelliJ will support refactoring code to make it more thread-friendly in the same way that I can extract an interface, fix imports, or change the scope of a variable today.

Java to provide first-class support to create parallel constructs

In a five-year timeframe, the core Java programming language will be extended to provide first-class, explicit support for parallel computing, in much the same way that first-class support for XML is being planned at the moment. This support will greatly enhance the power of tools to automatically suggest simple "parallel-friendly" modifications to algorithms.

Java: Strong and weak points relating to concurrency

This article so far has given you a basic but firm grounding in parallel computing and some of the theory behind it. Now we move on to apply that knowledge to the design and implementation of Java-based systems (bearing in mind that I do not plan to give a tutorial on Threads, ThreadLocal, or use of the synchronized keywords—see Resources for sources dealing with these topics).

Let's use Amdahl (the pessimist) and Gustafson (the optimist) one more time in this article to drive the next few sections:

Amdahl: "Forget about increased performance. If I were you, I'd be more worried about making sure my application simply runs at least as well on the new chips and that they don't expose some weird threading or race condition bug."

Gustafson: "Hey, the new hardware is here! Threads, threads, and more threads! It's one big thread party!"

(Actually, Gene Amdahl is a successful entrepreneur, so I doubt he is pessimistic. It's also poetic license for me to assume that Gustafson goes around shouting "It's one big thread party!" I hope both authors will forgive my using their names to create a literary construct in this article. Note to self: Consider copying Dan Brown and using biblical characters in next JavaWorld article.)

The fundamental question to ask is: Does the Java programming language and platform give developers the capability to correctly and efficiently take advantage of parallel processing hardware?

One of Java's strongest points is the java.lang.Thread class, which has been in the core JDK libraries since the 1.0 release. One of the most significant upgrades to the Java platform in release 5 and higher is the new java.util.concurrent library. It is long overdue. (Doug Lea created and maintained his util.concurrent package for a long time before it moved into the core JDK.) Key classes to examine here are the java.util.concurrent.Executor and java.util.concurrent.Executors. There are two important points to note about this library:

  1. Entire chunks of the library are not needed by 80 percent of Java developers, in particular the subpackage dealing with locks (java.util.concurrent.locks) and atomics (java.util.concurrent.atomic)
  2. A corollary of this point is as a Java programmer, you should look to your main framework providers/vendors to provide your application with threading support instead of providing it yourself

Looking beyond explicit support for parallel processing, Java EE developers have been exposed to an inherently threaded model since the inception of enterprise Java. All the building blocks of the Java EE specification—servlets, session beans, and entity beans—have always been regarded as components, where threading issues need to be explicitly considered and handled by the developer. It is fair to assume then that advances in parallel processing hardware will cause no issues at all for this segment of the Java ecosystem. Developers will still create bugs relating to unordered access to shared variables by threads and create bottlenecks by the inappropriate use of Singleton objects in their code, but by and large, these bugs/anti-patterns and their resolution are understood and well documented.

However, Java by itself is weaker when compared to popular frameworks such as Spring and Hibernate, which support threading good practices. Java EE developers want to delegate threading issues to the framework and focus on implementing business logic.

Moving off the server and onto the desktop, client-side developers will have more work to do. Threading issues have always been well known on the client; multiple helper classes and best practices abound to prevent improper use of the AWT (Abstract Window Toolkit) main UI thread (SwingWorker, javax.swing.SwingUtilities.invokeLater, etc.), while the entire model of the latest UX buzzword—Ajax (Asynchronous JavaScript and XML)—is based on using JavaScript to asynchronously receive data from the server to update the rendered HTML shown to the user.

On an interesting side note, it is fair to say that Java on the client is under pressure from Ajax-enabled browser applications, with the crowd storming the fort being led by Google Suggest, Maps, Spreadsheet, etc. Therefore, we need to ask the question: How do Internet Explorer, Firefox, Safari, etc. handle multicore CPUs, or, put another way, what are the threading models for these applications and their built-in JavaScript interpreter engines?

As it turns out, this is a difficult question to answer, and the answer will change over time. (Why wouldn't Microsoft and Mozilla add threading code to take advantage of additional cores—just as a Java programmer would?) At present, I believe that the answer is that all browsers have one main UI thread, just like the main UI thread in AWT/Swing. The JavaScript interpreter gets another thread to process events that occur during a page's lifecycle.

In summary then, a client machine with more than one true processing unit could run your application differently than a client with only one processing unit. The only way to certify your client-side Java threaded application is to test it on a true multicore CPU.


In summary, the main message of this article is that the next wave of hardware currently filtering onto the client fundamentally differs from what it replaces in one important regard: as a programmer or architect, you must explicitly take advantage of the new parallel computing resources through the efficient creation and utilization of threads to pass performance improvements on to your end users. Stepping back, at a minimum, you must ensure that your application continues to operate correctly on the new multicore hardware, something you never had to do before.

On the server, the commoditization of multicore CPUs continues apace, reducing the price point of servers dramatically. Small companies can now own the same level of computing power for one tenth or even one hundredth of the cost just five years ago. And you ain't seen nothing yet folks.

In the next installment of this article, I will move from the theoretical into the practical, proposing and implementing a vendor-independent threading benchmark that you can use to ascertain the performance of any given hardware-OS-JVM combination. Stay tuned!

Humphrey Sheil is chief technical architect for CedarOpenAccounts, a UK supplier of financial enterprise applications for service industries. He holds a BS and an MS in computer science from University College, Dublin, Ireland, and maintains his Weblog here.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies