Hardware is really just software crystallized early," says Alan C. Kay in his paper "The Early History of Smalltalk" (ACM, 1993). This quote really explains the inspiration for this article. Software developers have always been at the mercy of hardware manufacturers, although we've had a pretty easy ride of it since the inception of the computing industry itself. From then until now, increasing speeds of every single component that goes into the standard Von Neumann architecture have given our software literally free increases in performance.
All of the main hardware manufacturers (Intel, IBM, Sun, and AMD) have realized the problems inherent in jacking up the clock speeds of CPUs and are in the process of rolling out a fundamental change in processor architecture—multicore units with more than one processing element, providing true hardware support for multiple threads of execution, instead of simulated as in the past.
Like every other change you have come across in your career as a Java programmer, this one brings opportunities and threats to you. In this article, I highlight these opportunities and challenges, as well as detail how to avoid the main challenges identified.
Let's take a step back and examine the factors that have precipitated the advent of parallel computing hardware in all tiers of IT, as opposed to specialized high-end niches. Why would we want hardware that can execute software in true parallel mode? For two reasons: You need an application to run more quickly on a given dataset and/or you need an application to support more end users or a larger dataset.
And if we want a "faster" or "more powerful" application, where powerful means handling more and more users, then we have two options:
- Increase the power of the system resources
- Add more resources to the system
If we increase the power of the system resources (i.e., in some semantic sense, replace or extend the system within its original boundaries), then we are scaling the system vertically; for example, replacing a 1-GHz Intel CPU with a 2-GHz version that is pin-compatible, which is a straight swap. If however, we choose to add to the system resources such that we extend beyond the original boundaries of the system, then we are scaling the system horizontally; for example, adding another node to our Oracle 10g RAC cluster to improve overall system performance.
Finally, I'd like to make one more point on the advent of parallel hardware—you may not need it or even want it for your application, but you have no choice in the matter—CPUs that provide true hardware support for multiple threads are becoming the norm, not the exception. Some estimates indicate that 75 percent of the Intel CPUs that ship by the end of 2007 will be multicore. Intel itself estimates 25 percent by the end of 2006.
Let's switch gears for a moment and place ourselves in the shoes of a hardware designer. How can we feed the insatiable appetite of software programmers for more powerful hardware? We've spent the last 20 years delivering the promise of Moore's Law, to the extent that we are running into fundamental problems of physics. Continuing to increase the clock speed of single processing units is not a sustainable solution moving forward because of the power required and heat generated. The next logical move is to add more processing units to those self-same chips. But that means no more free lunch. Those software engineers will need to explicitly take advantage of the new hardware resources at their disposal or not realize the benefit.
This approach is well-proven in academic and specialized industrial applications, and is also biologically plausible. Say what? As you read this article, you are processing and understanding it using the most complex parallel piece of hardware known to man—your brain (a BNN, or biological neural network, composed of 100 billion simple processing units). ANNs (artificial neural networks) represent a branch of AI (artificial intelligence) dedicated to reproducing the characteristics of BNNs.
So in summary then, we want to—in fact we need to—parallelize hardware if we are to "kick on" and provide more computing resources in a sustainable way to software applications.
Hardware: State of the nation
Let's now briefly look (an in-depth look would be a meaty article in its own right) at how the main hardware vendors have commoditized parallel hardware architectures. Essentially, every hardware vendor has had to create its own viewpoint on how to support parallelism in hardware, implement it, and then pass it to its marketing department to explain how much better its secret sauce is over the opposition's. The challenge for system designers is to select the right hardware platform for their production environment. The challenge for programmers, especially the team of programmers who develop the JDK for these platforms is two-fold:
- How do I take advantage of the new hardware platform (for example the IBM POWER5+ or Sun UltraSparc T1 chipset) without straying from the promise of "Write Once, Run Anywhere" (WORA) for Java?
- How do I use the operating system provided by the hardware vendor, again, in an optimal way to map onto key Java constructs like
java.lang.Thread, but remain true to WORA?
C and C++ programmers have the luxury of being able to provide architecture-specific flags to GCC (GNU Compiler Collection) or g++ (GNU C++) to optimize for those architectures. Java programmers have no such compile-time luxury. Instead, we must tune the running system itself, often using the JVM's -XX flags (see Resources) to tweak behavior for the actual platform used.
Now, let's examine the main characteristics of the hardware offerings from the main vendors.
Intel's offering was once synonymous with the concept of hyperthreading (see Resources for more information), but has now matured to include true multicore support—for example, the dual core CPU now shipping inside the new Apple MacBook. Initially, Intel's implementation of hyperthreading replicated some parts of the CPU architecture, but retained single copies of others, creating bottlenecks under certain conditions. This type of implementation must be regarded as a stop-gap measure at best, designed to keep the market happy while the true multicore support (two cores at first) filtered into production. Intel plans to release a four-core CPU next year for the server market.
See Resources for more details on Intel's offerings, as well as offerings for the other vendors mentioned here.
AMD (Advanced Micro Devices) offers dual-core CPUs. In addition, it uses its Direct Connect and HyperTransport technologies to team those CPUs together to create larger clusters of CPUs.
Sun has truly embraced the power of parallel computing on the server, and their current line-up of T1 (Niagara) and planned T2 (Rock) systems reflects this. Sun has been producing high-end servers long before the T1-based family of course, but the T1 really starts to deliver on the promise of parallel hardware as a commodity.
Unfortunately, the T1 has a flaw, which is only addressed in the T2—poor parallel floating-point operation support. So if you need this feature, the T1 is not for you. But it's ideal if you want to use the T1 in a classic Web server setting. Apart from this weakness, the T1 offers up to 32 truly independent threads by using an eight-core CPU with each core able to handle four different active thread contexts.
Of all the hardware vendors considered here, IBM has arguably one of the most potent offerings with its Cell processor (developed in conjunction with Sony and Toshiba), providing the multicore brawn inside the Sony PlayStation 3, or PS3. The Cell processor is essentially a POWER5 CPU that uses eight more processing units to burn through the massive number of vector and floating-point operations needed to render video games.
More mainstream, IBM's main offering is the dual-core POWER5+ CPU, with POWER6 scheduled for delivery in 2007 (probably also dual-core). Core for core, most industry observers say that a POWER CPU is much faster than the competition. Like Sun, IBM has significant experience (including mainframe technology) to bring to bear on the engineering problems inherent in building multicore CPUs. Interestingly, with the Cell forming part of the Sony PS3, IBM is uniquely placed to put its parallel hardware in tens of millions of homes over the next decade—an interesting point to ponder for client-side programmers.
Putting it all together: A taxonomy
In much the same way as Michael J. Flynn proposed a classification scheme for parallel architectures in 1966, we can get a good view of how the various hardware models stack up against each other if we can produce a taxonomy for the current mainstream processing components (where component is a high-level term describing how that vendor has implemented its parallel hardware strategy). The primary decision for this map was to decide on the axes for the model—I chose to use number of cores per processing component, or CPU, and the power of those cores, tentatively based on clock speed (I know that this is a generalization, and some vendors will balk at it, but an in-depth benchmarking of each vendor offering is not this article's main purpose). A more detailed examination of per-core performance would be expanded to examine memory bandwidth and latency, pipelining, and other techniques.
In any event, the map makes one clear point—Sun has a unique offering at the moment in the market. In fact, my own subjective opinion on the current state of play is that Sun has the best offering, especially when considered from a Java architect's point of view. The combination of T1 hardware providing 32 hardware threads, the Java 5 VM, and Solaris 10 as a bullet-proof OS is hard to resist in the enterprise space. The price is compelling too. The only fly in the ointment is the weak FPU support. What if one of the core frameworks I use suddenly introduces a feature that exposes this weakness in the T1? The Rock iteration of the T1 addresses this weakness, as does the Niagara 2 to a certain extent.
On the other hand, if you are building a system that does not need and will never need 32 hardware threads, then look at the AMD, Intel, and IBM offerings—the individual cores in these CPUs are more powerful than a T1 core. Floating-point performance will be better too.
Hardware: Looking forward
Given that current available processors range from 2 to 32 cores, but vector processors with at least 1,024 processing units have been available for about two decades (Gustafson (see below) wrote his seminal speedup note in 1987 based on a 1,024-processor system), what can we expect for the future? Examining the potential rate at which vendors can add cores to a single die (and Sun is leading the mainstream vanguard in this regard) and applying the simplest form of Moore's Law (a doubling of capacity every 18 months), that suggests we could expect to see 1,024 cores on a CPU in about 7.5 years, or 2014. Unrealistic? If we were to couple four of Sun's 32-thread T1 chips using AMD's HyperTransport technology, we could have 128 cores today. I don't doubt that significant hardware problems need to be solved in scaling out multicore CPUs, but, equally, much R&D is being invested in this area too.
Finally, chips with 1,024 cores only make sense (for now at least) in a server-side environment—commodity client-side computing simply doesn't need this kind of horsepower, yet.
Software: State of the nation
Having surveyed the hardware landscape, let's look at some of the fundamental laws that define the theoretical limits on how scalable software can be. First up, Amdahl's Law.
In true sound bite mode, Gene Amdahl was a pessimist. In 1967, he stated that any given calculation can be split into two parts: F and (1 - F), where F is the fraction that is serial and (1 - F), the remaining parallelizable fraction. He then expressed the theoretical speedup attainable for any algorithm thus defined as follows:
speedup = 1 / (F + (1 -F) / N)
N = number of processors used F = the serialized fraction
Simplifying this relationship reveals that the limiting factor on speedup is the relationship 1/F. In other words, if 20 percent of an algorithm must be executed in serial, then we can hope to achieve 5 times the maximum speedup (no matter how many CPUs we add, and that assumes we get linear speedup). If the serial percentage is 1 percent, then we can expect a theoretical maximum of 20 times the speedup.
However, it turned out that Amdahl's equation omits one critical part of the equation, which was added by John Gustafson of Sandia National Labs in 1987—the size of the problem. Basically, if we can grow the problem size so that the serializable fraction becomes less important and the parallelizable fraction grows more important, then we can achieve any speedup or efficiency we want. Mathematically:
Scaled speedup = N + (1 - N) * s'
N = number of processors s' = the serialized component of the task spent on the parallel system