Seymour Cray's Cray and Danny Hillis's Connection Machine -- these two types of supercomputers are what dreams are made of, at least to programmers who, like myself, hunger for infinite processing power. These Testarossas of the number-crunching world can simulate weather, car crashes and impacting anti-tank shells, miniature ecosystems and the global economy, the effects of imaginary pollution disasters, the internals of stars, and the detonation of an H-bomb.
Supercomputers are the ultimate kit for more than day-dreaming power-programmers. Superpower governments and giant multinationals rely heavily on them. Their astronomical costs have meant that few programmers have had the pleasure of even writing a single machine instruction for these ivory-tower instruments of knowledge.
But what does all this have to do with Java?
"Knock, knock. Can I borrow your CPU for a minute?"
Consider for a second your own PC: You're most likely staring at a Pentium-class machine, with 16 or more megabytes of RAM. Now look at your humble telephone socket: For many of us, this is your PC's connection to the Internet, linking your machine to
more of these same Pentium-class machines. From this perspective, the Internet metamorphoses itself as the
supercomputer in the world. And all that is necessary to harvest its awesome potential is to convince as many as possible of these Internet-connected machines to do some work for you, while you sit back, still staring at your diminutive PC.
But how do you get millions of Internet nodes to execute some code?
Enter the ubiquitous Web browser (which by now is likely to be Java-compatible). Most applets embedded in a Web page these days range in usefulness from the entertaining but excrutiatingly slow to the annoying crash-on-arrival type. Why not distribute some useful applets for a change, like ones tackling a massive computational problem whose solution could benefit the world, instead of the military or some profit-obsessed oil giant?
Raytracing as an example DAMPP application
Since I am hardly a nuclear physicist or a meteorologist or a world-renowned economist, I have chosen to tackle a subject most computer users and programmers will be more comfortable with: raytraced image generation. Raytracing is a technique that to date has produced the most life-like, photorealistic synthetic images to come out of machines. Some of the best, mind-blowing movie special effects in these past few years were done using raytracing, although most are generated using much cheaper (that is, faster) techniques that end up producing less realistic images or animation.
Although raytracing is not what this article is about, I shall give you a very brief summary of the technique: Computer generated images all have at their heart a mathematical description of the scene to be generated. Life-like images are produced from three-dimensional mathematical descriptions, where the scene or model is composed out of simple building blocks like flat polygons, spheres, cubes, cones, and lines. To obtain realistic images, the algorithm must model color and light and all its physical side-effects (shadows, reflections, transparancy, brightness, and darkness) as close to the real phenomenon as possible. This is where the cheaper image-rendering algorithms use shortcuts that trade image quality for speed of generation. Raytracing does the opposite. Its goal is maximum realism at the expense of quick rendering.
Raytracing models reality very closely indeed. It attempts to model rays of light that, after emanating from some light source(s), interact with the model and eventually end up hitting your eye's retina. Concretely, the raytracing algorithm is extremely simple, in principle: For each image pixel to be rendered, it traces an imaginary beam of light from your eye (the viewpoint), going through the pixel and intersecting at some point the computer model of the scene to be drawn. At such an intersection the algorithm determines the pixel's color (and intensity) by tracing more imaginary light beams, this time emanating from the light source(s) in the scene's model. Hence ray-tracing, a very elegant but incredibly compute-intensive solution. (It is essentially a brute-force approach.)
Raytracing has one characteristic that's quite relevant to parallel computation: Each pixel can be computed totally independently from all others. In parallel jargon, there are no data dependencies. This means raytracing can be easily adapted to a parallel distributed environment, which is exactly what I have done with a raytracer written in Java, by Frederico Inacio de Moraes (who kindly gave me permission to use his code for this project).
The DAMPP Design: JobMasters and WorkerApplets
Let us now forget about raytracing and design the system this article is all about: a distributed applet-based (massively) parallel processing (DAMPP) system.
We need applets to calculate some parts of a parallel computation and we need to somehow assemble all the produced results in one central location. That central location will be the WWW server that carries the Web page containing our applets. These millions of applets, working like ants toward a common goal, are in fact just clones. So from now on I will simply talk about the WorkerApplet. (Keep in mind, though, that hundreds or thousands of these cloned applets can be instantiated at any one time, working in parallel on a given problem.)
On the WWW server runs another server program (along with the HTTP server), which I call the JobMaster. It is the command and control center of a DAMPP system. See the DAMPP high-level protocol diagram for the big picture.
DAMPP high-level protocol
Here's the scenario of the two components (WorkerApplet and JobMaster) interacting with each other:
A user surfs the Web and hits the page containing the worker applet. The worker applet loads and starts running immediately. The user is not even aware of this fact since the applet is totally invisible; it does not boast flashy animations or a GUI. It is simply a harmless visitor process in the machine.
The first thing the applet does is call home to tell the JobMaster it arrived safe and sound, and is ready to do some work (Step 1 in the DAMPP high-level protocol diagram). The JobMaster responds by issuing the applet with a job "specification" (Step 2). This is a parcel of information that allows the applet to start working on a sub-part of the big problem being attacked (in our raytracing example application, each applet calculates individual image lines).
Since Web users don't surf the Web for days on end, the applet knows it's living on borrowed time. At any moment the user can decide that surfing real waves is more exciting than surfing cyberwaves, and so kill his browser along with our poor applet. Worker applets therefore are designed to be dispensible, thus conforming nicely to our current throw-away consumer society <sigh>. DAMPP is designed to cope with applets being killed prematurely; all it does is eventually re-issue the same job to another applet on another surfer's computer.
Whenever an applet does manage to complete its calculations, it contacts the JobMaster again to hand over its results (Step 3) - and then, like the male bee for which there is no more use, commits suicide.
If we switch our perspective to that of the JobMaster, we see two types of incoming communications: job requests and returned results. To simplify implementation and enhance performance, these two very different streams will use separate channels and both rely on UDP (User Datagram Protocol) for the underlying communication protocol. UDP is one of the TCP/IP protocols, sitting between IP and TCP as far as complexity and capabilities are concerned. I chose the UDP protocol over the more-frequently used TCP protocol for three reasons:
- UDP's connectionless communication mode is perfectly suited to our design
- UDP consumes less Internet bandwidth than TCP for the same amount of information transferred
- UDP is faster
UDP has one disadvantage: It does not guarantee delivery of information. (That is why TCP, which does guarantee delivery, is used more often.) But since the number of applet hosts on the Internet is, in all practicality, limitless, any lost results or any applet-server rendezvous gone wrong can be safely ignored. This is analoguous to your body not blinking an eyelid at the death of one of its own cells: Statistically and practically, it is a meaningless event. The design of the DAMPP computation engine can therefore rely, in a similar vein as our bodies, on massive built-in redundancy to achieve its goals with a very high degree of probability.
The overall picture is now clear: A swarm of applets passively and invisibly are being executed by client machines all over the world, contacting their home server to request jobs, executing those jobs and returning results to the server.
DAMPP implementation details
We are now ready to talk Java code!
The whole DAMPP system consists of just five classes: three classes for the applet and three for the server (one class is shared by both). These classes are, for the applet:
- RayTrace (which extends WorkerApplet, instead of java.applet.Applet)
for the server:
and shared by both:
At this proof-of-concept stage, the applet side already has been decomposed into a generic platform for parallel processing (WorkerApplet) and the application-specific code (RayTrace). Since this article is all about the distributed parallel processing, and not about raytracing, I shall not go into detail of how RayTrace works. So let's look at WorkerApplet's implementation instead.
WorkerApplet is a semi-abstract (or semi-concrete -- both views are valid) foundation class for any concrete applets wishing to be the parallel building block for a DAMPP parallel computation engine. Actually, WorkerApplet consists mostly of concrete parts; the only abstract part left to be implemented in subclasses is a
run() method to satisfy the standard Java
Runnable interface. This method ensures that your application applets execute as a thread in the client's browser, thus allowing other applets to execute concurrrently with the DAMPP applet.
WorkerApplet is an abstraction layer meant to hide all the DAMPP communication details between applet and server. To this end WorkerApplet provides two methods:
- byte requestJob()
- void returnResults(byte array)
These two methods correspond directly with steps 1 and 3 of the applet-server protocol shown in the DAMPP high-level protocol diagram.
requestJob() initially is used by an applet to find out what part of the larger problem it should tackle. The return type of the method is
byte which means that an unstructured array of bytes is returned. This is because WorkerApplet does not impose any restrictions on the type or structure of the job specification message issued by the server. In the case of the raytracer demonstration application, the job specification essentially consists of the scanline number (the y-coordinate) of the image scanline that the applet should generate. Other applications would use completely different job specification messages; this is left entirely to the specific application.
Once the calculations are done (and these should be designed to require no more than a couple of minutes at most, lest you risk loosing too many applets prematurely due to them being killed off by users exiting their browsers), the applet sends its results back to the server via method
returnResults(). This method also uses an unstructured byte array as an argument to allow any type of information to be returned. In the case of our raytracer, a complete scanline of 24-bit color pixels is returned each time.
Since the system relies on an exchange of UDP packets (which are analogous to telegrams), and not on TCP-style persistent connections (which are more analoguous to telephone connections), the server needs a reply address to be able to reply to the applet.
Job Request UDP packet structure
As mentioned before, the results produced by the applets are transmitted back to the server using a different "channel" than that used for job requests. More specifically, the results get transmitted to one of many reply ports managed by the server. Each of these reply ports is also handled by its own thread. This arrangement performs more effectively than relying on a single port, since the result packets need a lot more processing on arrival than the simple job request packets emanating from the applets. If a single thread were to be responsible for just one global results port, then it would be overrun by packets and data would be lost. So by multiplexing the results channel over several ports and several threads, incoming data is much less likely to cause a server overrun. The actual number of reply ports is set to 3 by default, but can be configured by the webmaster to any number between 1 and 100.
The job specification UDP packet is mostly application-specific, except for the first 2 bytes of the packet. These encode the reply port to address when sending the results back. This small field is stripped from the packet before the packet is passed to the concrete applet that invoked
requestJob(). For the record, our raytracing example simply adds a single 16-bit scanline number that completely specifies what the applet has to do (that is, the job spec). This means the total size of the job spec packet for our raytracer is just 4 bytes.
Let us fast-forward past all the raytracing code (the application-specific part) and follow events from the point where the applet calls
returnResults(). The application applet deals only with its own application-specific messages, so it does not specify any reply ports to
returnResults(). The underlying WorkerApplet stored the port address from the incoming job spec, so it can now send the results to that port.
Assuming all UDP packets arrived safely, here's what happens on the applet side:
- Get job spec
- Execute job
- Return results
A multitude of threads
The design of the JobMaster server is somewhat more complex as it relies on multiple threads servicing multiple UDP ports. The general structure is one of a single "master" thread and a configurable number of slave threads. The master thread simply loops endlessly, waiting for job requests to arrive and immediately replying with job specifications. You can follow the exact implementation of all this in method
Like the master thread, each slave thread is responsible for one receive UDP port. Both entities (thread and port) are encapsulated in an instance of class
ResultsReceiver thread loops forever, "catching" the results from applets and passing these results to the main program. (See the
run() method for the exact implementation).
Back in the JobMaster class, the results are processed in the
collateResults() method, whose implementation is application-specific. In our raytracing example, the results are single scanlines of the image being generated. These scanlines are copied to an off-screen Image, which is used to redraw and refresh a window on the server console where the progress of the parallel task can be followed.
Classic multithreading pitfalls
At this point I need to explain more about how the server assembles the raytraced picture and how exactly jobs are issued. Although these are application-specific, the issues tackled for the raytracer will crop up in most DAMPP applications. To manage the whole process of distributing the raytracing algorithm to remote processors, the server maintains a central data structure: a simple array of booleans, with one boolean for each scanline of the image. In other words, IMAGE_HEIGHT flags. These flags are used to track which scanlines already have been generated and received, and which haven't.
When the server receives a job request, it simply picks the next scanline which has not yet been received, and tells the applet to work on this scanline. When method
collateResults() receives a scanline from one of the receiver threads, it first checks whether this scanline is actually outstanding. The fault-tolerant design of DAMPP means more than one applet theoretically can be working on the same problem, so when two duplicate results are received, the result arriving last will be rejected as being redundant. This does not pose any problems, and in fact is just a side-effect of the robustness of the system.
The potential problem, though, lies in different threads accessing shared data structures and corrupting the structures' state. The core flags array is accompanied by a simple
(int) counter that also tracks how many scanlines have been successfully received. These two objects form a non-atomic whole, which needs to be kept rigorously in sync for the server to perform its essential bookkeeping. Therefore a locking mechanism is required to modify or access the two objects atomically. This is achieved using Java's synchronized keyword.
Usually, the synchronized keyword is used on methods to mean that the entire method needs to be atomic with respect to some Java object. But when implementing multithreading systems, you have to watch out for a potential performance bottleneck: If many threads need to access a common structure, locking and releasing of this structure needs to be done as swiftly as possible. Otherwise you risk creating long queues of threads waiting to access the shared structure. You should only lock (in other words, use the synchronized keyword) on the set of statements which absolutely require them. In
collateResults() I synchronize the code as tightly as possible, around the flags array object.
Warning: Prototype ahead!
First of all, the minimalistic reliance on UDP for all JobMaster-WorkerApplet communications means the rendezvous protocol currently is very brittle. If a job request or job specification UDP packet ever gets lost during this initial stage of the protocol, then the (current) WorkerApplet fails to detect this, resulting in the applet becoming silent (forever waiting for a UDP packet that will never arrive). The upcoming 1.1 java.net API contains the most convenient solution: enhanced UDP support that will allow timeouts to be generated when waiting for a UDP packet. These timeouts can then trivially be exploited to allow the applet to retransmit its packets until a reply finally gets through.
Secondly, at the declaration level, class
JobMaster is declared as being a subclass of
java.awt.Frame. This is because the
JobMaster pops up an AWT window to display the raytraced image as it is being calculated by the World Wide Web itself. The window also contains a couple of buttons to reset the picture (start from scratch) and quit the server. This means the
JobMaster class contains an event handler method to deal with button events too.
Since the discussed implementation of the DAMPP system is a proof-of-concept prototype, the server combines (rather monolithically) both application-specific code and code that could be abstracted into the server equivalent of
WorkerApplet -- that is, an application-independent foundation class that programmers could subclass and extend to deal with any suitable parallel problem. Decomposing the current server into independent parts would be the next step towards a more flexible DAMPP implementation.
Joining the DAMPP club
The DAMPP system will not handle every type of parallel problem well. Problems ideally suited to DAMPP should exhibit the following characteristics:
- application-specifc computations should be expressible in a small program (since applets need to contain most of this code)
- sub-problems should be easily definable using few parameters (this is to keep job spec packets small)
- nodes tackling a sub-problem should not have to exchange data with other nodes
- the computed results should be of reasonable size (otherwise the
JobMasterwill not be able to handle the load of the results stream)
- the problem should be allowed to run for hours or days
If a DAMPP system is used within the confines of a company's intranet, using a fast LAN backbone as the communication medium, then all these requirements can be relaxed or even ignored. The third requirement, though, would mean additional work on
WorkerApplet before a flexible system would be obtained.
Mass hysteria, ethical considerations and "You scratch my back, I scratch yours."
I can already hear the storm of shouts: "Viruses! He's distributing computational viruses on the Net!" Calm yourselves. Nothing is further from the truth. DAMPP applets should be embedded only in pages that have been clearly labeled to contain such applets. Users will be able to use their own free will, anywhere in the world, to decide whether they wish to have their (currently wasted) background CPU power borrowed and employed to a useful end or not.
One way to counter the inevitable segment of people who will refuse to donate some of their idle CPU power is to use a DAMPP system on sites where users are currently getting a lot of goodies for nothing: news sites, shareware sites, entertainment sites, software upgrade sites, etc… All of these sites could quite ethically, in my humble opinion, trade some of their valuable Web content for some minute amount of processing on client machines. After all, Web surfing and downloading from the Web are two activities that leave the client processor idle most of the time (yes, even with downloads), so only the selfish will object to this spare processor capacity being used by the site from which he or she is taking so much. The alternative means the processor is only used to generate heat, and nothing more.
What do you think has been running in the background while you were reading this article?
That's right! The DAMPP raytracing applet.
If you enable your Java console and check its contents, you should see that the applet has (most likely) already finished its brief calculation and returned its results to our server.
If the applet happens to still be running, and you strongly object to this applet's presence, just quit your browser and the applet will die with it… but possibly think again after the deed; your machine now sits idle again.