Make room for JavaSpaces, Part 6

Build and use distributed data structures in your JavaSpaces programs

The design of any space-based application typically revolves around one or more distributed data structures. These are data structures that exist in a space where multiple processes can access and work on them at the same time -- something that is often difficult to achieve in distributed computing models.

As Figure 1 below shows, message passing and remote method invocation systems usually sequester data structures behind a centralized manager process; therefore, any process that wants to manipulate the structures will have to wait its turn and ask the manager to perform the operation on its behalf. In other words, multiple processes can't truly access a data structure simultaneously in these conventional systems.

Figure 1. Conventional systems barricade data behind a centralized manager process (graphic courtesy of Sun Microsystems)

Distributed data structures take a very different approach that decouples the data from any one process. Distributed data structures are represented as collections of objects that, as I said earlier, can be accessed and modified by multiple processes concurrently. As Figure 2 suggests, processes can work on different pieces of the structure at the same time without getting in each other's way.

Figure 2. JavaSpaces supports concurrent access to distributed data structures. (graphic courtesy of Sun Microsystems)

The design of any distributed data structure requires both a representation of the data structure and a protocol that processes follow to ensure that they can manipulate the structure safely and fairly. Now I'll explain how you can use entries to represent distributed data structures, and how you can use space operations to implement distributed protocols.

TEXTBOX: TEXTBOX_HEAD: Make room for JavaSpaces: Read the whole series!

Build distributed data structures with entries

Let's say you want to store an array of values in a space, and that you'd like the array to be accessible and modifiable by multiple processes simultaneously. For example, suppose your array holds today's highest temperatures for all 50 US states, and that weather stations in the various states will need to read and alter the values throughout the day.

The first approach you might think up is to store the array in an entry, like this:

public class HighTemps implements Entry {
    public Integer[] values;
    public HighTemps() {
    }
}

If a weather station wants to read or alter the high temperature for its state, it must read or remove the entire array. With this scheme, you haven't yet created a truly distributed array, since multiple processes cannot access or modify the array values simultaneously. The entry itself can become a bottleneck if many processes are trying to access the array concurrently.

You can try a different approach and decompose the temperature array into a collection of entries, each of which represents a single array element. Here's how you would represent an array element:

public class HighTemp implements Entry {
    public Integer index;
    public Integer value;
    
    public HighTemp () {
    }
    public HighTemp(int index, int value) {
        this.index = new Integer(index);
        this.value = new Integer(value);
    }
}

Each HighTemp entry has an index (the element's position in the array) and a value (the element's temperature value). Here's how to initialize a 10-element array (leaving out the try/catch clause for simplicity):

for (int i = 1; i <= 10 ; i++) {
    HighTemp temp = new HighTemp(i, -99);
    space.write(temp, null, Lease.FOREVER);
}

This loop deposits 10 HighTemp entries into the space, initializing each with an index and a value of -99. Here's how you would read one of the temperature entries -- the fifth one, in this example:

// create a template to retrieve the element
HighTemp template = new HighTemp();
template.index = new Integer(5);
// read it from the space
HighTemp temp = (HighTemp)space.read(template, null, Long.MAX_VALUE);

To modify that temperature, you take the temperature entry from the space (using the same template), update its value, and then write the entry back to the space as follows:

// remove the element 
HighTemp temp = (HighTemp)space.take(template, null, Long.MAX_VALUE);
// alter the value
temp.value = new Integer(105);
    
// return the element to the space
space.write(temp, null, Lease.FOREVER);

This example demonstrates the basics of building and using a very simple distributed data structure. Since this distributed array is represented by multiple entries in the space, multiple processes can access and modify it concurrently -- for example, one process can modify element 0 without interfering with another process modifying element 48.

Along with the distributed data structure's representation, you've seen the protocol that processes follow to deal with the array. To read an array value, a process reads that value's entry in the space. To modify an array value, a process first takes the entry from the space, updates its value field, and then writes the entry back to the space. You could augment this ultra-simple protocol in various ways: for example, you could also store the size of the array, copy a value from one element to another, or delete an element.

While the distributed array data structure has a counterpart in the sequential programming world, you'll find that some distributed data structures are completely new. In the compute server example discussed in Part 2 and Part 4 of this series, you already encountered a distributed data structure that has no sequential analog: bags (which were used for holding task and result entries). Let's take a closer look.

Unordered structures: Bags

Unlike sequential programs, space-based programs often make use of unordered data structures -- in other words, collections of objects. These collections are sometimes called bags, because they are like bags of objects. With ordered collections, you pinpoint and manipulate specific elements in the structure; with unordered collections, you perform just two basic operations: put (i.e., add a new object to the collection) and get (i.e., remove and return any object from the collection). Think of a box that holds raffle tickets. If you want to participate, you fill out a ticket with your information and just throw it into the box, without caring where it goes. When the contest is over, someone reaches into the box and grabs an arbitrary card -- any card will do. Bags are very easy to create in space-based programs: to create one, you simply define an entry class and write as many instances of that entry into the space as you want.

You've already seen the use of task bags and result bags in the compute server example earlier in this series. Recall from that example that a master process divides a large problem into a number of tasks and writes task entries into a task bag in the space. One or more worker processes look for these tasks, remove them from the space, compute them, and write result entries back into a result bag in the space. The master process looks for and collects these result entries and combines them into a meaningful result, such as a ray-traced image or decrypted password.

Besides their use in parallel computations, task and result bags can be used by processes to request and supply services without having to know exactly who they are requesting a service from or supplying a service to. For example, a process that needs some kind of service can drop a task entry into the task bag, and any available service that knows how to fulfill that task can pick it up, perform it, and write the result entry to the result bag, from which the original process will retrieve it.

Ordered structures: Channels

When people first see the JavaSpaces API, they often ask, "How can I get a list of all the entries in a space?" It's a legitimate concern, since programmers often want to iterate over a set of objects. The simple answer is that the API itself provides no way to iterate over all the entries in the space. You can think of a space itself as just a big bag of entries -- an unordered collection of objects. The good news is that you can impose your own structure on a space by building various ordered distributed data structures within it. Once you've done this, entries can be enumerated by performing simple operations on those data structures.

In the remainder of this article, I'll explore how to create and work with one very useful type of ordered distributed data structure -- a channel. You can think of a channel as an information pipe that takes in a series of objects at one end and delivers them at the other end in the same order. At the input end, there may be one or more processes piping in messages or requests. At the output end, there may be one or more processes reading or removing these objects.

Channels turn out to be very versatile data structures, and can be used to achieve a variety of communication patterns in JavaSpaces programs (refer to Chapter 5 of JavaSpaces Principles, Patterns, and Practice for more details). In the rest of this article, I'll walk step-by-step through the details of implementing one kind of channel distributed data structure, and show you how it is central to a real-world program -- a distributed MP3 encoding application. Before diving down into the details, I'll give an overview of the application we'll be building.

A distributed MP3 encoding application

Most folks have heard of MP3, a wildly popular sound compression format. Sounds encoded in MP3 format almost match CDs in quality, but MP3 files are much smaller than the equivalent files on a CD. If you've ever played around with digital audio files, you may have performed encoding -- the process of taking uncompressed digital audio data (for example, wav files on your PC) and compressing them according to a specific compression scheme such as MP3. MP3 encoding can be a computationally intensive process, especially if the original wav file is large or you have a large collection of wav files to encode.

You might imagine submitting your wav files to a high-powered, net-connected MP3 encoding service that would perform the wav-to-MP3 encoding for you and return the MP3 data. If many users simultaneously send encoding requests to a single MP3 encoder, the requests would need to queue up until they could be serviced. However, since the encoding tasks are completely independent of each other, they lend themselves well to being serviced in parallel by a farm of MP3 encoders. By using a space, you can easily build a distributed MP3-encoding application; Figure 3 shows the architecture of the system.

Figure 3. The architecture of the distributed MP3-encoding application

At first, you might think that your distributed application should just use bags, as the compute server did. In other words, the application could revolve around a bag of MP3 requests (deposited by MP3Requester processes and picked up by MP3Worker processes) and a bag of MP3 results (deposited by MP3Workers and picked up by MP3Requesters). This scheme has one major drawback: there is no guarantee of fairness. In other words, there's no assurance that some newer encoding requests won't be serviced before older requests. In fact, some requests might languish forever and never get picked up, while others are serviced. Ideally, you'd like for encoding requests to be handled on a first-come, first-served basis, with a guarantee of eventually being serviced (provided that at least one MP3 worker remains up and running to perform encoding). It turns out that a channel distributed data structure is exactly what you need to accomplish those goals.

As you can see from Figure 3, MP3Requester processes add MP3 encoding requests to the tail end of an MP3 Request channel, and MP3Worker processes take the requests off the head end of the channel, in order. An MP3Worker calls a piece of third-party software to perform the actual wav to MP3 encoding; when that software generates the MP3 data, the worker writes the data to a result bag in the space (you can assume for this example that the order in which results are picked up is not important). In the meantime, the MP3Requester processes constantly monitor the space to pick up and display results that have been tagged for them.

Now that you have the aerial view of the application, I'll dive down into the details.

The MP3 Request channel

To implement the MP3 Request channel, you use an ordered collection of MP3Request entries -- one for each request in the channel. Each MP3Request entry holds its position in the channel. For instance, the third request to be added to the channel has a position of three (assuming that the request sequence starts at one).

1 2 3 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more