Survival of the fittest Jini services, Part 2

Understanding reliability in distributed transactions

Since no network functions reliably all the time, developers need
methodologies and tools to construct systems that remain dependable
despite intermittent network failures. In this article, Frank
Sommers argues that developers should distinguish between the
systems themselves and the computations they wish to perform on
them, as the two might have different characteristics of
reliability. 

Failure, according to Merriam-Webster's Dictionary, is a state in which something is "unable to perform a normal function." Inasmuch as a network's normal function is to transmit information between two or more hosts, experience shows that most networks often cannot perform that function as expected. In other words, failure is as much a characteristic of the network as is its normal operation.

In many aspects of life, we have learned to live in the presence of failure. In a large city, many new shops spring up all the time, while others close their doors for good. Unless you are a shop owner, you are not likely to lose sleep over that fact. Instead, you are interested in being able to obtain the goods you are looking to buy, at a reasonable price and in close proximity.

Taking a similar approach to building Jini-based distributed systems might be helpful. We cannot make a large network, such as the Internet, more reliable. But we can make the computations we wish to perform over that network as reliable as possible. Your users -- whether people or other Jini services -- are primarily interested in the computations your service provides. Ensuring the reliability of those computations in the presence of network and component failures will likely lead to your service's longevity.

By reliability I mean a set of guarantees that hold, no matter what. In other words, as long as the computation produces a result, that result should keep with a set of guarantees. If the computation cannot ensure those guarantees, then it should abort and not return a result.

We are all familiar with this notion of reliability: When people wish to accomplish a goal together, they typically agree to a verbal or written contract, which thereafter binds each party to its terms. Thenceforth, the participants perform all their actions related to the common task in accord with those promises. And, should the parties fail to keep their promises, all actions under contract will produce "unreliable" and unpleasant results.

The equivalent of such "rules of the game" between components of software-based distributed systems is the transaction: components participating in a computation agree to a set of rules, and each component thereafter adheres to those rules during the computation.

In distributed systems, such as Jini networks, components typically reside on distant network hosts. This is significant, because it means that no component can, by itself, ascertain whether the other components adhere to the rules. A component can only implement the rules and then communicate to the others that it, indeed, keeps with those rules.

Distributed transactions, therefore, are made up of the rules (semantics) by which the services must abide, and a coordination mechanism between the services that ensures that the rules hold for the whole computation. If even one service indicates that it cannot guarantee its promise, that mechanism will abort the transaction.

The problem of four generals

The story of the four generals, inspired by Leslie Lamport's "Byzantine Generals," illustrates the kind of guarantees distributed transactions must promise, and the way participants in it might communicate. In this example, the generals and their armies are metaphors for distributed services, and carrying out their battle plan is analogous to a distributed computation. This scenario is known as the coordinated attack problem.

The generals, each commanding an army, plot to capture a medieval fortress. Alone, no one army can force the besieged defenders to surrender. Together, however, they are more than a match for the city's defenses. Therefore, to win, the generals agree to fight only given the following battle conditions:

  • They must all attack at the same time. If any one army calls off the attack, the others must immediately retreat.
  • None of the armies may violate its own internal rules during the battle.
  • The attack must be a surprise. All preparations must be kept secret and made in isolation from all but the generals' most trusted confidants.
  • Victory must be permanent; for instance, the armies must be ready to occupy the city after the battle.

The four conditions for the generals' battle plan are the ACID guarantees:

  • A stands for atomicity: Either all the armies attack, or none of them do. One or two armies attacking would cause the battle to be lost, and is not permissible.
  • C means consistency: The armies must maintain their internal rules for order (consistency).
  • I stands for isolation: All the preparations for the attack must be hidden from those not involved in its planning. If the attack is called off, no one outside the generals' close circle should sense that any activity has taken place.
  • Finally, D implies durability: The results of the battle must survive the fight itself.

Next, the generals need a way to coordinate their activities. They settle on the following communication protocol:

  • Prior to the battle, each army makes the necessary preparations. When each is ready, its commanding general lets the others know that he is prepared to move forward.
  • Once each general is sure that all the others are prepared, he sends another message to the effect that his troops are now committed to battle (in effect, are marching against the fortress).

This protocol consists of two phases: The first indicates preparedness; the second, a fully committed state. It is often called the two-phase commit protocol -- or 2PC protocol, for short. Jini services participating in a distributed computation must use a similar mechanism to coordinate a transaction's completion, or commitment.

The only remaining concern for the generals is how to exchange messages. To indicate preparedness, each general must send a message to the others. Between the 4 generals, 12 messages are exchanged for the protocol's first phase. For N generals, N * (N-1) messages must be sent for each stage. This is bad news: If additional armies were involved in the attack, many more messages would be needed. For 10 generals, this arrangement would require an unmanageable 90 messages. Should any message get lost, the battle could not begin, since the generals could not be sure that the conditions were right for attack.

Instead of sending messengers directly to each other, the generals could decide to set up a central command post. Each general would send a messenger only to this command post to obtain the status of the other armies. With this arrangement, only two messages from each army are passed for each protocol stage: one delivering a general's message, and the other coming from the command post with an order to either proceed with or abort the plan. With this in place, only 8 messages are needed to indicate battle preparedness by our 4 generals. If 10 generals must coordinate their movements, then introducing the central post reduces the required messages from 90 to only 20. This command post is called the transaction manager, or coordinator, for the 2PC protocol.

Figure 1. Communication messages for the two-phase commit protocol

A Jini bookstore

The Jini Distributed Transactions Specification defines a transaction manager, which is a Jini service, and also describes transaction participants and transaction clients. Together, these entities make up a Jini distributed transaction. In addition, the spec defines default transaction semantics for the ACID properties. The net.jini.core.transaction and net.jini.core.transaction.server packages provide the API for services to interact with the transaction manager, and also offer classes for the default transaction semantics.

By separating transaction semantics from a coordination mechanism, the transaction specification allows for other, user-defined transaction semantics. These semantics might promise guarantees other than, or in addition to, the ACID ones, but transactions using those semantics could still employ the 2PC protocol.

To illustrate the benefit of transactions in service-to-service interaction, we will construct a Jini bookstore service. Like any bookstore, the service lets you search for and order books. Unlike most bookstores, however, its implementation relies on other Jini services for payment processing and order shipping. In this and the next article in this series, we will dissect the BookStore service to see how it provides high reliability even in the presence of intermittent network failures.

The bookstore makes available on the Internet (possibly in public lookup services) a Jini proxy object, which exposes something similar to the following service interface:

public interface BookStore {
  public Collection findBooks(Book template)
    throws java.rmi.RemoteException;
  public OrderConfirmation buyBook( Book book,
                                    Account creditCard,
                                    Customer customer,
                                    Address shipTo,
                                    int daysToDelivery)
    throws NoSuchBookException, CreditCardException,
      DeliveryException, BookStoreException,
      java.rmi.RemoteException;
}

The findBooks() method consumes a template and returns a collection of Book objects satisfying the template's specified fields (for instance, the author's name). The buyBook() method is more involved. It requires us to specify the desired book, as well as objects representing a credit card, customer information, a shipping address, and the number of days in which we want the book to be delivered. A successful purchase returns a confirmation, which includes the information a customer would need to make a delivery complaint. The buyBook() method declares a number of runtime exceptions to indicate failure in processing the purchase request.

Credit card companies provide Jini services to facilitate account debits and credits. The interface of the CreditCard service might be as follows:

public interface CreditCard {
  public ChargeConfirmation debit(Account account, Charge charge) 
    throws NoSuchAccountException, CannotChargeException, 
       CreditCardException, RemoteException;
  public PaymentConfirmation pay(Account account, Payment payment) 
    throws NoSuchAccountException, CreditCardException, 
      RemoteException;
  public CurrentBalance getBalance(Account account) 
    throws NoSuchAccountException, CreditCardException, 
      RemoteException;
}

The methods of this interface let the user charge her account, make payments, and inquire about the current available balance. Each method returns an object representing the result of the action, or, if the action did not succeed, any declared exception.

The final piece in the bookstore puzzle is the shipping company. Its Jini service proxies offer the following functionality:

public interface ShippingCompany {
  public PickupGuarantee checkPickup(Address origin, 
                                     Address destination, 
                                     PackageDesc package, 
                                     int daysToShip) 
    throws ShippingException, RemoteException;
  public PickupConfirmation schedulePickup(PickupGuarantee guar) 
    throws NoSuchGuaranteeException, ShippingException, 
     RemoteException; 
}

The checkPickup() method requests the origin and destination addresses, a description (including the package's approximate weight) and the requested number of delivery days. If the shipping company can deliver the package within the specified timeframe, it returns a PickupGuarantee object. This object contains the delivery price and an expiration time that indicates how long the guarantee remains valid. On the other hand, if a shipping company cannot guarantee the requested delivery, the method returns a null value.

Ideally, we'd want the best delivery price. Therefore, we'd inquire with many companies by calling checkPickup() on their service objects. This way, we can trade time for money: the more companies we inquire with, the better price we might obtain -- although it takes longer to make all those method calls. (A shipping company might offer a good price, but set a short expiration time for the PickupGuarantee -- in other words, if you act now, you can ship your package cheaply. This would be the equivalent of a sale on the Jini-enabled, service-oriented Web.)

Once we choose a shipping company, we must pass the appropriate PickupGuarantee object to its schedulePickup() method. That company then returns a PickupConfirmation object, representing a receipt for the scheduled package pickup. This method also declares a number of exceptions, should a problem occur when accepting the pickup request.

Figure 2. Interaction of services in support of a Jini bookstore

The BookStore service must provide the ACID guarantees when buying a book:

1 2 3 4 Page 1
Page 1 of 4