Survival of the fittest Jini services, Part 1

Ensure the quality of Web services in the age of calm computing

1 2 3 4 Page 3
Page 3 of 4

Guaranteed 'quality of service'

Jini's infrastructure is designed to actually encourage eliminating unreliable services. First, a service participates in the Jini federation based on a lease. The service's implementation must renew the lease periodically, or it will expire and no longer be advertised in lookup services. Even if a lease is kept current, a client might experience network or other communication problems with the service's implementation. A Jini client could eliminate such a service's proxy from its lookup cache. Further, it could record service implementations that often produce errors, and upon subsequent lookup, exclude them from future lookup requests.

On today's Web of HTML pages, dependability is, at best, an anecdotal property: People talk about Websites being more or less reliable, more or less fast in responding to requests, and more or less accurate in terms of the information they provide. We need to pin down dependability far better if we plan to truly depend on these ubiquitously available information systems.

Computer scientist Gerhard Weikum has studied the notions of dependability of Web-based information systems. He argues that a service-oriented Web will only start to evolve when services strive toward a guaranteed quality of service (QoS). (See Resources for a link to Weikum's online papers.) Such guarantees have both qualitative and quantitative dimensions.

Qualitative QoS means that the service produces results that are acceptable in some human context. For instance, using a flight reservation system should mean that an airline seat is reserved, a credit card is charged, and a confirmation is received. Just one or two of these actions (for instance, charging a credit card without reserving the seat) would not be acceptable from a business perspective.

Quantitative QoS implies that a service provides acceptable performance in some measurable dimension. For instance, a service that provides streaming video must provide a certain number of frames per second to achieve smoothness of the movie. Or, a reservation system must respond within a certain time period to a reservation request (a few hundred milliseconds).

A software service can verify quantitative QoS guarantees relatively easily: If our reservation system does not respond within the specified time frame, a client service can conclude that it is not meeting its QoS guarantees. Other issues pending, the client could then look for another service that performs better.

The Rio project has already proposed a QoS notion for Jini services. QoS in Rio is based on Jini services having a set of requirements to properly execute in a runtime environment (such as bandwidth, memory, or disk requirements), and a runtime environment having a well-defined set of capabilities (such as available memory, disk space, bandwidth, and so forth). Rio strives to provide the best possible execution guarantees to a Jini service by matching a service's requirements with an environment's capabilities.

The notion of specifying a software service's requirements can be extended with a stochastic scale of (quantitative) service quality under a given environment's capabilities. For instance, a streaming video service could specify that it can deliver X number of frames per second, given a network bandwidth of N. A client could obtain this information from a service's service item, which could provide Entry objects for this purpose. These numbers would still only be approximate, but it would help clients locate services that match a required QoS level.

Many service qualities are not so easy to measure. Quality, at any rate, is often in the eyes of the beholder. It is helpful at this point to establish some terminology, some of which Weikum also describes as important for Web-based information services.

Qualities of a service

Reliability is the guarantee that a service won't break within a given period of time in a way that the user (human or another service) notices the failure. For instance, if you reserve an airline seat, but no record of your reservation exists when you arrive at the airport, you know that the system is not reliable.

The trick is to make highly reliable systems out of unreliable components. John von Neumann, author of The Computer and the Brain, was the first to study this problem in the 1940s (see Resources); in his time, computer parts were notoriously flaky. In his work on the logical theory of automata, von Neumann strove to account for the probability of failure in a complex system. He presented a method to apply probabilistic logic to describe partial failure as an integral part of a system. He created a mathematical model for the human brain with a failure probability of once in 60 years. To achieve this level of reliability, he required 20,000 times redundancy: when a part failed, there had to be about 20,000 spares. Nowadays, disk mirroring, RAID, or redundant network connections are common ways to build a reliable system out of unreliable parts.

Jini is the first system for distributed computing designed with the explicit acknowledgement that each component of a distributed system, including the network, is unreliable. Because it squarely faces this issue up front, it is the first successful infrastructure to build Internet-wide reliable systems out of unreliable components.

Availability means that the system is up and running. Periodically, every system fails, resulting in downtime. When that happens, a system needs repair so that it can work again. How often a system fails is measured by the mean time between failure (MTBF). Modern hardware has incredibly high MTBF. Disk-drive maker Western Digital, for instance, specifies 300,000 hours of MTBF: "If a disk drive is operated 24 hours a day, 365 days a year, it would take an average of 34 years before this disk drive will fail." (See "What Is Mean Time Between Failure (MTBF)?").

A fun exercise is to find out the MTBF of equipment in your office or house (your TV set, refrigerator, or car). Then compare those figures to the MTBF of the operating system on your personal computer -- when did you have to reboot your machine last? In the world of competing, service-oriented software on the Web, users (human and software alike) will likely demand that software service providers give guarantees similar to the one Western Digital offers.

Often a system's component might fail, but the system can mask the failure, thus increasing system-level MTBF. Sectors on magnetic media, for instance, often fail, but disk drives and drive controllers can mask that by remembering the failed sector and avoiding it in future write requests.

If you deploy a software service that relies on other services, as illustrated in Figure 2, you might consider a similar strategy: have a redundant set of component services, and, when one fails, mark it so that your service won't use it again for a while. This technique, of course, exacerbates the survival of only the fittest services, since the flaky ones will be universally avoided.

Suppose your service fails once a year -- has an MTBF of one year -- and it takes one hour to repair (has a mean time to repair (MTTR) of one hour). This means that your service will be unavailable for one hour every year. Availability is computed as: Availability = MTBF/MTBF + MTTR. Thus, your service's availability is about 99.988 percent (8,760 (hours in a year) / 8,760 + 1). If we count the number of 9s, then it has about a "four 9s availability."

Clearly, one way to increase availability is to reduce MTTR. A few years ago, I noticed that the new version of a popular operating system restarted much quicker than its predecessor did: If you must reboot, reboot quickly. Well, it's one way to increase availability ...

But a better way to increase availability is to have a replica of your service running on the network. When the primary version fails, client requests can fail over quickly to the replica. From the client's point of view, the failover is considered a repair (since it doesn't care which copy it uses, as long as the replicas behave identically). If your service fails over to a replica in one minute, for instance, you will have increased your availability to six 9s (525,600 / 525,601 ~ 99.9998 percent), or by two orders of magnitude. As a client, you should not wait too long to fail over to a replica, but should also not do it too quickly, lest you incur the failover overhead. It's yet another reason why services must provide some explicit guarantees, or guidelines, to clients about their promised availability. Of course, as a service provider, you must then fix the broken service, and make a replica available as soon as possible.

Integrity is the promise that the system will accurately capture, or facilitate, a business process or some aspect of the real world. A system's designer must explicitly ensure that this property holds true during the system's lifetime. For instance, an airline reservation system should capture all reservations for a flight -- it should never project an incomplete picture of this information, or allow the number of reservations to exceed the number of airplane seats.

I will not discuss security here in detail, but the notion of trusting a service becomes paramount if we want to see a worldwide object Web offering vast amounts of information. Clearly, the concepts of trust and security in this massive world of information and ubiquitous computing are far more complex than what exists today on the Web of HTML pages.

Ideally we would want to verify that our system is secure, maintains its integrity, and provides good availability and reliability. Verifiability of a service's behavior, however, is a tricky issue. In an enterprise MIS setting, or one in which we are in control of a software system's complete development, we can run tests that probe a system's salient properties. Because we know what the output should be, given an input, we can verify that the system behaves as we expect. But do we run these tests when vital parts of the system are operated by entities over which we have no control, or when pieces of the system are discovered only at runtime?

Verifiability is an important way to establish trust, and therefore, to facilitate a ubiquitous object Web. Jini has a big advantage in this area over other distributed object-interoperation techniques. Jini relies on programming interfaces defined in the Java language for object-to-object interactions. Such interfaces extend the notion of object-oriented encapsulation to the network. An object's public interface is essentially its contract with the rest of the world: It promises to do what its public methods advertise. This promise is formulated both in terms of programming language (Java), and by way of a semantic specification typically provided in a human language. Given these two sources of information, a system's designer can verify whether a service on the network actually performs as promised.

As a developer of Jini services, you must keep in mind that some operations might not be testable. Testing, in essence, means that we measure a system's state, S0, at time T0. We then take similar measurements after performing a set of operations, resulting in S1 at T1. The difference between S0 and S1 is DeltaS0_S1. We then compare our expected difference, based on our algorithms, Deltaexpected with DeltaS0_S1, and see whether the two are similar.

A problem results when we can't measure either S0 or S1. Imagine a service that charges our credit card a given amount for the purchase of a good (such as an airline ticket), via the method int confNum chargeCard(String cardNumber, float amount, Merchandise product). The service also has a method that lets us inquire about the current balance: float balance(String cardNumber). State S0 at time T0 produces a balance of ,000. Then we charge an airline ticket worth 00. However, the chargeCard() method fails to return a confirmation number. At that point, we request the balance again at T1, and it results in the value of ,000 again.

What went wrong here? At first, it seems that the purchase didn't go through, and we should therefore retry it. Or could it be that the confirmation number got lost while in transit on the network? Or perhaps, between T0 and T1, the credit card company deposited our payment check for 00?

One way to handle a situation like this is to make method calls idempotent: Calling a method call twice or more should have the same effect as calling it once. Thus, the client of the credit-charging service can retry the operation until it receives either a confirmation number or an exception. If the card has already been charged, no effect will take place; if it has not been charged, the operation will likely succeed on subsequent tries. So while the service is not really testable, at least it provides some integrity guarantees.

Qualities of a process

Once we understand the requirements we can expect, and demand, from a service, we should ask the broader question of whether the service is helpful to us at all. Specifically, we expect that the service will provide some guarantees with regard to the processes we are trying to accomplish with it. This question is similar to what an employer might ask about an employee: Given that an employee is reliable and highly available (shows up every day), has integrity, and can easily be checked in terms of work accuracy, is this person helpful to the company's business objectives? Is the employee supporting the organization's processes?

Process-centric notions of QoS take precisely this viewpoint. Specifically, we want to ensure that nothing bad happens by using the service, and that something good eventually happens. In computing terminology, we strive for safety and liveness. (See Leslie Lamport's "Proving the Correctness of Multiprocess Programs".) While these requirements sound simplistic, their implementation in actual information systems is not trivial.

A classic trip-planning problem will illustrate the challenges. Suppose you wish to travel from Los Angeles to Salzburg, Austria. The trip will involve a flight from Los Angeles to Frankfurt, and then one from Frankfurt to Salzburg. Once there, you will drive a rental car to your hotel. And, of course, you will need to ensure that you have a hotel room at your destination. Further, you have a ,500 travel budget. In addition, you would like to benefit from your frequent flyer club memberships with three different airlines. This scenario is illustrated in Figure 3.

Figure 3: Trip planning workflow with services

Flights, hotel rooms, and rental cars are all exposed on the network as services. Metaservices, such as flight, hotel, and rental car reservation systems, and credit card processors are also exposed. We must ensure that the interaction of these services produces some expected benefit (or at the least does not cause damage). An individual service, therefore, must guarantee that it is composable with other services in a meaningful way. This is very important, because the combination of services that produces the biggest benefit (such as saving the most money and providing the most convenient travel schedule) is the winner.

This problem has been studied in workflow systems, and now we will utilize similar techniques on the service-oriented object Web. Clearly, having just one leg of the trip completed is not sufficient; after all, you want to get to your destination. The service must also take into account your budget and frequent flyer memberships. In addition, because the object Web offers vast amounts of information, the best service (or "metaservice") will likely use more types of information. For instance, it may find that, based on your previous trips, you are eligible for a free hotel room in Frankfurt, and that staying there one night would reduce your airfare significantly. Looking at your previous correspondences/contacts, it might also discover that your uncle, who lives in Frankfurt, happens to have an evening open on his calendar, and that your favorite restaurant there is still taking reservations for dinner. Then you have the option of staying in Frankfurt for the night and having dinner with your uncle while also saving money. It is precisely this sort of service-to-service interaction that makes calm computing such an enticing vision.

Over the past few years, transaction-processing concepts were expanded to facilitate this sort of workflow or service composition. Two particular conceptual extensions are of interest here: complex or nested transactions, and transaction logic. The former takes each step of trip planning and fires off small transactions with each service involved. These small transactions are nested inside the big transaction, which is the trip planning. Transaction logic provides a way to specify conditions that must be met during a transaction's execution, and the possibility of a transaction to branch out into different execution paths, given different conditions.

Transactions are often explained with a banking example, using an ATM. When you withdraw 00 from an ATM, you expect that both your account balance will be reduced, and that the machine will eject the cash; that withdrawal is an atomic operation. The transaction should also not muddle your bank account, for instance, if the cash machine breaks down during the operation, a transaction should leave behind consistent results. If your wife withdraws cash from the same account at a different location, you probably don't want these two transactions to interfere. If your bank account has a balance of 00, then your bank would want to ensure that the two transactions are isolated; the system actually executes these tasks one after the other so that you both can't withdraw 00 simultaneously. Finally, you might want your transaction results to be saved persistently so that they are available after your transaction completes. The guarantees of atomicity, consistency, isolation, and durability, or ACID for short, are the distinguishing guarantees that transactions provide in a computation.

Transaction-oriented computing considers transactions to be the primary units of computation in a distributed system. According to transaction-processing pioneer Jim Gray, without transactions, you cannot dependably automate processes in distributed systems (see Resources).

In the travel-planning example, you could consider the trip a transaction, a process having ACID guarantees. Each step in the planning would then be a subtransaction. The services involved in every step are then transaction participants. Note that a transaction participant might, in turn, have nested transactions (that consist of other participants) -- for instance, legacy systems at an airline, or an object representing a rental car. This is illustrated in Figure 4.

Figure 4: Nested transactions

In addition to ACID guarantees, we also care about other guarantees in the trip-planning example: our trip must be within budget, and must take us to our destination. If we consider the transaction to be a computational entity -- an object, for instance -- then we can "push" the responsibility of ensuring these constraints inside the transaction. Because more than one good solution might satisfy your goals, the system should let the user select the most appropriate one. But it must also ensure that options don't become invalid while the user ponders the choice (because someone else in the meantime reserved that last airline seat, for example).

Imagine then a system where planning your trip would involve creating many transactions, evaluating the outcome of each, then letting you select the most desired outcome, and finally reversing the effects of the undesired transactions. That way, the outcome of this selection offers ACID guarantees, and ensures that you'll get the best possible trip, no "bad" outcomes will occur (going over the budget), and that the process terminates in a state acceptable to you (all tickets reserved and paid for). How to specify this sort of logic is beyond the scope of this article, but you might want to check out some references on advanced transaction models and transaction logic in Resources.

Spheres of control

I'd like to conclude this article by revisiting an idea originally suggested in the late 1970s by two IBM researchers, Charles T. Davies and Larry Bjork. Extending their revolutionary idea might provide a unifying metaphor for the issues of dependability guarantees of Web-based object services.

In this scenario, when you request the planning of a trip from the system -- interfacing it via any user interface that the service provides, be it voice, a computer, and so forth -- the trip-planning service will locate all the component services needed to accomplish this objective, and bring them under a sphere of control. A sphere of control (SoC), thus, is a logical grouping of services on the network, which share a similar set of guarantees. This is illustrated in Figure 5.

Figure 5: Dynamic spheres of control

To quote from Davies' paper, "Data Processing Spheres of Control" (IBM Systems Journal, 1978):

The kinds of control considered, for which there are potentially many instances, are process atomicity, process commitment, controlled dependence, resource allocation, in-process recovery, post-process recovery, system recovery, intraprocess auditing, interprocess auditing, and consistency.... Spheres of control exist for other purposes ... examples are privacy control, transaction control, and version control for both data and procedures.

At the outset of a new SoC, services agree upon this common set of guarantees, and then interact inside this control regime. A sphere of control, therefore, would be a "contract," which services deployed by organizations or individuals could use when interacting with one another on the object Web. An SoC would have both quantitative and qualitative guarantees for each participant service, and for the SoC as a whole. SoCs, therefore, would become a unit of trust and execution on the object Web.

An SoC would be represented as an object, which is saved persistently, even after the lifetime of the process it facilitates. The advantages of its persistence are evident when we consider either auditing or reversing a process. In the latter case, suppose your trip planning was done via an SoC, and you needed to cancel your trip. Since your process used many services, you'd need to instruct each one to reverse its original action. Why not delegate this job to your SoC object representing the trip reservation? Since it has the information of what steps were taken, which services were used, and what agreements were made between each service, it could arrange a reverse action with each service.

All your SoC objects would be available on the Web (for your own use only), and you could perform auditing and action reversal from virtually anywhere. With this in place, you could ask your car to find the restaurant you liked so much a few months before: your car would provide an interface, possibly via voice, to a space where all your SoC objects are accessible. A service could find the SoC objects that have restaurant services as participants, some of which you might have marked at the time as being excellent (using your PDA). It could then read a few attributes, such as the restaurant's name, from these back to you via the car's speaker system. You could then instruct your car to make a reservation at the desired restaurant, which it will do by creating a new SoC with the object representing the restaurant on the network. This is calm computing in action.

Conclusion

Services that can provide guarantees of dependability, both qualitative and quantitative, will have a better chance of survival in the coming age of calm computing. People have always relied on guarantees in their dealings with one another. The notion of law and the development of legal systems are in many ways attempts to establish sets of expectations people can rely on: when two people sign a contract, they explicitly promise certain guarantees to each other. If they fail to keep those guarantees, the contract will be annulled, and the faulty party will not be dealt with thenceforth. The state of the Web today is similar to societies without contract law: it is immensely useful, but its arbitrariness is becoming an obstacle to the development of more sophisticated forms of use required by a vision of ubiquitously available information. In the Jini community, we should start thinking seriously about how best to facilitate such guarantees for the services we develop.

Frank Sommers is founder and CEO of AutoSpaces.com, a startup focusing on bringing Jini technology to the automotive software market. He also serves as vice president of technology at Nowcom Corp., an outsourcing firm based in Los Angeles. He has been programming in Java since 1995, after attending the first public demonstration of the language on the Sun Microsystems campus in November of that year. His interests include parallel and distributed computing, the discovery and representation of knowledge in databases, as well as the philosophical foundations of computing. When not thinking about computers, he composes and plays piano, studies the symphonies of Gustav Mahler, or explores the writings of Aristotle and Ayn Rand. Sommers would like to thank Jungwon Jin, aka Nugu, for her tireless support and unfailing belief.
1 2 3 4 Page 3
Page 3 of 4