Accelerate your RMI programming

Speed up performance bottlenecks created by RMI

RMI lets you write distributed Java programs with minimal extra work. Unfortunately, RMI also introduces new opportunities for performance bottlenecks in addition to those typically present in nonnetworked Java programs. You can reduce or eliminate many of these bottlenecks with careful design or with hand optimizations. This article shows how to safely optimize around a number of those bottlenecks.

Where does the time go?

An RMI call consists of three basic steps: pounding an object or object tree into an array of bytes, sending those bytes across a network, and rebuilding the byte array into an equivalent object or object tree at the far end. We will ignore a fourth activity -- shipping missing classes across the network -- because it only happens during the first RMI call with new classes.

The object-to-bytes and bytes-to-object conversions are paired, and in this article, we treat them as a unit for throughput purposes. The time spent sending bytes over the network tends to be independent of the time spent converting objects to bytes and back again. Instead, that time can be influenced by two factors: by the byte array's length, and by optimizations that reduce the number of bytes sent over the network; those bytes sometimes run faster because fewer bytes are transmitted.

The network

RMI is built on top of network sockets and, thus, network problems can slow RMI calls. For applets running over the Internet, you may not be able to ensure that the network settings provide reasonable throughput. However, you do have control over applets and applications that run only on local networks. At the very least, you should ensure that your part of the network connection is not responsible for any bandwidth limitations.

This cost of suboptimal network settings can be higher than you might assume. On one internal application, we found that some machines, containing small subnets, were running 50 times faster than others. Further investigation revealed that the slow machines had their networks misconfigured and were only getting 100-200 Kbps out of a 100 Mbps network!

The netperf tool

We measured our network performance using a tool called netperf, which lets you measure the network bandwidth between machines under a variety of configurations. It showed us that our fast machines were running at 70 Mbps and the slow machines at 100-200 Kbps. As the hardware was identical, this rapidly led to a comparison of network configurations and the problem was resolved.

The lesson here is that even a fast network can run slowly if misconfigured. Before spending time optimizing your code, ensure that the problem isn't your network.

The TcpWindowSize parameter

TcpWindowSize is one of many TCP/IP tuning parameters, and seems to have the most direct influence over network bandwidth. TcpWindowSize controls the amount of data you can send from one machine to another without receiving an acknowledgment packet from the other side. A detailed discussion of this parameter is beyond this article's scope, but a short summary is that reliable networks, like internal ethernet networks, run faster with large window sizes. Unreliable networks, like the Internet, run faster with smaller window sizes. Usually, the default size is fine, and you don't want to change this parameter without measuring the benefit. We found that for one example, changing TcpWindowSize increased the netperf bandwidth by about 10 percent, but only increased our RMI bandwidth by less than 1 percent. Your application may be different, especially if run over a less reliable network. In any event, if you tune this to speed up netperf, verify that it helps your RMI performance as well.

RMI infrastructure settings

After your network is properly configured, you should verify that your RMI system properties are also configured properly. Unless configured otherwise, RMI uses defaults for a number of RMI configurable system properties. These properties control things like the level of logging the RMI infrastructure provides, the maximum time between distributed garbage collection checks, and various time-out values. The default values are often fine, but two parameters can often hinder performance.

Our project consists of software dynamically configured at runtime to run over a network of N processors. If N is one, then all the software runs on one CPU. For larger values of N, we put our software components on multiple CPUs and use RMI to communicate between them. The software behaves identically with one or N CPUs, but runs faster with more CPUs.

After testing and debugging our software running on one CPU, we reconfigured it to run on five. With only one CPU, we used up our heap before running a full garbage collection, but with five CPUs we got a full garbage collection with only 25 percent of the heap exhausted. These extra garbage collections hurt our performance, and we eventually concluded that they had to be RMI related.

We discovered that if a garbage collection has not run within a specified time, RMI is configured to run a forced one at least every 60 seconds (the default value). Our heap was large enough that we only used about 25 percent before the forced garbage collection.

You can configure this garbage collection time with two system properties: sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval. Changing these values affects the garbage collection of remote objects, so don't change these values casually. In our case, however, these forced garbage collections were unnecessary. Forcing these parameters to much higher values sped up our application without any negative consequences. Sun has a full list of these configurable RMI parameters on its Website, see Resources.

Latency and bandwidth

David Clark of M.I.T. once said of networks:

Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed -- you can't bribe God.

Latency is the time between an action's start and finish, like a function call or a set of function calls. Bandwidth is the amount of data that can be shipped in a given amount of time.

To illustrate the difference, imagine a courier service that ships 1 GB Jaz cartridges from Los Angeles to San Francisco driving Volkswagens. Driving at the speed limit, a driver makes the trip in 8 hours. We have a bandwidth of 125 MB/hour and a latency of 8 hours.

To improve latency, we buy our courier a Ferrari and encourage speeding. Now he gets from Los Angeles to San Francisco in 3 hours. Our latency has dropped from 8 hours to 3 hours. If this isn't enough, we charter a private jet and drop the latency to 2 hours, including time traveling to and from the airports.

To improve bandwidth, the driver takes 100 cartridges in each trip instead of just one. With the jet, our latency is still 2 hours, but our bandwidth has increased to 12.5 GB/hour. For more bandwidth, we purchase a semi-trailer and ship 100,000 cartridges at a time. Our bandwidth now goes to 12,500 GB/hour, but our latency is back to 8 hours. Whether you want the bandwidth or the latency depends on what you are trying to do.

When programming with RMI, both latency and bandwidth are important. For internal ethernet networks, the bandwidth is often more than adequate and latency becomes the focus of attention. Dialup connections are much slower and often have sufficiently low bandwidth, so bandwidth becomes more important.

Unfortunately, a desire for high bandwidth often conflicts with a desire for small latency. Sending characters across a network one at a time gives the best per-character latency, but the smallest bandwidth. Bundling characters into packets increases the bandwidth, but adds latency.

In network programming, high latency isn't usually something you fix but rather something you must design around. In practice, you usually want RMI calls to send or return as much information as possible to reduce the number of calls.

As an example, the time to move a certain number of data items over RMI varies based on the number of items sent with each call. One example, using strings, looks like Table 1:

Table 1. Time spent based on number of items sent with each RMI method call

ObjectsBucket SizeTime without Bucket (ms)Time using Bucket (ms)
327681881220028
327682877210065
32768487225047
32768887432543
327681687321312
32768328723691
32768648703390
327681288743221
327682568792161
327685128782100
327681024884380
327682048877370
327684096867370
327688192866270
3276816384867261
3276832768868260

With RMI in the middle, designing APIs to send data collections rather than one data piece at a time is critical.

Serialization and externalization

Frequently, you can simply rework the relevant APIs to speed things up and then, at that point, stop. If not, you must optimize the serialization step of your RMI calls.

Though serialization provides great flexibility, its overhead can be quite high. This is because serialization uses Java's reflection mechanism to determine what data to serialize and how to serialize it. You can convert objects to byte arrays more efficiently by writing out objects yourself with Java's externalization mechanism.

Externalize to avoid reflection

The default serialization provided by the tagging java.io.Serialization interface uses reflection. You then don't need to write any code other than adding the tagging interface to the classes you want to serialize. The cost to using this approach is that reflection runs slower than straight method calls. An example is the serialization performance illustrated by the following two class fragments:

    public class SerializedClass implements Serializable
    {
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
        // No more code related to serialization!
             :
             :
    public class ExternalizedClass implements Externalizable
    {
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
       
        public void writeExternal(ObjectOutput out) throws IOException
        {
            out.writeObject(aString);
            out.writeInt(anIntA);
            out.writeInt(anIntB);
            out.writeObject(floatArray);
            out.writeObject(dimensionA);
            out.writeObject(dimensionB);
        }
        public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
        {
            aString = (String)in.readObject();
            anIntA = in.readInt();
            anIntB = in.readInt();
            floatArray = (float[])in.readObject();
            dimensionA = (Dimension)in.readObject();
            dimensionB = (Dimension)in.readObject();
        }
        // Etc.

Notice that the only real difference here is that the Externalized class has two extra methods, writeExternal() and readExternal().

Using various JVMs, the time to serialize, and then deserialize, the two classes for our example is shown in Table 2.

Table 2: Time to serialize and deserialize two classes

JVMSerialization (ms/object)Externalization (ms/object)
 serialize + deserialize = totalserialize + deserialize = total
JDK 1.1 w/JIT293 + 531 = 824193 + 304 = 497
JDK 1.2 classic328 + 446 = 774278 + 274 = 552
JDK 1.2 w/HotSpot 1656 + 521 = 1177538 + 346 = 884
JDK 1.2 w/HotSpot 2569 + 536 = 1105490 + 330 = 820
JDK 1.3 client635 + 410 = 1045506 + 308 = 814
JDK 1.3 server578 + 405 = 983455 + 311 = 766
IBM JDK 1.3434 + 306 = 740368 + 220 = 588
JDK 1.4 beta server459 + 358 = 817378 + 265 = 643

The above times are in microseconds per object and were tested on a 550 MHz Pentium-III running Windows NT 4.0. Please keep in mind that these times are only illustrative. A different example would show different relative performances.

The default Serialization class is simple to write and use, but runs slower than the Externalized class. However, the Externalized class requires more coding, plus the child classes must now handcode their own read and write methods. Therefore, using externalization to make future additions to this class hierarchy likely requires more debugging than equivalent additions made using serialization.

Object packing

If you still need more speed, then you can resort to more drastic measures, such as packing the objects into shorter byte arrays.

In our example above, if intA and intB each only require 16 bits, then we can store them together:

1 2 3 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more