Accelerate your RMI programming

Speed up performance bottlenecks created by RMI

RMI lets you write distributed Java programs with minimal extra work. Unfortunately, RMI also introduces new opportunities for performance bottlenecks in addition to those typically present in nonnetworked Java programs. You can reduce or eliminate many of these bottlenecks with careful design or with hand optimizations. This article shows how to safely optimize around a number of those bottlenecks.

Where does the time go?

An RMI call consists of three basic steps: pounding an object or object tree into an array of bytes, sending those bytes across a network, and rebuilding the byte array into an equivalent object or object tree at the far end. We will ignore a fourth activity -- shipping missing classes across the network -- because it only happens during the first RMI call with new classes.

The object-to-bytes and bytes-to-object conversions are paired, and in this article, we treat them as a unit for throughput purposes. The time spent sending bytes over the network tends to be independent of the time spent converting objects to bytes and back again. Instead, that time can be influenced by two factors: by the byte array's length, and by optimizations that reduce the number of bytes sent over the network; those bytes sometimes run faster because fewer bytes are transmitted.

The network

RMI is built on top of network sockets and, thus, network problems can slow RMI calls. For applets running over the Internet, you may not be able to ensure that the network settings provide reasonable throughput. However, you do have control over applets and applications that run only on local networks. At the very least, you should ensure that your part of the network connection is not responsible for any bandwidth limitations.

This cost of suboptimal network settings can be higher than you might assume. On one internal application, we found that some machines, containing small subnets, were running 50 times faster than others. Further investigation revealed that the slow machines had their networks misconfigured and were only getting 100-200 Kbps out of a 100 Mbps network!

The netperf tool

We measured our network performance using a tool called netperf, which lets you measure the network bandwidth between machines under a variety of configurations. It showed us that our fast machines were running at 70 Mbps and the slow machines at 100-200 Kbps. As the hardware was identical, this rapidly led to a comparison of network configurations and the problem was resolved.

The lesson here is that even a fast network can run slowly if misconfigured. Before spending time optimizing your code, ensure that the problem isn't your network.

The TcpWindowSize parameter

TcpWindowSize is one of many TCP/IP tuning parameters, and seems to have the most direct influence over network bandwidth. TcpWindowSize controls the amount of data you can send from one machine to another without receiving an acknowledgment packet from the other side. A detailed discussion of this parameter is beyond this article's scope, but a short summary is that reliable networks, like internal ethernet networks, run faster with large window sizes. Unreliable networks, like the Internet, run faster with smaller window sizes. Usually, the default size is fine, and you don't want to change this parameter without measuring the benefit. We found that for one example, changing TcpWindowSize increased the netperf bandwidth by about 10 percent, but only increased our RMI bandwidth by less than 1 percent. Your application may be different, especially if run over a less reliable network. In any event, if you tune this to speed up netperf, verify that it helps your RMI performance as well.

RMI infrastructure settings

After your network is properly configured, you should verify that your RMI system properties are also configured properly. Unless configured otherwise, RMI uses defaults for a number of RMI configurable system properties. These properties control things like the level of logging the RMI infrastructure provides, the maximum time between distributed garbage collection checks, and various time-out values. The default values are often fine, but two parameters can often hinder performance.

Our project consists of software dynamically configured at runtime to run over a network of N processors. If N is one, then all the software runs on one CPU. For larger values of N, we put our software components on multiple CPUs and use RMI to communicate between them. The software behaves identically with one or N CPUs, but runs faster with more CPUs.

After testing and debugging our software running on one CPU, we reconfigured it to run on five. With only one CPU, we used up our heap before running a full garbage collection, but with five CPUs we got a full garbage collection with only 25 percent of the heap exhausted. These extra garbage collections hurt our performance, and we eventually concluded that they had to be RMI related.

We discovered that if a garbage collection has not run within a specified time, RMI is configured to run a forced one at least every 60 seconds (the default value). Our heap was large enough that we only used about 25 percent before the forced garbage collection.

You can configure this garbage collection time with two system properties: sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval. Changing these values affects the garbage collection of remote objects, so don't change these values casually. In our case, however, these forced garbage collections were unnecessary. Forcing these parameters to much higher values sped up our application without any negative consequences. Sun has a full list of these configurable RMI parameters on its Website, see Resources.

Latency and bandwidth

David Clark of M.I.T. once said of networks:

Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed -- you can't bribe God.

Latency is the time between an action's start and finish, like a function call or a set of function calls. Bandwidth is the amount of data that can be shipped in a given amount of time.

To illustrate the difference, imagine a courier service that ships 1 GB Jaz cartridges from Los Angeles to San Francisco driving Volkswagens. Driving at the speed limit, a driver makes the trip in 8 hours. We have a bandwidth of 125 MB/hour and a latency of 8 hours.

To improve latency, we buy our courier a Ferrari and encourage speeding. Now he gets from Los Angeles to San Francisco in 3 hours. Our latency has dropped from 8 hours to 3 hours. If this isn't enough, we charter a private jet and drop the latency to 2 hours, including time traveling to and from the airports.

To improve bandwidth, the driver takes 100 cartridges in each trip instead of just one. With the jet, our latency is still 2 hours, but our bandwidth has increased to 12.5 GB/hour. For more bandwidth, we purchase a semi-trailer and ship 100,000 cartridges at a time. Our bandwidth now goes to 12,500 GB/hour, but our latency is back to 8 hours. Whether you want the bandwidth or the latency depends on what you are trying to do.

When programming with RMI, both latency and bandwidth are important. For internal ethernet networks, the bandwidth is often more than adequate and latency becomes the focus of attention. Dialup connections are much slower and often have sufficiently low bandwidth, so bandwidth becomes more important.

Unfortunately, a desire for high bandwidth often conflicts with a desire for small latency. Sending characters across a network one at a time gives the best per-character latency, but the smallest bandwidth. Bundling characters into packets increases the bandwidth, but adds latency.

In network programming, high latency isn't usually something you fix but rather something you must design around. In practice, you usually want RMI calls to send or return as much information as possible to reduce the number of calls.

As an example, the time to move a certain number of data items over RMI varies based on the number of items sent with each call. One example, using strings, looks like Table 1:

Table 1. Time spent based on number of items sent with each RMI method call

ObjectsBucket SizeTime without Bucket (ms)Time using Bucket (ms)
327681881220028
327682877210065
32768487225047
32768887432543
327681687321312
32768328723691
32768648703390
327681288743221
327682568792161
327685128782100
327681024884380
327682048877370
327684096867370
327688192866270
3276816384867261
3276832768868260

With RMI in the middle, designing APIs to send data collections rather than one data piece at a time is critical.

Serialization and externalization

Frequently, you can simply rework the relevant APIs to speed things up and then, at that point, stop. If not, you must optimize the serialization step of your RMI calls.

Though serialization provides great flexibility, its overhead can be quite high. This is because serialization uses Java's reflection mechanism to determine what data to serialize and how to serialize it. You can convert objects to byte arrays more efficiently by writing out objects yourself with Java's externalization mechanism.

Externalize to avoid reflection

The default serialization provided by the tagging java.io.Serialization interface uses reflection. You then don't need to write any code other than adding the tagging interface to the classes you want to serialize. The cost to using this approach is that reflection runs slower than straight method calls. An example is the serialization performance illustrated by the following two class fragments:

    public class SerializedClass implements Serializable
    {
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
        // No more code related to serialization!
             :
             :
    public class ExternalizedClass implements Externalizable
    {
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
       
        public void writeExternal(ObjectOutput out) throws IOException
        {
            out.writeObject(aString);
            out.writeInt(anIntA);
            out.writeInt(anIntB);
            out.writeObject(floatArray);
            out.writeObject(dimensionA);
            out.writeObject(dimensionB);
        }
        public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
        {
            aString = (String)in.readObject();
            anIntA = in.readInt();
            anIntB = in.readInt();
            floatArray = (float[])in.readObject();
            dimensionA = (Dimension)in.readObject();
            dimensionB = (Dimension)in.readObject();
        }
        // Etc.

Notice that the only real difference here is that the Externalized class has two extra methods, writeExternal() and readExternal().

Using various JVMs, the time to serialize, and then deserialize, the two classes for our example is shown in Table 2.

Table 2: Time to serialize and deserialize two classes

JVMSerialization (ms/object)Externalization (ms/object)
 serialize + deserialize = totalserialize + deserialize = total
JDK 1.1 w/JIT293 + 531 = 824193 + 304 = 497
JDK 1.2 classic328 + 446 = 774278 + 274 = 552
JDK 1.2 w/HotSpot 1656 + 521 = 1177538 + 346 = 884
JDK 1.2 w/HotSpot 2569 + 536 = 1105490 + 330 = 820
JDK 1.3 client635 + 410 = 1045506 + 308 = 814
JDK 1.3 server578 + 405 = 983455 + 311 = 766
IBM JDK 1.3434 + 306 = 740368 + 220 = 588
JDK 1.4 beta server459 + 358 = 817378 + 265 = 643

The above times are in microseconds per object and were tested on a 550 MHz Pentium-III running Windows NT 4.0. Please keep in mind that these times are only illustrative. A different example would show different relative performances.

The default Serialization class is simple to write and use, but runs slower than the Externalized class. However, the Externalized class requires more coding, plus the child classes must now handcode their own read and write methods. Therefore, using externalization to make future additions to this class hierarchy likely requires more debugging than equivalent additions made using serialization.

Object packing

If you still need more speed, then you can resort to more drastic measures, such as packing the objects into shorter byte arrays.

In our example above, if intA and intB each only require 16 bits, then we can store them together:

    public class ExternalizedClass implements Externalizable
    {
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
       
        public void writeExternal(ObjectOutput out) throws IOException
        {
            out.writeObject(aString);
            int toWrite =  (anIntA << 16 ) & 0xffff0000;    // <- Difference
            toWrite |= anIntB & 0x0000ffff;                 // <- Difference
            out.writeInt(toWrite);                          // <- Difference
            out.writeObject(floatArray);
            // pack any types that can be packed.
            out.writeObject(dimensionA);
            out.writeObject(dimensionB);
        }
        public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
        {
            aString = (String)in.readObject();
            int written = in.readInt();                     // <- Difference
            anIntA = (written & 0xffff0000) >> 16;          // <- Difference
            anIntB = written & 0x0000ffff;                  // <- Difference
            floatArray = (float[])in.readObject();
            dimensionA = (Dimension)in.readObject();
            dimensionB = (Dimension)in.readObject();
        }

This reduces the number of bytes that must be sent over the network. In the example above, this won't matter over a fast ethernet network, but might improve a slower dialup network connection. It also helps larger objects containing a lot of packable data.

Flatten contained objects

Another approach to minimizing the number of sent bytes is to directly serialize the contents of contained objects.

Serializing multiple objects, even with explicit serialization, requires more time and space than serializing the same amount of data in one object. This suggests optimizing RMI calls by serializing the contents of objects contained in other objects, but not the contained objects themselves. Upon deserialization, new contained objects are constructed using the contents.

We modify our example further to look like this:

    public class ObjectFlattening implements Externalizable
    {
        private static final int NOT_CHILD_CLASS = 0;
        private static final int CHILD_CLASS = 1;
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
        public void writeExternal(ObjectOutput out) throws IOException
        {
            out.writeObject(aString);
            
            int toWrite =  (anIntA << 16 ) & 0xffff0000;
            toWrite |= anIntB & 0x0000ffff;
            out.writeInt(toWrite);
            
            out.writeObject(floatArray);
            
            if ( dimensionA.getClass().equals(Dimension.class) )
            {
                out.writeInt(NOT_CHILD_CLASS);
                out.writeInt(dimensionA.width);
                out.writeInt(dimensionA.height);
            }
            else
            {
                out.writeInt(CHILD_CLASS);
                out.writeObject(dimensionA);
            }
            
            out.writeObject(dimensionB);
        }
        public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
        {
            aString = (String)in.readObject();
            
            int written = in.readInt();
            anIntA = (written & 0xffff0000) >> 16;
            anIntB = written & 0x0000ffff;
            
            floatArray = (float[])in.readObject();
            
            int type = in.readInt();
            if ( type == NOT_CHILD_CLASS )
            {
                int width = in.readInt();
                int height = in.readInt();
                dimensionA = new Dimension(width, height);
            }
            else
            {
                dimensionA = (Dimension)in.readObject();
            }
            
            dimensionB = (Dimension)in.readObject();
        }

Table 3 shows the time required for the optimizations made thus far.

Table 3. Time spent with total optimizations made

JVMSerialization (ms/object)Optimizations so far (ms/object)
 serialize + deserialize = totalserialize + deserialize = total
JDK 1.1 w/JIT293 + 531 = 824194 + 256 = 450
JDK 1.2 classic328 + 446 = 774269 + 270 = 539
JDK 1.2 w/HotSpot 1656 + 521 = 1177528 + 314 = 842
JDK 1.2 w/HotSpot 2569 + 536 = 1105443 + 330 = 773
JDK 1.3 client635 + 410 = 1045517 + 272 = 789
JDK 1.3 server578 + 405 = 983438 + 296 = 734
IBM JDK 1.3434 + 306 = 740354 + 214 = 568
JDK 1.4 beta server459 + 358 = 817366 + 260 = 626

This optimization, flattening contained objects, may introduce subtle bugs. Therefore, we must handle these following three issues:

  1. Class changes: As written, the containing class assumes that it knows the contents of the child class. In general, this poor assumption is a bug waiting to happen. If a new member variable is added to the contained class, the serialization/deserialization will break. Debugging this is straightforward, but it may take awhile to discover that a bug even exists!

    Fortunately, in our example, Dimension is part of the standard Java class libraries and won't change. If we had created Dimension, we would modify it so that it wrote out its contents into a byte stream that was passed in by containing classes (in our example, we have optimized ObjectFlattening to flatten the contained objects).

  2. Object slicing: Occurs when a child class's object is serialized, but the parent class's object is reconstructed during deserialization. This can be a serious problem when flattening objects that can be different classes in a class hierarchy. If the flattened classes are made final, child classes won't exist, so object slicing isn't a problem. But if they can't be made final, the parent class must know about all the child classes for successful deserialization, and you must pack an identifying int with the serialized bytes. This is exactly the sort of problem that the default serialization mechanism handles so we don't need to, but if you need maximum speed, then you may have to manage this yourself. In our example above, we handle it using the type parameter.
  3. Object sharing: Refers to the fact that two references to the same contained object may get converted into two references to two different contained objects upon deserialization. This is because deserialization creates new objects and doesn't know enough to restore lost shared references. In our example above, dimensionA and dimensionB might both refer to the same Dimension object before serialization. After we modify the readExternal() and writeExternal() methods to flatten these objects, we lose this reference sharing. This must be managed on a case-by-case basis, but fortunately isn't usually an issue.

Use nulls for default objects

Our last optimization uses null for default object values. In our example, we have a class containing a location stored as Dimension objects. Assume that many objects in our example have a position of (0, 0). Finally, assume that we wrote our class so that location will never be null -- a reasonable assumption as this permits the client code to avoid dealing with null as a special case.

The trick is to send a null as a special case for the (0, 0) location. Modifying our example class one last time, we get this:

    public class ObjectFlattening implements Externalizable
    {
        private static final int NOT_CHILD_CLASS = 0;
        private static final int CHILD_CLASS = 1;
        private String aString;
        private int anIntA;
        private int anIntB;
        private float[] floatArray;
        private Dimension dimensionA;
        private Dimension dimensionB;
        public void writeExternal(ObjectOutput out) throws IOException
        {
            out.writeObject(aString);
            
            int toWrite =  (anIntA << 16 ) & 0xffff0000;
            toWrite |= anIntB & 0x0000ffff;
            out.writeInt(toWrite);
            
            out.writeObject(floatArray);
            
            if ( dimensionA.getClass().equals(Dimension.class) )
            {
                out.writeInt(NOT_CHILD_CLASS);
                out.writeInt(dimensionA.width);
                out.writeInt(dimensionA.height);
            }
            else
            {
                out.writeInt(CHILD_CLASS);
                out.writeObject(dimensionA);
            }
            
            if ( dimensionB.getClass().equals(Dimension.class) )
                out.writeInt(NOT_CHILD_CLASS);
            else
                out.writeInt(CHILD_CLASS);            
            if ( dimensionB.height == 0 && dimensionB.width == 0 )
                out.writeObject(null);
            else 
                out.writeObject(dimensionB);
        }
        public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException
        {
            aString = (String)in.readObject();
            
            int written = in.readInt();
            anIntA = (written & 0xffff0000) >> 16;
            anIntB = written & 0x0000ffff;
            
            floatArray = (float[])in.readObject();
            
            int type = in.readInt();
            if ( type == NOT_CHILD_CLASS )
            {
                int width = in.readInt();
                int height = in.readInt();
                dimensionA = new Dimension(width, height);
            }
            else
            {
                dimensionA = (Dimension)in.readObject();
            }
            
            dimensionB = (Dimension)in.readObject();
        }

Table 4 shows the time to serialize/deserialize objects with all our optimizations.

Table 4. Time spent with total optimizations

JVMSerialization (ms/object)Total optimizations (ms/object)
 serialize + deserialize = totalserialize + deserialize = total
JDK 1.1 w/JIT293 + 531 = 824137 + 160 = 297
JDK 1.2 classic328 + 446 = 774227 + 173 = 400
JDK 1.2 w/HotSpot 1656 + 521 = 1177461 + 198 = 659
JDK 1.2 w/HotSpot 2569 + 536 = 1105410 + 189 = 599
JDK 1.3 client635 + 410 = 1045427 + 193 = 620
JDK 1.3 server578 + 405 = 983387 + 172 = 559
IBM JDK 1.3434 + 306 = 740308 + 139 = 447
JDK 1.4 beta server459 + 358 = 817344 + 144 = 488

The check for (0, 0) is fairly obvious. The check for Dimension is less so. Why do we need to check that the class is java.awt.Dimension? The answer is to avoid object slicing. If we don't check and the object is really of java.awt.Dimension's child class, then we lose information when we serialize. When we deserialize, we generate an object of the wrong type.

Note that you can lose object sharing with this approach, just like you can with object flattening. If you have multiple references pointing to the same object, this approach breaks that relationship and you instead get each reference pointing to its own object after deserialization.

Again, this is usually not an issue, but you must be aware of it when optimizing.

Putting this all together

As shown by Table 1 (repeated here), combining RMI method calls provides the most dramatic increases in speed:

Table 1. Time spent based on number of items sent with each RMI method call

ObjectsBucket SizeTime without Bucket (ms)Time using Bucket (ms)
327681881220028
327682877210065
32768487225047
32768887432543
327681687321312
32768328723691
32768648703390
327681288743221
327682568792161
327685128782100
327681024884380
327682048877370
327684096867370
327688192866270
3276816384867261
3276832768868260

Configuring the TcpWindowSize and the RMI system properties help only slightly. You can easily do both, so you should investigate them before changing your object serialization strategies.

Finally, collecting the data for our example using all the serialization speedups in Table 5, we find that we can accelerate serialization and deserialization by complicating the situation.

Table 5. Time for all serialization speedups

JVMSerializationExternalizationPackingFlatteningNull Default
JDK 1.1 w/JIT293 + 531 = 824193 + 304 = 497194 + 304 = 498194 + 256 = 450137 + 160 = 297
JDK 1.2 classic328 + 446 = 774278 + 274 = 552270 + 275 = 545269 + 270 = 539227 + 173 = 400
JDK 1.2 w/HotSpot 1656 + 521 = 1177538 + 346 = 884501 + 365 = 866528 + 314 = 842461 + 198 = 659
JDK 1.2 w/HotSpot 2569 + 536 = 1105490 + 330 = 820459 + 341 = 800443 + 330 = 773410 + 189 = 599
JDK 1.3 client635 + 410 = 1045506 + 308 = 814514 + 282 = 796517 + 272 = 789427 + 193 = 620
JDK 1.3 server578 + 405 = 983455 + 311 = 766447 + 305 = 752438 + 296 = 734387 + 172 = 559
IBM JDK 1.3434 + 306 = 740368 + 220 = 588363 + 221 = 584354 + 214 = 568308 + 139 = 447
JDK 1.4 beta server459 + 358 = 817378 + 265 = 643376 + 255 = 631366 + 260 = 626344 + 144 = 488

In our example, explicitly externalizing a class rather than using the default serialization mechanism helps for every JVM we tested. Packing and flattening don't help much in our example, although for other examples, they would. Packing and flattening can also help by reducing the number of bytes that must be sent over a slow connection, although this benefit won't show up as faster byte packing. Finally, using null for default values accelerate things again for this example on all JVMs. Other examples might show no benefit.

Hit your target

RMI programming introduces opportunities for performance bottlenecks in addition to those typically present in nonnetworked Java programs. Further, some of the speed-increasing optimizations can introduce subtle bugs. We have presented some of these bottlenecks and showed how to increase their speed, while pointing out pitfalls and tips for avoiding them.

With the exception of configuring the RMI system parameters and setting the TcpWindowSize, each acceleration makes the code more complicated and difficult to debug and maintain. Thus, it is important to have a firm target in mind and cease complicating the code once you meet that target.

Ashok Mathew, a Sun-certified programmer, has been programming for more than five years. He works full time at KLA-Tencor, and his interests include systems programming and world travel. Mark Roulo, JavaWorld's former Java Tip technical coordinator, has been programming professionally since 1989 and has been using Java since the alpha-3 release. He works full time at KLA-Tencor and, with Ashok, is part of a team that has built and shipped a 650KLOC distributed parallel multicomputer application for image processing (among other things) that is written almost entirely in Java.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more