Java performance programming, Part 3: Managing collections

See how collections alternatives measure up in performance, and find out how to get the most out of each type

In Java performance programming, Part 2, we looked into the basics of casting -- what it is, what it costs, and some ways you can reduce your use of it. This month, we'll cover the related issue of collections in Java, looking first at the basic collections support supplied in the original versions of Java, then moving on to the vastly improved support added in JDK 1.2. Looking ahead, we'll also examine the current state of the Java Community Process proposal for generic types in Java -- including both the benefits we're likely to see from this proposal and the areas in which it perhaps falls short.

We'll then take a look at ways to implement type-specific collections in Java, including an approach that uses a common base class to eliminate some of the duplicated code for these collections. Finally, we'll finish off this month's performance tips with the results of timing tests to see how our alternatives measure up.

(Note: See the sidebar below for my views on the effectiveness of Sun's Java Community Process.)

Vintage collections: Vector and Hashtable

If you're developing for JDK 1.1.x, your standard collections support includes just two choices: java.util.Vector and java.util.Hashtable. (There is also java.util.Stack, but this is only a minor variation on Vector). To make matters even worse, the two choices you do have available are noticeably handicapped in the performance area. Both classes use generic java.lang.Object references, creating the need for casting -- with the overhead associated with it -- on virtually every access, and both use synchronized methods for all accesses.

You need synchronized access methods for modifiable collections when they're used by multiple threads. However, in many cases a single thread owns and exclusively accesses a collection. In such a situation, synchronized access methods are overkill, and result in a substantial performance penalty.

We'll discuss the issue of synchronization further in the next article in this series, along with other threading issues. If you can't wait until then, the performance comparison table near the end of this article includes figures for the legacy Vector class that should give you a feeling for how heavy use of synchronization can affect performance.

Given the performance limitations, these old-time collection workhorses are generally best avoided in heavily used code. Just remember that they're still useful when you need only a simple collection and when performance is not a major issue.

There is a way to get improved standard collections support in JDK 1.1.x, but it comes with a price in terms of code compatibility. We'll discuss this further in the next section of this article, but for now let's limit our discussion to these two basic collections classes provided as part of the standard Java class library.

Change your Vector

If you must use JDK 1.1.x collections in heavily executed code, you can often still work around the limitations of these classes. An opportunity to do so arises when you want to dynamically create a collection of some type by accumulating the elements one at a time, as long as you won't be adding to the collection after you've finished creating it. In such a situation, java.util.Vector provides a simple means of accumulating the elements, although the need to access the elements while doing so leaves you at a performance disadvantage.

To get around this, it's easy to convert a Vector into a type-specific array; once you've turned the collection into an array, access is very fast. For the JDK 1.1.x version of Vector, you can take a direct approach to the conversion:

    // Convert a vector of Integer into an array of Integer.
    Vector vector = ...
    Integer[] array = new Integer[vector.size()];
    for (int i = 0; i < array.length; i++) {
        array[i] = (Integer)vector.elementAt(i);
    }

You can also extend Vector to implement a generic form of conversion:

    public class ConvertibleVector extends Vector
    {
        public Object buildArray(Class type) {
            Object copy = Array.newInstance(type, elementCount);
            System.arraycopy(elementData, 0, copy, 0, elementCount);
            return copy;
        }
        // Assumes there's at least one element and it's not null!
        public Object buildArray() {
            if (elementCount > 0 && elementData[0] != null) {
                return buildArray(elementData[0].getClass());
            } else {
                throw new IllegalArgumentException("cannot convert to array");
        }
}

This form assumes that all elements of the Vector are of the same type. The buildArray() method builds an array of elements of that type and returns it. The caller still needs to cast the result to an array of the specific type, though:

    // Convert a vector of Integer into an array of Integer.
    ConvertibleVector vector = ...
    Integer[] array = (Integer[])vector.buildArray();

Customize Hashtable

java.util.Hashtable is more problematic. In addition to the issues involved in casting and synchronization, Hashtable also requires that the keys be objects (rather than primitive types), and then creates more objects to link the keys and values. This is not a major concern in the standard use of a hashtable, when lookups are much more frequent than adds, but can lead to a lot of unnecessary object creation in cases that don't fit this usual pattern.

You can avoid these potential performance problems by using a custom hashtable implementation in place of the standard implementation. On my Website, I've provided simplified example implementations of such custom hashtables that work with int and String key values for objects implementing appropriate interfaces (see the Resources section below for the URL). These examples provide considerably better performance than the standard Hashtable implementation for simple accumulate-only hashtables, and can be extended to include additional functionality.

The next generation: The collections framework

Fortunately, Sun recognized the limitations of Java's original collection classes fairly early. For JDK 1.2, the company designed the collections framework, a greatly enhanced architecture for representing and manipulating collections. Based on six interrelated collection interfaces, the framework provides a variety of efficient implementations, as well as some extremely useful utility classes. The collections are unsynchronized by default, but the framework provides a simple wrapper class to build synchronized versions of the collections when necessary.

Unsynchronized equivalents of the legacy collection classes java.util.Vector and java.util.Hashtable exist in the collections framework (these classes are java.util.ArrayList and java.util.HashMap, respectively), while the legacy classes themselves remain synchronized but add some additional operations. All of the collections classes include support for converting the collection to an array of values via toArray() methods similar to the buildArray() method shown above.

The utility classes include the java.util.Collections class, which implements a variety of useful algorithms for collections, and the java.util.Arrays class, which sorts, searches, compares, and fills arrays of both primitive and generic reference types. Having efficient standard implementations available for these common operations on arrays is especially useful when working with primitive values, since they allow you to directly use efficient arrays of primitives and do not require as much custom code.

The Resources section below lists several articles and documents for more details on the collections framework.

The collections classes are also available for JDK 1.1.x, but only in a modified form (see Resources for a link to this, as well). Since it's generally not possible to add classes to the standard java.xxx packages in a JVM, the form used for JDK 1.1.x supplies the collections framework as a set of classes in the com.sun.java.util.collections package.

These classes can be used by JDK 1.1.x programs to provide the same functionality as their JDK 1.2 equivalents, but, because of the different class paths, they are not assignment or serialization compatible. They are also only licensed for use in JDK 1.1.x JVMs, so code that needs to run on both JDK 1.1.x and JDK 1.2 JVMs cannot use this alternative. Converting source code from the JDK 1.1.x form to the JDK 1.2 form requires only a simple search-and-replace operation, but this still means that the same class files cannot be used for both JDKs.

If you're supporting only JDK 1.1.x, or if you don't mind distributing separate versions of your class files for JDK 1.1.x and JDK 1.2, the add-on collections framework for JDK 1.1.x is well worth a look.

Parameterized types in Java

One of the most requested enhancements in the Java Developer Connection's bug database is support for templates or some form of parameterized types in Java. The idea here is to allow code to be written in a general form that can be applied to a variety of different data types. When you want an instance of the parameterized class that's specific to a particular type of value, you just create one by specifying the actual value type as a parameter.

This issue is especially relevant to collections, since their use is one of the common cases in which parameterized types would be helpful. With parameterized types, you could define a collection as containing objects of a specific type and have the type checked by the compiler whenever the collection was used. Most developers agree that this would be an improvement over the current approach of detecting collection usage errors by means of runtime exceptions -- which usually occur either during demonstrations to upper management or after the system is put into full production!

There are different ways in which support for parameterized types could be implemented in Java. The choices arouse strong emotions for many developers -- as can be seen in the comments to the request for enhancement linked in the Resources section -- with some adamantly demanding support for C++-style templates and others vehemently opposed to any such "perversion" of Java.

With C++ templates, the compiler constructs the code to follow a model or template provided by the developer. Within the constructed code, a specific type is substituted for the placeholder type used in the template definition. This allows a type-specific implementation of an algorithm to be generated automatically from the generic algorithm template to suit the needs of the program.

The upside of this approach is that it allows multiple highly efficient type-specific implementations to be generated from a single source template. You can write a binary search algorithm, for instance, which can then be used with any primitive or (in C++) object type that supports the comparison operations used by the algorithm. It also could be implemented as a compiler change for Java, without necessarily requiring any JVM changes.

The downside? This approach creates multiple copies of the generated code -- one copy for each specific type used with the template. This is the main reason that many Java developers object to a template approach of this type, since it can potentially lead to the proliferation of class files, or at least to multiple specialized copies of class files within a JVM.

Another approach to providing a form of parameterized types is to specify for the compiler the actual type of the objects used in a particular case, while still writing the implementation for generic objects as with the current collections framework. The compiler could then check the types of values passed in to ensure that they match the specified type, and could transparently generate code to cast returned generic values to the specified type.

This approach has the advantage of eliminating multiple copies of the base code, but the disadvantages of poorer performance (due to the casting) and a lack of support for primitive types. It would also require extensions to the class file format, to support specifying parameterized types, and probably to the JVMs as well, to support checking for proper usage of these types.

Community process to the rescue?

Sun has initiated a specification proposal -- JSR-000014, "Add Generic Types to the Java Programming Language" -- under the Java Community Process (JCP) to investigate adding support for some form of parameterized types to Java (see Resources for a link). The recent formation of an Expert Group indicates that the specification proposal, which was approved in May, has started moving forward. (See the sidebar below for my thoughts on the JCP.)

Sun's Gilad Bracha serves as the contact person for this particular JSR. In a telephone interview, he provided some details of the direction being taken by this process and the time frame in which we might see some results.

The proposal was written with the second approach discussed above in mind, with a compiler that can check types and generate casts as necessary to provide type-specific interfaces to code written for generic objects. The focus is on code clarity and compile-time error checking, rather than on performance, though the proposal does not foresee any significant effects on performance. It also explicitly does not require any support for primitives as parameterized types. (For an in-depth discussion of one group's work on this problem, see "Behold the Power of Parametric Polymorphism in this month's JavaWorld.)

The formation of the Expert Group is the first step in the process of turning the proposal into any actual change to the Java specifications. From this point, the process will continue with a discussion among the experts, leading eventually to a draft specification change. This draft will go through one or more cycles of internal review among the participant companies before it is released as a public proposal. Finally, a reference implementation and reference test suite will be developed and released, and at that point the recommended specification change can be adopted for inclusion into a version of the JDK.

There is no guarantee that the proposal will actually result in a change to the Java specifications, and even if it does, the entire process will take at least one year -- and quite possibly two or more -- to complete. All things considered, it seems likely that we won't actually be able to make use of parameterized types in standard Java until sometime in 2002 or 2003. Even when (or if) this change comes about, it's unlikely to include support for primitive types, so there will likely be a use for type-specific collections in high usage code for some time to come.

Type-specific collections

Although the standard Java libraries don't include support for type-specific collections, it's possible to build template implementations of collections that can be specialized to a particular type by searching and replacing. Here's a (fairly lengthy) example of a collection class, similar to Vector, implemented specifically for java.lang.Integer:

public class IntegerArray
{
    protected int countPresent;
    protected int maximumGrowth;
    protected Integer[] baseArray;
    public IntegerArray(int size, int growth) {
        maximumGrowth = growth;
        baseArray = new Integer[size];
    }
        public IntegerArray(int size) {
        this(size, Integer.MAX_VALUE);
    }
    // Implementation method to increase the underlying array size.
    protected void growArray(int required) {
        int size = Math.max(required,
            baseArray.length + Math.min(baseArray.length, maximumGrowth));
        Integer[] grown = new Integer[size];
        System.arraycopy(baseArray, 0, grown, 0, baseArray.length);
        baseArray = grown;
    }
    // Append a value to the collection.
    public int add(Integer value) {
        int index = countPresent++;
        if (countPresent > baseArray.length) {
            growArray(countPresent);
        }
        baseArray[index] = value;
        return index;
    }
    
    // Insert a value into the collection.
    public void add(int index, Integer value) {
        if (index >= 0 && index <= countPresent) {
            if (++countPresent > baseArray.length) {
                growArray(countPresent);
            }
            if (index < countPresent) {
                System.arraycopy(baseArray, index, baseArray, index + 1,
                    countPresent - index - 1);
            }
            baseArray[index] = value;
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Remove a value from the collection.
    public void remove(int index) {
        if (index >= 0 && index < countPresent) {
            if (index < --countPresent){
                System.arraycopy(baseArray, index + 1, baseArray, index,
                    countPresent - index);
                baseArray[countPresent] = null;
            }
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Make sure we have at least a specified capacity.
    public void ensureCapacity(int min) {
        if (min > baseArray.length) {
            growArray(min);
        }
    }
    // Set the collection empty.
    public void clear() {
        countPresent = 0;
    }
    // Get number of values in collection.
    public int size() {
        return countPresent;
    }
    // Set the size of the collection.
    public void setSize(int count) {
        if (count > baseArray.length) {
            growArray(count);
        } else if (count < countPresent) {
            for (int i = count; i < countPresent; i++) {
                baseArray[i] = null;
            }
        }
        countPresent = count;
    }
    // Get value from the collection.
    public Integer get(int index) {
        if (index < countPresent) {
            return baseArray[index];
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Set the value at a position in the collection.
    public void set(int index, Integer value) {
        if (index < countPresent) {
            baseArray[index] = value;
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Convert to an array.
    public Integer[] buildArray() {
        Integer[] copy = new Integer[countPresent];
        System.arraycopy(baseArray, 0, copy, 0, countPresent);
        return copy;
    }
}

In C++, it would be easy to make this a template and generate an implementation specific to work with any type we wanted to use. The equivalent process in Java requires a search and replace (and continued maintenance of multiple copies of the code). We can convert this collection to any type of element we want by substituting the type name for Integer (though for primitive types we'd probably want to handle the class name separately, to get something like IntArray rather than intArray).

Factoring the base

If widely used, however, this approach definitely contributes to code bloat (along with maintenance problems). To mitigate this problem, with a little effort we can define a base class -- ArrayBase in the code below -- for type-specific collections of this type. Our newly defined class will allow a portion of the code to be shared between specific types:

import java.lang.reflect.Array;
public abstract class ArrayBase
{
    protected int countPresent;
    protected int countLimit;
    protected int maximumGrowth;
    public ArrayBase(int size, int growth, Class type) {
        Object array = Array.newInstance(type, size);
        countLimit = size;
        maximumGrowth = growth;
        setArray(array);
    }
    public ArrayBase(int size, Class type) {
        this(size, Integer.MAX_VALUE, type);
    }
    protected abstract Object getArray();
    protected abstract void setArray(Object array);
    protected abstract void discardValues(int from, int to);
    // Implementation method to increase the underlying array size.
    protected void growArray(int required) {
        Object base = getArray();
        int size = Math.max(required,
            countLimit + Math.min(countLimit, maximumGrowth));
        Class type = base.getClass().getComponentType();
        Object grown = Array.newInstance(type, size);
        System.arraycopy(base, 0, grown, 0, countLimit);
        countLimit = size;
        setArray(grown);
    }
    // Get next add position for appending, increasing size if needed.
    protected int getAddIndex() {
        int index = countPresent++;
        if (countPresent > countLimit) {
            growArray(countPresent);
        }
        return index;
    }
    // Make room to insert a value at a specified index.
    protected void makeInsertSpace(int index) {
        if (index >= 0 && index <= countPresent) {
            if (++countPresent > countLimit) {
                growArray(countPresent);
            }
            if (index < countPresent - 1) {
                Object array = getArray();
                System.arraycopy(array, index, array, index + 1,
                    countPresent - index - 1);
            }
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Remove a value from the collection.
    public void remove(int index) {
        if (index >= 0 && index < countPresent) {
            if (index < --countPresent){
                Object base = getArray();
                System.arraycopy(base, index + 1, base, index,
                    countPresent - index);
                discardValues(countPresent, countPresent + 1);
            }
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Make sure we have at least a specified capacity.
    public void ensureCapacity(int min) {
        if (min > countLimit) {
            growArray(min);
        }
    }
    // Set the collection empty.
    public void clear() {
        setSize(0);
    }
    // Get number of values in collection.
    public int size() {
        return countPresent;
    }
    // Set the size of the collection.
    public void setSize(int count) {
        if (count > countLimit) {
            growArray(count);
        } else if (count < countPresent) {
            discardValues(count, countPresent);
        }
        countPresent = count;
    }
    // Convert to an array of specified type.
    protected Object buildArray(Class type) {
        Object copy = Array.newInstance(type, countPresent);
        System.arraycopy(getArray(), 0, copy, 0, countPresent);
        return copy;
    }
}

ArrayBase factors out a lot of the common code that would otherwise be present in each of the type-specific collections. By using some helpful methods to construct arrays from runtime parameters, we're able to avoid copying the code that sizes and reallocates the underlying array. In addition, by using a set of abstract methods in the base class to call out to the subclass (as discussed in "Java Performance Programming, Part 2"), we're able to avoid a lot of casting in the subclasses while still allowing the base class to manipulate the array used for the collection without knowledge of the specific type of the elements.

IntBaseArray below serves as an example of a type-specific collection class extending our new base class, this time demonstrating its use for primitive int values:

public class IntBaseArray extends ArrayBase
{
    protected int[] baseArray;
    
    public IntBaseArray(int size, int growth) {
        super(size, growth, Integer.TYPE);
    }
    public IntBaseArray(int size) {
        super(size, Integer.TYPE);
    }
    // Implementation of callout to get the underlying array.
    protected Object getArray() {
        return baseArray;
    }
    // Implementation of callout to set the underlying array.
    protected void setArray(Object array) {
        baseArray = (int[]) array;
    }
    // Implementation of callout to initialize a portion of the array.
    protected void discardValues(int from, int to) {
        for (int i = from; i < to; i++) {
            baseArray[i] = 0;
        }
    }
    // Append a value to the collection.
    public int add(int value) {
        int index = getAddIndex();
        baseArray[index] = value;
        return index;
    }
    // Insert a value into the collection.
    public void add(int index, int value) {
        makeInsertSpace(index);
        baseArray[index] = value;
    }
    // Get value from the collection.
    public int get(int index) {
        if (index < countPresent) {
            return baseArray[index];
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Set the value at a position in the collection.
    public void set(int index, int value) {
        if (index < countPresent) {
            baseArray[index] = value;
        } else {
            throw new ArrayIndexOutOfBoundsException("Invalid index value");
        }
    }
    // Convert to an array.
    public int[] buildArray() {
        return (int[]) buildArray(Integer.TYPE);
    }
}

This is not only much shorter than the type-specific collection class with which we started; it also has been pared down to a bare minimum in terms of complexity. Most of the methods remaining are nothing but error checkers and a simple operation or two, so they're much less likely to create any maintenance problems than the original code. The real meat of the implementation is now in the base class, and even if we create a dozen type-specific subclasses, the bulk of the code can easily be modified or fixed.

The performance tests

All this code for type-specific collections has been fun to write, and hopefully has been interesting to glance over. Since this series is about performance, though, we need to see how the different implementation approaches compare in order to get a feeling for when it may be worthwhile to use a type-specific collection.

Performance test 1: Bubble sort

Table 1 below shows how several alternatives perform for a typical case in which values in a collection are accessed more often than new ones are added or existing ones are deleted. This test performs a simple bubble sort on the values in a collection. For the base case, it uses a simple fixed-size array of int values to give us a line on the best possible performance for this type of operation with current JVMs. The test then tries the same operation using various types of resizable collections.

 JRE 1.1.8 (Sun)JRE 1.1.8 (IBM)JRE 1.2.2 (Classic)JRE 1.2.2 (HotSpot 2.0 beta)
int[]

8

9

7

10

IntArray

36

39

36

20

IntBaseArray

34

40

36

21

IntegerArray

72

125

74

36

ObjectArray

98

142

102

45

Vector

451

148

292

90

Table 1. Bubble sort performance (typical time in seconds, 20,000 values)

The IntArray version of the test uses a collection of int values based on the first type-specific collection example code, while IntBaseArray uses the version obtained by subclassing our ArrayBase class. Both of these are much slower than the simple array case, but there's not much performance difference between the two, showing that we haven't hurt performance by going to the common base class.

The IntegerArray test uses the example code for a type-specific collection of Integer values from this article, while ObjectArray uses code that is basically the same, but has been altered to use Object values instead. The latter case is essentially the same as ArrayList from the collections framework discussed in the first part of this article, requiring casting to convert returned reference values from Object to Integer.

These two versions of the test show the added overhead of working with objects rather than primitive values (even when only the added time needed to access a value from an object, and not any object-creation time, is included). Some JVMs do better than others at handling these cases, but they all experience a substantial performance drop when moving from primitive values to objects and a similar performance drop when the overhead of casting objects to a specific type is added in.

The final version of this test uses Integer values with our old friend Vector, adding the overhead of synchronization to each access. This is a topic we'll discuss in more detail next month, but for now it's interesting to see the extreme performance hit this causes with the standard JVMs.

Performance test 2: Add elements to a collection

Table 2 shows how these same approaches perform when elements are added to a collection. This test repeatedly fills a collection with values, discarding the existing set of values each time it begins with a new set.

 JRE 1.1.8 (Sun)JRE 1.1.8 (IBM)JRE 1.2.2 (Classic)JRE 1.2.2 (HotSpot 2.0 beta)
int[]

5

3

9

8

IntArray

17

15

19

25

IntBaseArray

17

17

19

23

IntegerArray

120

56

106

96

ObjectArray

125

63

106

96

Vector

184

60

131

104

Table 2. Performance in adding to a collection (typical time in seconds, 200,000 values, 200 iterations)

The performance here largely matches the results in the first set of tests, with the exception of the larger difference between the collections of primitive types and those of objects. This difference is due to the relatively long object-creation time, since the objects being added to the collections are created as they're added. It demonstrates the advantage of working with a collection of primitive types as opposed to objects when elements are continually being added and deleted from the collection.

Conclusion

The collections framework added in JDK 1.2 goes a long way toward eliminating the performance penalties of the earlier collections support. There are still some ways to improve performance further, though, in cases in which code executes frequently enough to make the effort worthwhile.

One simple approach (by far the fastest overall, for collections that are fixed in size after initial construction) is to convert your collection to a type-specific array. The collections framework includes direct support for doing just this, and with some of our examples we've seen how support for this can easily be added to the JDK 1.1.x java.util.Vector class as well.

Another alternative is to use a type-specific collection. This offers especially large performance gains when working with primitive values, since these cannot be used without object wrappers in the standard collections. With a little work on the class structure it's possible to eliminate most of the duplicated code you'd otherwise need for type-specific variations of a collection, so this can be a reasonable approach to take if you frequently work with collections of a specific type and need the best performance.

December's article showed the double-barrelled performance and clarity costs of casting in your code, and gave some structural techniques to use for avoiding it. In this article we've gone on to show the benefits of arrays and type-specific collections, which again keep your code cleaner and more efficient while eliminating possible runtime errors due to casting.

Next month, we'll take on the performance costs of threading in your programs -- investigating the costs associated with various types of operations and outlining some ways to minimize the price you pay for the convenience of multithreaded programming in Java.

Dennis Sosnoski is a software professional with over 25 years experience in developing for a wide range of platforms and application areas. An early adoptee of C++, he was just as quick to convert to using Java when it became available, and for the last three years it's been his preferred development platform. Dennis is the President and Lead Consultant of Sosnoski Software Solutions, Inc., a Java consulting and contract software development firm that's not a monopoly even though it's headquartered in Redmond, Washington.

Learn more about this topic