Design for performance, Part 1: Interfaces matter

Avoid performance hazards when designing Java classes

Many programmers don't start thinking about performance management until late in the development cycle. Often, they hold off on performance tuning until the end, hoping perhaps to avoid it entirely -- and sometimes this strategy is successful. However, early design decisions can affect the need for and success of performance tuning. If performance is likely to become an issue in your program, performance management should be integrated into the design and development cycle from day one.

This series explores some of the ways in which early design decisions can significantly affect application performance. In this article, I look at one of the most common performance problems: temporary object creation. A class's object-creational behavior is often determined -- often not deliberately -- at design time, sowing the seeds for performance problems later.

Read the whole "Design for Performance" series:

Performance problems come in many varieties. The easiest to fix are those where you simply have chosen a poor algorithm for performing a computation -- such as using a bubble sort to sort a large data set, or where you are recomputing a frequently used data item every time it is used instead of caching it. You can easily spot these types of bottlenecks using profiling and, once found, they usually can be corrected quickly. However, many Java performance problems stem from a deeper and harder-to-fix source -- the interface design of a program's components.

Most programs today are constructed from components that have been either developed internally or acquired from an outside vendor. Even when programs don't rely heavily on pre-existing components, the object-oriented design process encourages applications to be factored into components, as this simplifies the design, development, and testing process. While these advantages are undeniable, you should recognize that the interfaces implemented by components might have a significant effect on the behavior and performance of the programs that use them.

At this point, you may be asking what interfaces have to do with performance. Not only does a class's interface define what functions the class can perform, but it also can define its object-creational behavior and the sequence of method calls required to use it. How a class defines its constructors and methods will dictate whether an object can be reused, whether its methods will create -- or require its client to create -- intermediate objects, and how many method calls a client needs to make in order to use that class. All of these factors affect program performance.

Watch out for object creations

One of the fundamental Java performance management principles is this: Avoid excessive object creation. This doesn't mean that you should give up the benefits of object-oriented programming by not creating any objects, but you should be wary of object creation inside of tight loops when executing performance-critical code. Object creation is expensive enough that you should avoid unnecessarily creating temporary or intermediate objects in situations where performance is an issue.

The String class is a major source of object creation in programs that manipulate text. Because Strings are immutable, a new object must be created each time a String is modified or constructed. As a result, performance-conscious programmers avoid excessive use of String. However, this is often impossible. Even when you eliminate reliance on String from your code, you frequently find yourself using components whose interfaces are defined only in terms of String. Thus, you end up being forced to use String anyway.

Example: Regular expression matching

As an example, suppose you write a mail server called MailBot. MailBot needs to process the MIME header lines -- such as the send date or the sender's email address -- located at the top of each message. It will process the MIME header lines using a component for matching regular expressions to make the procedure easier. MailBot is smart enough not to create a String object for each header line or header element. Instead, it fills up a character buffer with the input text and identifies the headers to be processed by indexing into this buffer. MailBot will call the regular expression matcher to process each header line, so the matcher's performance will be significant.

Lets start with an example of a very poor interface for your regular expression matcher class:

public class AwfulRegExpMatcher { 
  /** Create a matcher with the given regular expression and which will
   *  operate on the given input string */
  public AwfulRegExpMatcher(String regExp, String inputText);
  /** Retrieve the next match of the pattern against the input text,
      returning the matched text if possible or null if not */
  public String getNextMatch();
}

Even if this class implements an efficient regular expression-matching algorithm, any program that uses it heavily will suffer. Since the matcher object is tied to the input text, every time you want to invoke it, you will have to first construct a new matcher object. Since you aim to reduce unnecessary object creations, having the ability to reuse the matcher would seem an obvious place to start.

The class definition below illustrates another possible interface for your matcher, which allows for reuse of the matcher, but is still pretty bad:

public class BadRegExpMatcher { 
  public BadRegExpMatcher(String regExp);
  /** Attempts to match the specified regular expression against the input
      text, returning the matched text if possible or null if not */
  public String match(String inputText);
  /** Get the next match against the input text, or return null if no match */
  public String getNextMatch();
}

Ignoring the more subtle points of regular expression-matching -- such as returning matched subexpressions, what's wrong with this seemingly harmless class definition? From a functionality point of view, nothing. But from a performance point of view, a lot. First, the matcher requires its caller to create a String to represent the text to be matched. MailBot tries to avoid generating String objects, but when it finds a header it wants to parse as a regular expression, it has to create a String to satisfy BadRegExpMatcher:

  BadRegExpMatcher dateMatcher = new BadRegExpMatcher(...);
  while (...) {
    ...
    String headerLine = new String(myBuffer, thisHeaderStart, 
                                   thisHeaderEnd-thisHeaderStart);
    String result = dateMatcher.match(headerLine);
    if (result == null) { ... }
  }

Second, the matcher creates the result string even if MailBot is interested only in whether the string matched or not, and doesn't require the matched text. This means that in order to simply use BadRegExpMatcher to validate that a date header conforms to a specific format, you must create two String objects -- the input to the matcher, and the resulting matched text. Two objects may not seem like very many, but if you have to create two objects for each header line of each mail message that MailBot processes, this could significantly influence performance. The fault doesn't lie in the design of MailBot but in the design of -- or the choice to use -- the BadRegExpMatcher class.

Note that returning a lighter-weight Match object -- which could expose the getOffset(), getLength(), and getMatchString() methods -- instead of returning a String would not improve performance by much. While creating a Match object is probably cheaper than creating a String -- as that involves generating a char[] array and copying the data, you still create an intermediate object that is of little value to your caller.

It's bad enough that BadRegExpMatcher forces you to provide it with input in the form that it wants to see, rather than in the form that you can more efficiently provide. But using BadRegExpMatcher comes with another risk, one that is potentially even more hazardous to MailBot's performance: You began with the noble intention of avoiding the use of Strings when processing the mail headers. But since you are forced to create many String objects anyway to satisfy BadRegExpMatcher, you might be tempted to abandon that goal and use String even more liberally. Now, one component's bad design has infected the program that uses it. Even if you later find a better regular expression component that doesn't require you to provide it with a String, your whole program might be infected by then.

A better interface

How can you define BadRegExpMatcher so as not to cause such a problem? First, BadRegExpMatcher should try not to dictate the format of its input. It should be willing to accept the input in whatever formats its caller can efficiently provide. Second, it should not automatically generate a String for the resulting match; it should return enough information so that the caller can create it if desired. (It can also provide a method to do this, as a convenience, but its use should not be required.) Here is a better interface:

class BetterRegExpMatcher { 
  public BetterRegExpMatcher(...);
  /** Provide matchers for multiple formats of input -- String,
      character array, and subset of character array.  Return -1 if no
      match was made; return offset of match start if a match was
      made.  */
  public int match(String inputText);
  public int match(char[] inputText);
  public int match(char[] inputText, int offset, int length);
  /** Get the next match against the input text, if any */
  public int getNextMatch();
  /** If a match was made, returns the length of the match; between
      the offset and the length, the caller should be able to
      reconstruct the match text from the offset and length */
  public int getMatchLength();
  /** Convenience routine to get the match string, in the event the
      caller happens to wants a String */
  public String getMatchText();
}

The new interface reduces the caller's requirement to convert the input text into the format desired by the matcher. MailBot can now call match() as follows:

  int resultOffset = dateMatcher.match(myBuffer, thisHeaderStart, 
                                       thisHeaderEnd-thisHeaderStart);
  if (resultOffset < 0) { ... }

This accomplishes the desired goal without creating any new objects. As an added bonus, its interface design style adheres to Java's "lots-of-simple-methods" design philosophy.

The exact performance impact of the extra object creations depends on the amount of work performed by match(). You can determine an upper bound on the performance gap by creating and timing do-nothing implementations of the two regular expression matcher classes. Using the Sun 1.3 JDK, the code fragments above ran nearly 50 times faster with the dummy BetterRegExpMatcher class than with the dummy BadRegExpMatcher class. Using a simple implementation that supports substring matching only, BetterRegExpMatcher ran five times faster than its BadRegExpMatcher counterpart.

Interchange types

BadRegExpMatcher forced MailBot to convert the input text from the character array it already had into a String, resulting in an unnecessary object creation. Ironically, many implementations of BadRegExpMatcher would immediately convert that String right back into a character array for easy access to the input text. Not only does this allocate yet another object, but it means that you performed all that work only to end up with the same representation that you started with. Neither MailBot nor BadRegExpMatcher actually wanted to deal with a String -- String just seemed like the obvious format for exchanging text between components.

In the BadRegExpMatcher example above, the String class acts as an interchange type. An interchange type is one in which neither the caller nor the callee is likely to actually have or want the data in that format, but both can easily convert to or from it. Defining an interface in terms of an interchange type reduces the interface's complexity while maintaining its flexibility, but sometimes this simplicity comes at the cost of performance.

A prime example of an interchange type is the JDBC ResultSet interface. It is unlikely that any native database interface would provide its results as a ResultSet interface, but the JDBC driver can easily wrap the native representation provided by the database with a ResultSet implementation. Similarly, no client program would represent data records as such, but you can convert from a ResultSet into the desired representation with little difficulty. In the case of JDBC, you accept this level of overhead because it comes with the valuable benefit of standardization and portability across database implementations. However, be aware of the performance overhead that interchange types can carry.

In the case of RegExpMatcher, defining the interface solely using String offered little in terms of reduced complexity or increased portability. It did cost something in performance, and thus does not seem to be a sound tradeoff. When designing component interfaces, using interchange types is often tempting, as it makes the interfaces look "cleaner," but you should still ensure that you make smart tradeoffs. Sometimes -- as was the case with RegExpMatcher -- there are alternate input or output formats that your callers are more likely to use; you should consider whether you can easily accommodate them as well.

1 2 Page
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more