Make bad code good

Refactor broken Java code for fun and profit

So, you've inherited some code. It's 50,000 lines of the oldest code on the project. The authors have left the company and won't return your calls. It's undocumented, badly designed, and buggy; yet the rest of the team uses it all the time. There's absolutely no option but to make it work, and that's your job. What do you do?

Don't stress. This article will make some concrete suggestions as to the sorts of things you can do, from the simple and mindless to the complex and dangerous. The process can be mostly evolutionary and absolutely safe. In a couple of months, colleagues will stop complaining about your section of the code, and a couple of months after that, your revamped code will be the pride of the project.

Some of the suggestions that follow are nothing new. Indeed, concepts such as documenting code and following coding standards have been drummed into programmers since they went to school. However, the usual reasons given are motherhood statements -- things you wouldn't say the opposite of (be a good boy, write understandable code, for example). I've tried to explain why you would want to do those things, and what their positive effects are.

So, let's get to it.

Start with the easy things

When making bad code good, it's best to start with easy changes, to make sure you don't break anything. A large lump of bad code breaks easily, and nobody can be expected to fix bugs in it without some preparation.

Fix the Javadoc comments

You never realize a comment's importance until you need to read it. Even if nobody else reads it (and usually, they don't), Javadoc comments are important for the code authors to help them remember what the code/class/parameter should do. Pieces of information particularly important to include are:

  • Whether an object parameter may be null, and what it means if it is
  • Whether a parameter needs to be mutable or not
  • Whether a return value is mutable
  • Whether a return value may be null
  • Whether changes to the return value affect the returner's internal state

Fixing the Javadoc comments doesn't mean just adding them where there were none before. You should also delete misleading and trivial comments (such as documenting setters and getters) as they engender a false sense of security. Important information exists for inclusion in Javadoc comments, so there's no point wasting the time of potential readers.

When you write new methods and classes, try to include some Javadoc comments (just a few) to explain what they do. Even if it's obvious to you, the next person to get the code might thank you for the effort. If you don't know what to write in the Javadoc comments, you have no business writing the code.

Javadoc comments can also serve as a record of what you learned about the code. When you figure out what a particularly tricky, clever, or ugly method does (especially a private or protected method with no implied contract), write a comment to record it for posterity. This expands the amount of the code under your control and helps you later when you're trying to decipher something related.

Of course, Javadoc comments should only explain the effect of a method, never the implementation (because you might change that). If code within a method needs explanation, use standard code comments. A school of thought exists that says with good enough identifiers you don't need code comments at all. I'm not that extreme, but I take the point. Indeed, many code comments don't tell you anything new. I urge you to leave those uninformative comments out, as they only serve to make simple code verbose and intimidating.

Fix the code formatting

Code formatting standards always spark contention. I'm in favor of coding standards, as long as they're the ones I already use, of course. If you take over a piece of badly formatted code (different from how you would do it), and you're having trouble reading it, you'll want to reformat it. Having a project-wide coding standard makes it less defensible for people to reformat code to their personal unreadable style and more likely that any particular piece of code makes sense to you the first time you read it. The most widely accepted coding standard is probably Sun's.

The advantage of reformatting badly formatted, badly written code is that as you change it, you leave a clear trail indicating where you've been. If you find yourself looking at a badly formatted method, you know not to trust it because the expert (that's you) has not put a seal of approval on it. Moreover, if you reformat code, at least you know you've read it. You may not have understood it, but you might notice repeated sections of code, calls to obsolete methods, or any number of obviously bad things. If you're lucky, you might find a bug caused by indentation that doesn't match the brackets.

While you're messing with the code, you might like to rename identifiers to be meaningful. I have almost given up using abbreviations in identifier names because I find them difficult to understand and hard to pronounce. There's no point having an easy-to-type identifier if it makes so little sense that you can't remember it. And please, use ALLCAPS identifiers only for final variables!

Finally, I think it's just dishonest to claim that code that does not meet project coding standards is good code.

Follow project conventions

Your project, if it is a big one, probably includes some conventions. Examples that I have encountered include:

  • Error logging and tracing: The project may specify a standard interface to the logging system. Code that does not use the standard interface may write directly to System.out, causing messy output that can't be turned off by the logging system configuration. With a standard logging interface, logging implementations can be changed as the project evolves without breaking existing code.
  • Access to resources: There may be conventions for location of images to display in the user interface. Code that does not comply with the conventions, for example by referring to absolute file locations rather than the class path, will break as soon as the project needs to be packaged for release.
  • Use of CORBA: Distributed object systems, such as CORBA, add complexity to a project. CORBA's limited capability to pass objects as arguments can, unless the programmers are vigilant, result in struct types being used instead of proper objects. Project standards can require that CORBA objects be wrapped in an OO/CORBA impedance mismatch adapter, which restricts the CORBA types to the remote access layer where they belong. Similarly, ORB instance creation is a task that belongs in the remote access layer, not in general code.

Changing bad code to follow these project conventions yields low-hanging fruit in the form of immediate results. Inventing conventions where they do not exist proves more difficult, but in all likelihood is worth the effort.

Refactoring

With any luck, by fixing the Javadoc comments, reformatting the code, and changing the code to comply with project conventions, you've learnt something about the code. Armed with this experience, you're prepared to change it -- the refactoring process. In case you're unfamiliar with the term, refactoring means changing the design of existing code. (See Resources for more on refactoring.)

Write automated unit tests

Unit tests are one of those things that we all think of as being done by the mythical perfect programmer, yet something for which the rest of us just never find time (but read Test Infected). However, with some experience, you'll find that unit tests save you time. Automated unit tests help you go fast. Here's how.

In all likelihood, if you are working on a large piece of bad code, it's part of a bigger project. It might be a distributed system with assorted servers and associated processes, and some sort of user interface. The only way to tell whether your code works is to start up some of the system, try a few things, and see if it works. Starting all these processes, and remembering the whole sequence, wastes your time. You're employed to fix bugs, not wait for computers. The solution: write small, quick tests that you can run quickly and that will tell you whether or not the code works. A JUnit test gives you a green light or a red light. Isn't that easier than checking all the output to make sure it is correct? With proper procedures, testing the correctness of the code becomes an absolute no-brainer.

Of course, writing unit tests is not easy, which is one of the reasons they're not as popular as they should be. Writing unit tests for distributed systems is very difficult indeed. However, writing correct code without testing proves even harder than that, so the easy way out is to write unit tests. You'll find that if you start with unit tests for simple parts of the code that have small interfaces, you'll eventually have the skills to tackle larger and more complex parts of the code. And almost certainly, you'll find some embarrassingly stupid bugs along the way.

Unit tests are very important because they tell you when you break things. And things are very likely to be broken, given the daring nature of the upcoming refactorings. By the way, if you break something, and somebody else detects it first, your unit tests need to be improved. Other people are handy like that.

You aren't gonna need it

One of the Extreme Programming mantras says, "You aren't gonna need it." That is, unless you have a requirement for some code today, you have no reason to write it, and to make your job easier, you shouldn't. Unnecessary code becomes particularly easy to detect in code already in use -- it's the bits nobody calls because they don't work. Usually, the requirement has lapsed because nobody expects it to ever work. In this case, it's easiest to delete the code. Think of every line you delete as being something you'll never need to understand, document, reformat, or refactor.

Also, find ideas that were started but never completed, and delete them. You'll find that a lot of refactorings or design decisions that you don't understand were made only to accommodate changes that never saw the light of day, and by deprecating those features, you can simplify the design. If it later turns out that you need the feature back, you can put it in and do it right.

Another fertile source of unnecessary code includes completed but nonetheless unused features. These are a consequence of designing code before knowing what it is supposed to do. Take them out. Methods and classes will vanish before your very eyes, and nobody will notice.

Break up big classes

With big classes the usual causes are:

  • No clear definition of what the class should do, resulting in it doing rather a lot of different things
  • Out-of-control inner classes
  • Numerous static and instance methods
  • Excessive numbers of convenience methods
  • Cut-and-pasted code

My rule of thumb states that 500 lines of code amounts to a big class. Classes over 1,000 lines are bad and must be fixed, and classes over 2,000 lines equal acts of terrorism. In the paragraphs below you'll find suggestions for five areas on how to break up big classes.

Firstly, extract the named inner classes. Some people write inner classes so they can achieve friend classes as you would in C++; for those people there is no solution but to stop them from writing Java. However, developers create many inner classes simply to put them near the code from which they are used. In such situations the inner classes can easily be extracted. However, tiny, anonymous inner classes are usually so intimately attached to their containing class that they can safely be left in the code. If they really are a problem, give them a name and move them out.

Secondly, examine the class's static and instance (nonstatic) methods. If you have a lot of both, your class may be trying to be two different things. Perhaps the static methods implement some sort of singleton, while the instance methods implement things referred to by the singleton. If that's the case, it won't be much effort to break the class in two, with one taking over the responsibilities of the singleton and the other taking over the responsibilities of the instances. You might also find some static methods that just provide some useful functionality. They can be extracted into a tools class. The difficult part about this sort of refactoring is finding all the references to this class that no longer compile. Refer to the "Tools" section below for some ideas.

A third idea: closely examine the class's instance variables. If two or more must be kept in sync (for example, a map from As to Bs and the reverse map from Bs to As), you should extract those variables into a new class. The new class's methods will replace any direct access to those instance variables in the original class. By placing all of the objects' behavior into one class, the relationship between those variables will be more aptly defined and more difficult to break.

Fourthly, if your class is one of several implementations of an interface, you might find that all implementations of that class share some common behavior, but the code to implement it is duplicated in each class -- a sure sign that you need an abstract superclass that implements the shared behavior. By taking the shared behavior up into the superclass, you're removing code from two classes, and coincidentally documenting the expected behavior of implementations of the interface. At the very least, you can write an empty abstract superclass, extend it in each implementation, and move shared behavior into the superclass as you resolve incompatibilities.

When dealing with interfaces of classes, it's best to keep them small. The Law of Demeter says that you should never write code that invokes a getter on an object returned from a getter, for example x.getBlah().getFoo(). Instead, x's class should have a getFoo() method that does that for you. I think that law is wrong. If x.getBlah()returns a complex object, the public interface expands way out of control. Furthermore, changes to the interface of x.getBlah() cause changes to the interface of x. For that reason, I suggest exactly the opposite: Convenience methods with simple alternatives should be removed from the public interface of the class. If you suspect that a public method does not need to be public, you can do an experiment by deprecating it. You won't break the build, and you'll soon find out if anyone cares whether or not it can be called.

Finally, it's my conviction that big classes with multiple instance variables and synchronized methods represent reliable sources of deadlocks. Programmers often use synchronization to protect against concurrent access to instance variables that could corrupt relationships among those variables. For example, consider the case above where we had a map and its reverse as instance variables of a class. Obviously, synchronization should be used to ensure that updates to the pair of variables remain consistent. If the class has another, separate pair of related variables, synchronization should be used to ensure their consistency as well. However, as the instance has only one synchronization lock, access to the separate pairs of dependent variables is sequentialized as well, although not required. With the inevitable addition of a little more complexity, the unnecessary synchronization will result in a deadlock, when all that was being attempted was to protect the instance variables. The related variables should be extracted into subclasses, where they can have their own synchronization locks.

Break up big methods

Just as big classes prove difficult to understand, so do big methods, whose usual causes include:

  • Too many options, causing the method to do too many things
  • Not enough support from other methods, causing the method to do tasks at a lower level than it should
  • Overly complicated exception handling

My rule of thumb for methods: If I can't fit the entire thing on one screen, it's too long. It's an entirely practical rule with no theoretical basis, but it seems right and works for me. Remember though, when you read a method of 100 statements for the first time, it looks like 100 unrelated statements, and it is only after some study that you can see the internal structure. Life would be easier for you and for others if you made that internal structure explicit.

The most obvious target when attacking overgrown methods is code repeated in multiple methods. You would have noticed these when you were reformatting the code. A typical case occurs when a getter method may throw an exception, for example to indicate that the value is not available. Of course, everywhere you need to get the value, you handle the exception and log an error message. This means that in every place you get the value, instead of one simple function invocation you have five or six lines of code. Duplicate those in 10 methods where you need to look up that value, and you have 50 or 60 lines of bad code, even though all you are trying to do is get some value.

To solve the problem, you can start by extracting the five or six lines of code into a new private (or protected) method. Decide on some way of handling the exceptional condition, for example by returning a default value, or by throwing another exception type that is propagated by the caller. Code the exception logging and handling only in the new method, and fix the callers to simply invoke the new method. This suggestion seems entirely trivial and it's basic object-oriented practice, but I have seen good programmers fall into the trap many times. The best defense: refuse to write any line of code twice, and always make a new method in preference to duplicating even a small piece of code.

Loops serve as another obvious target for removal from a method. Often a method will call another method to get a Collection back from it, and immediately iterate over the Collection. This is known as mixed levels of abstraction -- the caller doesn't need to look inside the collection to get it, but the caller does need to look inside to use it. It's cleaner to write a method that will do the iteration over the collection for you, so the original method can use the collection without being concerned about its implementation. By the way, studying the Java Collections API can save you a lot of effort in this type of code.

Once you've extracted any loops that don't belong, it'll be easier to read the method. You'll be able to see what is being done to what without the distraction of loop variables, indentation, and Iterator declarations. You might then notice that a chunk of the method only works with one of the parameters, and the results of that computation are combined with the values of the other parameters separately. This is a perfect opportunity to extract the operations on that parameter into a separate method. If you give the new method a good name, it'll be much clearer as to what's happening.

Moving on, exceptions represent an excellent source of long methods and hard-to-read code. As exceptions are a new thing for most people (since Java was invented, or C++ became popular), there's not much folklore on how to use them. Even worse, as exceptions usually indicate that something has gone wrong, and you hardly ever know why, you don't know what to do when you do get one. This generally results in overly long, complex, and useless exception-handling code.

With exceptions, it's best to realize that you don't have to handle every one. If you don't know what to do with a ClassNotFoundException, or an InterruptedException, don't do anything. You're allowed to pass them back to your caller. Of course, that changes the method's signature, which is not always possible, so it's best to declare your methods as throwing a nested exception (an exception that has another exception as its cause). That way, your caller can discover that the FooException you threw was caused by a ClassNotFoundException, and deal with it appropriately (or rethrow it).

It's advisable for your foo package to have a generic FooException to throw from any method, just in case something goes wrong. Then your caller has to catch only FooException. For particular problems, you can implement FooNotFoundException and IllegalFooException. As long as these subclass FooException, you won't have to change your method signature, and the caller can include enough detail to make sensible responses to the problem.

Now consider the following code, which uses the Reflection API in a fairly simple way:

public Object getInstance(String s) throws FooException {
   try {
      Class c = Class.forName(s);
      Class[] argTypes = { Integer.TYPE };
      Constructor constr = c.getConstructor(argTypes);
      Object[] args = { new Integer(0) };
      return constr.newInstance(args);
   } catch (ClassNotFoundException ex) {
      throw new FooException("class not found", ex);
   } catch (NoSuchMethodException ex) {
      throw new FooException("no such constructor", ex);
   } catch (InvocationTargetException ex) {
      throw new FooException("exception in constructor", ex);
   } catch (IllegalAccessException ex) {
      throw new FooException("can't access constructor", ex);
   } catch (InstantiationException ex) {
      throw new FooException("something bad", ex);
   }
}

In the example above, I consider the five catch statements in a row to be bad code, as they're essentially useless, they take up a lot of space, and if I ever change FooException I might have to modify them. Instead, I recommend putting them in a method by themselves, like this:

public Object newGetInstance(String s) throws FooException {
   try {
      Class c = Class.forName(s);
      Class[] argTypes = { Integer.TYPE };
      Constructor constr = c.getConstructor(argTypes);
      Object[] args = { new Integer(0) };
      return constr.newInstance(args);
   } catch (Exception ex) {
      throwFooException(ex, "constructing new " + s);
      return null;
   }
}
private void throwFooException(Throwable ex1, String explain)
      throws FooException {
   explain = toString() + ": " + explain;
   try {
      throw ex1;
   } catch (ClassNotFoundException ex) {
      throw new FooException(explain, ex);
   } catch (NoSuchMethodException ex) {
      throw new FooException(explain, ex);
   } catch (InvocationTargetException ex) {
      ex = ex1.getTargetException();
      throw new FooException(explain, ex);
   } catch (IllegalAccessException ex) {
      throw new FooException(explain, ex);
   } catch (InstantiationException ex) {
      throw new FooException(explain, ex);
   } catch (Exception ex) {
      throw new FooException(explain, ex);
   }
}

I claim that almost all of the class's exceptions will be converted into FooException, and almost always in the same mindless way, so you might as well do it explicitly and efficiently. The throwFooException() method can be reused many times. For each invocation, you can change 10 lines of exception-handling code into 3. In addition, you automatically add contextual information (toString()), providing useful debugging information in all FooExceptions. If you're ever required to improve that debugging information, you only have to do it in one place. Remember the lesson from Animal Farm: "Less code good, more code bad." (That's not exactly what George Orwell wrote ...)

Rearchitecting

As you increase your knowledge of the code, making it more comprehensible and better factored, you'll inevitably find some bits you don't like. When you started this project, you weren't really capable of changing much, but now you've played with the code, and it's not as damaged as it once was. Along the way, you have seen some bad things and had some ideas on how to fix them. It's time to make some serious changes.

Rewrite code you don't understand

If you know what the code should do, and it seems to do it, but you can't figure out quite how, it's bad code. If the original author did something tricky, some comments should have been provided. More likely though, the author was clueless and the code really is a disaster. Rewrite it.

By rewriting the code, you have the opportunity to do it the easy way, and to include useful comments. Start by writing down in Javadoc comments what the new code does (just to make sure that you know). Then you can take advantage of the byproducts of your refactoring: call this new method that does that, use that new data structure to store this. The new code will make more sense to everybody and will probably run faster as well. Your unit tests will tell you whether your new code works. Go into your bug tracking system and close all the bugs you have fixed.

Move to a layered architecture

Let's hope that your inherited, formerly bad code now looks better. It's time to look at the bigger picture: how the packages relate to each other. Using some sort of tool that tells you what classes invoke what methods (that information can be extracted from the class files), find out the interclass dependencies. From those dependencies, infer the interpackage dependencies. If you can't draw a hierarchy (without any loops!) of what packages depend on what other packages, you have some architectural work to do.

Using your instincts, a code browser, and a great deal of guesswork, figure out what such an architecture might look like one fine day. Then identify the particular problem spots. A particular class that frequently refers to other packages or a particular constant referred to from other packages hints that classes or parts of classes reside in the wrong package. Ask yourself whether it would make sense to move that code to another package.

After a lot of consideration, you can derive a package hierarchy that looks sensible. Although it may require a lot of work to achieve the clean architecture, it gives you a goal to work toward. Your unit tests have clarified what the code should do, and now the architecture allocates the responsibility to do it to packages and classes. All you have to do is fill in the code.

Tools

All this cutting and pasting, extracting, and creating does not come without a cost. You will find yourself mindlessly changing parameter lists more frequently than you would like. You will have more compilation errors than you ever dreamed possible. You will (sometimes) insert more bugs than you take out. To help you through the boring bits, you need good tool support.

Assuming you have a good editor with search and replace functionality (preferably on multiple files at a time), you need a fast compiler to quickly detect what you've broken. Multiple invocations of javac probably won't run fast enough, instead, to avoid the startup cost, you may turn to a tool like Jikes or something that keeps javac running in the background.

A good IDE, one that compiles fast and saves you from referring to the Javadoc all the time, will stand you in good stead. I was never a fan of Java IDEs, but they seem to have evolved into useful tools. Being able to compile without changing windows and losing concentration proves vital.

Moreover, a reverse-engineering tool would be useful for determining the architecture of the system. As implied above, I don't think the architecture of the system is anything to be concerned about when you start. After all, there's not much you can do about it. However, when you have the small details under control, you'll almost certainly need automated support to figure out the existing architecture. Look for the buzzword "round-trip engineering" on the Web.

Note: I won't make any specific recommendations for tools, as I have not used most of the sophisticated ones. I'm usually happy with find, grep, and a Python interpreter. I've cobbled together utilities to do most of the things I need.

Conclusion

If you're just starting out with bad code, you've got a long way to go. But don't fret, when the going gets rough, the tough get their chance to shine. You can write good code with a smug sense of superiority and a more than average amount of job security. Moreover, you can learn from the mistakes of others (but blame them when you mess up). You can run your unit tests and say, "It works now." And you can write on your resumé, "I can make bad code good!"

That's got to be worth paying for.

Acknowledgements

I would like to thank all the bad programmers in the world, without whom life would present no challenges. I would also like to thank all the good programmers in the world, without whom sympathetic shoulders to cry upon would not be found. Working on bad code can be frustrating, and kindred spirits will keep you sane.

Further, I would like to acknowledge the contribution of the Extreme Programming inventors, and all contributors on the WikiWikiWeb, without whom the evils of Big Design Up Front and the value of refactoring and unit tests would not be duly recognized.

I would like to thank Ian Clatworthy, Darryl Collins, Christian Larney, Simon McMullen, and Thomas Maslen for their comments on the first draft; they have made this a better article. I would also like to thank David O'Brien for his comments on my English -- it's amazing what you can learn from a professional.

Dr. John Farrell has been a Java programmer since the days when even programmers thought it would be mostly used for writing fancy Webpages. He currently maintains a distributed object server infrastructure that stretches from JDBC through CORBA to just below the GUI layer. It was formerly bad code, but he made it good. Before being a Java programmer, John completed a PhD in functional programming, which gave him an appreciation of elegant languages that run slowly; and worked as a C programmer, during which time he gained an appreciation of automatic garbage collection. He currently spreads the word that Java runs fast enough for real applications, if done right; and that good object-oriented design almost always represents the right way. On the second Monday of every month, he is an active heckler at the Australian Java User Group meetings in Brisbane, Australia.

Learn more about this topic

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more