Can Assure save Java from the perils of multithreading?

Not without your help, but it's a big step in a good direction

The Java language contains one feature that is so dangerous, so difficult to avoid using, so hard to use correctly, and so pervasively used incorrectly that it has to rank as a serious design flaw. That feature is multithreading. Experienced programmers (myself included) are aware of the hazards associated with multithreading, and employ various strategies to ameliorate the problem. I've been dealing with various forms of multithreading since my first real job in the industry, which was sometime around 1973; and I consider myself a competent and careful programmer.

Therefore, I was shocked by what Assure showed me about some of my "thoroughly debugged" Java code. Assure is an easy to use runtime behavioral monitoring and analysis tool that was developed by Kuck & Associates Inc. (KAI). It found problems that were so glaringly obvious (once pointed out) that I would be embarrassed to outline them. It also found problems so subtle that I had to think a long time before I could even imagine how there could be such a problem.

This article will explore the tip of the iceberg, the very easiest part of the problem to see, by using Assure to debug example code provided by Sun with the JDK. Wouldn't you assume that such code was as close to bug-free as is humanly possible? This isn't the case, as we shall see. In the process of reading this article, perhaps you'll learn a little about writing code that is thread-safe to begin with, as I did in writing this article.

The extent of the multithreading problem in Java

Languages and environments other than Java permit the use of multiple threads. But Java applications are virtually impossible to write without using them, and most apps use a lot of them. Even the most trivial applet will have three threads: the applet's main run loop, the Java virtual machine's event loop, and the Java display manager's paint loop.

To this we can probably add even more threads for applications with animation timing loops, sound, and network communication. There can be no argument that Java's runtime environment is replete with threads, all running simultaneously and manipulating the same data. To deal with this chaos, Java provides just one low-level construct: the synchronized keyword.

The hazards of synchronization

Synchronization is both necessary and sufficient to control the execution of multithreaded code, but that's akin to saying a raft and an oar are enough to cross any body of water. Sure, you can do it. In practice, however, synchronization is very difficult to use correctly. Here's a list of some of the problems inherent with synchronization:

  • If you don't use synchronization enough, and in the right places, your program will execute using inconsistent sets of data, and thus will produce unpredictable results. This situation is called a data race, because multiple threads race each other to produce and/or consume their common data. This kind of problem typically manifests itself in your programs as random, nonreproducible bugs which, if you investigate the situation, appear to result from "impossible" configurations of your data, such as

    if(a==1) { b=2; }
    if((a==1)&&(b!=2)) { throw new Error("impossible error!"); }
    
  • If you use synchronization too much, or in the wrong places, your threads can deadlock, with two or more processes permanently blocked because they are waiting for each other to relinquish some resource they both want to use. This kind of problem typically results in the application "locking up."

  • Threads that temporarily have nothing to do must be suspended and resumed when there is something for them to do again. If a thread should be resumed, but no other thread issues the command, a permanently stalled thread can result.

Common strategies for dealing with multiple threads

Without turning this article into a tutorial on multithreaded programming, here are some good rules of thumb that are commonly used in Java programming. (For previous JavaWorld articles on multithreading, see Resources.)

  • Do all of your drawing in one place: either do all your drawing in the paint method, or do all your drawing in your run loop, and use the paint method only to set a "needs to be painted" flag. Conversely, do not do any drawing directly in response to various mouse, button, and network events.

  • Use semaphores to indicate when data is ready to be consumed by another thread.

  • Do as little as possible within synchronized methods. In particular, do not do anything that will not finish after some clearly foreseeable series of actions. Do not do anything inside synchronization that might block indefinitely, particularly I/O.

  • Use local variables rather than global variables.

  • Try to organize your code so each piece of data is manipulated in exactly one thread. Any data that is not shared between threads is guaranteed to be safe.

  • Synchronize code that you know will be used by multiple processes and that uses state that may be shared among the processes.

Assure: A tool that addresses synchronization problems

One day I was trying to solve a particularly hard-to-find display glitch, the kind that, to my eye, simply reeked of a synchronization problem. As I frequently do when temporarily stumped, I consulted my peers on Usenet, using Deja News (see Resources), and found a brief reference to Assure. Assure is a runtime analysis tool that examines your program's behavior in two phases:

  • In the first phase, you run your Java application (or applet) using Assure's specialized version of the Java virtual machine, which produces a log file.

  • In the second phase, you run the Assure tool itself to analyze the log and present the results. Assure looks for dynamic situations where data races, deadlocks, and thread stalls might occur. The algorithms Assure uses to analyze the target program's behavior are not foolproof -- they indicate the possibility, not the certainty, that a problem exists; and they're not guaranteed to detect all problems that do exist.

Debugging threads using Assure

For the purposes of this article, I decided to use one of Sun's own examples as a guinea pig; both because these examples are supposed to be good examples, and because they're readily available, simple, clean little programs. I didn't have to look very hard; the Molecule Viewer demo was the second one I looked at using Assure.

The Molecule Viewer is a fairly simple applet that allows you to view a molecular model from various angles, using the mouse to control the point of view. This demo is found in demo/MoleculeViewer/ below your JDK 1.1 directory.

The first step is to run the demo using Assure's Java runtime, instead of the regular Sun JDK (or your browser's Java, or whatever other Java engine you would normally use):

cd \java\jdk-1.1\demo\MoleculeViewer
d:\java\assurej11\bin\appletviewer -J-assurej example1.html

The model pops up. Drag it around a little using the mouse, then hit Exit. Now run Assure on the log file that has been created. Because I didn't specify a name, the file is called assurej.kgi.

d:\java\assurej11\bin\assurej assurej.kgi

This creates a screen-filling array of windows that looks something like this:

The top window shows a mouse-sensitive tree of classes and the problems detected in each. The bottom two windows present the source from the classes corresponding to the error selected in the top pane. Additional information to help you understand the nature of the complaint and the context of the error is available in pop-up windows.

Assure is quite easy to use, and presents its results in a clear, concise, and convenient manner. Unfortunately -- and this is not Assure's fault -- it requires significant brainpower to understand what Assure is reporting. The same human limitations that lead to threading errors also make it difficult to see them, even when they're right under your nose! The cruel thing is that sometimes nothing is wrong, because you know something about the overall behavior of the program which Assure is unaware of. But more often than not, there really is a problem, and it's your own limited brainpower and inability to think like a computer that has you believing otherwise.

How to use Assure: A working example

The initial run of Assure on the Molecule Viewer produced 63 complaints about Matrix3D and 25 complaints about XYZApp, the two Java source files in the demo applet. It also produced 40 complaints about code inside the JVM -- but I don't want to even think about those!

Examining the complaints about Matrix3D, the problem looked obvious. Essentially, the functions that change the matrix (xrot, yrot, and so on) were not synchronized with respect to the matrix multiply routine, mult.

Here's a snippet from the original sources:

void mult(Matrix3D rhs) {
float lxx = xx * rhs.xx + yx * rhs.xy + zx * rhs.xz;
float lxy = xy * rhs.xx + yy * rhs.xy + zy * rhs.xz;
float lxz = xz * rhs.xx + yz * rhs.xy + zz * rhs.xz;
float lxo = xo * rhs.xx + yo * rhs.xy + zo * rhs.xz + rhs.xo;
float lyx = xx * rhs.yx + yx * rhs.yy + zx * rhs.yz;
float lyy = xy * rhs.yx + yy * rhs.yy + zy * rhs.yz;
float lyz = xz * rhs.yx + yz * rhs.yy + zz * rhs.yz;
float lyo = xo * rhs.yx + yo * rhs.yy + zo * rhs.yz + rhs.yo;
...}
void yrot(double theta) {
theta *= (pi / 180);
double ct = Math.cos(theta);
double st = Math.sin(theta);
float Nxx = (float) (xx * ct + zx * st);
float Nxy = (float) (xy * ct + zy * st);
float Nxz = (float) (xz * ct + zz * st);
float Nxo = (float) (xo * ct + zo * st);
...
xo = Nxo;
xx = Nxx;
xy = Nxy;
xz = Nxz;
}

No doubt mult was being used inside paint, while xrot, yrot, and friends were being called during some mouse event in another thread. So clearly these need to be synchronized, and that's exactly what I did next. Noting that Assure only reported on the code that was actually used, I synchronized all the array manipulation primitives. This may be overkill, but in this application the synchronization overhead isn't critical, and the duration of any particular episode of synchronization is obviously limited.

Note that this problem could have been fixed in other ways -- for example, by recording the x, y, z motion required in the mouse event, and actually doing the rotation in the paint method, before painting. This would mean that no synchronization would be required in the matrix manipulations themselves.

So, with my chosen synchronization strategy, I added synchronized to most of the method declarations in Matrix3D and ran the test again, fully expecting no complaints.

synchronized void yrot(double theta) {...}
synchronized void xrot(double theta) {...}
synchronized void mult(Matrix3D rhs) {..}
/* etc. */

Imagine my surprise when the problems were still there! I pondered the situation. I formulated theories about bugs in Assure. I reexamined my assumptions about how synchronization works. And eventually, I realized that mult uses two instances of Matrix3D, and both of them had to be synchronized simultaneously. That is, the instance variables in the right-hand side of the matrix multiply were being accessed using slot accessors (rhs.xx), which are not synchronized with respect to anything.

The simplest solution is to make sure the right-hand operand is also synchronized by introducing a synchronized function call. While this may be crude, it's safe to do in this particular case.

Here's the essence of the changed code; mult, the function the outside world uses, calls a second function mult_other to synchronize the other matrix. mult_other calls a third function, which is merely a renamed version of the original mult, to do the actual work.

public synchronized void mult(Matrix3D rhs)
{ rhs.mult_other(this);
//get the right hand side synchronized
}
public synchronized void mult_other(Matrix3D lhs)
{ lhs.real_mult(this);
//remember that mat mult isn't commutative, so get the operands
//back in the original order
}
/** Multiply this matrix by a second: M = M*R */
private void  real_mult(Matrix3D rhs) {
...}

With this additional change, Matrix3D generates no complaints from Assure.

The complaints about the molecule demo concerned data races. Since the demo contained no synchronization, deadlocks and stalls were not even a possibility before eliminating the races, and Assure shows that (probably) no deadlocks and stalls were introduced by the synchronization added to eliminate the races. But this "synchronize everything" solution is likely to create instances of deadlocks and stalls in more complex programs. If this were anything other than a trivial demo program, now would be an excellent time to reevaluate the basic solution strategy (of adding synchronized everywhere), because clearly every method that uses two or more matrices would have to get this treatment. This kind of unenforceable rule about coding behavior is sure to be a rich source of bugs in the future.

1 2 Page 1
Page 1 of 2