Programming Java threads in the real world, Part 2

The perils of race conditions, deadlock, and other threading problems

In his recent " Design for thread safety" article (part of the Design Techniques column), Bill Venners introduced the notion of a race condition, a situation whereby two threads simultaneously contend for the same object and, as a consequence, leave the object in an undefined state. Bill pointed out that Java's synchronized keyword is in the language to help avoid these problems, and he provided a straightforward example of its use. Last month's article was the first in a series on threads. The present article continues that series, covering race conditions in depth. It also discusses various deadlock scenarios, which are related to race conditions but are much harder to find and debug. This month, we'll focus primarily on the problems encountered when programming Java threads. Subsequent Java Toolbox columns will focus entirely on solutions.

Monitors, mutexes, and bathrooms

Just to make sure we're all starting from the same place, a little review of terms is in order. The central concept for synchronization in the Java model is the monitor, developed some 20 years ago by C. A. R. Hoare. A monitor is a body of code guarded by a mutual-exclusion semaphore (or, to use a term coined at Digital Equipment Corp., a mutex). The central notion of a mutex concerns ownership. Only one thread can own the mutex at a time. If a second thread tries to "acquire" ownership, it will block (be suspended) until the owning thread "releases" the mutex. If several threads are all waiting to acquire the same mutex, they will all be released simultaneously when the owning thread releases the mutex. The released threads will then have to sort out amongst themselves who gains ownership. (Typically, priority order, FIFO order, or some combination thereof is used to determine which thread gains control.) You guard a block of code by acquiring a mutex at the top of the block and releasing it at the bottom. The code comprising the monitor does not have to be contiguous: several discontiguous code blocks can all acquire and release the same mutex, in which case all of this code is considered to be in the same monitor because it uses a common mutex.

The best analogy I've heard for a monitor is an airplane bathroom. Only one person can be in the bathroom at a time (we hope). Everybody else is queued up in a rather narrow aisle waiting to use it. As long as the door is locked, the bathroom is unaccessible. Given these terms, in our analogy the object is the airplane, the bathroom is the monitor (assuming there's only one bathroom), and the lock on the door is the mutex.

In Java, every object has one and only one monitor and mutex associated with it. The single monitor has several doors into it, however, each indicated by the synchronized keyword. When a thread passes over the synchronized keyword, it effectively locks all the doors. Of course, if a thread doesn't pass across the synchronized keyword, it hasn't locked the door, and some other thread is free barge in at any time.

Bear in mind that the monitor is associated with the object, not the class. Several threads can all be executing the same method in parallel, but the receiving objects (as identified by the this references) have to be different. For example, several instances of a thread-safe queue could be in use by several threads. These threads could simultaneously enqueue objects to different queues, but they could not enqueue to the same queue at the same time. Only one thread at a time can be in the monitor for a given queue.

To refine the earlier analogy, the airplane is still the object, but the monitor is really all the bathrooms combined (each synchronized chunk of code is a bathroom). When a thread enters any bathroom, the doors on all the bathrooms are locked. Different instances of a class are different airplanes, however, and if the bathroom doors in your airplane are unlocked, you needn't really care about the state of the doors in the other airplanes.

Why is stop() deprecated in JDK 1.2?

The fact that the monitor is built into every Java object is actually somewhat controversial. Some of the problems associated with bundling the condition variable and mutex together inside every Java object have been fixed in JDK 1.2, by the simple expediency of deprecating the most problematic methods of the Thread class: stop() and suspend(). You can deadlock a thread if you call either of these methods from inside a synchronized method of your own. Look at the following method, remembering that the mutex is the lock on the door, and the thread that locks the door "owns" the monitor. That's why the stop() and suspend() methods are deprecated in JDK 1.2. Consider this method:

class some_class
{   //...
    synchronized void f()
    {   Thread.currentThread().stop();

Now consider what happens when a thread calls f(). The thread acquires the lock when entering the monitor but then stops. The mutex is not released by stop(). This is the equivalent of someone going into the bathroom, locking the door, and flushing himself down the toilet! Any other thread that now tries to call f() (or any other synchronized method of the class) on the same object is blocked forever. The suspend() method (which is also deprecated) has the same problem. The sleep() method (which is not deprecated) can be just as tricky. (Someone goes into the bathroom, locks the door, and falls asleep). Also remember that Java objects, even those that extend Thread or implement Runnable, continue to exist, even if the associated thread has stopped. You can indeed call a synchronized method on an object associated with a stopped thread. Be careful.

Race conditions and spin locks

A race condition occurs when two threads try to access the same object at the same time, and the behavior of the code changes depending on who wins. The following diagram shows a single (unsynchronized) object accessed simultaneously by multiple threads. A thread can be preempted in fred() after modifying one field but before modifying the other. If another thread comes along at that point and calls any of the methods shown, the object will be left in an unstable state, because the initial thread will eventually wake up and modify the other field.

Usually, you think of objects sending messages to other objects. In multithreaded environments, you must think about message handlers running on threads. Think: this thread causes this object to send a message to that object. A race condition can occur when two threads cause messages to be sent to the same object at the same time.

Working with wait() and notify()

The following code demonstrates a blocking queue used for one thread to notify another when some event occurs. (We'll see a more realistic version of these in a future article, but for now I want to keep things simple.) The basic notion is that a thread that tries to dequeue from an empty queue will block until some other thread puts something in the queue.

class Notifying_queue 
    private static final queue_size = 10;
    private Object[]  queue = new Object[ queue_size ];
    private int       head  = 0;
    private int       tail  = 0;
    public void synchronized enqueue( Object item )
        queue[++head %= queue_size ]= item; 
        this.notify();                  // The "this" is there only to
    }                                   // improve readability.
    public Object synchronized dequeue( )
        {   if( head == tail ) // <=== This is a bug
        catch( InterruptedException e )
            // If we get here, we were not actually notified.
            // returning null doesn't indicate that the
            // queue is empty, only that the waiting was
            // abandoned.
            return null;
        return queue[++tail %= queue_size ]; 

Starting with an empty queue, let's follow the sequence of operations in play if one thread does a dequeue and then another (at some later time) enqueues an item.

  1. The dequeueing thread calls dequeue(), entering the monitor (and locking out other threads) until the monitor is released. The thread tests for an empty queue (head==tailwait(). (I'll explain why this is a bug in a moment.)

  2. The wait() releases the lock. (The current thread temporarily leaves the monitor.) It then blocks on a second synchronization object called a condition variable. (The basic notion of a condition variable is that a thread blocks until some condition becomes true. In the case of Java's built-in condition variable, the thread will wait until the notified condition is set to true by some other thread calling notify.) It's important to realize that the waiting thread has released the monitor at this juncture.

  3. A second thread now comes along and enqueues an object, eventually calling notify(), thereby releasing the waiting (dequeueing) thread.

  4. The dequeueing thread now has to wait to reacquire the monitor before wait() can return, so it blocks again, this time on the lock associated with the monitor.

  5. The en queueing thread returns from the enqueue() method releasing the monitor.

  6. The dequeueing thread acquires the monitor, wait() returns, an object is dequeued, and dequeue() returns, releasing the monitor.

Race conditions following a wait()

Now, what sorts of problems can arise? The main ones are unexpected race conditions. First, what if you replaced the notify() call with a call to notifyAll()? Imagine that several threads were simultaneously trying to dequeue something from the same, empty, queue. All of these threads are blocked on a single condition variable, waiting for an enqueue operation to be executed. When enqueue() sets the condition to true by calling notifyAll(), the threads are all released from the wait (moved from a suspended to a runnable state). The released threads all try to acquire the monitor at once, but only one would "win" and get the enqueued object. The problem is that the others would then get the monitor too, in some undefined sequence, after the winning thread had dequeued the object and returned from dequeue(). Since these other threads don't retest for an empty queue, they will dequeue garbage. (Looking at the code, the dequeue() method will advance the tail pointer to the next slot, the contents of which are undefined since the queue is empty.)

Fix the problem by replacing the if statement with a while loop (called a spin lock). Now the threads that don't get the object will go back to waiting:

    public Object synchronized dequeue( )
        {   while( head == tail )  // used to be an if
        catch( InterruptedException e )
        {   return null;
        return queue[++tail %= queue_size ]; 

That while loop also solves another, less obvious problem. What if we leave the notify() statement in place, and don't put in a notifyAll()? Since notify() releases only one waiting thread, won't that solve the problem? It turns out that the answer is no. Here's what can happen:

  1. notify() is called by the enqueueing thread, releasing the condition variable.

  2. The dequeueing thread is then preempted, after being released from the wait on the condition variable, but before it tries to reacquire the monitor's lock.

  3. A second thread calls dequeue() at exactly this point, and successfully enters the monitor because no other threads are officially waiting. This second thread successfully dequeues the object; wait() is never called since the queue isn't empty.

  4. The original thread is now allowed to run, it acquires the monitor, doesn't test for empty a second time, and then dequeues garbage.

This second scenario is easily fixed the same way as the first: replace the if statement with a while loop.

Note that the Java specification does not require that wait() be implemented as an atomic operation (that is, one that can't be preempted while moving from the condition variable to the monitor's lock). This means that using an if statement instead of a while loop might work in some Java implementations, but the behavior is really undefined. Using a spin lock instead of a simple if statement is cheap insurance against implementations that don't treat wait() as atomic.

Threads are not objects

Now let's move on to harder-to-find problems. The first difficulty is the commonplace confusion of threads and objects: Methods run on threads, objects do not. Put another way, the only way to get a method to run on a given thread is to call it (either directly or indirectly) from that thread's run() method. Simply putting it into the Thread derivative is not enough. For example, look at the following simple thread (which just prints its fields every so often):

class My_thread extends Thread
    private int field_1 = 0;
    private int field_2 = 0;
    public void run()
        setDaemon(true);  // this thread will not keep the app alive
        while( true )
            System.out.println( " field_1=" + field_1
                                    " field_2=" + field_2 );
    synchronized public void modify( int new_value )
    {   field_1 = new_value;
            field_2 = new_value;

You could start up the thread and send it a message like this:

My_thread test = new My_thread;
1 2 Page 1
Page 1 of 2