Until recently, true concurrency has been impossible on most computers marketed to consumers. Most have been one-processor models capable of executing only a single thread in any given time slice. Operating systems simulate doing many things at once by rapidly time-slicing between threads, a practice known as temporal multithreading.

True concurrency, also known as simultaneous multithreading (SMT), occurs when multiple threads execute instructions during the same clock cycle. While in the past, SMT has been possible only on high-end, multiple-processor systems, the advent of inexpensive multiple-core processors is bringing SMT-capable systems to average consumers. The laptop you pick up at Best Buy this holiday season might very well sport a Centrino Duo capable of simultaneously executing two threads. Intel recently announced a new line of quad-core processors. Soon, single-processor, single-core systems will be considered relics.

These new systems give you, the high-performance Java developer, a tremendous opportunity for speeding up your programs. However, they won't inherit the advantages of SMT automatically; you must adapt your algorithms so the time-consuming sections are performed by multiple threads designed to run simultaneously. The addition of high-level concurrency utilities to Java 5 greatly eases this process. Using two important classes from the package `java.util.concurrent`

, I'll demonstrate how to speed up time-consuming tasks by having them use the optimal number of threads for the systems on which they are executed.

## A simple approach

The basic approach is rather simple:

**Develop a single-threaded, sequential, robust, and clearly organized version of your algorithm.**If you're reading this article to speed up something that exists, you may already have this step covered.**Identify the subtasks.**Examine the algorithm to identify the discrete stages. This shouldn't be difficult if you performed Step 1 adequately. Each subtask probably has its own method, or at least a clearly-delineated block of code. If you have difficulty identifying the subtasks, you probably need to understand the algorithm better or reorganize your code. In that case, return to Step 1.**Benchmark the algorithm.**In other words, time the identified subtasks to determine the fraction of time consumed by each. This should be a trivial matter of adding some timing and output statements.**Delegate the most time-consuming subtasks to a thread pool.**Now we come to the most difficult part: reorganizing your code so the slowest subtasks are performed concurrently by multiple subtask threads. Luckily, two utility classes in`java.util.concurrent`

greatly simplify the process.

## Introduction to K-means

For an example, I'll show you how to adapt K-means clustering to SMT. Don't panic if you've never heard of K-means. It's a real-world clustering algorithm that is easy to understand, and it benefits enormously from SMT. K-means is the algorithm on which I cut my SMT teeth.

My employer became frustrated when our main product took hours and hours to cluster a particularly large dataset. Overhearing him mutter in the elevator about needing to convert K-means from Java to C was all the motivation I needed. I more than doubled the product's speed by simple refactoring and optimization. Then, knowing that the processing server was a quad-CPU Sun, I decided to make key sections concurrent. Since this was before Java 5, I used utilities included with version 1.0.3 of the Colt libraries. (These utilities, written by Doug Lea, have since been integrated into `java.util.concurrent`

.) The SMT adaptation achieved a further tripling in speed. My total efforts sped up K-means eight-fold and made the boss so happy, we avoided the C conversion nightmare.

Before going into K-means further, I recommend that you download the files kmeans.jar and kmeans_src.jar from Resources. Only the key sections of code will be included in this article, so you'll want to browse the full listings.

Extract the sources from the jar using the command `jar xvf kmeans_src.jar`

. It contains three versions of K-means in the files: `BasicKMeans.java`

, `BenchmarkedKMeans.java`

, and `ConcurrentKMeans.java`

. These files correspond to the results of Steps 1, 3, and 4 of our basic approach.

K-means splits a large set of coordinates into groups called clusters. More formally:

- K-means partitions N coordinates into K clusters, where N and K are positive integers, and K is less than N. The number of clusters K is an input parameter to K-means. K-means doesn't determine what K should be.
- A coordinate is a series of numbers. All have the same length M. The version of K-means presented in this article stores them in a two-dimensional N by M array of doubles.
- A cluster consists of a center and a list of the coordinates indices that belong to it. The center, also of length M, is simply the average of the member coordinates.
- At the conclusion of K-means, every coordinate belongs to one and only one cluster.

K-means partitions the coordinates into clusters through the following iterative sequence:

**Cluster initialization.**K clusters must somehow be initialized. Although K-Means may do this in several different ways, the version included in this article randomly picks K of the coordinates to serve as the initial cluster centers. This is done in the method`initCenters()`

.**Distance computation.**To determine which cluster a coordinate should be assigned to, K-means compares the distance between the coordinate and each of the clusters. To minimize the number of times distances must be computed, this version keeps all NK distances stored in a two-dimensional array of doubles. This step computes them all in the method`computeDistances()`

.**Initial cluster assignment.**Each coordinate is assigned to the nearest cluster by the method`makeAssignments()`

.**Cluster center computation.**Now K-means enters an iteration loop. The first step in the loop is to compute the cluster centers changed by the last call to`makeAssignments()`

.**Distance computation.**K-means calls`computeDistances()`

again, but this time, it only computes distances to clusters changed by the last call to`makeAssignments()`

.**Cluster assignment.**Coordinates are again assigned to clusters by`makeAssignments()`

. This time, the number of movesâ€”the number of coordinates switching clustersâ€”is noted. K-means loops back to Step 4 unless at least one of two stopping criteria is met: the number of moves is 0 or the iteration counter reaches an upper limit. If either is met, K-means exits.

Try running K-means from the jar you downloaded using the command `java -Xms200m -Xmx200m -jar kmeans.jar`

. If necessary, spell out the path to `java.exe`

, version 5 or later. The two virtual machine options set the minimum and maximum heap sizes. Increase or lower these as appropriate for your system. Set them to the same value so expansions of the heap do not affect the timing results.

The figure above shows the graphical user interface. Three text fields let you specify the number of coordinates, clusters, and the seed for the random number generator. (An explicit seed permits repeatable results between runs.) The combo box lets you select one of three implementations of K-means: `BasicKMeans`

, `BenchmarkedKMeans`

, or `ConcurrentKMeans`

. When the last is selected, another text field is enabled, letting you enter the number of threads used in the SMT-adapted steps. This field is initialized to the optimal number of threads for your system, which is conveniently obtained by `Runtime.getRuntime().availableProcessors().`

If you see 1 in this field when you first bring up the program, your system is unfortunately not SMT-capable. You can still run the example, but you won't observe a speedup when you increase the number of threads.

The Run KMeans button executes the selected version. As it executes, you will see real-time results in the text area. You'll probably notice the absence of input fields for coordinate length and maximum number of iterations. The test program always uses coordinates of length 100 randomly generated using a Gaussian distribution. The iteration limit is always 500.

## Identifying the subtasks of K-means

Following the steps of our basic approach, let's now identify our algorithm's subtasks, which we can see in `BasicKMeans`

's `run()`

method:

**Listing 1. The run method of BasicKMeans**

` ````
public void run() {
try {
// Note the start time.
long startTime = System.currentTimeMillis();
postKMeansMessage("K-Means clustering started");
// Randomly initialize the cluster centers creating the
// array mProtoClusters.
initCenters();
postKMeansMessage("... centers initialized");
// Perform the initial computation of distances.
computeDistances();
// Make the initial cluster assignments.
makeAssignments();
// Number of moves in the iteration and the iteration counter.
int moves = 0, it = 0;
// Main Loop:
//
// Two stopping criteria:
// - no moves in makeAssignments
// (moves == 0)
// OR
// - the maximum number of iterations has been reached
// (it == mMaxIterations)
//
do {
// Compute the centers of the clusters that need updating.
computeCenters();
// Compute the stored distances between the updated clusters and the
// coordinates.
computeDistances();
// Make this iteration's assignments.
moves = makeAssignments();
it++;
postKMeansMessage("... iteration " + it + " moves = " + moves);
} while (moves > 0 && it < mMaxIterations);
// Transform the array of ProtoClusters to an array
// of the simpler class Cluster.
mClusters = generateFinalClusters();
long executionTime = System.currentTimeMillis() - startTime;
postKMeansComplete(mClusters, executionTime);
} catch (Throwable t) {
postKMeansError(t);
} finally {
// Clean up temporary data structures used during the algorithm.
cleanup();
}
}
```

Even though I described six steps in the K-means algorithm, it has only four subtasks, since `computeDistances()`

and `makeAssignments()`

each handle two steps. We may think of K-means, therefore, as consisting of the following subtasks:

**Cluster initialization:**Performed by`initCenters()`

**Distance computation:**Performed by`computeDistances()`

**Cluster assignment:**Performed by`makeAssignments()`

**Cluster center computation:**Performed by`computeCenters()`

## Benchmarking K-means

Now that the subtasks have been identified, it's time to go to the benchmarking phase. `BenchMarkedKMeans.java`

is the version to use for benchmarking. It is nothing but `BasicKMeans.java`

with the addition of a few statements for timing of the subtasks. Listing 2 shows its `computeDistances()`

method, which is the same as `BasicKMeans`

's method, except for the two statements that update the object variable `mComputeDistancesMS`

upon every call to the method. Similar timing statements are in the other subtasks methods to increment other timing variables.

**Listing 2. The computeDistances()method of BenchMarkedKMeans**

` ````
/**
* Compute distances between coordinates and cluster centers,
* storing them in the distance cache. Only distances that
* need to be computed are computed. This is determined by
* distance update flags in the protocluster objects.
*/
private void computeDistances() throws InsufficientMemoryException {
// Mark the time the method was entered.
long t = System.currentTimeMillis();
int numCoords = mCoordinates.length;
int numClusters = mProtoClusters.length;
if (mDistanceCache == null) { // Distance cache instantiated on first call.
// Explicit garbage collection to reduce likelihood of insufficient
// memory.
System.gc();
// Ensure there is enough memory available for the distances.
// Throw an exception if not.
long memRequired = 8L * numCoords * numClusters;
if (Runtime.getRuntime().freeMemory() < memRequired) {
throw new InsufficientMemoryException();
}
// Instantiate an array to hold the distances between coordinates
// and cluster centers.
mDistanceCache = new double[numCoords][numClusters];
}
for (int coord=0; coord < numCoords; coord++) {
// Update the distances between the coordinate and all
// clusters currently in contention with update flags set.
for (int clust=0; clust < numClusters; clust++) {
ProtoCluster cluster = mProtoClusters[clust];
if (cluster.getConsiderForAssignment() && cluster.needsUpdate()) {
mDistanceCache[coord][clust] =
distance(mCoordinates[coord], cluster.getCenter());
}
}
}
// Increment the timing variable by the number of ms taken by the method
// call.
mComputeDistancesMS += (System.currentTimeMillis() - t);
}
```

Run `BenchMarkedKMeans`

with the default numbers of coordinates and clusters, 25,000 and 300, respectively, and you'll see the timing results in the text area. On my dual-processor Xeon system, 93 percent of the time is eaten up by distance computation, with cluster assignment coming in a distant second of 5 percent. Cluster initialization doesn't even register, which is not surprising, since it's performed only once and involves no extensive calculations. You might be a little surprised that cluster center computation consumes so little time, considering that it is done every iteration and contains numerous averaging operations. The lesson here: never skip the benchmarking step! If I had assumed `computeCenters()`

took a significant part of the processing time, I might have expended a great deal of time and energy adapting it to SMT to attain a trivial gain at most.