Recommended: Sing it, brah! 5 fabulous songs for developers
JW's Top 5
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Until recently, true concurrency has been impossible on most computers marketed to consumers. Most have been one-processor models capable of executing only a single thread in any given time slice. Operating systems simulate doing many things at once by rapidly time-slicing between threads, a practice known as temporal multithreading.
True concurrency, also known as simultaneous multithreading (SMT), occurs when multiple threads execute instructions during the same clock cycle. While in the past, SMT has been possible only on high-end, multiple-processor systems, the advent of inexpensive multiple-core processors is bringing SMT-capable systems to average consumers. The laptop you pick up at Best Buy this holiday season might very well sport a Centrino Duo capable of simultaneously executing two threads. Intel recently announced a new line of quad-core processors. Soon, single-processor, single-core systems will be considered relics.
These new systems give you, the high-performance Java developer, a tremendous opportunity for speeding up your programs. However,
they won't inherit the advantages of SMT automatically; you must adapt your algorithms so the time-consuming sections are
performed by multiple threads designed to run simultaneously. The addition of high-level concurrency utilities to Java 5 greatly
eases this process. Using two important classes from the package java.util.concurrent, I'll demonstrate how to speed up time-consuming tasks by having them use the optimal number of threads for the systems on
which they are executed.
The basic approach is rather simple:
java.util.concurrent greatly simplify the process.
For an example, I'll show you how to adapt K-means clustering to SMT. Don't panic if you've never heard of K-means. It's a real-world clustering algorithm that is easy to understand, and it benefits enormously from SMT. K-means is the algorithm on which I cut my SMT teeth.
My employer became frustrated when our main product took hours and hours to cluster a particularly large dataset. Overhearing
him mutter in the elevator about needing to convert K-means from Java to C was all the motivation I needed. I more than doubled
the product's speed by simple refactoring and optimization. Then, knowing that the processing server was a quad-CPU Sun, I
decided to make key sections concurrent. Since this was before Java 5, I used utilities included with version 1.0.3 of the
Colt libraries. (These utilities, written by Doug Lea, have since been integrated into java.util.concurrent.) The SMT adaptation achieved a further tripling in speed. My total efforts sped up K-means eight-fold and made the boss
so happy, we avoided the C conversion nightmare.
Before going into K-means further, I recommend that you download the files kmeans.jar and kmeans_src.jar from Resources. Only the key sections of code will be included in this article, so you'll want to browse the full listings.
Extract the sources from the jar using the command jar xvf kmeans_src.jar. It contains three versions of K-means in the files: BasicKMeans.java, BenchmarkedKMeans.java, and ConcurrentKMeans.java. These files correspond to the results of Steps 1, 3, and 4 of our basic approach.
K-means splits a large set of coordinates into groups called clusters. More formally:
K-means partitions the coordinates into clusters through the following iterative sequence:
initCenters().
computeDistances().
makeAssignments().
makeAssignments().
computeDistances() again, but this time, it only computes distances to clusters changed by the last call to makeAssignments().
makeAssignments(). This time, the number of moves—the number of coordinates switching clusters—is noted. K-means loops back to Step 4 unless
at least one of two stopping criteria is met: the number of moves is 0 or the iteration counter reaches an upper limit. If
either is met, K-means exits.
Try running K-means from the jar you downloaded using the command java -Xms200m -Xmx200m -jar kmeans.jar. If necessary, spell out the path to java.exe, version 5 or later. The two virtual machine options set the minimum and maximum heap sizes. Increase or lower these as appropriate
for your system. Set them to the same value so expansions of the heap do not affect the timing results.

User interface for testing K-means.
The figure above shows the graphical user interface. Three text fields let you specify the number of coordinates, clusters,
and the seed for the random number generator. (An explicit seed permits repeatable results between runs.) The combo box lets
you select one of three implementations of K-means: BasicKMeans, BenchmarkedKMeans, or ConcurrentKMeans. When the last is selected, another text field is enabled, letting you enter the number of threads used in the SMT-adapted
steps. This field is initialized to the optimal number of threads for your system, which is conveniently obtained by Runtime.getRuntime().availableProcessors(). If you see 1 in this field when you first bring up the program, your system is unfortunately not SMT-capable. You can still
run the example, but you won't observe a speedup when you increase the number of threads.
The Run KMeans button executes the selected version. As it executes, you will see real-time results in the text area. You'll probably notice the absence of input fields for coordinate length and maximum number of iterations. The test program always uses coordinates of length 100 randomly generated using a Gaussian distribution. The iteration limit is always 500.
Following the steps of our basic approach, let's now identify our algorithm's subtasks, which we can see in BasicKMeans's run() method:
Listing 1. The run method of BasicKMeans
public void run() {
try {
// Note the start time.
long startTime = System.currentTimeMillis();
postKMeansMessage("K-Means clustering started");
// Randomly initialize the cluster centers creating the
// array mProtoClusters.
initCenters();
postKMeansMessage("... centers initialized");
// Perform the initial computation of distances.
computeDistances();
// Make the initial cluster assignments.
makeAssignments();
// Number of moves in the iteration and the iteration counter.
int moves = 0, it = 0;
// Main Loop:
//
// Two stopping criteria:
// - no moves in makeAssignments
// (moves == 0)
// OR
// - the maximum number of iterations has been reached
// (it == mMaxIterations)
//
do {
// Compute the centers of the clusters that need updating.
computeCenters();
// Compute the stored distances between the updated clusters and the
// coordinates.
computeDistances();
// Make this iteration's assignments.
moves = makeAssignments();
it++;
postKMeansMessage("... iteration " + it + " moves = " + moves);
} while (moves > 0 && it < mMaxIterations);
// Transform the array of ProtoClusters to an array
// of the simpler class Cluster.
mClusters = generateFinalClusters();
long executionTime = System.currentTimeMillis() - startTime;
postKMeansComplete(mClusters, executionTime);
} catch (Throwable t) {
postKMeansError(t);
} finally {
// Clean up temporary data structures used during the algorithm.
cleanup();
}
}
Even though I described six steps in the K-means algorithm, it has only four subtasks, since computeDistances() and makeAssignments() each handle two steps. We may think of K-means, therefore, as consisting of the following subtasks:
initCenters()computeDistances()makeAssignments()computeCenters()Now that the subtasks have been identified, it's time to go to the benchmarking phase. BenchMarkedKMeans.java is the version to use for benchmarking. It is nothing but BasicKMeans.java with the addition of a few statements for timing of the subtasks. Listing 2 shows its computeDistances() method, which is the same as BasicKMeans's method, except for the two statements that update the object variable mComputeDistancesMS upon every call to the method. Similar timing statements are in the other subtasks methods to increment other timing variables.
Listing 2. The computeDistances()method of BenchMarkedKMeans
/**
* Compute distances between coordinates and cluster centers,
* storing them in the distance cache. Only distances that
* need to be computed are computed. This is determined by
* distance update flags in the protocluster objects.
*/
private void computeDistances() throws InsufficientMemoryException {
// Mark the time the method was entered.
long t = System.currentTimeMillis();
int numCoords = mCoordinates.length;
int numClusters = mProtoClusters.length;
if (mDistanceCache == null) { // Distance cache instantiated on first call.
// Explicit garbage collection to reduce likelihood of insufficient
// memory.
System.gc();
// Ensure there is enough memory available for the distances.
// Throw an exception if not.
long memRequired = 8L * numCoords * numClusters;
if (Runtime.getRuntime().freeMemory() < memRequired) {
throw new InsufficientMemoryException();
}
// Instantiate an array to hold the distances between coordinates
// and cluster centers.
mDistanceCache = new double[numCoords][numClusters];
}
for (int coord=0; coord < numCoords; coord++) {
// Update the distances between the coordinate and all
// clusters currently in contention with update flags set.
for (int clust=0; clust < numClusters; clust++) {
ProtoCluster cluster = mProtoClusters[clust];
if (cluster.getConsiderForAssignment() && cluster.needsUpdate()) {
mDistanceCache[coord][clust] =
distance(mCoordinates[coord], cluster.getCenter());
}
}
}
// Increment the timing variable by the number of ms taken by the method
// call.
mComputeDistancesMS += (System.currentTimeMillis() - t);
}
Run BenchMarkedKMeans with the default numbers of coordinates and clusters, 25,000 and 300, respectively, and you'll see the timing results in
the text area. On my dual-processor Xeon system, 93 percent of the time is eaten up by distance computation, with cluster
assignment coming in a distant second of 5 percent. Cluster initialization doesn't even register, which is not surprising,
since it's performed only once and involves no extensive calculations. You might be a little surprised that cluster center
computation consumes so little time, considering that it is done every iteration and contains numerous averaging operations.
The lesson here: never skip the benchmarking step! If I had assumed computeCenters() took a significant part of the processing time, I might have expended a great deal of time and energy adapting it to SMT
to attain a trivial gain at most.
Archived Discussions (Read only)