Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

MapReduce programming with Apache Hadoop

Process massive data sets in parallel on large clusters

  • Print
  • Feedback

Page 2 of 6

A more complicated scenario

Processing 30 input elements doesn't really make for an interesting scenario. Imagine instead that there are 100,000 elements of data to be processed. The task at hand is to search for the total number of occurrences of the word JavaWorld. The data may be structured or unstructured. Here's how you'd approach it:

  • Assume that, in some way, the data is divided into smaller chunks and is inserted into buckets. You have a total of 10 buckets now, with 10,000 elements of data within each of them. (Don't bother worrying about who exactly does the dividing at the moment.)
  • Apply a function named map(), which in turn executes your search algorithm on a single bucket and repeats it concurrently for all the buckets in parallel, storing the result (of processing of each bucket) in another set of buckets, called result buckets. Note that there may be more than one result bucket.
  • Apply a function named reduce() on each of these result buckets. This function iterates through the result buckets, takes in each value, and then performs some kind of processing, if needed. The processing may either aggregate the individual values or apply some kind of business logic on the aggregated or individual values. This functionality once again takes place concurrently.
  • Finally, you will get the result you expected.

These four steps are very simple but there is so much power in them! Let's look at the details.

Dividing the data

In Step 1, note that the buckets created by someone for you may be on a single machine or on multiple machines (though they must be on the same cluster in that case). In practice, that means that in large enterprise projects, multiple terabytes or petabytes of data could be segmented into thousands of buckets on different machines in the cluster, and processing could be performed in parallel, giving the user an extremely fast response. Google uses this concept to index every Web page it crawls. If you take advantage of the power of the underlying filesystem used for storing the data in individual machines of the cluster, the result could be more fascinating. Google uses the proprietary Google File System (GFS) for this.

The map() function

In Step 2, the map() function understands exactly where it should go to process the data. The source of data may be memory, or disk, or another node in the cluster. Please note that bringing data to the place where the map() function resides is more costly and time-consuming than letting the function execute at the place where the data resides. If you write a C++ or Java program to process data on multiple threads, then the program fetches data from a data source (typically a remote database server) and is usually executed on the machine where your application is running. In MapReduce implementations, the computation happens on the distributed nodes.

The reduce() function

In Step 3, the reduce() function operates on one or more lists of intermediate results by fetching each of them from memory, disk, or a network transfer and performing a function on each element of each list. The final result of the complete operation is performed by collating and interpreting the results from all processes running reduce() operations.

In Step 4, you get the final output, which can be either 0 or some data element.

With this simple Java program under your belt, you're ready to understand the more complex MapReduce implementation in Apache Hadoop.

  • Print
  • Feedback

Resources

More