|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 2 of 6
Processing 30 input elements doesn't really make for an interesting scenario. Imagine instead that there are 100,000 elements of data to be processed. The task at hand is to search for the total number of occurrences of the word JavaWorld. The data may be structured or unstructured. Here's how you'd approach it:
map(), which in turn executes your search algorithm on a single bucket and repeats it concurrently for all the buckets in parallel,
storing the result (of processing of each bucket) in another set of buckets, called result buckets. Note that there may be more than one result bucket.
reduce() on each of these result buckets. This function iterates through the result buckets, takes in each value, and then performs
some kind of processing, if needed. The processing may either aggregate the individual values or apply some kind of business
logic on the aggregated or individual values. This functionality once again takes place concurrently.
These four steps are very simple but there is so much power in them! Let's look at the details.
In Step 1, note that the buckets created by someone for you may be on a single machine or on multiple machines (though they must be on the same cluster in that case). In practice, that means that in large enterprise projects, multiple terabytes or petabytes of data could be segmented into thousands of buckets on different machines in the cluster, and processing could be performed in parallel, giving the user an extremely fast response. Google uses this concept to index every Web page it crawls. If you take advantage of the power of the underlying filesystem used for storing the data in individual machines of the cluster, the result could be more fascinating. Google uses the proprietary Google File System (GFS) for this.
In Step 2, the map() function understands exactly where it should go to process the data. The source of data may be memory, or disk, or another
node in the cluster. Please note that bringing data to the place where the map() function resides is more costly and time-consuming than letting the function execute at the place where the data resides.
If you write a C++ or Java program to process data on multiple threads, then the program fetches data from a data source (typically
a remote database server) and is usually executed on the machine where your application is running. In MapReduce implementations,
the computation happens on the distributed nodes.
In Step 3, the reduce() function operates on one or more lists of intermediate results by fetching each of them from memory, disk, or a network transfer
and performing a function on each element of each list. The final result of the complete operation is performed by collating
and interpreting the results from all processes running reduce() operations.
In Step 4, you get the final output, which can be either 0 or some data element.
With this simple Java program under your belt, you're ready to understand the more complex MapReduce implementation in Apache Hadoop.
More