MapReduce programming with Apache Hadoop

Process massive data sets in parallel on large clusters

Google and its MapReduce framework may rule the roost when it comes to massive-scale data processing, but there's still plenty of that goodness to go around. This article gets you started with Hadoop, the open source MapReduce implementation for processing large data sets. Authors Ravi Shankar and Govindu Narendra first demonstrate the powerful combination of map and reduce in a simple Java program, then walk you through a more complex data-processing application based on Hadoop. Finally, they show you how to install and deploy your application in both standalone mode and clustering mode.

Are you amazed by the fast response you get while searching the Web with Google or Yahoo? Have you ever wondered how these services manage to search millions of pages and return your results in milliseconds or less? The algorithms that drive both of these major-league search services originated with Google's MapReduce framework. While MapReduce is proprietary technology, the Apache Foundation has implemented its own open source map-reduce framework, called Hadoop. Hadoop is used by Yahoo and many other services whose success is based on processing massive amounts of data. In this article we'll help you discover whether it might also be a good solution for your distributed data processing needs.

We'll start with an overview of MapReduce, followed by a couple of Java programs that demonstrate the simplicity and power of the framework. We'll then introduce you to Hadoop's MapReduce implementation and walk through a complex application that searches a huge log file for a specific string. Finally, we'll show you how to install Hadoop in a Microsoft Windows environment and deploy the application -- first as a standalone application and then in clustering mode.

You won't be an expert in all things Hadoop when you're done reading this article, but you will have enough material to explore and possibly implement Hadoop for your own large-scale data-processing requirements.

About MapReduce

MapReduce is a programming model specifically implemented for processing large data sets. The model was developed by Jeffrey Dean and Sanjay Ghemawat at Google (see "MapReduce: Simplified data processing on large clusters"). At its core, MapReduce is a combination of two functions -- map() and reduce(), as its name would suggest.

A quick look at a sample Java program will help you get your bearings in MapReduce. This application implements a very simple version of the MapReduce framework, but isn't built on Hadoop. The simple, abstracted program will illustrate the core parts of the MapReduce framework and the terminology associated with it. The application creates some strings, counts the number of characters in each string, and finally sums them up to show the total number of characters altogether. Listing 1 contains the program's Main class.

Listing 1. Main class for a simple MapReduce Java app

public class Main
{

    public static void main(String[] args)
    {

        MyMapReduce my = new MyMapReduce();
        my.init();

    }
}

Listing 1 just instantiates a class called MyMapReduce, which is shown in Listing 2.

Listing 2. MyMapReduce.java

import java.util.*;

public class MyMapReduce

...


Download complete Listing 2

As you see, the crux of the class lies in just four functions:

  • The init() method creates some dummy data (just 30 strings). This data serves as the input data for the program. Note that in the real world, this input could be gigabytes, terabytes, or petabytes of data!
  • The step1ConvertIntoBuckets() method segments the input data. In this example, the data is divided into six smaller chunks and put inside an ArrayList named buckets. You can see that the method takes a list, which contains all of the input data, and another int value, numberOfBuckets. This value has been hardcoded to five; if you divide 30 strings into five buckets, each bucket will have six strings each. Each bucket in turn is represented as an ArrayList. These array lists are put finally into another list and returned. So, at the end of the function, you have an array list with five buckets (array lists) of six strings each.

    These buckets can be put in memory (as in this case), saved to disk, or put onto different nodes in a cluster!

  • step2RunMapFunctionForAllBuckets() is the next method invoked from init(). This method internally creates five threads (because there are five buckets -- the idea is to start a thread for each bucket). The class responsible for threading is StartThread, which is implemented as an inner class. Each thread processes each bucket and puts the individual result in another array list named intermediateresults. All the computation and threading takes place within the same JVM, and the whole process runs on a single machine.

    If the buckets were on different machines, a master should be monitoring them to know when the computation is over, if there are any failures in processing in any of the nodes, and so on. It would be great if the master could perform the computations on different nodes, rather than bringing the data from all nodes to the master itself and executing it.

  • The step3RunReduceFunctionForAllBuckets() method collates the results from intermediateresults, sums it up, and gives you the final output.

    Note that

    intermediateresults

    needs to combine the results from the parallel processing explained in the previous bullet point. The exciting part is that this process also can happen concurrently!

1 2 3 4 5 6 Page
Recommended
Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more