Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

MapReduce programming with Apache Hadoop

Process massive data sets in parallel on large clusters

  • Print
  • Feedback

Page 6 of 6

Installing a MapReduce application in real cluster mode

Running the sample application in standalone mode will prove that things are working properly, but it isn't really very exciting. To really demonstrate the power of Hadoop, you'll want to execute it in real cluster mode.

  1. Pick six open port numbers that you can use; this example will use 8000 through 8005. (If the details from the netstat command reveal that these are not available, please feel free to use any six of your choice.) You will need four machines, MACH1 to MACH4, all interconnected either through a cable or wireless LAN. In the sample scenario described here, they are connected via a home network.
  2. MACH1 will be the NameNode, and MACH2 will be the JobTracker. As mentioned earlier, in a cluster-based environment there will be only one of each.
  3. Open the file named hadoop-site.xml under the conf directory of your Hadoop installation. Change the values to match those shown in Listing 9.

    Listing 9. hadoop-site.xml

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
    
    <!-- Put site-specific property overrides in this file. -->
    
    <configuration>
      <property>
        <name>fs.default.name</name>
        <value>MACH1:8000</value>
      </property>
      <property>
        <name>mapred.job.tracker</name>
        <value>MACH2:8000</value>
      </property>
      <property>
        <name>dfs.replication</name>
        <value>2</value>
      </property>
      <property>
        <name>dfs.secondary.info.port</name>
        <value>8001</value>
      </property>
      <property>
        <name>dfs.info.port</name>
        <value>8002</value>
      </property>
      <property>
        <name>mapred.job.tracker.info.port</name>
        <value>8003</value>
      </property>
      <property>
        <name>tasktracker.http.port</name>
        <value>8004</value>
      </property>
    </configuration>
    
  4. Open the file named masters under the conf directory. Here you need to add the master NameNode and JobTracker names, as shown in Listing 10. (If there are existing entries, please replace them with those shown in the listing).

    Listing 10. Adding NameNode and JobTracker names

    MACH1
    MACH2
    
  5. Open the file named slaves under the conf directory. This is where you put the names of DataNodes, as shown in Listing 11. (Again, if there are existing entries in this file, please replace them.)

    Listing 11. Adding DataNode names

    MACH3
    MACH4
    
  6. Now you're ready to go, and it's time to start the Hadoop cluster. Log on to each node, accepting the defaults. Log into your NameNode as follows:

    ssh  MACH1
    
    Now go to Hadoop directory:

    cd /hadoop0.17.1/
    
    Execute the start script to launch HDFS:

    bin/start-dfs.sh
    
    (Note that you can stop this later with the stop-dfs.sh command.)
  7. Start the JobTracker exactly as above, with the following commands:

    ssh MACH2
    cd  /hadoop0.17.1/
    bin/start-mapred.sh
    
    (Again, this can be stopped later by the corresponding stop-mapred.sh command.)

You can now execute the EchoOche application as described in the previous section, in the same way. The difference is that now the program will be executed across a cluster of DataNodes. You can confirm this by going to the Web interface provided with Hadoop. Point your browser to http://localhost:8002. (The default is actually port 50070; to see why you'd need to use port 8002 here, take a closer look at Listing 9.) You should see a frame similar to the one in Figure 2, showing the details of NameNode and all jobs managed by it.

Hadoop Web interface, showing the number of nodes and their status

Figure 2. Hadoop Web interface, showing the number of nodes and their status (click to enlarge)

This Web interface will provide many details to browse through, showing you the full statistics of your application. Hadoop comes with several different Web interfaces by default; you can see their default URLs in Hadoop-default.xml. For example, in this sample application, http://localhost:8003 will show you JobTracker statistics. (The default is port is 50030.)

In conclusion

In this article, we've presented the fundamentals of MapReduce programming with the open source Hadoop framework. This excellent framework accelerates the processing of large amounts of data through distributed processes, delivering very fast responses. It can be adopted and customized to meet various development requirements and can be scaled by increasing the number of nodes available for processing. The extensibility and simplicity of the framework are the key differentiators that make it a promising tool for data processing.

About the author

Ravi Shankar is an assistant vice president of technology development, currently working in the financial industry. He is a Sun Certified Programmer and Sun Certified Enterprise Architect with 15 years of industry experience. He has been a presenter at international conferences like JavaOne 2004, IIWAS 2004, and IMECS 2006. Ravi served earlier as a technical member of the OASIS Framework for Web Services Implementation Committee. He spends most of his leisure time exploring new technologies.

Govindu Narendra is a technical architect pursuing development of parallel processing technologies and portal development in data warehousing. He is a Sun Certified Programmer.

Read more about Enterprise Java in JavaWorld's Enterprise Java section.

  • Print
  • Feedback

Resources

More