Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

MapReduce programming with Apache Hadoop

Process massive data sets in parallel on large clusters

  • Print
  • Feedback

Page 5 of 6

Installing a MapReduce application in standalone mode

Unlike a Java EE application that can easily be deployed onto an app server, a MapReduce application using Hadoop requires some extra steps for deployment. First, you should understand the default, out-of-the-box way that Hadoop operates: standalone mode. The following steps describe how to set up the application on a Windows XP Professional system; the process would be almost identical for any other environment, with an important exception that you'll learn more about in a moment.

  1. Ensure that version 5.0 or above of Java is installed on your machine.
  2. Download the latest version of Hadoop. At the time that this article was published, the latest distribution was version 0.18.0. Save the downloads into a directory -- this example will use D:\hadoop.
  3. Make sure that you're logged in with an OS user name that doesn't contain spaces. For example, a username like "Ravi" should be used rather than "Ravi Shankar". This is to avoid some problems (which will be fixed in later versions) while using SSH communication. Please also make sure that your system uses a username and password to log on at startup. Do not bypass authentication. SSH will synchronize with Windows login while doing some handshakes.
  4. As mentioned earlier, you will need to have an execution environment for shell scripts. If you're using a Unix-like OS, you will already have a command line available to you; but on a Windows machines, you will need to install the Cygwin tools. Download the Cygwin package, making sure that you have selected the openSSH package (under the NET category) before you begin. For the other packages, you can simply use the defaults.
  5. In this example, Java has been installed in D:\Tiger. You need to make Hadoop aware of this directory. Go to your Hadoop installation in the D:\hadoop directory, then to the conf subdirectory. Open the file named hadoop-env.sh and change the value of JAVA_HOME (uncommenting, if necessary) to the following:

    JAVA_HOME = /cygdrive/d/Tiger
    
    (Note that /cygdrive prefix. This is how Cygwin maps your Windows directory to a Unix-style directory format.)
  6. Start Cygwin by choosing Start > All Programs > Cygwin > Cygwin Bash Shell.
  7. In Hadoop, communication between different processes across different machines is achieved in through SSH, so the next important step is to get sshd running. If you're using SSH for the first time, please note that sshd needs a config file to run, which is generated by the following command:

    ssh-host-config
    
    When you enter this, you will get a prompt usually asking for the value for CYGWIN. Enter ntsec tty. If you are again prompted with a question on the privilege separation that should be used, your answer should be no. If asked for your consent for installing SSH as a service, give yes as your response.

    Once this has been set up, start the sshd service by typing:

    /usr/sbin/sshd
    
    To make sure that sshd is running, check the process status:

    ps | grep sshd
    
  8. If sshd is running, you can try to SSH to localhost:

    ssh localhost
    
    If you're asked for a passphrase to SSH to the localhost, press Ctrl-C and enter:

    ssh-keygen -t dsa -P ' ' -f /.ssh/id_dsa
    cat /.ssh/id_dsa.pub >> /.ssh/authorized_keys
    
  9. Try running the example programs available at the Hadoop site. If all of the above steps have gone as they should, you should be get the expected output.
  10. Now it's time to create the input data for the EchoOhce application:

    echo "Hello" >> word1
    echo "World" >> word2
    echo "Goodbye" >> word3
    echo "JavaWorld" >> word4
    
  11. Next, you need to put the files you created in Step 10 into HDFS after creating a directory. Note that you do not need to create any partitions for HDFS. It comes as part of the Hadoop installation, and all you need to do is execute the following commands:

    bin/hadoop dfs -mkdir words
    bin/hadoop dfs -put word1 words/
    bin/hadoop dfs -put word2 words/
    bin/hadoop dfs -put word3 words/
    bin/hadoop dfs -put word4 words/
    
  12. Next, create a JAR file for the sample application. As an easy and extensible approach, create two environment variables in your machine, HADOOP_HOME and HADOOP_VERSION. (For the sample under consideration, the values will be D:\Hadoop and 0.17.1, respectively.) Now you can create EchoOhce.jar with the following commands:

    mkdir EchoOhce_classes
    javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d EchoOhce_classes EchoOhce.java
    jar -cvf  EchoOhce.jar -C EchoOhce_classes/
    
  13. Finally, its time to see the output. Run the application with the following command:

    bin/hadoop jar EchoOhce.jar com.javaworld.mapreduce.EchoOhce words result
    
    You will see an output screen with details like the following:

    08/07/18 11:14:45 INFO streaming.StreamJob:  map 0%  reduce 0%
    08/07/18 11:14:52 INFO streaming.StreamJob:  map 40%  reduce 0%
    08/07/18 11:14:53 INFO streaming.StreamJob:  map 80%  reduce 0%
    08/07/18 11:14:54 INFO streaming.StreamJob:  map 100%  reduce 0%
    08/07/18 11:15:03 INFO streaming.StreamJob:  map 100%  reduce 100%
    08/07/18 11:15:03 INFO streaming.StreamJob: Job complete: job_20080718003_0007
    08/07/18 11:15:03 INFO streaming.StreamJob: Output: result
    
    Now go to result directory, and look in the file named result. It should contain the following:

    Hello olleH
    World dlroW
    Goodbye  eybdooG
    JavaWorld dlroWavaJ
    

  • Print
  • Feedback

Resources

More