MapReduce programming with Apache Hadoop
Process massive data sets in parallel on large clusters
By Ravi Shankar and Govindu Narendra, JavaWorld.com, 09/23/08
Page 5 of 6
Installing a MapReduce application in standalone mode
Unlike a Java EE application that can easily be deployed onto an app server, a MapReduce application using Hadoop requires
some extra steps for deployment. First, you should understand the default, out-of-the-box way that Hadoop operates: standalone
mode. The following steps describe how to set up the application on a Windows XP Professional system; the process would be
almost identical for any other environment, with an important exception that you'll learn more about in a moment.
- Ensure that version 5.0 or above of Java is installed on your machine.
- Download the latest version of Hadoop. At the time that this article was published, the latest distribution was version 0.18.0. Save the downloads into a directory
-- this example will use D:\hadoop.
- Make sure that you're logged in with an OS user name that doesn't contain spaces. For example, a username like "Ravi" should
be used rather than "Ravi Shankar". This is to avoid some problems (which will be fixed in later versions) while using SSH
communication. Please also make sure that your system uses a username and password to log on at startup. Do not bypass authentication.
SSH will synchronize with Windows login while doing some handshakes.
- As mentioned earlier, you will need to have an execution environment for shell scripts. If you're using a Unix-like OS, you
will already have a command line available to you; but on a Windows machines, you will need to install the Cygwin tools. Download the Cygwin package, making sure that you have selected the openSSH package (under the NET category) before you begin. For the other packages,
you can simply use the defaults.
- In this example, Java has been installed in D:\Tiger. You need to make Hadoop aware of this directory. Go to your Hadoop installation
in the D:\hadoop directory, then to the conf subdirectory. Open the file named hadoop-env.sh and change the value of JAVA_HOME
(uncommenting, if necessary) to the following:
JAVA_HOME = /cygdrive/d/Tiger
(Note that /cygdrive prefix. This is how Cygwin maps your Windows directory to a Unix-style directory format.)
- Start Cygwin by choosing Start > All Programs > Cygwin > Cygwin Bash Shell.
- In Hadoop, communication between different processes across different machines is achieved in through SSH, so the next important
step is to get
sshd running. If you're using SSH for the first time, please note that sshd needs a config file to run, which is generated by the following command:
When you enter this, you will get a prompt usually asking for the value for CYGWIN. Enter ntsec tty. If you are again prompted with a question on the privilege separation that should be used, your answer should be no. If asked for your consent for installing SSH as a service, give yes as your response.
Once this has been set up, start the sshd service by typing:
To make sure that sshd is running, check the process status:
- If
sshd is running, you can try to SSH to localhost:
If you're asked for a passphrase to SSH to the localhost, press Ctrl-C and enter:
ssh-keygen -t dsa -P ' ' -f /.ssh/id_dsa
cat /.ssh/id_dsa.pub >> /.ssh/authorized_keys
- Try running the example programs available at the Hadoop site. If all of the above steps have gone as they should, you should be get the expected output.
- Now it's time to create the input data for the EchoOhce application:
echo "Hello" >> word1
echo "World" >> word2
echo "Goodbye" >> word3
echo "JavaWorld" >> word4
- Next, you need to put the files you created in Step 10 into HDFS after creating a directory. Note that you do not need to
create any partitions for HDFS. It comes as part of the Hadoop installation, and all you need to do is execute the following
commands:
bin/hadoop dfs -mkdir words
bin/hadoop dfs -put word1 words/
bin/hadoop dfs -put word2 words/
bin/hadoop dfs -put word3 words/
bin/hadoop dfs -put word4 words/
- Next, create a JAR file for the sample application. As an easy and extensible approach, create two environment variables in
your machine, HADOOP_HOME and HADOOP_VERSION. (For the sample under consideration, the values will be D:\Hadoop and 0.17.1,
respectively.) Now you can create EchoOhce.jar with the following commands:
mkdir EchoOhce_classes
javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d EchoOhce_classes EchoOhce.java
jar -cvf EchoOhce.jar -C EchoOhce_classes/
- Finally, its time to see the output. Run the application with the following command:
bin/hadoop jar EchoOhce.jar com.javaworld.mapreduce.EchoOhce words result
You will see an output screen with details like the following:
08/07/18 11:14:45 INFO streaming.StreamJob: map 0% reduce 0%
08/07/18 11:14:52 INFO streaming.StreamJob: map 40% reduce 0%
08/07/18 11:14:53 INFO streaming.StreamJob: map 80% reduce 0%
08/07/18 11:14:54 INFO streaming.StreamJob: map 100% reduce 0%
08/07/18 11:15:03 INFO streaming.StreamJob: map 100% reduce 100%
08/07/18 11:15:03 INFO streaming.StreamJob: Job complete: job_20080718003_0007
08/07/18 11:15:03 INFO streaming.StreamJob: Output: result
Now go to result directory, and look in the file named result. It should contain the following:
Hello olleH
World dlroW
Goodbye eybdooG
JavaWorld dlroWavaJ