Recommended: Sing it, brah! 5 fabulous songs for developers
JW's Top 5
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 4 of 6
The surest way to understand how Hadoop works is to walk through the process of writing a Hadoop MapReduce application. For the remainder of this article, we'll be working with EchoOhce, a simple MapReduce application that can reverse many strings. The input strings to be reversed represent the large amount of data that MapReduce applications typically work with. The example divides the data into different nodes, performs the reversal operations, combines the result strings, and then outputs the results. This application provides an opportunity to examine all of the main concepts of Hadoop. After you understand how it works, you'll see how it can be deployed in different modes.
First, take a look at the package declaration and imports in Listing 3. The EchoOhce class is in the com.javaworld.mapreduce package.
package com.javaworld.mapreduce;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.StringTokenizer;
import java.io.*;
import java.net.*;
import java.util.regex.MatchResult;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
The first set of imports is for the standard Java classes, and the second set is for the MapReduce implementation.
The EchoOhce class begins by extending org.apache.hadoop.conf.Configured and implementing the interface org.apache.hadoop.until.Tool, as you can see in Listing 4.
public class EchoOhce extends Configured implements Tool {
//..your code goes here
}
The Configured class is responsible for delivering the configuration parameters specified in certain XML files. This is done when the programmer
invokes the getConf() method of this class. This method returns an instance of org.apache.hadoop.conf.Configuration, which is basically a holder for the resources specified as name-value pairs in XML data. Each resource is named by either
a String or by an org.apache.hadoop.fs.Path instance.
By default, the two resources loaded in order from the classpath are:
Please note that applications may add additional resources -- as many as you want. Those are loaded in order from the classpath.
You can find out more from the Hadoop API documentation for the addResource() and addFinalResource() methods. addFinalResource() allows the flexibility for declaring a resource to be final so that subsequently loaded resources cannot alter that value.
You might have noticed that the code implements an interface named Tool. This interface supports a variety of methods to handle generic command-line options. The interface forces the programmer
to write a method, run(), that takes in String arrays as parameters and returns an int. The integer returned will determine whether the execution has been successful or not. Once you've implemented the run() method in your class, you can write your main() method, as in Listing 5.
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new EchoOhce(), args);
System.exit(res);
}
The org.apache.hadoop.util.ToolRunner class invokes the run() method implemented in the EchoOhce class. The ToolRunner utility helps to run classes that implement the Tool interface. With this facility, developers can avoid writing a custom handler to process various input options.
Now you can jump into the actual MapReduce implementation. You're going to write two inner classes within the EchoOhce class. They are:
Map: Includes functionality for processing input key- value pairs to generate output key-value pairs.
Reduce: Includes functionality for collecting output from parallel map processing and outputting that collected data.
Figure 1 illustrates how the sample app will work.
First, take a look at the Map class in Listing 6.
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
private Text inputText = new Text();
private Text reverseText = new Text();
public void map(LongWritable key, Text inputs,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String inputString = inputs.toString();
int length = inputString.length();
StringBuffer reverse = new StringBuffer();
for(int i=length-1; i>=0; i--)
{
reverse.append(inputString.charAt(i));
}
inputText.set(inputString);
reverseText.set(reverse.toString());
output.collect(inputText,reverseText);
}
}
As mentioned earlier, the EchoOhce application must take an input string, reverse it, and return a key-value pair with input
and reverse strings together. First, it gets the parameters for the map() function -- namely, the inputs and the output. From the inputs, it gets the input String. The application uses the simple Java API to find the reverse of this String, then creates a key-value pair by setting the input String and the reverse String. You end up with an OutputCollector instance, which contains the result of this processing. Assume that this is one result obtained from one execution of the
map() function on one of the nodes.
Obviously, you'll need to combine all such outputs. This is exactly what the reduce() method of the Reduce class, shown in Listing 7, will do.
public static class Reduce extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
while (values.hasNext()) {
output.collect(key, values.next());
}
}
}
The MapReduce framework knows how many OutputCollectors there are and which are to be combined for the final result. The reduce() method actually does the grunt work.
Finally, to complete EchoOhce's Main class, you need to set the values for your configuration. Basically, these values inform the MapReduce framework about the
types of the output keys and values, the names of the Map and Reduce classes, and so on. The complete run() method is shown in Listing 8.
public int run(String[] args) throws Exception {
JobConf conf = new JobConf(getConf(), EchoOhce.class);
conf.setJobName("EchoOhce");
...
As you can see in the listing, you must first create a Configuration instance; org.apache.hadoop.mapred.JobConf extends from the Configuration class. JobConf has the primary responsibility of sending your map and reduce implementations to the Hadoop framework for execution. Once
the JobConf instance has been given the appropriate values for your MapReduce implementation, you invoke the most important method, named
runJob(), on the org.apache.hadoop.mapred.JobClient class, by passing in the JobConf instance. JobClient internally communicates with the org.apache.hadoop.mapred.JobTracker class, and provides facilities for submission of jobs, tracking progress, accessing the progress or logs, or getting cluster
status.
That should give you a good sense of how EchoOhce, a sample MapReduce application, works. We'll conclude with instructions for installing the relevant software and running the application.
More