Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

MapReduce programming with Apache Hadoop

Process massive data sets in parallel on large clusters

  • Print
  • Feedback

Page 3 of 6

Apache Hadoop

Hadoop is an open source implementation of the MapReduce programming model. Hadoop relies not on Google File System (GFS), but on its own Hadoop Distributed File System (HDFS). HDFS replicates data blocks in a reliable manner and places them on different nodes; computation is then performed by Hadoop on these nodes. HDFS is similar to other filesystems, but is designed to be highly fault tolerant. This distributed filesystem does not require any high-end hardware, but can run on commodity computers and software; it is also scalable, which is one of the primary design goals for the implementation. HDFS is independent of any specific hardware or software platform, and is hence easily portable across heterogeneous systems.

If you've worked with clustered Java EE applications, you're probably familiar with the concepts of a master instance that manages other instances of the application server (called slaves) in a network deployment architecture. These master instances may be called deployment managers (if you're using WebSphere), manager servers (with WebLogic) or admin servers (with Tomcat). It is the responsibility of the master server instance to delegate various responsibilities to slave application server instances, to listen for handshaking signals from each instance so as to decide which are alive and which are dead, to do IP multicasting whenever required for synchronization of serializable sessions and data, and other similar tasks. The master stores the metadata and relevant port information of the slaves and works in a collaborative manner so that the end user feels as if there is only one instance.

HDFS works more or less in a similar way. In the HDFS architecture, the master is called a NameNode and the slaves are called DataNodes. There is only a single NameNode in HDFS, whereas there are many DataNodes across the cluster, usually one per node. HDFS allocates a namespace (similar to a package in Java, a tablespace in Oracle, or a namespace in C++) for storing user data. A file might be split into one or more data blocks, and these data blocks are kept in a set of DataNodes. The NameNode will have the necessary metadata information on how the blocks are mapped to each other and which blocks are being stored in which of the NameNodes. Note that not all the requests to be delegated to DataNodes need to pass through the NameNode. All the filesystem's client requests for reading and writing are processed directly by the DataNodes, whereas namespace operations like the opening, closing, and renaming of directories are performed by NameNodes. NameNodes are responsible for issuing instructions to DataNodes for data block creation, replication, and deletion.

A typical deployment of HDFS has a dedicated machine that runs only the NameNode. Each of the other machines in the cluster typically runs one instance of the DataNode software, though the architecture does allow you to run multiple DataNodes on the same machine. The NameNode is concerned with metadata repository and control, but otherwise never handles user data. The NameNode uses a special kind of log, named EditLog, for the persistence of metadata.

Deploying Hadoop

Though Hadoop is a pure Java implementation, you can use it in two different ways. You can either take advantage of a streaming API provided with it or use Hadoop pipes. The latter option allows you to build Hadoop apps with C++; this article will focus on the former.

Hadoop's main design goal is to provide storage and communication on lots of homogeneous commodity machines. The implementers selected Linux as their initial platform for development and testing; hence, if you're working with Hadoop on Windows, you will have to install separate software to mimic the shell environment.

Hadoop can run in three different ways, depending on how the processes are distributed:

  • Standalone mode: This is the default mode provided with Hadoop. Everything is run as a single Java process.
  • Pseudo-distributed mode: Here, Hadoop is configured to run on a single machine, with different Hadoop daemons run as different Java processes.
  • Fully distributed or cluster mode: Typically, one machine in the cluster is designated as the NameNode and another machine as the JobTracker. There is exactly one NameNode in each cluster, which manages the namespace, filesystem metadata, and access control. You can also set up an optional SecondaryNameNode, used for periodic handshaking with NameNode for fault tolerance. The rest of the machines within the cluster act as both DataNodes and TaskTrackers. The DataNode holds the system data; each data node manages its own locally scoped storage, or its local hard disk. The TaskTrackers carry out map and reduce operations.

  • Print
  • Feedback

Resources

More