Recommended: Sing it, brah! 5 fabulous songs for developers
JW's Top 5
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 3 of 6
Hadoop is an open source implementation of the MapReduce programming model. Hadoop relies not on Google File System (GFS), but on its own Hadoop Distributed File System (HDFS). HDFS replicates data blocks in a reliable manner and places them on different nodes; computation is then performed by Hadoop on these nodes. HDFS is similar to other filesystems, but is designed to be highly fault tolerant. This distributed filesystem does not require any high-end hardware, but can run on commodity computers and software; it is also scalable, which is one of the primary design goals for the implementation. HDFS is independent of any specific hardware or software platform, and is hence easily portable across heterogeneous systems.
If you've worked with clustered Java EE applications, you're probably familiar with the concepts of a master instance that manages other instances of the application server (called slaves) in a network deployment architecture. These master instances may be called deployment managers (if you're using WebSphere), manager servers (with WebLogic) or admin servers (with Tomcat). It is the responsibility of the master server instance to delegate various responsibilities to slave application server instances, to listen for handshaking signals from each instance so as to decide which are alive and which are dead, to do IP multicasting whenever required for synchronization of serializable sessions and data, and other similar tasks. The master stores the metadata and relevant port information of the slaves and works in a collaborative manner so that the end user feels as if there is only one instance.
HDFS works more or less in a similar way. In the HDFS architecture, the master is called a NameNode and the slaves are called DataNodes. There is only a single NameNode in HDFS, whereas there are many DataNodes across the cluster, usually one per node. HDFS allocates a namespace (similar to a package in Java, a tablespace in Oracle, or a namespace in C++) for storing user data. A file might be split into one or more data blocks, and these data blocks are kept in a set of DataNodes. The NameNode will have the necessary metadata information on how the blocks are mapped to each other and which blocks are being stored in which of the NameNodes. Note that not all the requests to be delegated to DataNodes need to pass through the NameNode. All the filesystem's client requests for reading and writing are processed directly by the DataNodes, whereas namespace operations like the opening, closing, and renaming of directories are performed by NameNodes. NameNodes are responsible for issuing instructions to DataNodes for data block creation, replication, and deletion.
A typical deployment of HDFS has a dedicated machine that runs only the NameNode. Each of the other machines in the cluster typically runs one instance of the DataNode software, though the architecture does allow you to run multiple DataNodes on the same machine. The NameNode is concerned with metadata repository and control, but otherwise never handles user data. The NameNode uses a special kind of log, named EditLog, for the persistence of metadata.
Though Hadoop is a pure Java implementation, you can use it in two different ways. You can either take advantage of a streaming API provided with it or use Hadoop pipes. The latter option allows you to build Hadoop apps with C++; this article will focus on the former.
Hadoop's main design goal is to provide storage and communication on lots of homogeneous commodity machines. The implementers selected Linux as their initial platform for development and testing; hence, if you're working with Hadoop on Windows, you will have to install separate software to mimic the shell environment.
Hadoop can run in three different ways, depending on how the processes are distributed:
More