Apache Hadoop Tutorial: 02/22/17

The HDFS client library into its address space. The client library manages from the application to the NameNode and the DataNode. An HDFS cluster consists of a single NameNode a master server. The file system namespace and regulates access to files by clients. There are many DataNodes, usually one per computer node in the cluster. Which manage storage attached to the nodes that they run on?

The NameNode and DataNode are software designed to run on commodity machines. These machines run a GNU/Linux operating system (OS). HDFS using the Java language. Any machine that supports Java can run the NameNode or the DataNode software. The Java language means that HDFS can deploy on a wide range of machines. A typical deployment has a dedicated machine runs NameNode software. Each of the other machines in the cluster runs DataNode software. The architecture does not prevent running many DataNodes on the same machine.

1. HDFS Files

There is a distinction between an HDFS file and a native file on the host computer. Computers in an HDFS installation divide to NameNode or DataNode. Each computer has its own file system and information about an HDFS file. The metadata of NameNode and information store in the NameNode’s host file system. The information contained in an HDFS file managed by a DataNode. Stored on the DataNode’s host computer file system.

HDFS exposes a file system namespace and allows user data to store in HDFS files. An HDFS file consists of many blocks. Each block is 64MByes. Each block replicated some specified number of times. The replicas of the blocks stored on different DataNodes loading on a DataNode. To provide both speeds in transfer and resiliency in case of failure of a rack. Block Allocation for a description of the algorithm.

A standard directory structure used in HDFS. That files exist in directories. That may, in turn, be sub-directories of other directories, and so on. There is no concept of a current directory within HDFS. HDFS files referred to by their qualified name. The elements of the interaction between the Client and elements.

The NameNode executes HDFS file system namespace operations. Like the opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The list of files belongs to each block. The current location of the block replicas on the DataNodes. The state of the file access control information metadata and NameNode.

The DataNodes are responsible for reading and write from the HDFS file systems. The DataNodes block replica creation, deletion, and replication upon instruction from the NameNode. The DataNodes are state of the replicates and report to the NameNode.

The existence of a single NameNode in a cluster. The NameNode is the arbitrator and repository for all HDFS metadata. The client sends data to reads from DataNodes. So that client data never flows through the NameNode.

2. Block Allocation

Each block replicated some number of times. The default replication factor for HDFS is three. When addBlock() invoked, space allocated for each replica. Each replica allocated on a different DataNode. The algorithm for performing this allocation attempts to balance performance and reliability.

The dynamic load on the set of DataNodes. Preference to more loaded DataNodes.

The location of the DataNodes. Communication between two nodes in different racks has to go through switches. Network bandwidth between machines in the same rack is greater in different racks.
When the replication factor is three. HDFS’s placement policy is to put one replica on one node in the local rack. Another on a node in a different rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is less than that of a node failure. This co-location policy does not hurt data reliability and availability guarantees. The added network bandwidth users reading data a block in only two unique racks rather than three. With this policy, the replicas of a file do not distribute across the racks. One-third of replicas are on one node on some rack. The other two-thirds of replicas are nodes different rack. This policy improves write performance without compromising data reliability or read performance.

Apache Hadoop Tutorial

HDFS Architecture | Hadoop Training in Hyderabad