Hadoop Training Hyderabad |
MapReduce is a programming model for processing of data. Hadoop is capable of running MapReduce programs written in various languages. Java, Ruby, Python, and C++. MapReduce programs are parallel in nature. Thus are for performing large-scale data analysis using many machines in the cluster.
MapReduce programs work in two phases:
1. Map phase
2. Reduce phase.
The input to each phase is key-value pairs. Also, every programmer needs to specify two functions: map function and reduce function.
Input Splits:
The input to a MapReduce job divided into fixed-size pieces called input splits. Input split is a chunk of the input consumed by a single map
Mapping
This is a very first phase in the execution of the map-reduce program. In this phase data in each split passed to a mapping function to produce output values. The job of mapping phase is to count the number of occurrences of each word from input splits. Prepare a list in the form of word and frequency.
Shuffling
this phase consumes the output of Mapping phase. Its task is to join the relevant records from Mapping phase output. Same words clubbed together along their respective frequency.
Reducing
In this phase, output values from Shuffling phase aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phrase summarizes the complete dataset.
The process in detail
One map task creates for each split which then executes map function for each record in the split.
It is always beneficial to have many splits. because the time taken to process a split is small as compared to the time taken for processing of the whole input. When the splits are smaller. The processing is better to load balanced processing the splits in parallel.
But, it is also not desirable to have splits too small in size. When splits are too small. The overload of managing the splits and map task creation begin the total job execution time.
For most jobs, it is better to make split size equal to the size of an HDFS block (which is 64 MB, by default).
Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.
The reason for choosing local disk over HDFS. To avoid replication takes place in case of HDFS store operation.
Map output is intermediate output. Which processed by reduce tasks to produce the final output.
Once the job is complete, the map output throws away. So, storing in HDFS with replication becomes killing.
In the event of node failure before the map output consumed by the reduce task. Hadoop reruns the map task on another node and re-creates the map output.
Reduce task don't work on the concept of data locality. The output of every map task to the reduce task. Map output transferred to the machine where reduce task is running.
On this machine, outputs merged and then passed to the user-defined reduce function.
The map output reduces output stored in HDFS. (The first replica stored on the local node and other replicas stored on off-rack nodes).
How MapReduce Organizes Work?
Hadoop divides the job into tasks. There are two types of tasks:
Map tasks (Splits & Mapping)
Reduce tasks (Shuffling, Reducing)
The complete execution process controlled by two types of entities.
Jobtracker: Acts like a master (responsible for complete execution of submitted job)
Many Task Trackers: It Acts like slaves. Each slave performing the job
For every job submitted for execution in the system. There is one Job tracker. That resides on Namenode and there are many task trackers which live on Datanode.
A job divided into many tasks which are then run on many data nodes in a cluster.
It is the responsibility of job tracker. To coordinate the activity by scheduling tasks to run on data nodes.
Execution of individual task is to look after by task tracker. This resides on every data node executing part of the job.
Task trackers responsibility is to send the progress report to the job tracker.
Also, task tracker sends 'heartbeat' signal to the Jobtracker. So notify him of the current state of the system.
Thus job tracker keeps track of the progress of each job. In the event of task failure, the job tracker can reschedule it on a different task tracker.