MapReduce Overview | Hadoop Training Hyderabad



Hadoop Training Hyderabad




  MapReduce is a programming model for processing of data. Hadoop is capable of running MapReduce programs written in various languages. Java, Ruby, Python, and C++. MapReduce programs are parallel in nature. Thus are for performing large-scale data analysis using many machines in the cluster.


MapReduce programs work in two phases:


 1. Map phase


 2. Reduce phase.


The input to each phase is key-value pairs. Also, every programmer needs to specify two functions: map function and reduce function.


Input Splits:


The input to a MapReduce job divided into fixed-size pieces called input splits. Input split is a chunk of the input consumed by a single map

Mapping


This is a very first phase in the execution of the map-reduce program. In this phase data in each split passed to a mapping function to produce output values. The job of mapping phase is to count the number of occurrences of each word from input splits. Prepare a list in the form of word and frequency.


Shuffling


this phase consumes the output of Mapping phase. Its task is to join the relevant records from Mapping phase output. Same words clubbed together along their respective frequency.


Reducing


In this phase, output values from Shuffling phase aggregated. This phase combines values from Shuffling phase and returns a single output value. In short, this phrase summarizes the complete dataset.


The process in detail


  One map task creates for each split which then executes map function for each record in the split.


  It is always beneficial to have many splits. because the time taken to process a split is small as compared to the time taken for processing of the whole input. When the splits are smaller. The processing is better to load balanced processing the splits in parallel.


  But, it is also not desirable to have splits too small in size. When splits are too small. The overload of managing the splits and map task creation begin the total job execution time.


  For most jobs, it is better to make split size equal to the size of an HDFS block (which is 64 MB, by default).


  Execution of map tasks results into writing output to a local disk on the respective node and not to HDFS.


  The reason for choosing local disk over HDFS. To avoid replication takes place in case of HDFS store operation.


  Map output is intermediate output. Which processed by reduce tasks to produce the final output.


  Once the job is complete, the map output throws away. So, storing in HDFS with replication becomes killing.


  In the event of node failure before the map output consumed by the reduce task. Hadoop reruns the map task on another node and re-creates the map output.

  Reduce task don't work on the concept of data locality. The output of every map task to the reduce task. Map output transferred to the machine where reduce task is running.


  On this machine, outputs merged and then passed to the user-defined reduce function.


  The map output reduces output stored in HDFS. (The first replica stored on the local node and other replicas stored on off-rack nodes).


How MapReduce Organizes Work?


Hadoop divides the job into tasks. There are two types of tasks:


  Map tasks (Splits & Mapping)

  Reduce tasks (Shuffling, Reducing)


The complete execution process controlled by two types of entities.


  Jobtracker: Acts like a master (responsible for complete execution of submitted job)


  Many Task Trackers: It Acts like slaves. Each slave performing the job


For every job submitted for execution in the system. There is one Job tracker. That resides on Namenode and there are many task trackers which live on Datanode.


  A job divided into many tasks which are then run on many data nodes in a cluster.


  It is the responsibility of job tracker. To coordinate the activity by scheduling tasks to run on data nodes.


  Execution of individual task is to look after by task tracker. This resides on every data node executing part of the job.

  Task trackers responsibility is to send the progress report to the job tracker.


  Also, task tracker sends 'heartbeat' signal to the Jobtracker. So notify him of the current state of the system.


  Thus job tracker keeps track of the progress of each job. In the event of task failure, the job tracker can reschedule it on a different task tracker.

HDFS Architecture | Hadoop Training in Hyderabad

Hadoop Training in Hyderabad





The HDFS client library into its address space. The client library manages from the application to the NameNode and the DataNode. An HDFS cluster consists of a single NameNode a master server. The file system namespace and regulates access to files by clients. There are many DataNodes, usually one per computer node in the cluster. Which manage storage attached to the nodes that they run on?


The NameNode and DataNode are software designed to run on commodity machines. These machines run a GNU/Linux operating system (OS). HDFS using the Java language. Any machine that supports Java can run the NameNode or the DataNode software. The Java language means that HDFS can deploy on a wide range of machines. A typical deployment has a dedicated machine runs NameNode software. Each of the other machines in the cluster runs DataNode software. The architecture does not prevent running many DataNodes on the same machine.


1. HDFS Files


There is a distinction between an HDFS file and a native file on the host computer. Computers in an HDFS installation divide to NameNode or DataNode. Each computer has its own file system and information about an HDFS file. The metadata of NameNode and information store in the NameNode’s host file system. The information contained in an HDFS file managed by a DataNode. Stored on the DataNode’s host computer file system.


HDFS exposes a file system namespace and allows user data to store in HDFS files. An HDFS file consists of many blocks. Each block is 64MByes. Each block replicated some specified number of times. The replicas of the blocks stored on different DataNodes loading on a DataNode. To provide both speeds in transfer and resiliency in case of failure of a rack. Block Allocation for a description of the algorithm.


A standard directory structure used in HDFS. That files exist in directories. That may, in turn, be sub-directories of other directories, and so on. There is no concept of a current directory within HDFS. HDFS files referred to by their qualified name. The elements of the interaction between the Client and elements.


The NameNode executes HDFS file system namespace operations. Like the opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The list of files belongs to each block. The current location of the block replicas on the DataNodes. The state of the file access control information metadata and NameNode.


The DataNodes are responsible for reading and write from the HDFS file systems. The DataNodes block replica creation, deletion, and replication upon instruction from the NameNode. The DataNodes are state of the replicates and report to the NameNode.


The existence of a single NameNode in a cluster. The NameNode is the arbitrator and repository for all HDFS metadata. The client sends data to reads from DataNodes. So that client data never flows through the NameNode.


2. Block Allocation



Each block replicated some number of times. The default replication factor for HDFS is three. When addBlock() invoked, space allocated for each replica. Each replica allocated on a different DataNode. The algorithm for performing this allocation attempts to balance performance and reliability.


  The dynamic load on the set of DataNodes. Preference to more loaded DataNodes.


 The location of the DataNodes. Communication between two nodes in different racks has to go through switches. Network bandwidth between machines in the same rack is greater in different racks.
 When the replication factor is three. HDFS’s placement policy is to put one replica on one node in the local rack. Another on a node in a different rack, and the last on a different node in the same remote rack. This policy cuts the inter-rack write traffic which generally improves write performance. The chance of rack failure is less than that of a node failure. This co-location policy does not hurt data reliability and availability guarantees. The added network bandwidth users reading data a block in only two unique racks rather than three. With this policy, the replicas of a file do not distribute across the racks. One-third of replicas are on one node on some rack. The other two-thirds of replicas are nodes different rack. This policy improves write performance without compromising data reliability or read performance.

Overview of HDFS | Hadoop Training in Hyderabad

 
HDFS is an Apache Software Foundation project. a subproject of the Apache Hadoop project. Hadoop is ideal for storing large amounts of data. like terabytes and petabytes uses HDFS as its storage system. HDFS nodes contained within clusters over data files distributed. then access and store the data files as one seamless file system. Access to data files handles in a streaming manner. that applications or commands execute the MapReduce processing model.

HDFS is high-throughput access to large data sets. the primary features of HDFS provide a high-level view.


HDFS has many similarities with other distributed file systems. but is different in several respects. One noticeable difference is HDFS's write-once-read-many model. that relaxes control requirements, simplifies data, and enables high-throughput access.

The attribute of HDFS is viewpoint data rather than moving the data to the application space.

HDFS restricts data writing to one writer at a time. Bytes are always appended to the end of a stream. the byte streams to store in the order written.

HDFS has many goals. Here are some of the most notable

  Fault tolerance by detecting and applying quick automatic recovery

  Data access via MapReduce streaming

  Simple and robust model

 the data close to the processing logic

  Portability across commodity hardware and operating systems

  Scalability to store and process large amounts of data

  Economy by distributing data and processing across clusters of commodity personal computers

  Efficiency by distributing data and process it in parallel nodes data.

  many copies of data and redeploying processing logic in the event of failures

HDFS provides interfaces for applications to move them closer to data located

Big Data Use Cases for Modern Business | Hadoop Training Institutes In Hyderabad

 

Hadoop Training Institutes in Hyderabad
Hadoop Training Institutes in Hyderabad


Today’s organizations have amounts of data from all aspects of their operations. The power of big data during morning coffee break. But how can big data provide business intelligence, unlike other data mining techniques?.  It is different from running SQL queries or navigating your Excel spreadsheets.


1: Log Analytics


A Log data is a fundamental foundation of many business big data applications. The Log management and analysis tools have been around long before big data. But with the exponential growth of business activities and transactions. Log data a stored, processed and presented in the most efficient, cost-effective manner.
It can provide the ability collect, process, and analyze massive log data. The dump data into the relational database and retrieving through SQL queries. The synergy between log search capabilities and big data analytics. It organizes big data log analytics applications used for various business goals. Like IT system security and network performance, to market trends and e-commerce personalization.


2: E-Commerce Personalization


When you were browsing Amazon.com and Ebay to find that perfect gift for others. The search boxes, click on the navigation bar, expand product descriptions, or add a product to your cart. An e-commerce company actions key to optimizing the entire shopping experience. The tasks of collecting, processing, and transaction data for big data in e-commerce.
A powerful search and big data analytics platform allow e-commerce companies. To clean and product data for a better search experience on both desktops and mobile devices. Predictive analytics and machine learning to predict user preferences through log data. Then personalize products in a most-likely-to-buy order that maximizes conversion. A new movement towards real-time e-commerce personalization big data's massive processing power.


3: Recommendation Engines


 YouTube, Netflix, Spotify online media services noticed recommended for videos, movies, or music. Doesn’t it feel great to have a selection personalized only for you? It’s easy. It’s time-saving. A satisfying user experience, right. That the more videos and movies you watched, the better those recommendations became. As the media and entertainment space is strong competitors. The ability to deliver the top user experience will be the winning factor.


  The scalability and power to process amounts of both structured and unstructured data. Companies to analyze billions of clicks and viewing data from recommendations. The machine learning and predictive analytics are recommendations the user’s taste.


4: Automated Candidate Placement in Recruiting


 The race to place candidates as possible in a competitive environment. As matching resume keywords with job descriptions no longer provide the desired results. Big data for recruiting speed up and automate the placement process.


  Big data recruitment platforms databases and provide the view of a candidate. Such as education, experience, skill sets, job titles, certifications, geography, and anything else. Then compare to the company’s past hiring experience, salaries, before successful candidates. These platforms matching to expect recruiting needs and suggest candidates before positions posted. It recruiters to be more proactive – a competitive edge against their competitors.


5: Insurance Fraud Detection


Organizations that handle many financial transactions. To continue searching for more innovative, effective approaches to fighting fraud. Medical insurance agencies are no exception, as fraud can cost the industry up to $5. Fraud investigators need to work with BI analysts. To run complex SQL queries from the bill and claim data. Then wait weeks or months to get the results back. This process lengthy delay in legal fraud cases, thus, huge losses for the business.

It used into billions of billing and claim records pulled into a search engine. So that investigators can analyze individual records by searches on a graphical interface. Predictive analytics and machine learning capabilities big data automatic red flag alerts. It recognizes a pattern that matches a before known fraud scheme.


6: Relevancy and Retention Boost for Online Publishing


The research publishing companies giving their online subscribers. They want to build authority, expand subscriber base, and boost the bottom line. To investing in great SEO effort to make the publishing site searchable, strategizing.
First, a powerful search engine helps clean and enrich research documents’ metadata. The most relevant content and explore related content. Then machine learning and predictive analytics to order in the top results.

Analyze Big Data with Hadoop | Hadoop Training Institutes Hyderabd




Hadoop Training Institutes Hyderabad
Hadoop Training Institutes Hyderabad




What is Big Data Analysis


the social media websites, sensors, devices, video/audio, networks, log files, and the web. it generates in real time and on a very large scale. Big data analytics is the process of examining this large amount of different data types or big data. to uncover hidden patterns, unknown correlations, and other useful information.


Advantages of Big Data Analysis


This allows market analysts, researchers, and business users to develop data. Business users are able to make the analysis of the data. the key early indicators from this analysis can mean fortunes for the business. Some of the exemplary use cases are as follows


  The users browse travel portals, shopping sites, search flights, hotels add to the cart. then Ad Targeting companies can analyze this wide variety of data and activities. the user discounts and deals based on the user browsing history and product history.


  In telecommunications, customers are moving from one service provider to another service provider. then analyzing call data records of the various issues faced by the customers. Issues could be as wide-ranging the call drops or some network congestion problems. Based on these issues, it identified if a telecom company needs to place a new tower in a particular urban area. if they need to revive the marketing strategy for a particular region as a new player has come up there.


Case Study – Stock market data



Now let’s look at a case study for analyzing stock market data. We will check various big data technologies to analyze this stock market. ‘New York Stock Exchange’ dataset and calculate stock data. to solve both storage and processing problems related to a huge volume of data.


Covariance


It is a financial term. that represents stocks or financial instruments move together apart from each other. The investors have the opportunity different investment respective risk profile. It is a statistical measure of investment moves in relation to the other.


A positive covariance means returns moved together. If investment instruments or stocks tend to be up or down during the same time periods.


A negative covariance means returns move. investment instrument tends to be up while the other is down.


This will help a stock broker the stocks to his customers.


Dataset

The sample dataset provided is a comma separated file. that contains the stock information like Stock opening price, Stock highest price etc.


The dataset provided is a sample small dataset having around 3500 records. but in the real production environment, there could be huge stock data running into GBs or TBs.

An introduction to Apache Hadoop | Hadoop Training in Hyderabad






Apache Hadoop is an open source software framework. Its storage and large scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache top-level project. Being built and used by a global community of contributors and users. It licensed under the Apache License 2.0.


 Hadoop invented by Doug Cutting and Mike Cafarella in 2005. It is support distribution for the Nutch search engine project. Doug was working at Yahoo! at the time. He is now Chief Architect of Cloudera, named the project after his son's toy elephant. Cutting's son was 2 years old at the time and beginning to talk. He called his stuffed yellow elephant "Hadoop".


The Apache Hadoop framework following modules is


  Hadoop Common:


It contains libraries and utilities.





  A distributed file-system that stores data on the commodity machines. It providing very high bandwidth across the cluster




A resource-management platform responsible for managing compute resources in clusters. By using them for scheduling of users' applications




 Programming model for large-scale data processing


All the modules in Hadoop designed with a fundamental assumption. Those hardware failures are the common software by the framework. Hadoop's MapReduce and HDFS derived from Google's MapReduce and Google File System papers.


 The entire Hadoop "platform" now consists of many related projects. As well Apache Pig, Apache Hive, Apache HBase, and others.


 The end-users, though MapReduce Java code is common. A programming language can with "Hadoop Streaming" to put in place the "map" and "reduce" parts. Apache Pig and Apache Hive, among other related projects. Expose higher level user interfaces like Pig Latin and a SQL variant. The Hadoop framework itself is written in the Java programming language. Some native code in C and command line utilities are written as shell-scripts.

Brief History Of Big data | Hadoop Training Institues Hyderabad




Big data is a very large amount of structured or unstructured data. This data is so “big”. That it gets problematic to using conventional database techniques and software. Who handles evaluating large amounts of data. The most important task of a Big Data Scientist is bulk data understood.


The term “Big data” used the tools and procedures an organization to process a large volume of data. That around 91% of the data today has created in the last 3 years. The need for data handling has led to a need for developing and using Big Data Technologies.


Clear examples of Big Data could be:


Around 600 million tweets sent in a day. This is more than 6,840 tweets per second.


VISA handles around 172,800,000 card transactions every day.


The 3 Vs in Big Data


Which are Variety, Velocity and Volume.






The varying formats of data. Like databases, excel sheets, documents and several existing formats.




The rate which with the data keeps changing, or in other words the rate at which data create and update.




The size of the available data. Today, data size has become enormous, ranging from gigabytes to even petabytes.


Skills Required Becoming a Big Data Scientist


A set of technical skills, visualization skills and business domain expertise. A data scientist should also have strong analytical and problem-solving skills.




1. Technical Skills:


Knowledge of at least one big data technology such as Hadoop.


Knowledge of programming and scripting languages like Java and Python.


Knowledge of database management and SQL.


Knowledge of data modeling and relational databases.


Knowledge of statistical tools like SAS and Excel.






2. Visualization Skills:




These include presentation skills and knowledge of tools. Like PowerPoint, Google Visualization API, Tableau, MS Paint etc.




3. Business Skills:


These include knowledge of the business domain. Where going to work understanding and knowledge of risk analysis etc.








Big data has many applications for the capital market companies




Exploring data
Finding and managing data for every enterprise data in the big challenge. Big Data technologies can help these enterprises in exploring the “Big Data".


Risk Analytics


Risks, frauds and security controlled by using big data technologies. This could benefit in banking, insurance etc.




Trading Analytics


Companies can analyze their customers. Their needs by using big data technologies for data processing.




Medical Data Management




Big data can help the patient data in the medical sector.




Telecom Data Management




Big data use to decrease the processing time. But managing the call data in the telecom sector. It optimizes the locations based on telecom services.




Financial data management


Financial services companies process several millions of transactions for every day. Big data technologies can help such companies in managing such a big data.




Tax Compliance




Big data could help in detecting tax-related frauds.




Data tagging




Big data can help in organize information of data