Big Data Map Reduce | IndianTechnoEra

What is Map Reduce?

Map Reduce is a programming model used for processing and generating large data sets. It is a distributed computing framework that allows for parallel processing of data across multiple nodes.

The MapReduce framework is divided into two phases, the map phase and the reduce phase. The map phase takes an input dataset and processes it into a set of key-value pairs. The reduce phase takes the output from the map phase, groups the data by key, and reduces it into a smaller set of values. The output of the reduce phase is the final result of the MapReduce job.

MapReduce is an effective way of processing and analyzing large datasets, as it allows for parallel processing of data across multiple nodes. This makes it more efficient than traditional data processing systems which require data to be processed sequentially in a single node. MapReduce also provides fault tolerance and scalability, which makes it suitable for handling large datasets.

Map Reduce Architecture

Map

Big Data Map Reduce | IndianTechnoEra

Example:

Anatomy (Phase) of a Map Reduce

1. Input: The input to a Map Reduce job is typically a large data set that is stored in a distributed file system. The data set is divided into chunks that are processed in parallel.

2. Map: The Map phase processes each input chunk and produces a set of intermediate key-value pairs.

3. Partition function: The partition function assigns the output of each Map function to the appropriate reducer. The available key and value provide this function. It returns the index of reducers.

4. Shuffle and Sort: The Shuffle and Sort phase collects the intermediate pairs from the Map phase, groups them by key, and sorts them by value.

5. Reduce: The Reduce phase processes each key-value pair and generates a set of output values.

6. Output: The output of a Map Reduce job is typically stored in a distributed file system.

a) Map reduce Job Run

1. Job Submission: The user submits the job to the JobTracker, which then looks for available TaskTrackers to execute the job.

2. Job Initialization: The JobTracker initializes the job and distributes the job configuration to the TaskTrackers.

3. Map Task Assignment: The JobTracker assigns the map tasks to the TaskTrackers.

4. Map Task Execution: The TaskTrackers execute the map tasks in parallel and generate intermediate key-value pairs.

5. Shuffle and Sort: The TaskTrackers shuffle and sort the intermediate key-value pairs and send them to the reducer.

6. Reduce Task Assignment: The JobTracker assigns the reduce tasks to the TaskTrackers.

7. Reduce Task Execution: The TaskTrackers execute the reduce tasks in parallel and generate the output key-value pairs.

8. Output: The output key-value pairs are written to the output file.

b) Map Reduce failures

1. System Errors: System errors occur when the MapReduce job fails due to a problem with the underlying infrastructure, such as an overloaded cluster or insufficient memory.

2. Application Errors: Application errors are caused when the MapReduce job fails due to a bug in the application code. These errors can be caused by incorrect logic, coding errors, or other issues.

3. Data Errors: Data errors occur when the input data is incorrect or corrupt. These errors can be caused by bad formatting or missing data.

4. Scheduling Errors: Scheduling errors occur when the job is not scheduled properly. This can lead to jobs not running when they should, or running too frequently.

5. Network Errors: Network errors occur when the job fails due to a problem with the network connection. This can be caused by a slow connection or a dropped connection.

c) Map Reduce job scheduling

1. The client sends a request to the JobTracker to start a job.

2. The JobTracker obtains a list of available TaskTrackers from the NameNode.

3. The JobTracker creates a job configuration and divides the job into tasks.

4. The JobTracker sends the tasks to the TaskTrackers.

5. The TaskTrackers execute the task on the node where the data resides.

6. The TaskTrackers report the progress of the tasks back to the JobTracker.

7. The JobTracker monitors the progress of the tasks and reschedules any tasks that fail.

8. The TaskTrackers send the output of the tasks back to the JobTracker.

9. The JobTracker aggregates the output of the tasks and sends it back to the client.

10. The client receives the output from the JobTracker.

d) Map Reduce Shuffle and Sort

1. Partitioning: Before the shuffle and sort can begin, the map output must be partitioned. Partitioning distributes the data to the reducers based on the partitioning function. This function is usually based on the key of the key-value pair.

2. Shuffle: The shuffle phase is when the map output is sent to the reducers. The map output is sorted by the key and then transferred to the reducer nodes.

3. Sort: After the map output is transferred to the reducer nodes, the map output is sorted by the key. This is done to ensure that all values with the same key are grouped together in one location.

4. Merge: The reducer merges the sorted map output into one sorted list. This list is then used by the reducer to generate the final output.

e) map reduce task execution

1. The user submits their MapReduce job to the cluster.

2. The JobTracker splits the job into tasks and assigns them to TaskTrackers.

3. The TaskTrackers execute the Map and Reduce tasks, producing intermediate key-value pairs and output files.

4. The JobTracker monitors the progress of the tasks and coordinates the execution of the tasks.

5. The Map tasks read input files, process the data, and produce intermediate key-value pairs.

6. The Reduce tasks combine the intermediate key-value pairs and produce output files.

7. The output files are written to HDFS, and the job is completed.

Map Reduce Types

MapReduce is a type of parallel computing for processing large data sets.

It consists of two processes, Map and Reduce, which are used to process and analyze large volumes of data.

Types of MapReduce include:

1. Batch Processing - This is the most common type of MapReduce, where large data sets are processed in batches. This is typically used for analyzing large amounts of data to produce summary reports.

2. Online Processing - This type of MapReduce is used for real-time data processing, such as in web search engines or streaming media applications.

3. Iterative Processing - This type of MapReduce is used for iterative algorithms, such as machine learning algorithms.

4. Streaming Processing - This type of MapReduce is used for processing data that is continuously generated, such as from sensors or other input sources. In this type of MapReduce, data is processed in real-time.

5. Graph Processing - This type of MapReduce is used for processing graph data, such as in social networks or other connected data sets.

MapReduce Formats:

MapReduce formats include:

1. TextInputFormat: Used to read plain text files.

2. KeyValueTextInputFormat: Used to read plain text files where the input is in key-value pairs.

3. SequenceFileInputFormat: Used to read sequence files.

4. NLineInputFormat: Used to read files where each line is an input record.

5. AvroKeyInputFormat: Used to read Avro data files.

6. XMLInputFormat: Used to read XML files.

7. DBInputFormat: Used to read data from a database.

8. TableInputFormat: Used to read data from a HBase table.

Map Reduce Features

1. Scalable: MapReduce is designed to be easily scalable, allowing for easy parallelization of tasks across multiple nodes of a cluster.

2. Fault Tolerance: MapReduce provides fault tolerance by automatically re-executing any failed tasks. This ensures that the data is accurately processed without any data loss.

3. Flexible: MapReduce is flexible and can be used for a wide variety of data processing tasks, such as search, sorting, indexing, and data mining.

4. Cost-Effective: MapReduce is cost-effective, as it enables distributed processing of large datasets across multiple nodes of a cluster. This reduces the cost of processing data significantly.

5. Easy to Use: MapReduce is relatively easy to use, as it uses a simple programming model (Map and Reduce functions) and provides a higher-level abstraction layer that hides the complexity of distributed computing.

Map Reduce advantages

1. Scalability: MapReduce can easily scale up from a single server to thousands of machines, each offering local computation and storage.

2. Fault Tolerance: MapReduce makes it easy to parallelize tasks across many computers, reducing the risk of data loss due to hardware failure.

3. Cost: MapReduce can be used to process large datasets quickly, which means it can be used to reduce the cost of running data intensive tasks.

4. High Availability: The distributed nature of MapReduce means that it can run on multiple machines at once, allowing for high availability.

5. Simplicity: MapReduce is designed to be easy to use and understand, making it a great choice for processing large datasets.

Map Reduce disadvantages

1. Complexity: MapReduce is a relatively complex system to set up and manage. It requires expert knowledge of distributed systems, data processing and related software to be able to get the most out of it.

2. Latency: MapReduce jobs have higher latency compared to traditional databases. This is because MapReduce jobs are divided up into multiple tasks that have to be executed in sequence.

3. Limited support for iterative algorithms: MapReduce is not well suited for iterative algorithms as they require multiple passes over the same data set.

4. Limited support for interactive queries: MapReduce is not well suited for interactive queries as it requires the data to be read from disk for each query.

5. Performance issues: MapReduce can be slow for certain types of queries as it has to read and process large amounts of data. This can be mitigated by using techniques such as caching and data partitioning.

SQL vs Hadoop/mapreduce?

SQL vs Hadoop/ mapreduce

IndianTechnoEra

Big Data Map Reduce | IndianTechnoEra

What is Map Reduce?

Map Reduce Architecture

Anatomy (Phase) of a Map Reduce

a) Map reduce Job Run

b) Map Reduce failures

c) Map Reduce job scheduling

d) Map Reduce Shuffle and Sort

e) map reduce task execution

Map Reduce Types

Types of MapReduce include:

MapReduce Formats:

Map Reduce Features

Map Reduce advantages

Map Reduce disadvantages

SQL vs Hadoop/mapreduce?

إرسال تعليق

EF Core Mastery: From Zero to Hero

Fundamental of Mathematical Statistics by S. C. Gupta V. K. Kapoor | IndianTechnoEra

Linear Algebra For AIML - Chapter 3 Matrix Properties

ITE - CodeSam