Chapter 2: MapReduce and the New Software Stack
Key Concepts and Summaries:
-
Cluster Computing:
-
Modern large-scale applications use clusters consisting of multiple compute nodes.
-
Each node contains a processor, memory, and local storage.
-
Nodes are typically mounted on racks connected via Ethernet or a high-speed switch.
-
-
Distributed File Systems:
-
Files are split into chunks (typically 64MB) and stored redundantly across nodes.
-
This architecture provides both fault tolerance and parallelism for data access.
-
-
MapReduce Programming Model:
-
A framework for processing large-scale data across clusters.
-
Relies on Map and Reduce functions defined by the user:
-
Map: Processes input and outputs key-value pairs.
-
Reduce: Aggregates values associated with the same key.
-
-
-
The Role of the Master Process:
-
Manages and distributes Map and Reduce tasks.
-
Monitors for node failures and reassigns tasks as needed.
-
-
Fault Tolerance:
-
Tasks that fail are automatically restarted.
-
A job needs to be restarted only if the Master node fails.
-
-
Applications of MapReduce:
-
Well-suited for relational algebra, matrix operations, and index construction.
-
Also useful for large-scale search engines and analytics tasks.
-
-
Hadoop Ecosystem:
-
Open-source implementation of MapReduce and HDFS.
-
Widely adopted in enterprise environments and research.
-
-
Workflow Systems:
-
Generalize MapReduce to support DAG (Directed Acyclic Graph) of computations.
-
Enables more complex data processing pipelines.
-
-
Spark:
-
Introduces Resilient Distributed Datasets (RDDs).
-
Supports lazy evaluation and lineage tracking.
-
Efficient for iterative algorithms (e.g., in ML).
-
-
TensorFlow:
-
Designed for machine learning applications.
-
Uses tensors (multi-dimensional arrays) and supports complex operations like backpropagation and gradient descent
A cluster is a collection of interconnected computers (nodes) that work together as a single system.
-
Each node has its own processor, memory, and disk.
-
Clusters allow parallel processing of large-scale data.
-
2. Distributed File Systems
-
Data is split into large chunks (e.g., 64MB) and stored across multiple nodes.
-
Each chunk is replicated on different machines to provide fault tolerance.
-
Examples include the Google File System (GFS) and Hadoop Distributed File System (HDFS).
3. MapReduce Framework
-
A programming model for processing large datasets with two main steps:
-
Map: Processes input data and emits intermediate key-value pairs.
-
Reduce: Aggregates all values for each key and emits a final output.
-
-
Enables automatic parallelism and fault recovery.
4. Master Process in MapReduce
-
A central coordinator that:
-
Assigns Map and Reduce tasks to worker nodes.
-
Monitors task progress.
-
Detects failures and restarts tasks on other machines if needed.
-
5. Fault Tolerance
-
MapReduce is resilient to node failure:
-
Failed tasks are re-executed on another machine.
-
Input and intermediate data are persisted to enable reprocessing.
-
Only a failure of the master node is catastrophic.
-
6. Applications of MapReduce
-
Can be used to implement:
-
Relational algebra operations (e.g., join, group by).
-
Matrix computations for scientific data processing.
-
Index building for search engines.
-
7. Hadoop Ecosystem
-
Hadoop is an open-source platform that implements:
-
The MapReduce programming model.
-
The HDFS file system.
-
-
Widely used in industry for big data processing.
8. Workflow Systems
-
Extensions of MapReduce to support multi-step computations.
-
Data flows through a Directed Acyclic Graph (DAG) where each node is a task.
-
Enables more complex data processing pipelines beyond a single MapReduce job.
9. Apache Spark
-
A fast, in-memory alternative to Hadoop MapReduce.
-
Introduces Resilient Distributed Datasets (RDDs):
-
Immutable, partitioned collections of objects.
-
Track lineage for fault recovery.
-
-
Efficient for iterative algorithms (like those used in machine learning).
10. TensorFlow
-
A framework for numerical computation, especially for deep learning.
-
Uses tensors (multi-dimensional arrays) and defines a computation graph.
-
Supports operations such as gradient descent and backpropagation.
-
Facilitates distributed training of neural networks.
-