◐ system-design/batch-processing.md
30. Batch Processing: MapReduce & Spark
Batch processing handles large bounded data sets — terabytes to petabytes — producing reports, models, and derived data. The patterns shaped modern data engineering and big-data system design.
~6 min read·updated 5/29/2026
30. Batch Processing: MapReduce & Spark
Batch processing handles large bounded data sets — terabytes to petabytes — producing reports, models, and derived data. The patterns shaped modern data engineering and big-data system design.
30.1 The Unix philosophy as a model
grep | sort | uniq -c | sort -nr | head — a pipeline of small composable tools acting on a stream. Batch processing at scale generalizes this idea.
Properties to preserve:
- Each stage takes input, produces output (no shared state).
- Output is purely a function of input (idempotent, retriable).
- Clear interfaces between stages.
This is the conceptual basis for MapReduce, Spark, Beam.
30.2 MapReduce (Google, 2004)
The paper that launched the big-data era.
The model
Two phases:
- Map: for each input record, emit zero or more (key, value) pairs.
- Reduce: for each unique key, process the list of values.
Example: count words in 10 TB of text.
- Map:
("hello world hello") → [("hello", 1), ("world", 1), ("hello", 1)] - Shuffle (framework): group by key.
("hello", [1,1,1]), ("world", [1]) - Reduce: sum.
("hello", 3), ("world", 1)
Why it scaled
- Map and Reduce are stateless functions → run any of them on any worker.
- Data-local computation: schedule map tasks on the node holding the input HDFS block (data locality).
- Fault tolerance: failed task → re-run on another node.
- Linear scaling: 10× nodes → 10× throughput (within limits).
The shuffle is the bottleneck
Between map and reduce, all-to-all data exchange happens (every reducer pulls its keys from every mapper). I/O and network heavy. Sort-based, often disk-spilled.
Limitations
- Only two stages: complex jobs become long chains; intermediate writes to disk between every stage.
- Disk-bound: every shuffle goes through disk.
- High latency: minutes to hours.
- Verbose: lots of boilerplate per job.
These limits made MapReduce slow for iterative algorithms (ML), interactive queries, and anything chained.
30.3 Hadoop ecosystem
Open-source MapReduce ecosystem:
- HDFS: distributed file system (clone of GFS).
- MapReduce: batch compute.
- YARN: resource manager (cluster scheduling).
- Hive: SQL on HDFS, compiled to MapReduce/Tez/Spark.
- HBase: wide-column on HDFS (clone of Bigtable).
- Pig: dataflow language on MapReduce.
By ~2015 most workloads moved to Spark. Hadoop persists at older enterprises and as a storage layer.
30.4 Apache Spark
The MapReduce successor.
Key innovations
- In-memory caching of intermediate data (RDDs / DataFrames) — 10-100× faster for iterative jobs.
- DAG execution: arbitrary graphs of operations, not just map/reduce.
- Lazy evaluation: build a plan, optimize, execute when needed.
- Higher-level APIs: DataFrame, SQL, MLlib, GraphX, Structured Streaming.
RDD (Resilient Distributed Dataset)
The original Spark abstraction. An immutable distributed collection. Operations: transformations (map, filter, join — lazy) and actions (count, collect, save — trigger execution).
Fault tolerance: an RDD remembers how it was computed (lineage). On failure, recompute the missing partition from its lineage.
DataFrame / Dataset
SQL-like, schema-aware, with Catalyst optimizer (rule-based + cost-based query optimization). Most Spark code uses DataFrames now.
Cluster execution
- Driver: orchestrates.
- Cluster manager: YARN, Kubernetes, Spark Standalone, Mesos.
- Executors: workers that run tasks and hold cached data.
Why Spark won
- Dramatic speedup vs MapReduce.
- Unified API for batch, SQL, streaming, ML.
- Active development, large community.
- Excellent Python/SQL ergonomics (PySpark).
30.5 SQL on big data
Most batch workloads at scale are SQL-shaped. Several engines compete:
- Hive: SQL on HDFS, compiles to MapReduce/Tez/Spark.
- Presto / Trino: distributed SQL, federates many sources (HDFS, S3, Kafka, MySQL).
- Spark SQL: Spark's SQL layer.
- BigQuery (Google): serverless, massively parallel, columnar storage; based on Dremel paper.
- Snowflake: managed, separation of storage + compute, scale on demand.
- Redshift (AWS): column-oriented, MPP.
- ClickHouse: open source, blazing fast columnar.
- DuckDB: embedded analytic SQL; "SQLite for analytics."
Lakehouse pattern
- Storage: cheap object store (S3) holds raw data.
- Format: open columnar (Parquet, ORC) plus table metadata (Apache Iceberg, Delta Lake, Hudi).
- Compute: Spark, Trino, Snowflake, etc., on demand.
- Decouples storage cost from compute cost.
This is the dominant modern architecture. Companies migrating off proprietary data warehouses.
30.6 Workflow / orchestration
Batch jobs are usually pipelines. You need a workflow scheduler.
- Apache Airflow: Python DAGs; rich UI; the de facto standard.
- Prefect, Dagster: modern Python-first alternatives.
- Luigi: older, by Spotify.
- Argo Workflows: K8s-native.
- AWS Step Functions, Google Cloud Composer (managed Airflow).
Concerns:
- Idempotent tasks (rerun safe).
- Backfill capability.
- SLA monitoring.
- Sensors (wait for upstream data).
- Retries with backoff.
30.7 Storage formats
For analytics, columnar formats win.
Parquet
- Columnar.
- Per-column compression (dictionary, RLE, bit-packing).
- Row group metadata for predicate pushdown.
- Open spec; widely supported.
ORC
Similar to Parquet; popular in Hive ecosystems.
Avro
Row-oriented; great for streaming and schema evolution. Used as Kafka payload format with schema registry.
When to use what
- Parquet: analytical queries on columnar data.
- Avro: streaming events with schema evolution.
- JSON: ad hoc, exploration; never for bulk analytics.
Table formats (over Parquet)
- Iceberg: schema/partition evolution, snapshots, time travel. Adopted by Netflix, Apple.
- Delta Lake: similar; from Databricks.
- Hudi: optimized for upserts, CDC.
These add transactionality, schema management, and time travel to data-lake files.
30.8 Common batch patterns
ETL (Extract, Transform, Load)
Pull from sources → transform → load to warehouse. Classic; still dominant.
ELT (Extract, Load, Transform)
Pull → load raw → transform in warehouse SQL. Cheaper compute; warehouses are powerful enough.
Slowly Changing Dimensions (SCD)
Tracking historical changes to entities (e.g., customer address over time). Type 1 (overwrite), Type 2 (history rows with valid_from/valid_to), Type 4 (separate history table).
Aggregation pipelines
Roll up raw events into hourly/daily summaries. Drop raw after retention.
Materialized views (incremental)
Compute on writes; query from materialized form. Druid, Pinot, ClickHouse-style.
30.9 Performance levers
- Partitioning: by date, region, etc. Skip irrelevant partitions.
- File size: 100 MB - 1 GB sweet spot for Parquet (avoid the "small files" problem of millions of tiny files killing the metadata layer).
- Compaction: merge small files into bigger ones.
- Compression: Snappy / Zstd / LZ4 — usually default.
- Predicate pushdown: skip files / row groups that don't match filter.
- Projection pushdown: read only needed columns.
- Broadcast joins: replicate small table to every worker; avoid shuffle.
- Skew mitigation: salt hot keys.
30.10 Common bugs
- Skewed partitions: 99% of work on one reducer → job hangs.
- Out-of-memory on big shuffles: need to spill to disk; tune.
- Time zones: UTC always.
- Date partition mistakes: re-running for "yesterday" without bounded backfill.
- Schema drift: upstream adds a field; downstream breaks. Use schema registry.
- Idempotency: rerunning produces duplicates; use dedup keys.
30.11 What an interviewer wants
- Explain MapReduce model and shuffle bottleneck.
- Articulate why Spark won.
- Describe lakehouse: object store + open formats + flexible compute.
- Know Iceberg / Delta / Hudi as the table-format trio.
- Pick batch vs streaming based on latency requirement.
30.12 Google's lineage
- MapReduce (paper 2004): two-stage compute on GFS.
- Dremel (paper 2010): interactive SQL on nested columnar data → BigQuery.
- FlumeJava (2010): higher-level API on top of MapReduce.
- MillWheel (paper 2013): low-latency stream processing.
- Dataflow / Apache Beam (2015): unified batch + stream model.
Beam's model: bounded (batch) and unbounded (stream) data are the same; runners (Spark, Flink, Dataflow) execute the unified pipeline.
Key takeaways
- MapReduce introduced data-parallel batch on commodity clusters; superseded by Spark.
- Spark unified batch, SQL, streaming, ML in one engine.
- Lakehouse = cheap object storage + open columnar (Parquet) + table format (Iceberg) + flexible compute.
- Workflow orchestration (Airflow / Dagster / Prefect) is the glue.
- Performance: partition, compact, prune, broadcast small joins, address skew.
// 0 views