system-design/batch-processing.md

30. Batch Processing: MapReduce & Spark

Batch processing handles large bounded data sets — terabytes to petabytes — producing reports, models, and derived data. The patterns shaped modern data engineering and big-data system design.

~6 min read·updated 5/29/2026

30. Batch Processing: MapReduce & Spark

Batch processing handles large bounded data sets — terabytes to petabytes — producing reports, models, and derived data. The patterns shaped modern data engineering and big-data system design.

30.1 The Unix philosophy as a model

grep | sort | uniq -c | sort -nr | head — a pipeline of small composable tools acting on a stream. Batch processing at scale generalizes this idea.

Properties to preserve:

  • Each stage takes input, produces output (no shared state).
  • Output is purely a function of input (idempotent, retriable).
  • Clear interfaces between stages.

This is the conceptual basis for MapReduce, Spark, Beam.

30.2 MapReduce (Google, 2004)

The paper that launched the big-data era.

The model

Two phases:

  • Map: for each input record, emit zero or more (key, value) pairs.
  • Reduce: for each unique key, process the list of values.

Example: count words in 10 TB of text.

  • Map: ("hello world hello") → [("hello", 1), ("world", 1), ("hello", 1)]
  • Shuffle (framework): group by key. ("hello", [1,1,1]), ("world", [1])
  • Reduce: sum. ("hello", 3), ("world", 1)

Why it scaled

  • Map and Reduce are stateless functions → run any of them on any worker.
  • Data-local computation: schedule map tasks on the node holding the input HDFS block (data locality).
  • Fault tolerance: failed task → re-run on another node.
  • Linear scaling: 10× nodes → 10× throughput (within limits).

The shuffle is the bottleneck

Between map and reduce, all-to-all data exchange happens (every reducer pulls its keys from every mapper). I/O and network heavy. Sort-based, often disk-spilled.

Limitations

  • Only two stages: complex jobs become long chains; intermediate writes to disk between every stage.
  • Disk-bound: every shuffle goes through disk.
  • High latency: minutes to hours.
  • Verbose: lots of boilerplate per job.

These limits made MapReduce slow for iterative algorithms (ML), interactive queries, and anything chained.

30.3 Hadoop ecosystem

Open-source MapReduce ecosystem:

  • HDFS: distributed file system (clone of GFS).
  • MapReduce: batch compute.
  • YARN: resource manager (cluster scheduling).
  • Hive: SQL on HDFS, compiled to MapReduce/Tez/Spark.
  • HBase: wide-column on HDFS (clone of Bigtable).
  • Pig: dataflow language on MapReduce.

By ~2015 most workloads moved to Spark. Hadoop persists at older enterprises and as a storage layer.

30.4 Apache Spark

The MapReduce successor.

Key innovations

  • In-memory caching of intermediate data (RDDs / DataFrames) — 10-100× faster for iterative jobs.
  • DAG execution: arbitrary graphs of operations, not just map/reduce.
  • Lazy evaluation: build a plan, optimize, execute when needed.
  • Higher-level APIs: DataFrame, SQL, MLlib, GraphX, Structured Streaming.

RDD (Resilient Distributed Dataset)

The original Spark abstraction. An immutable distributed collection. Operations: transformations (map, filter, join — lazy) and actions (count, collect, save — trigger execution).

Fault tolerance: an RDD remembers how it was computed (lineage). On failure, recompute the missing partition from its lineage.

DataFrame / Dataset

SQL-like, schema-aware, with Catalyst optimizer (rule-based + cost-based query optimization). Most Spark code uses DataFrames now.

Cluster execution

  • Driver: orchestrates.
  • Cluster manager: YARN, Kubernetes, Spark Standalone, Mesos.
  • Executors: workers that run tasks and hold cached data.

Why Spark won

  • Dramatic speedup vs MapReduce.
  • Unified API for batch, SQL, streaming, ML.
  • Active development, large community.
  • Excellent Python/SQL ergonomics (PySpark).

30.5 SQL on big data

Most batch workloads at scale are SQL-shaped. Several engines compete:

  • Hive: SQL on HDFS, compiles to MapReduce/Tez/Spark.
  • Presto / Trino: distributed SQL, federates many sources (HDFS, S3, Kafka, MySQL).
  • Spark SQL: Spark's SQL layer.
  • BigQuery (Google): serverless, massively parallel, columnar storage; based on Dremel paper.
  • Snowflake: managed, separation of storage + compute, scale on demand.
  • Redshift (AWS): column-oriented, MPP.
  • ClickHouse: open source, blazing fast columnar.
  • DuckDB: embedded analytic SQL; "SQLite for analytics."

Lakehouse pattern

  • Storage: cheap object store (S3) holds raw data.
  • Format: open columnar (Parquet, ORC) plus table metadata (Apache Iceberg, Delta Lake, Hudi).
  • Compute: Spark, Trino, Snowflake, etc., on demand.
  • Decouples storage cost from compute cost.

This is the dominant modern architecture. Companies migrating off proprietary data warehouses.

30.6 Workflow / orchestration

Batch jobs are usually pipelines. You need a workflow scheduler.

  • Apache Airflow: Python DAGs; rich UI; the de facto standard.
  • Prefect, Dagster: modern Python-first alternatives.
  • Luigi: older, by Spotify.
  • Argo Workflows: K8s-native.
  • AWS Step Functions, Google Cloud Composer (managed Airflow).

Concerns:

  • Idempotent tasks (rerun safe).
  • Backfill capability.
  • SLA monitoring.
  • Sensors (wait for upstream data).
  • Retries with backoff.

30.7 Storage formats

For analytics, columnar formats win.

Parquet

  • Columnar.
  • Per-column compression (dictionary, RLE, bit-packing).
  • Row group metadata for predicate pushdown.
  • Open spec; widely supported.

ORC

Similar to Parquet; popular in Hive ecosystems.

Avro

Row-oriented; great for streaming and schema evolution. Used as Kafka payload format with schema registry.

When to use what

  • Parquet: analytical queries on columnar data.
  • Avro: streaming events with schema evolution.
  • JSON: ad hoc, exploration; never for bulk analytics.

Table formats (over Parquet)

  • Iceberg: schema/partition evolution, snapshots, time travel. Adopted by Netflix, Apple.
  • Delta Lake: similar; from Databricks.
  • Hudi: optimized for upserts, CDC.

These add transactionality, schema management, and time travel to data-lake files.

30.8 Common batch patterns

ETL (Extract, Transform, Load)

Pull from sources → transform → load to warehouse. Classic; still dominant.

ELT (Extract, Load, Transform)

Pull → load raw → transform in warehouse SQL. Cheaper compute; warehouses are powerful enough.

Slowly Changing Dimensions (SCD)

Tracking historical changes to entities (e.g., customer address over time). Type 1 (overwrite), Type 2 (history rows with valid_from/valid_to), Type 4 (separate history table).

Aggregation pipelines

Roll up raw events into hourly/daily summaries. Drop raw after retention.

Materialized views (incremental)

Compute on writes; query from materialized form. Druid, Pinot, ClickHouse-style.

30.9 Performance levers

  • Partitioning: by date, region, etc. Skip irrelevant partitions.
  • File size: 100 MB - 1 GB sweet spot for Parquet (avoid the "small files" problem of millions of tiny files killing the metadata layer).
  • Compaction: merge small files into bigger ones.
  • Compression: Snappy / Zstd / LZ4 — usually default.
  • Predicate pushdown: skip files / row groups that don't match filter.
  • Projection pushdown: read only needed columns.
  • Broadcast joins: replicate small table to every worker; avoid shuffle.
  • Skew mitigation: salt hot keys.

30.10 Common bugs

  • Skewed partitions: 99% of work on one reducer → job hangs.
  • Out-of-memory on big shuffles: need to spill to disk; tune.
  • Time zones: UTC always.
  • Date partition mistakes: re-running for "yesterday" without bounded backfill.
  • Schema drift: upstream adds a field; downstream breaks. Use schema registry.
  • Idempotency: rerunning produces duplicates; use dedup keys.

30.11 What an interviewer wants

  • Explain MapReduce model and shuffle bottleneck.
  • Articulate why Spark won.
  • Describe lakehouse: object store + open formats + flexible compute.
  • Know Iceberg / Delta / Hudi as the table-format trio.
  • Pick batch vs streaming based on latency requirement.

30.12 Google's lineage

  • MapReduce (paper 2004): two-stage compute on GFS.
  • Dremel (paper 2010): interactive SQL on nested columnar data → BigQuery.
  • FlumeJava (2010): higher-level API on top of MapReduce.
  • MillWheel (paper 2013): low-latency stream processing.
  • Dataflow / Apache Beam (2015): unified batch + stream model.

Beam's model: bounded (batch) and unbounded (stream) data are the same; runners (Spark, Flink, Dataflow) execute the unified pipeline.

Key takeaways

  • MapReduce introduced data-parallel batch on commodity clusters; superseded by Spark.
  • Spark unified batch, SQL, streaming, ML in one engine.
  • Lakehouse = cheap object storage + open columnar (Parquet) + table format (Iceberg) + flexible compute.
  • Workflow orchestration (Airflow / Dagster / Prefect) is the glue.
  • Performance: partition, compact, prune, broadcast small joins, address skew.

// 0 views

main
UTF-8·typescript