30. Batch Processing: MapReduce & Spark

Batch processing handles large bounded data sets — terabytes to petabytes — producing reports, models, and derived data. The patterns shaped modern data engineering and big-data system design.

30.1 The Unix philosophy as a model

grep | sort | uniq -c | sort -nr | head — a pipeline of small composable tools acting on a stream. Batch processing at scale generalizes this idea.

Properties to preserve:

Each stage takes input, produces output (no shared state).
Output is purely a function of input (idempotent, retriable).
Clear interfaces between stages.

This is the conceptual basis for MapReduce, Spark, Beam.

30.2 MapReduce (Google, 2004)

The paper that launched the big-data era.

The model

Two phases:

Map: for each input record, emit zero or more (key, value) pairs.
Reduce: for each unique key, process the list of values.

Example: count words in 10 TB of text.

Map: ("hello world hello") → [("hello", 1), ("world", 1), ("hello", 1)]
Shuffle (framework): group by key. ("hello", [1,1,1]), ("world", [1])
Reduce: sum. ("hello", 3), ("world", 1)

Why it scaled

Map and Reduce are stateless functions → run any of them on any worker.
Data-local computation: schedule map tasks on the node holding the input HDFS block (data locality).
Fault tolerance: failed task → re-run on another node.
Linear scaling: 10× nodes → 10× throughput (within limits).

The shuffle is the bottleneck

Between map and reduce, all-to-all data exchange happens (every reducer pulls its keys from every mapper). I/O and network heavy. Sort-based, often disk-spilled.

Limitations

Only two stages: complex jobs become long chains; intermediate writes to disk between every stage.
Disk-bound: every shuffle goes through disk.
High latency: minutes to hours.
Verbose: lots of boilerplate per job.

These limits made MapReduce slow for iterative algorithms (ML), interactive queries, and anything chained.

30.3 Hadoop ecosystem

Open-source MapReduce ecosystem:

HDFS: distributed file system (clone of GFS).
MapReduce: batch compute.
YARN: resource manager (cluster scheduling).
Hive: SQL on HDFS, compiled to MapReduce/Tez/Spark.
HBase: wide-column on HDFS (clone of Bigtable).
Pig: dataflow language on MapReduce.

By ~2015 most workloads moved to Spark. Hadoop persists at older enterprises and as a storage layer.

30.4 Apache Spark

The MapReduce successor.

Key innovations

In-memory caching of intermediate data (RDDs / DataFrames) — 10-100× faster for iterative jobs.
DAG execution: arbitrary graphs of operations, not just map/reduce.
Lazy evaluation: build a plan, optimize, execute when needed.
Higher-level APIs: DataFrame, SQL, MLlib, GraphX, Structured Streaming.

RDD (Resilient Distributed Dataset)

The original Spark abstraction. An immutable distributed collection. Operations: transformations (map, filter, join — lazy) and actions (count, collect, save — trigger execution).

Fault tolerance: an RDD remembers how it was computed (lineage). On failure, recompute the missing partition from its lineage.

DataFrame / Dataset

SQL-like, schema-aware, with Catalyst optimizer (rule-based + cost-based query optimization). Most Spark code uses DataFrames now.

Cluster execution

Driver: orchestrates.
Cluster manager: YARN, Kubernetes, Spark Standalone, Mesos.
Executors: workers that run tasks and hold cached data.

Why Spark won

Dramatic speedup vs MapReduce.
Unified API for batch, SQL, streaming, ML.
Active development, large community.
Excellent Python/SQL ergonomics (PySpark).

30.5 SQL on big data

Most batch workloads at scale are SQL-shaped. Several engines compete:

Hive: SQL on HDFS, compiles to MapReduce/Tez/Spark.
Presto / Trino: distributed SQL, federates many sources (HDFS, S3, Kafka, MySQL).
Spark SQL: Spark's SQL layer.
BigQuery (Google): serverless, massively parallel, columnar storage; based on Dremel paper.
Snowflake: managed, separation of storage + compute, scale on demand.
Redshift (AWS): column-oriented, MPP.
ClickHouse: open source, blazing fast columnar.
DuckDB: embedded analytic SQL; "SQLite for analytics."

Lakehouse pattern

Storage: cheap object store (S3) holds raw data.
Format: open columnar (Parquet, ORC) plus table metadata (Apache Iceberg, Delta Lake, Hudi).
Compute: Spark, Trino, Snowflake, etc., on demand.
Decouples storage cost from compute cost.

This is the dominant modern architecture. Companies migrating off proprietary data warehouses.

30.6 Workflow / orchestration

Batch jobs are usually pipelines. You need a workflow scheduler.

Apache Airflow: Python DAGs; rich UI; the de facto standard.
Prefect, Dagster: modern Python-first alternatives.
Luigi: older, by Spotify.
Argo Workflows: K8s-native.
AWS Step Functions, Google Cloud Composer (managed Airflow).

Concerns:

Idempotent tasks (rerun safe).
Backfill capability.
SLA monitoring.
Sensors (wait for upstream data).
Retries with backoff.

30.7 Storage formats

For analytics, columnar formats win.

Parquet

Columnar.
Per-column compression (dictionary, RLE, bit-packing).
Row group metadata for predicate pushdown.
Open spec; widely supported.

ORC

Similar to Parquet; popular in Hive ecosystems.

Avro

Row-oriented; great for streaming and schema evolution. Used as Kafka payload format with schema registry.

When to use what

Parquet: analytical queries on columnar data.
Avro: streaming events with schema evolution.
JSON: ad hoc, exploration; never for bulk analytics.

Table formats (over Parquet)

Iceberg: schema/partition evolution, snapshots, time travel. Adopted by Netflix, Apple.
Delta Lake: similar; from Databricks.
Hudi: optimized for upserts, CDC.

These add transactionality, schema management, and time travel to data-lake files.

30.8 Common batch patterns

ETL (Extract, Transform, Load)

Pull from sources → transform → load to warehouse. Classic; still dominant.

ELT (Extract, Load, Transform)

Pull → load raw → transform in warehouse SQL. Cheaper compute; warehouses are powerful enough.

Slowly Changing Dimensions (SCD)

Tracking historical changes to entities (e.g., customer address over time). Type 1 (overwrite), Type 2 (history rows with valid_from/valid_to), Type 4 (separate history table).

Aggregation pipelines

Roll up raw events into hourly/daily summaries. Drop raw after retention.

Materialized views (incremental)

Compute on writes; query from materialized form. Druid, Pinot, ClickHouse-style.

30.9 Performance levers

Partitioning: by date, region, etc. Skip irrelevant partitions.
File size: 100 MB - 1 GB sweet spot for Parquet (avoid the "small files" problem of millions of tiny files killing the metadata layer).
Compaction: merge small files into bigger ones.
Compression: Snappy / Zstd / LZ4 — usually default.
Predicate pushdown: skip files / row groups that don't match filter.
Projection pushdown: read only needed columns.
Broadcast joins: replicate small table to every worker; avoid shuffle.
Skew mitigation: salt hot keys.

30.10 Common bugs

Skewed partitions: 99% of work on one reducer → job hangs.
Out-of-memory on big shuffles: need to spill to disk; tune.
Time zones: UTC always.
Date partition mistakes: re-running for "yesterday" without bounded backfill.
Schema drift: upstream adds a field; downstream breaks. Use schema registry.
Idempotency: rerunning produces duplicates; use dedup keys.

30.11 What an interviewer wants

Explain MapReduce model and shuffle bottleneck.
Articulate why Spark won.
Describe lakehouse: object store + open formats + flexible compute.
Know Iceberg / Delta / Hudi as the table-format trio.
Pick batch vs streaming based on latency requirement.

30.12 Google's lineage

MapReduce (paper 2004): two-stage compute on GFS.
Dremel (paper 2010): interactive SQL on nested columnar data → BigQuery.
FlumeJava (2010): higher-level API on top of MapReduce.
MillWheel (paper 2013): low-latency stream processing.
Dataflow / Apache Beam (2015): unified batch + stream model.

Beam's model: bounded (batch) and unbounded (stream) data are the same; runners (Spark, Flink, Dataflow) execute the unified pipeline.

Key takeaways

MapReduce introduced data-parallel batch on commodity clusters; superseded by Spark.
Spark unified batch, SQL, streaming, ML in one engine.
Lakehouse = cheap object storage + open columnar (Parquet) + table format (Iceberg) + flexible compute.
Workflow orchestration (Airflow / Dagster / Prefect) is the glue.
Performance: partition, compact, prune, broadcast small joins, address skew.