system-design/stream-processing.md

31. Stream Processing: Kafka Streams, Flink, Exactly-Once

Stream processing operates on unbounded data — events arriving continuously. Done right, you replace nightly batch jobs with seconds-fresh insights and react to events as they happen.

~6 min read·updated 5/29/2026

31. Stream Processing: Kafka Streams, Flink, Exactly-Once

Stream processing operates on unbounded data — events arriving continuously. Done right, you replace nightly batch jobs with seconds-fresh insights and react to events as they happen.

31.1 Why streaming

  • Lower latency: react in seconds, not hours.
  • Continuous insight: dashboards live; alerts immediate.
  • Decouples producers and consumers: services emit events; many consumers process.
  • Elastic processing: workers scale with load.

Use cases: fraud detection, real-time analytics, dashboards, ML feature stores, alerts, ETL pipelines.

31.2 Streams vs batch

  • Batch: bounded data; result is "complete." Triggered by time.
  • Stream: unbounded; result is "the answer so far." Triggered by data.

Batch is a special case of stream (a finite stream). Apache Beam models both with one API.

31.3 The streaming stack

Source: an event log

  • Kafka: dominant.
  • Pulsar: Kafka competitor; tiered storage; better multi-tenancy.
  • Kinesis: AWS managed.
  • Pub/Sub: GCP managed.

Processor: a stream framework

  • Kafka Streams: library; runs in your JVM app; stateful via embedded RocksDB.
  • Apache Flink: dedicated cluster; rich windowing, exactly-once, true streaming. Industry leader for stateful streaming.
  • Spark Structured Streaming: micro-batch model; mature; integrates with Spark batch.
  • Apache Storm, Samza: older; mostly displaced.
  • ksqlDB: SQL on Kafka Streams.
  • Materialize, Decodable, Estuary: SQL streaming as a service.

Sink

Where results land: another Kafka topic, a DB, a search index, a dashboard, a data warehouse.

31.4 Time in streams: event time vs processing time

Critical distinction.

  • Event time: when the event actually happened (in the data's payload).
  • Processing time: when the stream processor received it.

In ideal world, equal. In reality, events arrive late, out of order, in bursts. Real-time analytics must handle both.

Watermarks

A watermark is a heuristic: "all events with event time ≤ T have arrived." Allows the system to close windows and emit results.

Generated from the data (e.g., max event time seen − allowed lateness). Late events (after watermark) are either dropped, sent to a side output, or trigger window updates.

Late data strategies

  • Drop.
  • Side output for separate handling.
  • Update windows (allowing fixes; expensive).

31.5 Windowing

Aggregations need bounds. Common windows:

Tumbling

Fixed-size, non-overlapping. "Hourly" bucket.

Hopping (sliding)

Fixed-size, overlapping. "5-minute window every 1 minute" → 5 results per minute boundary.

Session

Variable-size, separated by gaps of inactivity. "User session" = activity grouped with < 30 min gap.

Global / fixed-key

No window; aggregate forever per key.

Custom

Application-defined boundaries.

31.6 State in streams

Most useful streams are stateful: aggregates, joins, sessions.

Where state lives

  • In memory (small): fast; lost on crash.
  • Embedded persistent KV (RocksDB): used by Flink, Kafka Streams. Backed up via checkpoints.
  • External KV (Redis, DynamoDB): visible across processes; latency cost.

Checkpointing

Flink takes consistent snapshots of state across all operators (Chandy-Lamport algorithm). On failure, restart from the last checkpoint and replay events from Kafka offset → exactly-once.

31.7 Joins in streams

Stream-stream join

Two streams, joined within a window. Order events arrive in. Used for: clickstream + purchase, ad-impression + click.

Stream-table join

Stream joined against a slowly-changing table (often broadcast or lookup). Used for enrichment: events × user dimension.

Table-table join (continuous)

Both sides continuous; result kept up to date.

Joins require state proportional to the join window. Tune carefully.

31.8 Exactly-once processing

Achieved by:

  1. Idempotent / transactional sink: writes are deduplicated, or use a transaction.
  2. Checkpointed state: on failure, restart from last consistent point.
  3. Replayable source: Kafka holds events for the retention window.
  4. Coordinated checkpoint commits: state checkpoint + source offset commit are atomic.

Flink + Kafka achieves this end-to-end with two-phase commits.

Note: "exactly-once" is for the stream pipeline. Side effects to external systems (sending an email, hitting a third-party API) still need idempotency keys.

31.9 Backpressure

Producers may outpace consumers. Strategies:

  • Pull-based (Kafka): consumer fetches at its own pace; broker buffers.
  • Push-based with credits (gRPC, reactive streams): consumer signals capacity.
  • Drop on overload: shed events.
  • Auto-scale consumers: add workers based on lag.

Watch consumer lag. Alert when it grows beyond a threshold.

31.10 Stateful operators in detail

KStream / KTable / GlobalKTable (Kafka Streams)

  • KStream: an event stream.
  • KTable: a changelog stream interpreted as a table (latest value per key).
  • GlobalKTable: a KTable replicated to all workers (for broadcast joins).

Powerful, but a learning curve. The "duality" of streams and tables is foundational.

Flink's data model

DataStream + Table API. SQL-on-streams.

31.11 Common patterns

Real-time dashboards

Events → aggregate by minute → push to dashboard via WebSocket.

Fraud detection

Card swipes → stateful pattern detection (e.g., > $X within Y seconds in different countries) → alert.

Feature store updates

Real-time signals (click rates, user-recent-actions) computed via Flink → feature store → ML serving.

Stream-to-warehouse

Kafka events → micro-batches → Iceberg / BigQuery for analytics.

Change data capture (CDC)

DB → CDC tool (Debezium) → Kafka → downstream consumers (search index, data lake, microservices).

Outbox + CDC (recap)

Service writes outbox → Debezium reads WAL → Kafka. The reliable pattern for transactional event publishing.

31.12 Lambda and Kappa architectures

Lambda

Two pipelines:

  • Batch layer: accurate, slow.
  • Speed (stream) layer: fast, approximate.
  • Serving layer: merges both.

Built before streaming was reliable enough alone. Now considered legacy due to maintenance burden of two pipelines.

Kappa

Single stream pipeline. Replay history when you need a recompute (Kafka retention or archived to object store).

Most modern systems are Kappa-shaped: one stream pipeline, with batch jobs only when necessary.

31.13 Materialized views

Stream processing → continuously updated materialized view in a DB / cache / index. Queries hit the materialized form.

Examples:

  • Materialize, RisingWave: SQL streaming DBs that maintain materialized views in real time.
  • Druid, Pinot: real-time OLAP DBs that ingest streams and serve sub-second queries.
  • Custom: Flink job → Postgres view.

This pattern is increasingly the default for "real-time analytics" and dashboards.

31.14 Operational notes

  • Schema evolution: streams persist; old code reads new data, new code reads old. Use schema registry.
  • Reprocessing: Kafka allows resetting consumer offsets to replay history. Plan for it.
  • Hot keys: skewed key distribution → one worker overloaded. Salt or repartition.
  • State size: monitor; checkpoints get slow at large state.
  • Downstream backpressure: handle gracefully; never drop silently without metrics.

31.15 What an interviewer wants

  • Streams vs batch trade-off.
  • Event time vs processing time + watermarks.
  • Windowing types (tumbling, hopping, session).
  • Exactly-once semantics: how Flink + Kafka achieve it.
  • Choose Flink for stateful streaming; Kafka Streams for in-app embedded.

31.16 Google's contributions

  • MillWheel (2013 paper): low-latency stream processing.
  • Dataflow (2015 paper): unified batch + stream model. Open-sourced as Apache Beam.
  • The "Streaming Systems" book (Tyler Akidau et al.) — the definitive treatment.

Beam's model is the most coherent treatment of "what does a streaming computation mean": triggers, accumulation modes, panes, watermarks. Worth knowing.

Key takeaways

  • Stream processing operates on unbounded data; replaces or complements batch.
  • Event time + watermarks are essential for correctness with late/out-of-order data.
  • Windowing: tumbling, hopping, session.
  • Exactly-once via Flink + Kafka transactions is achievable end-to-end (within the pipeline).
  • Kappa architecture (one stream pipeline) is the modern default; Lambda is legacy.
  • Materialized views over streams are the new "real-time analytics" default.

// 0 views

main
UTF-8·typescript