31. Stream Processing: Kafka Streams, Flink, Exactly-Once

Stream processing operates on unbounded data — events arriving continuously. Done right, you replace nightly batch jobs with seconds-fresh insights and react to events as they happen.

31.1 Why streaming

Lower latency: react in seconds, not hours.
Continuous insight: dashboards live; alerts immediate.
Decouples producers and consumers: services emit events; many consumers process.
Elastic processing: workers scale with load.

Use cases: fraud detection, real-time analytics, dashboards, ML feature stores, alerts, ETL pipelines.

31.2 Streams vs batch

Batch: bounded data; result is "complete." Triggered by time.
Stream: unbounded; result is "the answer so far." Triggered by data.

Batch is a special case of stream (a finite stream). Apache Beam models both with one API.

31.3 The streaming stack

Source: an event log

Kafka: dominant.
Pulsar: Kafka competitor; tiered storage; better multi-tenancy.
Kinesis: AWS managed.
Pub/Sub: GCP managed.

Processor: a stream framework

Kafka Streams: library; runs in your JVM app; stateful via embedded RocksDB.
Apache Flink: dedicated cluster; rich windowing, exactly-once, true streaming. Industry leader for stateful streaming.
Spark Structured Streaming: micro-batch model; mature; integrates with Spark batch.
Apache Storm, Samza: older; mostly displaced.
ksqlDB: SQL on Kafka Streams.
Materialize, Decodable, Estuary: SQL streaming as a service.

Sink

Where results land: another Kafka topic, a DB, a search index, a dashboard, a data warehouse.

31.4 Time in streams: event time vs processing time

Critical distinction.

Event time: when the event actually happened (in the data's payload).
Processing time: when the stream processor received it.

In ideal world, equal. In reality, events arrive late, out of order, in bursts. Real-time analytics must handle both.

Watermarks

A watermark is a heuristic: "all events with event time ≤ T have arrived." Allows the system to close windows and emit results.

Generated from the data (e.g., max event time seen − allowed lateness). Late events (after watermark) are either dropped, sent to a side output, or trigger window updates.

Late data strategies

Drop.
Side output for separate handling.
Update windows (allowing fixes; expensive).

31.5 Windowing

Aggregations need bounds. Common windows:

Tumbling

Fixed-size, non-overlapping. "Hourly" bucket.

Hopping (sliding)

Fixed-size, overlapping. "5-minute window every 1 minute" → 5 results per minute boundary.

Session

Variable-size, separated by gaps of inactivity. "User session" = activity grouped with < 30 min gap.

Global / fixed-key

No window; aggregate forever per key.

Custom

Application-defined boundaries.

31.6 State in streams

Most useful streams are stateful: aggregates, joins, sessions.

Where state lives

In memory (small): fast; lost on crash.
Embedded persistent KV (RocksDB): used by Flink, Kafka Streams. Backed up via checkpoints.
External KV (Redis, DynamoDB): visible across processes; latency cost.

Checkpointing

Flink takes consistent snapshots of state across all operators (Chandy-Lamport algorithm). On failure, restart from the last checkpoint and replay events from Kafka offset → exactly-once.

31.7 Joins in streams

Stream-stream join

Two streams, joined within a window. Order events arrive in. Used for: clickstream + purchase, ad-impression + click.

Stream-table join

Stream joined against a slowly-changing table (often broadcast or lookup). Used for enrichment: events × user dimension.

Table-table join (continuous)

Both sides continuous; result kept up to date.

Joins require state proportional to the join window. Tune carefully.

31.8 Exactly-once processing

Achieved by:

Idempotent / transactional sink: writes are deduplicated, or use a transaction.
Checkpointed state: on failure, restart from last consistent point.
Replayable source: Kafka holds events for the retention window.
Coordinated checkpoint commits: state checkpoint + source offset commit are atomic.

Flink + Kafka achieves this end-to-end with two-phase commits.

Note: "exactly-once" is for the stream pipeline. Side effects to external systems (sending an email, hitting a third-party API) still need idempotency keys.

31.9 Backpressure

Producers may outpace consumers. Strategies:

Pull-based (Kafka): consumer fetches at its own pace; broker buffers.
Push-based with credits (gRPC, reactive streams): consumer signals capacity.
Drop on overload: shed events.
Auto-scale consumers: add workers based on lag.

Watch consumer lag. Alert when it grows beyond a threshold.

31.10 Stateful operators in detail

KStream / KTable / GlobalKTable (Kafka Streams)

KStream: an event stream.
KTable: a changelog stream interpreted as a table (latest value per key).
GlobalKTable: a KTable replicated to all workers (for broadcast joins).

Powerful, but a learning curve. The "duality" of streams and tables is foundational.

Flink's data model

DataStream + Table API. SQL-on-streams.

31.11 Common patterns

Real-time dashboards

Events → aggregate by minute → push to dashboard via WebSocket.

Fraud detection

Card swipes → stateful pattern detection (e.g., > $X within Y seconds in different countries) → alert.

Feature store updates

Real-time signals (click rates, user-recent-actions) computed via Flink → feature store → ML serving.

Stream-to-warehouse

Kafka events → micro-batches → Iceberg / BigQuery for analytics.

Change data capture (CDC)

DB → CDC tool (Debezium) → Kafka → downstream consumers (search index, data lake, microservices).

Outbox + CDC (recap)

Service writes outbox → Debezium reads WAL → Kafka. The reliable pattern for transactional event publishing.

31.12 Lambda and Kappa architectures

Lambda

Two pipelines:

Batch layer: accurate, slow.
Speed (stream) layer: fast, approximate.
Serving layer: merges both.

Built before streaming was reliable enough alone. Now considered legacy due to maintenance burden of two pipelines.

Kappa

Single stream pipeline. Replay history when you need a recompute (Kafka retention or archived to object store).

Most modern systems are Kappa-shaped: one stream pipeline, with batch jobs only when necessary.

31.13 Materialized views

Stream processing → continuously updated materialized view in a DB / cache / index. Queries hit the materialized form.

Examples:

Materialize, RisingWave: SQL streaming DBs that maintain materialized views in real time.
Druid, Pinot: real-time OLAP DBs that ingest streams and serve sub-second queries.
Custom: Flink job → Postgres view.

This pattern is increasingly the default for "real-time analytics" and dashboards.

31.14 Operational notes

Schema evolution: streams persist; old code reads new data, new code reads old. Use schema registry.
Reprocessing: Kafka allows resetting consumer offsets to replay history. Plan for it.
Hot keys: skewed key distribution → one worker overloaded. Salt or repartition.
State size: monitor; checkpoints get slow at large state.
Downstream backpressure: handle gracefully; never drop silently without metrics.

31.15 What an interviewer wants

Streams vs batch trade-off.
Event time vs processing time + watermarks.
Windowing types (tumbling, hopping, session).
Exactly-once semantics: how Flink + Kafka achieve it.
Choose Flink for stateful streaming; Kafka Streams for in-app embedded.

31.16 Google's contributions

MillWheel (2013 paper): low-latency stream processing.
Dataflow (2015 paper): unified batch + stream model. Open-sourced as Apache Beam.
The "Streaming Systems" book (Tyler Akidau et al.) — the definitive treatment.

Beam's model is the most coherent treatment of "what does a streaming computation mean": triggers, accumulation modes, panes, watermarks. Worth knowing.

Key takeaways

Stream processing operates on unbounded data; replaces or complements batch.
Event time + watermarks are essential for correctness with late/out-of-order data.
Windowing: tumbling, hopping, session.
Exactly-once via Flink + Kafka transactions is achievable end-to-end (within the pipeline).
Kappa architecture (one stream pipeline) is the modern default; Lambda is legacy.
Materialized views over streams are the new "real-time analytics" default.