system-design/async-patterns.md

22. Async Patterns: Pub/Sub, CQRS, Event Sourcing, Saga

The patterns in this chapter all flow from one insight: the act of changing state and the consequences of that change can be decoupled in time. Used wisely, this unlocks scale, integration, and auditability.

~6 min read·updated 5/29/2026

22. Async Patterns: Pub/Sub, CQRS, Event Sourcing, Saga

The patterns in this chapter all flow from one insight: the act of changing state and the consequences of that change can be decoupled in time. Used wisely, this unlocks scale, integration, and auditability.

22.1 Pub/Sub vs queue (recap)

  • Queue: each message consumed by exactly one worker.
  • Pub/Sub: each message delivered to all subscribers.

Most modern brokers support both via consumer groups (Kafka): one consumer group per logical subscriber.

22.2 Event-driven architecture (EDA)

Services emit events ("something happened") rather than commanding others ("do this"). Other services react.

Example flow:

  • User places order → OrderService writes order, emits OrderPlaced event.
  • InventoryService subscribes → reserves stock.
  • PaymentService subscribes → charges card.
  • NotificationService subscribes → emails confirmation.
  • AnalyticsService subscribes → updates dashboards.

Adding a new behavior (e.g., loyalty points) = new subscriber, no change to OrderService.

Event vs command vs message

  • Command: imperative ("ChargeCard"). Targeted at a known recipient.
  • Event: declarative ("OrderPlaced"). Broadcast.
  • Message: any data passed between services.

Naming convention: events use past-tense verbs.

Event design

  • Include enough data so consumers don't need to call back ("event-carried state transfer").
  • But not so much that events become bloated and brittle.
  • Include event_id, event_type, event_version, occurred_at, aggregate_id, payload.
  • Schema-evolve carefully (chapter 8).

Trade-offs

  • + Decoupled producers/consumers; new features without breaking changes.
  • + Audit trail; replayable history.
  • Hard to trace a flow; requires distributed tracing.
  • Eventual consistency everywhere.
  • Order matters but is harder to guarantee.

22.3 Event Sourcing

Don't store current state. Store the sequence of events that led to it. Derive state by replaying.

Example: bank account.

  • Traditional: a row {user, balance: 1500}.
  • Event sourcing: a stream Deposited(1000), Deposited(500), Withdrew(0). Current balance computed by sum.

Pros

  • Complete audit trail: every change is recorded with intent.
  • Time travel: state at any historical moment.
  • Replay for new projections: build a new view by re-running events.
  • Natural fit for CQRS: events are the input to read models.

Cons

  • Huge cognitive shift: most engineers have never modeled this way.
  • Event schemas evolve: events live forever; you must handle old schemas.
  • Snapshots needed: replaying millions of events is slow; periodically snapshot state.
  • Querying current state is non-trivial: you build read models for queries.
  • Overkill for most domains.

When it shines

  • Audit-heavy domains: finance, healthcare, legal.
  • Complex domain logic where causality matters.
  • Domains where new questions arise (replay events to answer).

When it hurts

  • CRUD apps where the only question is "what's the current state?"
  • Teams without strong DDD background.

Tools: EventStoreDB, Kafka (used as event store), Axon, custom on Postgres.

22.4 CQRS (Command Query Responsibility Segregation)

Separate the write model (commands) from the read model (queries).

  • Writes go through a command handler, validate, mutate.
  • Reads use a separate, optimized read model.
  • Read model is a projection of the write model (often eventually consistent).

Why

  • Read and write loads scale differently.
  • Reads need different shapes than writes.
  • One write event can update many read models (denormalized for fast queries).

Example

  • Write: PlaceOrder(items, address, payment) → write to orders table, emit OrderPlaced.
  • Read 1: customer order history view.
  • Read 2: warehouse pick list view.
  • Read 3: revenue dashboard.

Each read model is a separate denormalized projection updated from events.

CQRS without event sourcing

You can do CQRS without event sourcing — just write to the canonical store, then async-update read views. Common in any system with cache.

CQRS with event sourcing is the "full" pattern.

Cost

  • Eventual consistency between write and reads.
  • More moving parts.
  • "Where do I look up the current state?" answer becomes "depends on the question."

22.5 Saga (recap from chapter 16)

Multi-step distributed workflow with compensations.

  • Choreography: each service reacts to events; cycles emerge.
  • Orchestration: central coordinator (Temporal, Step Functions) drives.

Use when business operations span services and you need eventual completion.

22.6 The Outbox Pattern (deep dive)

The cornerstone of correct event-driven systems.

Problem

You write to your DB. Then you publish an event. If publish fails, DB has the change but no event. Inconsistency.

Solution

  1. In one local transaction: write the change AND insert an outbox_events row with the event payload.
  2. A separate process reads outbox rows → publishes to broker → marks outbox row sent.
BEGIN; UPDATE orders SET status='paid' WHERE id=42; INSERT INTO outbox(id, type, payload, created_at) VALUES (gen_uuid(), 'OrderPaid', '{"order_id":42}', NOW()); COMMIT;

Then a publisher polls (or streams via CDC) the outbox table and emits to Kafka. Once acked, marks published_at. Cleans up after retention.

CDC (Change Data Capture) variant

A CDC tool (Debezium) reads the DB's WAL/binlog and streams changes to Kafka. The outbox table is captured automatically — no polling, low latency.

This is the production way to do event-driven systems as of 2026.

22.7 Inbox pattern (consumer side)

Mirror of outbox.

  • Consumer receives event → checks inbox_events for event_id.
  • If present → already processed, skip.
  • Else → process, then write (event_id, processed_at) to inbox in the same transaction as the side effect.

Combined with outbox + at-least-once delivery → effectively-once processing.

22.8 Eventual consistency UX

Async patterns force the UI to deal with state in flight. Common patterns:

  • Optimistic UI: render assumed result immediately; reconcile later (Linear, Notion).
  • Status indicator: "Order placed (processing)" → "Order confirmed."
  • Polling: fetch status; show progress.
  • WebSocket update: server pushes when ready.

Hide eventual consistency from users who don't care; surface it when they do.

22.9 Idempotency, retries, dedup (recap)

Every async system needs:

  • Idempotency keys on commands.
  • Retries with backoff and jitter.
  • Dedup on the consumer (inbox).
  • DLQ for poison messages.

If you skip these, you'll have surprising bugs at 1 AM.

22.10 Choreography vs orchestration

Choreography:

  • Pro: decoupled, scales, no central failure point.
  • Con: hard to see the whole flow; cycles emerge; refactoring is painful.

Orchestration:

  • Pro: explicit workflow; observable; easier to debug; natural for human approvals.
  • Con: orchestrator is critical infra; another DB to operate.

Practical advice: start with choreography for simple flows; promote to orchestration when the flow has > 4-5 steps or human approvals.

Tools for orchestration:

  • Temporal: durable workflows; code-as-workflow; great DX. Best-in-class.
  • AWS Step Functions: serverless; JSON-defined state machines.
  • Apache Airflow: batch / data pipelines, less for online.
  • Cadence (Uber): predecessor to Temporal.

22.11 Event versioning

Events are forever. Schema evolution rules (chapter 8) apply hard.

  • Add fields with defaults.
  • Never remove fields.
  • New event types over modifying old ones.
  • Consumers must handle old + new versions during transition.

22.12 When NOT to go async

  • Synchronous, low-latency APIs (search results, login).
  • Tight transactional consistency required (move money between two of your accounts).
  • Tiny scale where the operational overhead isn't worth it.
  • Strong ordering with low throughput (queue paradigm fights you here).

A simple sync REST + DB is often the right answer. Async is a tool, not a religion.

22.13 Practical reference architecture

For a typical e-commerce or SaaS platform:

[ Client ]
    │
[ API Gateway ] ─── auth, rate limit, routing
    │
[ Service A ] ──┐
    │           │
[ Postgres A ]  │  outbox CDC → [ Kafka ]
                │
[ Service B ] ←─┘  reacts; updates own DB & emits
    │
[ Postgres B ]
    │
   ... and so on

Within Kafka:

  • Many topics by domain event.
  • Many consumer groups (one per consumer service).
  • Schema registry for compatibility.
  • Connectors to S3 (archive), Snowflake (analytics), Elasticsearch (search).

Key takeaways

  • Pub/sub + outbox is the modern reliable async pattern.
  • Event sourcing is powerful but heavy; reserve for audit-critical domains.
  • CQRS separates write & read models; pairs well with events.
  • Choreography for simple flows; orchestration (Temporal) when complexity grows.
  • Always: idempotency keys, backoff, dedup, DLQ.

// 1 view

main
UTF-8·typescript