◐ system-design/async-patterns.md
22. Async Patterns: Pub/Sub, CQRS, Event Sourcing, Saga
The patterns in this chapter all flow from one insight: the act of changing state and the consequences of that change can be decoupled in time. Used wisely, this unlocks scale, integration, and auditability.
~6 min read·updated 5/29/2026
22. Async Patterns: Pub/Sub, CQRS, Event Sourcing, Saga
The patterns in this chapter all flow from one insight: the act of changing state and the consequences of that change can be decoupled in time. Used wisely, this unlocks scale, integration, and auditability.
22.1 Pub/Sub vs queue (recap)
- Queue: each message consumed by exactly one worker.
- Pub/Sub: each message delivered to all subscribers.
Most modern brokers support both via consumer groups (Kafka): one consumer group per logical subscriber.
22.2 Event-driven architecture (EDA)
Services emit events ("something happened") rather than commanding others ("do this"). Other services react.
Example flow:
- User places order →
OrderServicewrites order, emitsOrderPlacedevent. InventoryServicesubscribes → reserves stock.PaymentServicesubscribes → charges card.NotificationServicesubscribes → emails confirmation.AnalyticsServicesubscribes → updates dashboards.
Adding a new behavior (e.g., loyalty points) = new subscriber, no change to OrderService.
Event vs command vs message
- Command: imperative ("ChargeCard"). Targeted at a known recipient.
- Event: declarative ("OrderPlaced"). Broadcast.
- Message: any data passed between services.
Naming convention: events use past-tense verbs.
Event design
- Include enough data so consumers don't need to call back ("event-carried state transfer").
- But not so much that events become bloated and brittle.
- Include
event_id,event_type,event_version,occurred_at,aggregate_id,payload. - Schema-evolve carefully (chapter 8).
Trade-offs
- + Decoupled producers/consumers; new features without breaking changes.
- + Audit trail; replayable history.
- − Hard to trace a flow; requires distributed tracing.
- − Eventual consistency everywhere.
- − Order matters but is harder to guarantee.
22.3 Event Sourcing
Don't store current state. Store the sequence of events that led to it. Derive state by replaying.
Example: bank account.
- Traditional: a row
{user, balance: 1500}. - Event sourcing: a stream
Deposited(1000), Deposited(500), Withdrew(0). Current balance computed by sum.
Pros
- Complete audit trail: every change is recorded with intent.
- Time travel: state at any historical moment.
- Replay for new projections: build a new view by re-running events.
- Natural fit for CQRS: events are the input to read models.
Cons
- Huge cognitive shift: most engineers have never modeled this way.
- Event schemas evolve: events live forever; you must handle old schemas.
- Snapshots needed: replaying millions of events is slow; periodically snapshot state.
- Querying current state is non-trivial: you build read models for queries.
- Overkill for most domains.
When it shines
- Audit-heavy domains: finance, healthcare, legal.
- Complex domain logic where causality matters.
- Domains where new questions arise (replay events to answer).
When it hurts
- CRUD apps where the only question is "what's the current state?"
- Teams without strong DDD background.
Tools: EventStoreDB, Kafka (used as event store), Axon, custom on Postgres.
22.4 CQRS (Command Query Responsibility Segregation)
Separate the write model (commands) from the read model (queries).
- Writes go through a command handler, validate, mutate.
- Reads use a separate, optimized read model.
- Read model is a projection of the write model (often eventually consistent).
Why
- Read and write loads scale differently.
- Reads need different shapes than writes.
- One write event can update many read models (denormalized for fast queries).
Example
- Write:
PlaceOrder(items, address, payment)→ write to orders table, emitOrderPlaced. - Read 1: customer order history view.
- Read 2: warehouse pick list view.
- Read 3: revenue dashboard.
Each read model is a separate denormalized projection updated from events.
CQRS without event sourcing
You can do CQRS without event sourcing — just write to the canonical store, then async-update read views. Common in any system with cache.
CQRS with event sourcing is the "full" pattern.
Cost
- Eventual consistency between write and reads.
- More moving parts.
- "Where do I look up the current state?" answer becomes "depends on the question."
22.5 Saga (recap from chapter 16)
Multi-step distributed workflow with compensations.
- Choreography: each service reacts to events; cycles emerge.
- Orchestration: central coordinator (Temporal, Step Functions) drives.
Use when business operations span services and you need eventual completion.
22.6 The Outbox Pattern (deep dive)
The cornerstone of correct event-driven systems.
Problem
You write to your DB. Then you publish an event. If publish fails, DB has the change but no event. Inconsistency.
Solution
- In one local transaction: write the change AND insert an
outbox_eventsrow with the event payload. - A separate process reads outbox rows → publishes to broker → marks outbox row sent.
BEGIN; UPDATE orders SET status='paid' WHERE id=42; INSERT INTO outbox(id, type, payload, created_at) VALUES (gen_uuid(), 'OrderPaid', '{"order_id":42}', NOW()); COMMIT;
Then a publisher polls (or streams via CDC) the outbox table and emits to Kafka. Once acked, marks published_at. Cleans up after retention.
CDC (Change Data Capture) variant
A CDC tool (Debezium) reads the DB's WAL/binlog and streams changes to Kafka. The outbox table is captured automatically — no polling, low latency.
This is the production way to do event-driven systems as of 2026.
22.7 Inbox pattern (consumer side)
Mirror of outbox.
- Consumer receives event → checks
inbox_eventsforevent_id. - If present → already processed, skip.
- Else → process, then write
(event_id, processed_at)to inbox in the same transaction as the side effect.
Combined with outbox + at-least-once delivery → effectively-once processing.
22.8 Eventual consistency UX
Async patterns force the UI to deal with state in flight. Common patterns:
- Optimistic UI: render assumed result immediately; reconcile later (Linear, Notion).
- Status indicator: "Order placed (processing)" → "Order confirmed."
- Polling: fetch status; show progress.
- WebSocket update: server pushes when ready.
Hide eventual consistency from users who don't care; surface it when they do.
22.9 Idempotency, retries, dedup (recap)
Every async system needs:
- Idempotency keys on commands.
- Retries with backoff and jitter.
- Dedup on the consumer (inbox).
- DLQ for poison messages.
If you skip these, you'll have surprising bugs at 1 AM.
22.10 Choreography vs orchestration
Choreography:
- Pro: decoupled, scales, no central failure point.
- Con: hard to see the whole flow; cycles emerge; refactoring is painful.
Orchestration:
- Pro: explicit workflow; observable; easier to debug; natural for human approvals.
- Con: orchestrator is critical infra; another DB to operate.
Practical advice: start with choreography for simple flows; promote to orchestration when the flow has > 4-5 steps or human approvals.
Tools for orchestration:
- Temporal: durable workflows; code-as-workflow; great DX. Best-in-class.
- AWS Step Functions: serverless; JSON-defined state machines.
- Apache Airflow: batch / data pipelines, less for online.
- Cadence (Uber): predecessor to Temporal.
22.11 Event versioning
Events are forever. Schema evolution rules (chapter 8) apply hard.
- Add fields with defaults.
- Never remove fields.
- New event types over modifying old ones.
- Consumers must handle old + new versions during transition.
22.12 When NOT to go async
- Synchronous, low-latency APIs (search results, login).
- Tight transactional consistency required (move money between two of your accounts).
- Tiny scale where the operational overhead isn't worth it.
- Strong ordering with low throughput (queue paradigm fights you here).
A simple sync REST + DB is often the right answer. Async is a tool, not a religion.
22.13 Practical reference architecture
For a typical e-commerce or SaaS platform:
[ Client ]
│
[ API Gateway ] ─── auth, rate limit, routing
│
[ Service A ] ──┐
│ │
[ Postgres A ] │ outbox CDC → [ Kafka ]
│
[ Service B ] ←─┘ reacts; updates own DB & emits
│
[ Postgres B ]
│
... and so on
Within Kafka:
- Many topics by domain event.
- Many consumer groups (one per consumer service).
- Schema registry for compatibility.
- Connectors to S3 (archive), Snowflake (analytics), Elasticsearch (search).
Key takeaways
- Pub/sub + outbox is the modern reliable async pattern.
- Event sourcing is powerful but heavy; reserve for audit-critical domains.
- CQRS separates write & read models; pairs well with events.
- Choreography for simple flows; orchestration (Temporal) when complexity grows.
- Always: idempotency keys, backoff, dedup, DLQ.
// 1 view