◐ system-design/microservices.md
20. Microservices, Monoliths, Service Mesh
Microservices vs monolith is a religious war that should be a pragmatic choice. Both have their place. The right answer depends on team size, change rate, and operational maturity.
~6 min read·updated 5/29/2026
20. Microservices, Monoliths, Service Mesh
Microservices vs monolith is a religious war that should be a pragmatic choice. Both have their place. The right answer depends on team size, change rate, and operational maturity.
20.1 Definitions
Monolith
One codebase, one deployable, one process per replica. All features share a database (usually).
Modular monolith
One deployable, but internally organized into well-bounded modules with explicit interfaces. Sometimes a stepping stone to microservices; often the right end state.
Service-Oriented Architecture (SOA)
Multiple services, often coarse-grained, communicating via contracts (SOAP historically). Older sibling of microservices.
Microservices
Many small services, each owned by a small team, each independently deployable, each with its own data store. Communicate over network (HTTP/gRPC/queue).
20.2 The case for monoliths (or modular monoliths)
- Simpler operations: one deploy, one log stream, one DB to back up.
- Lower latency: function calls, not network calls.
- Easier transactions: one DB → ACID for free.
- Easier debugging: one stack trace, no distributed trace.
- Faster iteration when team is small (< ~30 engineers).
- Refactoring is local: rename a function across the codebase in one PR.
Famous examples: Stack Overflow runs on a small Postgres + IIS monolith for years; Shopify, Basecamp, GitHub started monolithic and stayed mostly so for ages.
20.3 The case for microservices
- Independent scaling: scale auth separately from payments.
- Polyglot: each service can pick the right language.
- Team independence: separate deploys, separate on-call, separate ownership.
- Fault isolation: one bad service shouldn't crash the rest (in theory).
- Tech evolution: rewrite or replace one service without touching others.
20.4 The cost of microservices
- Distributed system problems: network latency, partial failure, eventual consistency, distributed transactions, debugging hell.
- Operational complexity: dozens or hundreds of services, each with its own deploy pipeline, DB, runbook, dashboards.
- Testing: integration tests cross many services; consumer-driven contracts emerge.
- Cross-service refactoring: a domain change might touch 5 services and 3 schemas; coordination required.
- Latency: every internal RPC adds 1-10 ms; chains of 5 services = 5-50 ms baseline.
The Microservices Premium (Fowler): microservices have a baseline operational cost. Below ~30 engineers, the premium is rarely worth it.
20.5 When to split
Triggers for splitting a monolith:
- Team > ~30 engineers and deploys are mutually blocking.
- Subsystems have wildly different scale needs (e.g., a small core service + a 100x bigger search backend).
- Subsystems have different reliability needs (payment must be 99.99%; rest can be 99.9%).
- Subsystems are owned by different orgs.
- Different release cadences (mobile API vs internal admin).
20.6 How to split
Bounded contexts (DDD)
Identify domain boundaries: payments, orders, inventory, search. Each becomes a candidate service. Don't split by technical layer (web tier, business tier, data tier) — that's distributed monolith.
Database per service
A core principle. Each service owns its DB. No other service reads it directly.
Why: you can change schema without coordinating; you can scale each DB independently; tenant data lives where it belongs.
Cost: cross-service queries become RPCs; joins disappear; data must be duplicated/eventual-consistent.
Anti-patterns when splitting
- Distributed monolith: services that all change together. You got the cost without the benefit.
- Shared DB across services: tight coupling, no independence. Just merge them back.
- Sync RPC chains 5+ deep: latency catastrophe; one slow service = everything slow.
- Chatty interfaces: 30 RPCs per user request. Combine.
- Premature splitting: the right boundaries emerge from operating the system; carve them out as you learn.
20.7 Communication patterns
Synchronous (REST/gRPC)
Direct request-response. Easy to reason about. Tight coupling.
When: read paths, simple CRUD, where the caller needs an answer to proceed.
Asynchronous (queue / event)
Producer fires; consumer eventually processes.
When: writes that don't need an immediate response, fan-out, decoupling, batching.
Pub/sub
Service emits domain events; others subscribe.
When: multiple downstream interests, decoupling, integration without API contracts proliferating.
Saga
Workflow across services with compensations. (See chapter 16.)
Default mix
Read paths sync (RPC). Write paths often async with outbox + events. Critical workflows orchestrated via saga (Temporal, Step Functions).
20.8 API design between services
(See chapter 21.) Contracts are the boundary; treat them like public APIs.
- Versioning: never break consumers without notice.
- Backwards compatibility: schema evolution rules (chapter 8).
- Idempotency keys: assume retries.
- Timeouts everywhere: never block forever.
20.9 Service discovery
How do services find each other?
Client-side (Eureka, Consul)
Client queries discovery; picks an instance; calls directly.
Server-side (DNS, K8s service)
Client calls a stable name; routing layer resolves to instance.
Service mesh (Envoy + control plane)
Sidecar proxy handles discovery transparently; app code talks to localhost.
In Kubernetes: every Service has a stable DNS name (payments.default.svc.cluster.local) → kube-proxy iptables rules → backing pods. Magic.
20.10 Service mesh
A layer that handles cross-cutting concerns for service-to-service traffic.
Components
- Data plane: sidecar proxy (Envoy, Linkerd-proxy) per service instance. Intercepts all traffic.
- Control plane: configures the data plane (Istiod, Linkerd control).
What it gives you
- mTLS between services (zero-trust networking).
- L7 routing (canary, A/B, traffic shifting).
- Retries, timeouts, circuit breakers, outlier detection.
- Telemetry (metrics, distributed traces) for free.
- Rate limiting.
- Policy (deny
service-A → service-B).
Cost
- Operational complexity (sidecars, control plane, version upgrades).
- Latency (sidecar adds ~1ms).
- Resource usage (sidecar per pod = CPU/RAM).
- Debugging: another layer to look at.
When it's worth it
- 10s+ services in production.
- Strong security requirements (zero-trust).
- Need for traffic shaping (canary across many services).
For 5-10 services, simple HTTP clients with retries + timeouts + good logging are usually enough.
20.11 API gateway
Single entry point for external clients. Common in microservices.
Responsibilities:
- TLS termination
- Auth (OAuth, JWT validation)
- Rate limiting
- Request routing to services
- Request/response transformation
- Aggregation of multiple service calls
- Observability (logging, tracing)
Examples: Kong, Apigee, AWS API Gateway, Envoy Gateway, Tyk.
Risk: gateway becomes a god object. Keep logic minimal.
Backend-for-Frontend (BFF)
A gateway per client type (mobile-bff, web-bff). Lets each frontend get tailored APIs without overloading the general API.
20.12 Data ownership and eventual consistency
In microservices, you can't JOIN across service boundaries. Solutions:
- Replicated read models: service A subscribes to service B's events, builds a local view.
- API composition: in the read path, query both services and merge.
- CQRS + event sourcing: write to one model; project to many read models.
Consistency is eventual. Design UX for it ("your order is being processed" while the saga finishes).
20.13 Operational maturity required
Microservices demand:
- Containerization (Docker).
- Orchestration (Kubernetes, ECS, Nomad).
- CI/CD per service.
- Centralized logging, metrics, tracing (the three pillars; chapter 26).
- Service catalog / inventory.
- Incident response across teams.
- SLOs and error budgets per service.
Without these, microservices will pull a small team underwater fast.
20.14 Practical migration strategy
Strangler fig: incrementally extract services from a monolith.
- Identify a bounded context.
- Build the new service alongside the monolith.
- Route traffic for that context to new service via a façade.
- Once stable, delete monolith code.
- Repeat.
Don't rewrite the whole thing in one shot. Big bang rewrites are graveyards.
20.15 Google's approach
Google internally uses many small services with an extremely sophisticated ecosystem:
- Borg / Kubernetes for orchestration.
- Stubby / gRPC for RPC.
- Protocol Buffers for serialization.
- Chubby (lock service) for coordination.
- Spanner / Bigtable for storage.
- Monorepo so cross-service changes are atomic at the source level (in one PR).
- Strong observability and SRE practices.
The monorepo + atomic cross-service changes addresses the biggest microservices pain point (coordinated changes). Most companies don't have this.
Key takeaways
- Monoliths are great until they aren't. Modular monoliths buy a long runway.
- Microservices win at scale (people and load) at high operational cost.
- Database per service is foundational; cross-service joins are forbidden.
- Service mesh handles cross-cutting concerns at scale; overkill for small fleets.
- Strangler fig over big bang rewrites.
// 0 views