◐ system-design/deployment-patterns.md
29. Deployment Patterns: Blue/Green, Canary, Feature Flags
Shipping code is a system design problem. Bad deployment practices are how systems with great architecture still take outages.
~6 min read·updated 5/29/2026
29. Deployment Patterns: Blue/Green, Canary, Feature Flags
Shipping code is a system design problem. Bad deployment practices are how systems with great architecture still take outages.
29.1 The principles
- Deployments should be boring: predictable, reversible, observable.
- Smaller is safer: each PR is its own change; ship small often.
- Decouple deploy from release: ship the code, then turn it on later (feature flags).
- Always have a rollback plan: if you can't roll back in <5 min, you're flying blind.
29.2 Deployment models
Big bang
Everyone, all at once. Don't.
Rolling update
Update N pods at a time; wait for healthy; continue. Default for K8s Deployments. Zero downtime if pods support it (drain + readiness).
Pros: simple; gradual. Cons: brief mixed state (old + new); rollback requires re-deploy; bug exposed to all once roll completes.
Blue/green
Run two identical environments. Blue is live; deploy to green. Smoke test. Flip the LB / DNS to green.
Pros: instant rollback (flip back); test in prod-like before exposing. Cons: 2× resource cost during deploy; data migration tricky (one DB serving both; or sync between).
Canary
Deploy to a small subset (1%, 5%, 25%, 100% over hours/days). Watch metrics. Promote or roll back.
Pros: detect issues early; small blast radius. Cons: needs traffic-shifting infrastructure; per-tenant skew possible; complex.
Tools: Argo Rollouts, Flagger, Istio, AWS Lambda traffic shifting.
A/B testing
Like canary but for behavior comparison: 50/50 split between variant A and B; measure outcomes (CTR, conversion). Often paired with feature flags.
Shadow / mirror traffic
Send a copy of real traffic to the new version; ignore responses. Catch performance/correctness issues without user impact.
29.3 Feature flags (toggles)
Decouple deploy from release. Code ships dark; turn on per-tenant, per-user, per-percentage.
Types
- Release toggle: temporary, for rollout (delete after stable).
- Experiment toggle: A/B test.
- Ops toggle: kill switch for a feature under load.
- Permission toggle: gate by plan/role (long-lived).
Tools
- LaunchDarkly, Statsig, Optimizely, ConfigCat, Unleash.
- GrowthBook (open source).
- DIY: a config service with rules + cache.
Best practices
- Keep flags short-lived: dead flags are debt. Burn them down quarterly.
- Document each flag: owner, purpose, expected removal date.
- Test both branches.
- Fail safe: if flag service is down, default to safe (usually "off").
- Cache with short TTL: don't hammer the flag service per request.
Example progressive rollout
- Deploy code with flag off.
- Enable for internal users.
- Enable for 1% of customers.
- Watch metrics.
- 5%, 25%, 50%, 100%.
- Remove flag in next release.
29.4 Database migrations
The hardest deploy. Code and schema must be compatible across the rollout window.
Expand / migrate / contract
- Expand: add new column nullable; deploy code that writes both old and new.
- Migrate: backfill old rows.
- Contract: deploy code that reads new only; later drop old column.
Each step backwards-compatible with the running version of the code.
Online migration tools
For huge tables:
- gh-ost (GitHub for MySQL): row-by-row copy via binlog.
- pt-online-schema-change (Percona for MySQL).
- pg_repack (Postgres for VACUUM-style operations).
- Postgres logical replication for major version upgrades.
Locking gotchas
Postgres ALTER TABLE ... ADD COLUMN is fast (metadata only) — but ADD COLUMN ... DEFAULT is fast only since PG 11 (volatile defaults still rewrite).
CREATE INDEX CONCURRENTLY doesn't block writes; without CONCURRENTLY it does. Always use CONCURRENTLY in prod.
ALTER TABLE waits for current transactions; if a long query is running, the migration blocks all subsequent queries until that query completes. Set lock_timeout (a few seconds) so the migration aborts rather than blocks.
29.5 CI/CD pipeline
CI (Continuous Integration)
- Every PR: lint, type check, unit tests, security scans.
- Build artifacts (container images, binaries).
- Push to registry.
CD (Continuous Delivery / Deployment)
- Delivery = artifacts ready to deploy; manual gate.
- Deployment = automatic to prod after CI passes.
Modern stacks: GitHub Actions, GitLab CI, CircleCI, Buildkite, Jenkins, Tekton.
Pipelines
- Build → test → security scan → push image → deploy to staging → integration tests → canary → full rollout.
- Approval gates: production usually requires explicit human "yes."
GitOps
The desired cluster state lives in git; an operator (Argo CD, Flux) reconciles cluster to git. Deploys = git pushes. Audit trail for free.
29.6 Rollback
Required: any deploy you can't roll back in <5 min is too risky.
- Code: previous container image is still in registry; redeploy.
- DB schema: backwards-compatible; old code still works.
- Data: backwards migrations are rare; usually you tolerate the new state, fix forward.
- Feature flag: instant rollback for code already deployed.
Rollback hierarchy:
- Toggle flag (seconds).
- Re-deploy previous version (minutes).
- Restore from backup (hours; data loss possible).
29.7 Zero-downtime rollouts
Requires:
- Backwards-compatible API and DB: old and new versions coexist briefly.
- Health checks (readiness): only ready pods get traffic.
- Connection draining: in-flight requests finish.
- Surge capacity: extra pods during rollout (configurable in K8s Deployment).
For long-lived connections (WebSockets):
- New connections to new pods; old connections persist until natural close.
- Or proactive client reconnect on a signal.
- Or sticky session affinity through deploy.
29.8 Twelve-factor app (relevant principles)
- Config in env, not code.
- Stateless processes.
- Disposability: fast startup, graceful shutdown.
- Logs as event streams (don't write to local files).
- Same image, multiple environments via env config.
29.9 Configuration management
- Static config in env vars or files baked into deploy.
- Dynamic config (feature flags, runtime tuning) in a config service.
- Secrets via KMS / Secrets Manager (chapter 27).
- Versioned and reviewed: config changes through PRs, just like code.
Config errors cause many outages (Cloudflare's BGP outage, Facebook's 2021 outage). Treat config like production code.
29.10 Disaster Recovery (DR)
What happens when an entire region dies?
RTO and RPO
- RTO (Recovery Time Objective): how fast you must be back up. Hours? Minutes?
- RPO (Recovery Point Objective): how much data loss tolerable. Minutes? Zero?
Strategies (cost ascending)
- Backup & restore: backup elsewhere; restore on disaster. RTO hours, RPO hours.
- Pilot light: minimal warm copy in another region. RTO minutes, RPO minutes.
- Warm standby: scaled-down full env. RTO minutes, RPO seconds.
- Hot active-active: full capacity in both. RTO seconds, RPO seconds. Most expensive.
Test your DR
Untested DR is fiction. Run drills quarterly.
29.11 Multi-region deploy
Independent regions means independent failures. Design for it.
Read-only replica + write region
Reads local; writes cross-region. DR = promote replica.
Active-active with conflict resolution
Multi-leader (chapter 10) or last-writer-wins. Hard.
Cell-based architecture
Many independent "cells" (each a full stack with its own DB). User assigned to a cell. Failure of one cell affects only its users. Used by AWS internally and at very large scale.
29.12 Observability of deployments
Each deploy should be tagged in:
- Logs:
deploy_id,git_sha. - Metrics: deploy markers on dashboards.
- Traces: version label.
When latency spikes at 14:32, you ask "what deployed at 14:30?" and find it instantly.
29.13 Postmortems
After every incident: blameless postmortem. What happened, why, what we'll change.
- Action items with owners and dates.
- Track to completion.
- Share widely; learnings compound.
Google's SRE book is the canonical reference for postmortem culture.
29.14 What an interviewer wants
- "Rolling update with health checks for routine; canary for risky changes; blue/green when state allows."
- Feature flags for risky features and progressive rollout.
- Decouple deploy from release.
- Schema migration strategy: expand/migrate/contract.
- Rollback plan + observability of the deploy.
Key takeaways
- Smaller, more frequent deploys are safer than big bangs.
- Rolling = default. Canary = risky changes. Blue/green = state-friendly.
- Feature flags decouple deploy from release; kill switch in seconds.
- DB schema migrations: expand → migrate → contract for backwards compatibility.
- Always have a rollback plan; always test DR; always do blameless postmortems.
// 1 view