◐ system-design/observability.md
26. Observability: Logs, Metrics, Traces, SLOs
You can't operate what you can't see. Observability = the ability to ask arbitrary questions about your system's behavior in production. The "three pillars" are logs, metrics, traces — but the goal is unified investig…
~7 min read·updated 5/29/2026
26. Observability: Logs, Metrics, Traces, SLOs
You can't operate what you can't see. Observability = the ability to ask arbitrary questions about your system's behavior in production. The "three pillars" are logs, metrics, traces — but the goal is unified investigation, not three separate tools.
26.1 Monitoring vs observability
- Monitoring: predefined dashboards, alerts on known failure modes.
- Observability: ability to investigate new failure modes ad hoc — by correlating signals you didn't predict needing.
You need both. Monitoring catches known unknowns; observability handles unknown unknowns.
26.2 Pillar 1: Logs
Time-stamped records of events.
Log levels
- ERROR: actionable failures.
- WARN: anomalies; attention if pattern.
- INFO: significant events (request started, deploy done).
- DEBUG: detailed flow; usually off in prod.
- TRACE: very detailed; rare.
Structured logging
JSON, not free text. Fields: timestamp, level, service, trace_id, user_id, request_id, event, plus arbitrary structured data.
{"ts":"2026-05-04T10:00:00Z","level":"info","service":"api","trace_id":"abc","event":"user.login","user_id":42,"latency_ms":120}
Reasons:
- Searchable / filterable in any logging system.
- Aggregatable (count, group, sum).
- Stable schema beats grep regex.
Log aggregation
Centralize. Local files don't scale.
- ELK / Elastic Stack: Logstash → Elasticsearch → Kibana.
- Loki (Grafana): logs indexed by labels only; cheap.
- Splunk: enterprise; expensive.
- Datadog, Honeycomb, New Relic: SaaS.
- CloudWatch Logs, Stackdriver: cloud-native.
Log volume problem
A 1000-server fleet at 1 KB/req × 10K req/sec = 10 MB/sec/server = 10 GB/sec total. That's $$$$$ in storage. Mitigations:
- Sampling: log 1% of successful requests, all errors.
- Drop debug in prod.
- Tiered storage: hot 7 days, warm 30 days, cold archive 1 year.
- Aggregation at source: don't log per-request what could be a metric.
When to log
- Request boundaries (start, end, status).
- Errors and exceptions (with stack traces).
- State transitions in workflows.
- Significant business events (signup, purchase).
When NOT to log
- Anything that can be a metric (counts, latencies). Use a counter, not a log.
- Sensitive data (PII, secrets, tokens). Redact at source.
- Loops with N+1 logs.
26.3 Pillar 2: Metrics
Numerical measurements over time. Cheap, aggregatable, alertable.
Types (Prometheus model)
- Counter: monotonic; only goes up.
requests_total. - Gauge: arbitrary value.
queue_depth,cpu_usage. - Histogram: distribution of values; bucketed.
request_duration_seconds. - Summary: client-side computed quantiles; harder to aggregate.
Cardinality
Each unique label combination = a separate time series. Cardinality explodes:
requests_total{status, route, user_id}with 1M users → 1M+ series. Death.- Keep label cardinality low. user_id is not a metric label; it's a log field.
Tools
- Prometheus: pull-based; scrapes
/metricsendpoints. Industry standard. - Graphite: push-based; older.
- InfluxDB / TimescaleDB: time-series DBs.
- Datadog / NewRelic / Honeycomb: SaaS.
- OpenTelemetry: vendor-neutral instrumentation; emits to any backend.
What to measure (RED method)
- Rate: req/sec.
- Errors: error rate.
- Duration: latency histogram.
USE method (for resources)
- Utilization: % of capacity used.
- Saturation: queue depth, wait.
- Errors: error count.
Together, RED + USE covers most operational concerns.
26.4 Pillar 3: Traces
A trace = the journey of one request through the system.
Anatomy
- Trace: a tree of spans, identified by
trace_id. - Span: one unit of work (RPC call, DB query). Has
start,end,name,parent_span_id, attributes.
Propagation
The trace_id flows through every service via headers (W3C Trace Context: traceparent, tracestate).
Standards
- OpenTelemetry (OTel): the standard. Vendor-neutral SDKs and APIs.
- Predecessors: OpenTracing, OpenCensus (now merged into OTel).
Tools
- Jaeger: open-source, started at Uber.
- Zipkin: older, simpler.
- Tempo (Grafana): cheap, scales.
- Datadog APM, Honeycomb, NewRelic, Lightstep: SaaS.
Sampling
You can't store every trace. Sample:
- Head sampling: decide at request start; deterministic by trace ID.
- Tail sampling: decide after request finishes; keep slow / error traces. Better but stateful.
When traces shine
- Latency debugging: which span took 800 ms?
- Cross-service flow visualization.
- Finding orphan retries, duplicate calls.
- Detecting "tail amplification" (slow downstream affects p99).
26.5 Correlation: the unified view
A request hits service A → emits log + metric + span. Same trace_id everywhere → you can pivot:
- Saw a slow request in tracing → jump to logs at the same trace_id → see business context.
- Saw an error spike in metrics → drill into traces with that error.
This correlation is what makes observability "observability" instead of "three disconnected tools."
26.6 SLIs, SLOs, SLAs (recap from chapter 1)
- SLI: indicator (a measurement). "Fraction of HTTP requests with status 200 in last 5 min."
- SLO: objective (target). "SLI ≥ 99.9% over 30 days."
- SLA: agreement (contract). External, with consequences.
Error budget
If your SLO is 99.9%, you have 0.1% "budget" for downtime. Burn rate measures how fast you're consuming it.
- Slow burn (will exhaust in 30 days): time to fix things.
- Fast burn (will exhaust in 1 hour): page on-call.
Multi-window, multi-burn-rate alerts: alert on (1h burn rate > 14.4 OR 6h burn rate > 6) → high precision, fast detection.
Picking SLOs
- Start with what users care about: availability, latency, correctness.
- Don't chase 100%; users can't tell. Aim for the gap that's noticeable.
- Different services have different SLOs.
26.7 Alerting
Alert on symptoms (user pain), not causes (specific failure modes).
Bad: "CPU > 80%" — maybe nothing's wrong. Good: "p95 latency > 500 ms for 5 min" — user is suffering.
Alert routing
- Page on-call for SLO burn / customer impact.
- Tickets for non-urgent.
- Auto-remediation for known patterns (restart pod, scale up).
Alert fatigue
The #1 killer of effective on-call. Every false positive trains the team to ignore alerts.
- Tune ruthlessly.
- Auto-close stale alerts.
- Quarterly review of every alert: did it lead to action?
26.8 Distributed tracing in microservices (deeper)
Instrument:
- HTTP clients/servers (auto via libraries).
- DB calls.
- Queue produce/consume.
- Cache hits/misses.
- External APIs.
Add custom spans for business-significant operations.
Watch for:
- Tail latency (p99 of a chain is much worse than p99 of any one link).
- Retries that cascaded.
- N+1 patterns within a request.
- Async fan-out that doesn't propagate trace context (fix: pass trace_id through queues).
26.9 Profiling
Beyond traces: continuous profiling in prod.
- Sample CPU stacks, memory, mutex contention.
- Tools: Pyroscope, Parca, Datadog Profiler, gprofiler.
- Helps find hot functions you didn't know existed.
26.10 Logs vs metrics vs traces — when to reach for which
| Question | Tool |
|---|---|
| Is the service up? | metric (uptime) |
| What's the error rate? | metric |
| What's the p99 latency? | metric (histogram) |
| Why did this request fail? | log + trace |
| Where in the chain did latency come from? | trace |
| What did the user see? | log |
| How many of X happened today? | metric |
| What was the input that caused this bug? | log |
Don't reach for logs to count things; counters are 1000× cheaper and queryable.
26.11 Health checks (recap)
/healthz (liveness): is the process alive?
/readyz (readiness): is the process able to serve traffic?
Kubernetes uses both. Liveness fail → restart pod. Readiness fail → remove from load balancer.
Deep health: check critical dependencies (DB, cache). Don't make health checks too expensive — they run every few seconds per pod.
26.12 Synthetic monitoring
Bots that periodically perform real user flows (login, checkout). Catch issues before users do. Tools: Pingdom, Datadog Synthetics, Checkly, custom scripts.
26.13 RUM (Real User Monitoring)
JS in the browser reports real user performance: page load time, Web Vitals (LCP, FID, CLS), API latency from the user's network.
Critical: backend p99 of 50 ms doesn't help users on bad networks seeing 5 sec total page load.
26.14 Chaos engineering
Inject failure on purpose, prove the system tolerates it.
- Kill random pods (Chaos Monkey).
- Inject network latency (Toxiproxy).
- Force partition (Gremlin).
- Trip circuit breakers manually.
Run in staging first; in prod once you trust your monitoring.
26.15 What an interviewer wants
- Distinguish logs / metrics / traces and when each is right.
- Discuss SLIs/SLOs/error budgets.
- Mention OpenTelemetry as the modern standard.
- Articulate why low-cardinality labels matter for metrics.
- Know how to find a slow request in a microservices system (trace_id correlation).
Key takeaways
- Three pillars: logs (events), metrics (counts/distributions), traces (request flow).
- Correlate via trace_id; observability = pivoting across pillars.
- Alert on symptoms, not causes. Alert fatigue kills.
- SLOs > SLAs > vague availability promises. Error budget guides risk.
- OpenTelemetry is the vendor-neutral instrumentation standard.
// 1 view