system-design/observability.md

26. Observability: Logs, Metrics, Traces, SLOs

You can't operate what you can't see. Observability = the ability to ask arbitrary questions about your system's behavior in production. The "three pillars" are logs, metrics, traces — but the goal is unified investig…

~7 min read·updated 5/29/2026

26. Observability: Logs, Metrics, Traces, SLOs

You can't operate what you can't see. Observability = the ability to ask arbitrary questions about your system's behavior in production. The "three pillars" are logs, metrics, traces — but the goal is unified investigation, not three separate tools.

26.1 Monitoring vs observability

  • Monitoring: predefined dashboards, alerts on known failure modes.
  • Observability: ability to investigate new failure modes ad hoc — by correlating signals you didn't predict needing.

You need both. Monitoring catches known unknowns; observability handles unknown unknowns.

26.2 Pillar 1: Logs

Time-stamped records of events.

Log levels

  • ERROR: actionable failures.
  • WARN: anomalies; attention if pattern.
  • INFO: significant events (request started, deploy done).
  • DEBUG: detailed flow; usually off in prod.
  • TRACE: very detailed; rare.

Structured logging

JSON, not free text. Fields: timestamp, level, service, trace_id, user_id, request_id, event, plus arbitrary structured data.

{"ts":"2026-05-04T10:00:00Z","level":"info","service":"api","trace_id":"abc","event":"user.login","user_id":42,"latency_ms":120}

Reasons:

  • Searchable / filterable in any logging system.
  • Aggregatable (count, group, sum).
  • Stable schema beats grep regex.

Log aggregation

Centralize. Local files don't scale.

  • ELK / Elastic Stack: Logstash → Elasticsearch → Kibana.
  • Loki (Grafana): logs indexed by labels only; cheap.
  • Splunk: enterprise; expensive.
  • Datadog, Honeycomb, New Relic: SaaS.
  • CloudWatch Logs, Stackdriver: cloud-native.

Log volume problem

A 1000-server fleet at 1 KB/req × 10K req/sec = 10 MB/sec/server = 10 GB/sec total. That's $$$$$ in storage. Mitigations:

  • Sampling: log 1% of successful requests, all errors.
  • Drop debug in prod.
  • Tiered storage: hot 7 days, warm 30 days, cold archive 1 year.
  • Aggregation at source: don't log per-request what could be a metric.

When to log

  • Request boundaries (start, end, status).
  • Errors and exceptions (with stack traces).
  • State transitions in workflows.
  • Significant business events (signup, purchase).

When NOT to log

  • Anything that can be a metric (counts, latencies). Use a counter, not a log.
  • Sensitive data (PII, secrets, tokens). Redact at source.
  • Loops with N+1 logs.

26.3 Pillar 2: Metrics

Numerical measurements over time. Cheap, aggregatable, alertable.

Types (Prometheus model)

  • Counter: monotonic; only goes up. requests_total.
  • Gauge: arbitrary value. queue_depth, cpu_usage.
  • Histogram: distribution of values; bucketed. request_duration_seconds.
  • Summary: client-side computed quantiles; harder to aggregate.

Cardinality

Each unique label combination = a separate time series. Cardinality explodes:

  • requests_total{status, route, user_id} with 1M users → 1M+ series. Death.
  • Keep label cardinality low. user_id is not a metric label; it's a log field.

Tools

  • Prometheus: pull-based; scrapes /metrics endpoints. Industry standard.
  • Graphite: push-based; older.
  • InfluxDB / TimescaleDB: time-series DBs.
  • Datadog / NewRelic / Honeycomb: SaaS.
  • OpenTelemetry: vendor-neutral instrumentation; emits to any backend.

What to measure (RED method)

  • Rate: req/sec.
  • Errors: error rate.
  • Duration: latency histogram.

USE method (for resources)

  • Utilization: % of capacity used.
  • Saturation: queue depth, wait.
  • Errors: error count.

Together, RED + USE covers most operational concerns.

26.4 Pillar 3: Traces

A trace = the journey of one request through the system.

Anatomy

  • Trace: a tree of spans, identified by trace_id.
  • Span: one unit of work (RPC call, DB query). Has start, end, name, parent_span_id, attributes.

Propagation

The trace_id flows through every service via headers (W3C Trace Context: traceparent, tracestate).

Standards

  • OpenTelemetry (OTel): the standard. Vendor-neutral SDKs and APIs.
  • Predecessors: OpenTracing, OpenCensus (now merged into OTel).

Tools

  • Jaeger: open-source, started at Uber.
  • Zipkin: older, simpler.
  • Tempo (Grafana): cheap, scales.
  • Datadog APM, Honeycomb, NewRelic, Lightstep: SaaS.

Sampling

You can't store every trace. Sample:

  • Head sampling: decide at request start; deterministic by trace ID.
  • Tail sampling: decide after request finishes; keep slow / error traces. Better but stateful.

When traces shine

  • Latency debugging: which span took 800 ms?
  • Cross-service flow visualization.
  • Finding orphan retries, duplicate calls.
  • Detecting "tail amplification" (slow downstream affects p99).

26.5 Correlation: the unified view

A request hits service A → emits log + metric + span. Same trace_id everywhere → you can pivot:

  • Saw a slow request in tracing → jump to logs at the same trace_id → see business context.
  • Saw an error spike in metrics → drill into traces with that error.

This correlation is what makes observability "observability" instead of "three disconnected tools."

26.6 SLIs, SLOs, SLAs (recap from chapter 1)

  • SLI: indicator (a measurement). "Fraction of HTTP requests with status 200 in last 5 min."
  • SLO: objective (target). "SLI ≥ 99.9% over 30 days."
  • SLA: agreement (contract). External, with consequences.

Error budget

If your SLO is 99.9%, you have 0.1% "budget" for downtime. Burn rate measures how fast you're consuming it.

  • Slow burn (will exhaust in 30 days): time to fix things.
  • Fast burn (will exhaust in 1 hour): page on-call.

Multi-window, multi-burn-rate alerts: alert on (1h burn rate > 14.4 OR 6h burn rate > 6) → high precision, fast detection.

Picking SLOs

  • Start with what users care about: availability, latency, correctness.
  • Don't chase 100%; users can't tell. Aim for the gap that's noticeable.
  • Different services have different SLOs.

26.7 Alerting

Alert on symptoms (user pain), not causes (specific failure modes).

Bad: "CPU > 80%" — maybe nothing's wrong. Good: "p95 latency > 500 ms for 5 min" — user is suffering.

Alert routing

  • Page on-call for SLO burn / customer impact.
  • Tickets for non-urgent.
  • Auto-remediation for known patterns (restart pod, scale up).

Alert fatigue

The #1 killer of effective on-call. Every false positive trains the team to ignore alerts.

  • Tune ruthlessly.
  • Auto-close stale alerts.
  • Quarterly review of every alert: did it lead to action?

26.8 Distributed tracing in microservices (deeper)

Instrument:

  • HTTP clients/servers (auto via libraries).
  • DB calls.
  • Queue produce/consume.
  • Cache hits/misses.
  • External APIs.

Add custom spans for business-significant operations.

Watch for:

  • Tail latency (p99 of a chain is much worse than p99 of any one link).
  • Retries that cascaded.
  • N+1 patterns within a request.
  • Async fan-out that doesn't propagate trace context (fix: pass trace_id through queues).

26.9 Profiling

Beyond traces: continuous profiling in prod.

  • Sample CPU stacks, memory, mutex contention.
  • Tools: Pyroscope, Parca, Datadog Profiler, gprofiler.
  • Helps find hot functions you didn't know existed.

26.10 Logs vs metrics vs traces — when to reach for which

QuestionTool
Is the service up?metric (uptime)
What's the error rate?metric
What's the p99 latency?metric (histogram)
Why did this request fail?log + trace
Where in the chain did latency come from?trace
What did the user see?log
How many of X happened today?metric
What was the input that caused this bug?log

Don't reach for logs to count things; counters are 1000× cheaper and queryable.

26.11 Health checks (recap)

/healthz (liveness): is the process alive? /readyz (readiness): is the process able to serve traffic?

Kubernetes uses both. Liveness fail → restart pod. Readiness fail → remove from load balancer.

Deep health: check critical dependencies (DB, cache). Don't make health checks too expensive — they run every few seconds per pod.

26.12 Synthetic monitoring

Bots that periodically perform real user flows (login, checkout). Catch issues before users do. Tools: Pingdom, Datadog Synthetics, Checkly, custom scripts.

26.13 RUM (Real User Monitoring)

JS in the browser reports real user performance: page load time, Web Vitals (LCP, FID, CLS), API latency from the user's network.

Critical: backend p99 of 50 ms doesn't help users on bad networks seeing 5 sec total page load.

26.14 Chaos engineering

Inject failure on purpose, prove the system tolerates it.

  • Kill random pods (Chaos Monkey).
  • Inject network latency (Toxiproxy).
  • Force partition (Gremlin).
  • Trip circuit breakers manually.

Run in staging first; in prod once you trust your monitoring.

26.15 What an interviewer wants

  • Distinguish logs / metrics / traces and when each is right.
  • Discuss SLIs/SLOs/error budgets.
  • Mention OpenTelemetry as the modern standard.
  • Articulate why low-cardinality labels matter for metrics.
  • Know how to find a slow request in a microservices system (trace_id correlation).

Key takeaways

  • Three pillars: logs (events), metrics (counts/distributions), traces (request flow).
  • Correlate via trace_id; observability = pivoting across pillars.
  • Alert on symptoms, not causes. Alert fatigue kills.
  • SLOs > SLAs > vague availability promises. Error budget guides risk.
  • OpenTelemetry is the vendor-neutral instrumentation standard.

// 1 view

main
UTF-8·typescript