26. Observability: Logs, Metrics, Traces, SLOs

You can't operate what you can't see. Observability = the ability to ask arbitrary questions about your system's behavior in production. The "three pillars" are logs, metrics, traces — but the goal is unified investigation, not three separate tools.

26.1 Monitoring vs observability

Monitoring: predefined dashboards, alerts on known failure modes.
Observability: ability to investigate new failure modes ad hoc — by correlating signals you didn't predict needing.

You need both. Monitoring catches known unknowns; observability handles unknown unknowns.

26.2 Pillar 1: Logs

Time-stamped records of events.

Log levels

ERROR: actionable failures.
WARN: anomalies; attention if pattern.
INFO: significant events (request started, deploy done).
DEBUG: detailed flow; usually off in prod.
TRACE: very detailed; rare.

Structured logging

JSON, not free text. Fields: timestamp, level, service, trace_id, user_id, request_id, event, plus arbitrary structured data.

{"ts":"2026-05-04T10:00:00Z","level":"info","service":"api","trace_id":"abc","event":"user.login","user_id":42,"latency_ms":120}

Reasons:

Searchable / filterable in any logging system.
Aggregatable (count, group, sum).
Stable schema beats grep regex.

Log aggregation

Centralize. Local files don't scale.

ELK / Elastic Stack: Logstash → Elasticsearch → Kibana.
Loki (Grafana): logs indexed by labels only; cheap.
Splunk: enterprise; expensive.
Datadog, Honeycomb, New Relic: SaaS.
CloudWatch Logs, Stackdriver: cloud-native.

Log volume problem

A 1000-server fleet at 1 KB/req × 10K req/sec = 10 MB/sec/server = 10 GB/sec total. That's $$$$$ in storage. Mitigations:

Sampling: log 1% of successful requests, all errors.
Drop debug in prod.
Tiered storage: hot 7 days, warm 30 days, cold archive 1 year.
Aggregation at source: don't log per-request what could be a metric.

When to log

Request boundaries (start, end, status).
Errors and exceptions (with stack traces).
State transitions in workflows.
Significant business events (signup, purchase).

When NOT to log

Anything that can be a metric (counts, latencies). Use a counter, not a log.
Sensitive data (PII, secrets, tokens). Redact at source.
Loops with N+1 logs.

26.3 Pillar 2: Metrics

Numerical measurements over time. Cheap, aggregatable, alertable.

Types (Prometheus model)

Counter: monotonic; only goes up. requests_total.
Gauge: arbitrary value. queue_depth, cpu_usage.
Histogram: distribution of values; bucketed. request_duration_seconds.
Summary: client-side computed quantiles; harder to aggregate.

Cardinality

Each unique label combination = a separate time series. Cardinality explodes:

requests_total{status, route, user_id} with 1M users → 1M+ series. Death.
Keep label cardinality low. user_id is not a metric label; it's a log field.

Tools

Prometheus: pull-based; scrapes /metrics endpoints. Industry standard.
Graphite: push-based; older.
InfluxDB / TimescaleDB: time-series DBs.
Datadog / NewRelic / Honeycomb: SaaS.
OpenTelemetry: vendor-neutral instrumentation; emits to any backend.

What to measure (RED method)

Rate: req/sec.
Errors: error rate.
Duration: latency histogram.

USE method (for resources)

Utilization: % of capacity used.
Saturation: queue depth, wait.
Errors: error count.

Together, RED + USE covers most operational concerns.

26.4 Pillar 3: Traces

A trace = the journey of one request through the system.

Anatomy

Trace: a tree of spans, identified by trace_id.
Span: one unit of work (RPC call, DB query). Has start, end, name, parent_span_id, attributes.

Propagation

The trace_id flows through every service via headers (W3C Trace Context: traceparent, tracestate).

Standards

OpenTelemetry (OTel): the standard. Vendor-neutral SDKs and APIs.
Predecessors: OpenTracing, OpenCensus (now merged into OTel).

Tools

Jaeger: open-source, started at Uber.
Zipkin: older, simpler.
Tempo (Grafana): cheap, scales.
Datadog APM, Honeycomb, NewRelic, Lightstep: SaaS.

Sampling

You can't store every trace. Sample:

Head sampling: decide at request start; deterministic by trace ID.
Tail sampling: decide after request finishes; keep slow / error traces. Better but stateful.

When traces shine

Latency debugging: which span took 800 ms?
Cross-service flow visualization.
Finding orphan retries, duplicate calls.
Detecting "tail amplification" (slow downstream affects p99).

26.5 Correlation: the unified view

A request hits service A → emits log + metric + span. Same trace_id everywhere → you can pivot:

Saw a slow request in tracing → jump to logs at the same trace_id → see business context.
Saw an error spike in metrics → drill into traces with that error.

This correlation is what makes observability "observability" instead of "three disconnected tools."

26.6 SLIs, SLOs, SLAs (recap from chapter 1)

SLI: indicator (a measurement). "Fraction of HTTP requests with status 200 in last 5 min."
SLO: objective (target). "SLI ≥ 99.9% over 30 days."
SLA: agreement (contract). External, with consequences.

Error budget

If your SLO is 99.9%, you have 0.1% "budget" for downtime. Burn rate measures how fast you're consuming it.

Slow burn (will exhaust in 30 days): time to fix things.
Fast burn (will exhaust in 1 hour): page on-call.

Multi-window, multi-burn-rate alerts: alert on (1h burn rate > 14.4 OR 6h burn rate > 6) → high precision, fast detection.

Picking SLOs

Start with what users care about: availability, latency, correctness.
Don't chase 100%; users can't tell. Aim for the gap that's noticeable.
Different services have different SLOs.

26.7 Alerting

Alert on symptoms (user pain), not causes (specific failure modes).

Bad: "CPU > 80%" — maybe nothing's wrong. Good: "p95 latency > 500 ms for 5 min" — user is suffering.

Alert routing

Page on-call for SLO burn / customer impact.
Tickets for non-urgent.
Auto-remediation for known patterns (restart pod, scale up).

Alert fatigue

The #1 killer of effective on-call. Every false positive trains the team to ignore alerts.

Tune ruthlessly.
Auto-close stale alerts.
Quarterly review of every alert: did it lead to action?

26.8 Distributed tracing in microservices (deeper)

Instrument:

HTTP clients/servers (auto via libraries).
DB calls.
Queue produce/consume.
Cache hits/misses.
External APIs.

Add custom spans for business-significant operations.

Watch for:

Tail latency (p99 of a chain is much worse than p99 of any one link).
Retries that cascaded.
N+1 patterns within a request.
Async fan-out that doesn't propagate trace context (fix: pass trace_id through queues).

26.9 Profiling

Beyond traces: continuous profiling in prod.

Sample CPU stacks, memory, mutex contention.
Tools: Pyroscope, Parca, Datadog Profiler, gprofiler.
Helps find hot functions you didn't know existed.

26.10 Logs vs metrics vs traces — when to reach for which

Question	Tool
Is the service up?	metric (uptime)
What's the error rate?	metric
What's the p99 latency?	metric (histogram)
Why did this request fail?	log + trace
Where in the chain did latency come from?	trace
What did the user see?	log
How many of X happened today?	metric
What was the input that caused this bug?	log

Don't reach for logs to count things; counters are 1000× cheaper and queryable.

26.11 Health checks (recap)

/healthz (liveness): is the process alive? /readyz (readiness): is the process able to serve traffic?

Kubernetes uses both. Liveness fail → restart pod. Readiness fail → remove from load balancer.

Deep health: check critical dependencies (DB, cache). Don't make health checks too expensive — they run every few seconds per pod.

26.12 Synthetic monitoring

Bots that periodically perform real user flows (login, checkout). Catch issues before users do. Tools: Pingdom, Datadog Synthetics, Checkly, custom scripts.

26.13 RUM (Real User Monitoring)

JS in the browser reports real user performance: page load time, Web Vitals (LCP, FID, CLS), API latency from the user's network.

Critical: backend p99 of 50 ms doesn't help users on bad networks seeing 5 sec total page load.

26.14 Chaos engineering

Inject failure on purpose, prove the system tolerates it.

Kill random pods (Chaos Monkey).
Inject network latency (Toxiproxy).
Force partition (Gremlin).
Trip circuit breakers manually.

Run in staging first; in prod once you trust your monitoring.

26.15 What an interviewer wants

Distinguish logs / metrics / traces and when each is right.
Discuss SLIs/SLOs/error budgets.
Mention OpenTelemetry as the modern standard.
Articulate why low-cardinality labels matter for metrics.
Know how to find a slow request in a microservices system (trace_id correlation).

Key takeaways

Three pillars: logs (events), metrics (counts/distributions), traces (request flow).
Correlate via trace_id; observability = pivoting across pillars.
Alert on symptoms, not causes. Alert fatigue kills.
SLOs > SLAs > vague availability promises. Error budget guides risk.
OpenTelemetry is the vendor-neutral instrumentation standard.