system-design/containers-orchestration.md

28. Containers, Kubernetes, Borg

Containers won the deployment war; Kubernetes won the orchestration war. Knowing how they work is now table stakes for any system design discussion.

~6 min read·updated 5/29/2026

28. Containers, Kubernetes, Borg

Containers won the deployment war; Kubernetes won the orchestration war. Knowing how they work is now table stakes for any system design discussion.

28.1 Containers vs VMs

Virtual Machine

Hardware virtualization (KVM, VMware). Each VM has its own kernel. Heavy: ~GBs disk, slow boot (10s of seconds), strong isolation.

Container

OS-level virtualization. Many containers share the host kernel; isolated via Linux namespaces (PID, network, mount, user, IPC, UTS) + cgroups (resource limits).

Light: ~MBs disk, fast boot (~1s), weaker isolation than VM (kernel exploits = container breakout).

When to use what

  • Containers: stateless services, microservices, CI builds, dev environments. Default.
  • VMs: when you need isolation across mutually-untrusted tenants (cloud provider, Lambda host), kernel-version control, GPU passthrough.
  • microVMs (Firecracker): VM-grade isolation with container-grade speed. AWS Lambda runs on these.

28.2 Docker

Made containers usable.

Components

  • Dockerfile: declarative recipe to build an image.
  • Image: immutable filesystem snapshot + metadata. Layered (each instruction is a layer; cache-friendly).
  • Container: a running instance of an image.
  • Registry: stores images (Docker Hub, ECR, GCR, Artifact Registry).

Key concepts

  • Layered FS: copy-on-write; pulls only changed layers.
  • COPY and RUN: each is a layer.
  • Multi-stage builds: build in fat image, copy artifacts to slim runtime image. Final image is small.
  • .dockerignore: exclude from build context.

Best practices

  • Pin base image versions (node:20.10 not node:latest).
  • Run as non-root user.
  • Use distroless or Alpine for runtime.
  • Layer order: most-stable to least-stable (cache-friendly).
  • One process per container (use init system if needed).

OCI standards

The Open Container Initiative standardized image format and runtime, so anything runs on anything (containerd, CRI-O, Podman).

28.3 Why orchestration

When you have 1000 containers across 50 hosts:

  • Where should each container run?
  • What if a host dies?
  • How do they find each other?
  • How do you roll out a new version?
  • How do you scale up/down based on load?

Manually unworkable. Orchestrators handle this.

28.4 Kubernetes

The dominant orchestrator. Originated at Google (heavily inspired by Borg).

Core concepts

  • Pod: one or more co-located containers, sharing network and storage. Smallest unit.
  • Node: a worker machine (VM or physical). Runs kubelet (agent) + container runtime.
  • Cluster: control plane + nodes.
  • Namespace: virtual cluster division for multi-tenancy.

Workload types

  • Deployment: stateless replicas; rolling updates; auto-restart.
  • StatefulSet: stable identity, ordered deploys, persistent volumes; for DBs, Kafka.
  • DaemonSet: one pod per node; for log shippers, network proxies.
  • Job / CronJob: batch and scheduled.

Service

Stable virtual IP and DNS name fronting a set of pods (selected by labels). Kube-proxy manages iptables to load-balance.

Types:

  • ClusterIP (default): internal only.
  • NodePort: expose port on every node.
  • LoadBalancer: provisions cloud LB.
  • ExternalName: DNS CNAME.

Ingress

L7 routing into the cluster (host/path rules → services). Backed by an Ingress Controller (Nginx, Traefik, Envoy/Contour, HAProxy).

ConfigMap, Secret

Decouple config from images. Mount as files or env vars.

Persistent storage

  • PV (PersistentVolume): cluster-scoped storage abstraction.
  • PVC (PersistentVolumeClaim): pod requests storage.
  • StorageClass: dynamic provisioning.

Scaling

  • HPA (Horizontal Pod Autoscaler): scale replicas by CPU / memory / custom metric.
  • VPA (Vertical Pod Autoscaler): adjust pod resource requests.
  • Cluster Autoscaler: add/remove nodes.

Health probes

  • Liveness: restart if failed.
  • Readiness: remove from LB if failed.
  • Startup: tolerate slow startup.

Rollouts

Default Deployment update is rolling: bring up new pods, drain old. Configurable surge / max-unavailable.

Other strategies (with helpers like Argo Rollouts, Flagger):

  • Blue/green: run both; flip.
  • Canary: small fraction to new; ramp.

Secrets

Encrypted at rest (with KMS). Mounted as files or env vars. Rotate with reloaders (Reloader operator).

RBAC

Roles, RoleBindings, ClusterRoles. Service accounts for pods. Principle of least privilege.

Networking model

Every pod gets its own IP. No NAT between pods. Implementations: Calico, Cilium, Flannel, AWS VPC CNI.

CRDs and Operators

Extend Kubernetes with custom resources. An operator is a controller that reconciles desired state for a custom resource (e.g., KafkaCluster, PostgresCluster). Pattern: ETCD as source of truth, controllers reconcile.

28.5 Kubernetes architecture

Control plane

  • API server: REST API, gateway to everything.
  • etcd: distributed KV (Raft-backed) — source of truth.
  • Scheduler: assigns pods to nodes.
  • Controller manager: runs reconciliation loops (deployments, replicasets, jobs).
  • Cloud controller manager: integrates with cloud (LBs, volumes).

Node

  • kubelet: agent; talks to control plane.
  • Container runtime: containerd / CRI-O.
  • kube-proxy: networking.
  • CNI plugin: pod networking.
  • CSI plugin: storage.

How a pod is born

  1. kubectl apply → API server.
  2. API server stores in etcd.
  3. Scheduler picks a node.
  4. kubelet on node sees its pod spec, pulls image, starts containers.
  5. CNI assigns IP.
  6. kube-proxy programs iptables for service routing.
  7. Probes pass → pod marked ready → traffic flows.

28.6 Borg (the predecessor)

Google's internal cluster manager since ~2003. Open paper from 2015.

Key ideas (most made it into Kubernetes):

  • Declarative job specs.
  • Tasks share machines to maximize utilization.
  • Priority + preemption so high-priority work runs even when machines are full.
  • Quotas enforce fairness across teams.
  • Allocs are resource reservations within which tasks run.
  • Borgmaster (control plane) + Borglet (node agent).

Differences from K8s:

  • Job-centric (not pod-centric).
  • Centralized scheduler optimized for utilization.
  • Two-level priority (production vs batch) — batch fills idle resources.

Borg powers most of Google. Knowing this is a Google interview signal.

28.7 Kubernetes pain points

  • Operational complexity: even managed K8s (GKE, EKS, AKS) demands expertise.
  • YAML hell: imperative tools (kubectl) over declarative state (Helm, Kustomize) over more declarative state (GitOps via ArgoCD/Flux).
  • Networking complexity: CNI variants, ingress, mesh, NetworkPolicies — each a learning curve.
  • Stateful workloads are harder; managed DBs usually win.
  • Cost overhead: control plane + node OS overhead.
  • Right-sizing: requests vs limits vs actuals; hard to tune.

For small teams: managed PaaS (Render, Fly, Railway, Heroku, Vercel) often a better fit than running K8s.

28.8 Service mesh (recap)

(Chapter 20.) Sidecar proxy per pod gives mTLS, L7 routing, retries, telemetry. Istio, Linkerd, Cilium Service Mesh.

For 5-10 services, simpler. Beyond, the mesh tax pays back in observability and security.

28.9 Serverless

The other end of the spectrum: no servers, just functions.

  • AWS Lambda, Cloud Functions, Cloud Run, Azure Functions: scale to zero; pay per invocation.
  • Cold starts: first invocation has init delay (~100ms - several seconds depending on runtime).
  • Stateless: state goes to external stores.
  • Time/memory limits: e.g., 15min Lambda max.

Use cases:

  • Sporadic workloads.
  • Event-driven (S3 trigger, queue trigger).
  • Glue code, integrations.
  • Edge compute (Cloudflare Workers).

When not to use:

  • Steady high traffic (cheaper on containers).
  • Long-running jobs.
  • Heavy cold-start penalty for latency-sensitive paths.

28.10 Choosing a deployment model

NeedPick
One small serviceHeroku / Render / Fly
Microservices, dedicated teamManaged K8s (GKE, EKS)
Functions + event-drivenLambda / Cloud Run / Cloud Functions
Strong ML / GPU workloadsK8s + GPU nodes, or specialized (SageMaker)
Stateful (DBs)Managed services (RDS, Aurora, Spanner, BigTable)
EdgeCDN edge functions (Cloudflare Workers, Vercel Edge)

28.11 Common interview points

  • Differentiate containers vs VMs.
  • Explain Kubernetes core: Pod, Deployment, Service, Ingress.
  • Discuss rolling vs blue/green vs canary.
  • Mention Borg as Google's predecessor (Google interviews).
  • Acknowledge K8s complexity; pick simpler tools when possible.

Key takeaways

  • Containers (Docker) packaged apps with their dependencies.
  • Kubernetes orchestrates containers across many hosts; descended from Google's Borg.
  • Pod is the deployment unit; Service is stable virtual IP; Ingress is L7 routing.
  • Operators extend K8s for stateful workloads (DBs, queues).
  • Serverless = scale to zero, pay per use; great for sporadic or event-driven work.

// 1 view

main
UTF-8·typescript