◐ system-design/data-models.md
4. Data Models: Relational, Document, Graph
The data model is the *single most important architectural decision*. It shapes how you think, what queries are easy, what queries are impossible, and what your scale ceiling looks like.
~5 min read·updated 5/29/2026
4. Data Models: Relational, Document, Graph
The data model is the single most important architectural decision. It shapes how you think, what queries are easy, what queries are impossible, and what your scale ceiling looks like.
4.1 Relational (SQL)
Origins: Codd, 1970. The dominant model for ~40 years; still dominant for OLTP.
Idea: data lives in relations (tables), each tuple (row) is an unordered set of values. Foreign keys connect tables. A query optimizer chooses the execution plan.
Why it won
- Schema enforcement. The database guarantees structure.
- Joins are cheap and arbitrary. You don't pre-decide access patterns.
- Transactions (ACID). Multi-row consistency.
- Mature ecosystem. Indexes, query planners, replication, tooling.
Examples
PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Spanner (Google).
When relational hurts
- Object-relational impedance mismatch: nested object hierarchies → multiple joins (N+1 query problem in ORMs).
- Schema migrations on huge tables are painful (locking, slow, rolling deployments).
- Horizontal scaling historically required sharding manually. Modern distributed SQL (Spanner, CockroachDB, TiDB, YugabyteDB) solves this with consensus + range partitioning, at cost of write latency.
4.2 Document (NoSQL)
Each row is a self-contained document, usually JSON or BSON. No schema enforced (schema-on-read). Common access pattern: load whole document by ID.
Why it took off (~2009-2015)
- "Schemaless" — easier early-stage iteration.
- One document = one object → no joins for the common case.
- Easy horizontal scaling (sharding by document ID).
Examples
MongoDB, Couchbase, DynamoDB (item model), Firestore.
When document hurts
- Joins are awkward. You either denormalize (data duplication, update anomalies) or do app-side joins.
- Many-to-many relationships (tags, friend graphs) → repeated work.
- Consistency across documents is usually not transactional (some, like Mongo 4.0+, support multi-doc txns; with cost).
Schema-on-read vs schema-on-write
"Schemaless" doesn't mean no schema; it means the schema lives in the application code. Every document still has structure — you just don't get database-side enforcement. When the app evolves, old documents in old shape stay in the DB; new app code must read both. (See chapter 8 on schema evolution.)
Document vs relational: it's not either/or
Modern Postgres has JSONB columns with GIN indexes, supporting most "document" use cases inside a relational database. The boundary is blurry. Pick relational + JSONB by default; reach for a pure document DB only when the whole workload is doc-shaped and you need horizontal scale.
4.3 Key-value
The simplest model. get(key), put(key, value). Value is opaque (bytes, string, JSON).
Examples
Redis, Memcached, DynamoDB (key model), Riak, RocksDB (embedded), LevelDB.
When KV is right
- Caching layer (Redis, Memcached).
- Session store.
- Sharded by key naturally — every key has a clear home.
- Embedded storage engine for higher-level systems (RocksDB inside Kafka, CockroachDB).
Limits
- No range queries unless the KV store explicitly supports them (Redis sorted sets, DynamoDB sort keys, RocksDB ordered keys).
- No secondary indexes without bolt-ons or app-level work.
4.4 Wide-column / column-family
A "row" has a key, and a flexible set of columns. Columns can be sparse — rows can have different columns. Designed for huge volumes, distributed by key range.
Examples
Bigtable (Google), Cassandra, HBase, ScyllaDB.
Mental model
Think of it as a sorted map: (row_key, column_family, column, timestamp) → value. Cassandra and HBase descend from Bigtable's design (Google paper 2006).
When it shines
- Time-series / event logs (write-heavy, append-mostly).
- Sparse data with millions of possible columns per row.
- Massive horizontal scale (PB+).
Pain points
- Modeling is query-driven. You design tables per query — denormalize aggressively. No general-purpose joins.
- Eventual consistency by default (Cassandra). HBase and Bigtable provide strong row-level consistency.
4.5 Graph
Nodes (entities) and edges (relationships), both with properties. First-class traversals.
When the data is the graph
- Social networks (friends, follows, mutual)
- Knowledge graphs (Wikidata, Google's knowledge graph)
- Recommendation engines
- Fraud detection (find unusual paths through transactions)
- Dependency analysis (build systems, package managers)
Examples
Neo4j (most popular), Neptune (AWS), JanusGraph (on top of Cassandra/HBase), Dgraph.
Query languages
- Cypher (Neo4j)
- Gremlin (Apache TinkerPop, runs across many graph DBs)
- SPARQL (RDF / triple stores)
When to skip a graph DB
If your "graph" has at most 2-3 hops and is small enough to fit, model it relationally. Joins handle 2-3 hops fine. Graph DBs win for deep traversals, variable-length paths, frequent subgraph queries.
4.6 Time-series
Append-mostly, timestamped data. Metrics, IoT, logs.
Examples
InfluxDB, TimescaleDB (Postgres extension), Prometheus (in-memory + WAL), Druid, ClickHouse (often used here), Bigtable / Cassandra (when you DIY).
Special properties
- Recent data hot, old data cold (lifecycle / TTL).
- Aggregations over time windows are the dominant read pattern.
- Compression is huge (delta-of-delta encoding for timestamps, gorilla compression for floats).
4.7 Search-oriented
Inverted index over text. Full-text search, faceting, ranking.
Examples: Elasticsearch, OpenSearch, Solr, Vespa, Algolia (managed), Meilisearch.
(See chapter 25 — Search Systems.)
4.8 Vector / embedding
For similarity search over high-dimensional vectors (ML embeddings). Approximate nearest neighbor (ANN).
Examples: Pinecone, Weaviate, Milvus, Qdrant, pgvector (Postgres extension), Vespa.
When you need it: RAG, semantic search, image similarity, recommendation. Pure SQL DBs will fall over at scale because brute-force similarity is O(N×D) per query.
4.9 Polyglot persistence
Real systems use multiple stores:
- Postgres for OLTP source of truth
- Redis for cache / session / rate-limit counters
- Elasticsearch for full-text search
- S3 / GCS for blobs (images, video)
- Kafka for event log / async pipelines
- Snowflake / BigQuery for analytics
- Pinecone / pgvector for embeddings
- Neo4j for graph queries (rare; only when truly needed)
The challenge is keeping them in sync — covered later under change-data-capture (CDC) and event sourcing.
4.10 How to pick
Decision flow for an interview:
- Default to Postgres unless you have a strong reason. It does relational, JSONB documents, JSON queries, full-text search (decent), and pgvector. One system, ACID, well-understood.
- Object storage (S3) for any blob >1 MB. Never put images in your DB.
- Add Redis when you need cache / counters / fast ephemeral state.
- Add Elasticsearch when full-text search becomes a first-class user feature.
- Move to a wide-column store (Cassandra, Bigtable) when single-region Postgres can't hold your write volume even with sharding.
- Use a graph DB only when traversal depth is unbounded.
For Google interviews specifically: be ready to discuss Spanner (globally consistent SQL with TrueTime) and Bigtable (sparse wide-column at petabyte scale).
Key takeaways
- Data model = the most consequential decision. Pick by access pattern, not by hype.
- Relational is the default. Document/wide-column win for specific workloads.
- Polyglot persistence is normal. Postgres + Redis + S3 covers most early-stage needs.
- For Google: know Spanner (relational+global), Bigtable (wide-column), and the trade-offs of each.
// 1 view