4. Data Models: Relational, Document, Graph

The data model is the single most important architectural decision. It shapes how you think, what queries are easy, what queries are impossible, and what your scale ceiling looks like.

4.1 Relational (SQL)

Origins: Codd, 1970. The dominant model for ~40 years; still dominant for OLTP.

Idea: data lives in relations (tables), each tuple (row) is an unordered set of values. Foreign keys connect tables. A query optimizer chooses the execution plan.

Why it won

Schema enforcement. The database guarantees structure.
Joins are cheap and arbitrary. You don't pre-decide access patterns.
Transactions (ACID). Multi-row consistency.
Mature ecosystem. Indexes, query planners, replication, tooling.

Examples

PostgreSQL, MySQL, SQL Server, Oracle, CockroachDB, Spanner (Google).

When relational hurts

Object-relational impedance mismatch: nested object hierarchies → multiple joins (N+1 query problem in ORMs).
Schema migrations on huge tables are painful (locking, slow, rolling deployments).
Horizontal scaling historically required sharding manually. Modern distributed SQL (Spanner, CockroachDB, TiDB, YugabyteDB) solves this with consensus + range partitioning, at cost of write latency.

4.2 Document (NoSQL)

Each row is a self-contained document, usually JSON or BSON. No schema enforced (schema-on-read). Common access pattern: load whole document by ID.

Why it took off (~2009-2015)

"Schemaless" — easier early-stage iteration.
One document = one object → no joins for the common case.
Easy horizontal scaling (sharding by document ID).

Examples

MongoDB, Couchbase, DynamoDB (item model), Firestore.

When document hurts

Joins are awkward. You either denormalize (data duplication, update anomalies) or do app-side joins.
Many-to-many relationships (tags, friend graphs) → repeated work.
Consistency across documents is usually not transactional (some, like Mongo 4.0+, support multi-doc txns; with cost).

Schema-on-read vs schema-on-write

"Schemaless" doesn't mean no schema; it means the schema lives in the application code. Every document still has structure — you just don't get database-side enforcement. When the app evolves, old documents in old shape stay in the DB; new app code must read both. (See chapter 8 on schema evolution.)

Document vs relational: it's not either/or

Modern Postgres has JSONB columns with GIN indexes, supporting most "document" use cases inside a relational database. The boundary is blurry. Pick relational + JSONB by default; reach for a pure document DB only when the whole workload is doc-shaped and you need horizontal scale.

4.3 Key-value

The simplest model. get(key), put(key, value). Value is opaque (bytes, string, JSON).

Examples

Redis, Memcached, DynamoDB (key model), Riak, RocksDB (embedded), LevelDB.

When KV is right

Caching layer (Redis, Memcached).
Session store.
Sharded by key naturally — every key has a clear home.
Embedded storage engine for higher-level systems (RocksDB inside Kafka, CockroachDB).

Limits

No range queries unless the KV store explicitly supports them (Redis sorted sets, DynamoDB sort keys, RocksDB ordered keys).
No secondary indexes without bolt-ons or app-level work.

4.4 Wide-column / column-family

A "row" has a key, and a flexible set of columns. Columns can be sparse — rows can have different columns. Designed for huge volumes, distributed by key range.

Examples

Bigtable (Google), Cassandra, HBase, ScyllaDB.

Mental model

Think of it as a sorted map: (row_key, column_family, column, timestamp) → value. Cassandra and HBase descend from Bigtable's design (Google paper 2006).

When it shines

Time-series / event logs (write-heavy, append-mostly).
Sparse data with millions of possible columns per row.
Massive horizontal scale (PB+).

Pain points

Modeling is query-driven. You design tables per query — denormalize aggressively. No general-purpose joins.
Eventual consistency by default (Cassandra). HBase and Bigtable provide strong row-level consistency.

4.5 Graph

Nodes (entities) and edges (relationships), both with properties. First-class traversals.

When the data is the graph

Social networks (friends, follows, mutual)
Knowledge graphs (Wikidata, Google's knowledge graph)
Recommendation engines
Fraud detection (find unusual paths through transactions)
Dependency analysis (build systems, package managers)

Examples

Neo4j (most popular), Neptune (AWS), JanusGraph (on top of Cassandra/HBase), Dgraph.

Query languages

Cypher (Neo4j)
Gremlin (Apache TinkerPop, runs across many graph DBs)
SPARQL (RDF / triple stores)

When to skip a graph DB

If your "graph" has at most 2-3 hops and is small enough to fit, model it relationally. Joins handle 2-3 hops fine. Graph DBs win for deep traversals, variable-length paths, frequent subgraph queries.

4.6 Time-series

Append-mostly, timestamped data. Metrics, IoT, logs.

Examples

InfluxDB, TimescaleDB (Postgres extension), Prometheus (in-memory + WAL), Druid, ClickHouse (often used here), Bigtable / Cassandra (when you DIY).

Special properties

Recent data hot, old data cold (lifecycle / TTL).
Aggregations over time windows are the dominant read pattern.
Compression is huge (delta-of-delta encoding for timestamps, gorilla compression for floats).

4.7 Search-oriented

Inverted index over text. Full-text search, faceting, ranking.

Examples: Elasticsearch, OpenSearch, Solr, Vespa, Algolia (managed), Meilisearch.

(See chapter 25 — Search Systems.)

4.8 Vector / embedding

For similarity search over high-dimensional vectors (ML embeddings). Approximate nearest neighbor (ANN).

Examples: Pinecone, Weaviate, Milvus, Qdrant, pgvector (Postgres extension), Vespa.

When you need it: RAG, semantic search, image similarity, recommendation. Pure SQL DBs will fall over at scale because brute-force similarity is O(N×D) per query.

4.9 Polyglot persistence

Real systems use multiple stores:

Postgres for OLTP source of truth
Redis for cache / session / rate-limit counters
Elasticsearch for full-text search
S3 / GCS for blobs (images, video)
Kafka for event log / async pipelines
Snowflake / BigQuery for analytics
Pinecone / pgvector for embeddings
Neo4j for graph queries (rare; only when truly needed)

The challenge is keeping them in sync — covered later under change-data-capture (CDC) and event sourcing.

4.10 How to pick

Decision flow for an interview:

Default to Postgres unless you have a strong reason. It does relational, JSONB documents, JSON queries, full-text search (decent), and pgvector. One system, ACID, well-understood.
Object storage (S3) for any blob >1 MB. Never put images in your DB.
Add Redis when you need cache / counters / fast ephemeral state.
Add Elasticsearch when full-text search becomes a first-class user feature.
Move to a wide-column store (Cassandra, Bigtable) when single-region Postgres can't hold your write volume even with sharding.
Use a graph DB only when traversal depth is unbounded.

For Google interviews specifically: be ready to discuss Spanner (globally consistent SQL with TrueTime) and Bigtable (sparse wide-column at petabyte scale).

Key takeaways

Data model = the most consequential decision. Pick by access pattern, not by hype.
Relational is the default. Document/wide-column win for specific workloads.
Polyglot persistence is normal. Postgres + Redis + S3 covers most early-stage needs.
For Google: know Spanner (relational+global), Bigtable (wide-column), and the trade-offs of each.