8. Encoding & Schema Evolution: JSON, Protobuf, Avro

Programs run on objects in memory. Networks and disks need bytes. Encoding is the bridge — and it's where forward/backward compatibility lives or dies.

8.1 The two worlds

In-memory: pointers, references, language-specific objects. Encoding: a self-contained byte sequence.

Process:

Encode / serialize / marshal: object → bytes
Decode / deserialize / unmarshal: bytes → object

You'll cross this boundary every time you write to disk, send over network, or hand off to another process.

8.2 Language-specific encodings

pickle (Python), Marshal (Ruby), Java Serializable, .NET BinaryFormatter.

Don't use them for anything beyond throwaway scripts:

Tied to one language → other services can't read.
Versioning is a footgun (Java serialVersionUID).
Security disasters: deserializing untrusted data has been the source of countless RCE bugs.

8.3 Text formats

JSON

Ubiquitous, human-readable. Schemaless (or schema lives in app code, or in a separate JSON Schema doc). Wins by being default everywhere.

Limits:

No distinction between integer and float (1 vs 1.0 are both Number).
64-bit integers don't survive (JavaScript Number is float64; precision lost above 2^53). Twitter learned this when tweet IDs blew past 2^53.
No native binary type — must base64-encode (33% bloat).
Verbose: field names repeated in every object.

XML

Heavier, more capable (namespaces, schemas via XSD). Mostly legacy now (SOAP, configuration files in old enterprise stacks).

YAML

Human-friendly superset of JSON. Used for configs (Kubernetes, GitHub Actions). Famous gotchas: no parses as false; 2.0 and 2 are different types; whitespace-sensitive. Use a strict parser.

CSV

Tabular. Looks simple, infinite escaping bugs. Don't roll your own parser.

8.4 Binary formats

When size and speed matter, you encode binary.

MessagePack, BSON, CBOR

Schemaless binary. Smaller than JSON, no schema management. Used internally (BSON in MongoDB, MessagePack in some RPC).

Protobuf (Google), Thrift (Facebook), Avro (Hadoop)

Schema-first binary. You write a .proto (or .thrift or Avro schema), generate code, encode/decode against the schema.

The schema delivers:

Compactness: field tags (small ints) instead of names.
Speed: parse without dictionary lookups.
Type safety: required fields, types validated.
Schema evolution: explicit rules for forward/backward compatibility.

8.5 Protobuf: the deep dive

A .proto file:

syntax = "proto3";
message Person {
  int64 id = 1;
  string name = 2;
  repeated string emails = 3;
  optional int32 age = 4;
}

Each field has a tag number (the small integer). On the wire, each field is encoded as (tag, type, value) — names are never on the wire.

Wire format

Varint encoding for integers (small ints take 1 byte).
Length-prefixed for strings and submessages.
Repeated fields are just multiple copies of the same tag.
Field order doesn't matter on decode.

Compatibility rules

Adding a field: safe. Old code ignores unknown tags.
Removing a field: keep the tag reserved (reserved 4;) so it's never reused.
Renaming a field: always safe (name not on wire).
Changing a field type: dangerous. Some int conversions are safe (int32 ↔ int64), most others are not.
Changing tag number: never safe.

proto3 quirks

Default values are not distinguishable from "field absent" (no null for primitives). optional was reintroduced to signal presence.
Required fields removed (proto2 had them; turned out to be footgun for evolution).

Why Google uses Protobuf

Internal systems all speak Protobuf-over-Stubby (now gRPC). Tooling, monitoring, security, all built around the schemas. Schemas live in a monorepo so cross-team changes are coordinated.

8.6 Apache Avro

Schema-first like Protobuf, but the schema is embedded with the data (or a schema ID references a registry).

Designed for Hadoop / Kafka pipelines.
Schema registry pattern: producers register schemas; consumers fetch by ID.
Compatibility checks: registry can refuse a schema that breaks consumers.
More dynamic than Protobuf: writers and readers each have a schema; the framework reconciles.

When to pick: data lakes, Kafka topics, anywhere you want to evolve schemas independently and check at write time.

8.7 Schema evolution: the compatibility matrix

Two directions of compatibility matter:

Backward compatibility: new code reads old data. (Critical when you deploy new code but DB still has rows in old format.)
Forward compatibility: old code reads new data. (Critical when one service deploys before another, or in a multi-version mobile fleet.)

Strategies that maintain both:

Add only optional / nullable fields with defaults.
Never reuse field tags / column IDs.
Keep old fields readable for a deprecation window.
Generate code; never hand-edit.

8.8 Versioning APIs

Endpoint contracts evolve. Three styles:

URL versioning

/v1/users, /v2/users. Easy to route, easy to deprecate. Stripe-style (more on this below).

Header versioning

API-Version: 2024-10-01. Hides version from URL; harder to debug from logs.

Date-based / continuous versioning

Stripe's approach: every breaking change has a date; clients pin to a date. The server runs all versions. Internal "compatibility transformers" map old request/response shapes to current internals. Lets old clients keep working forever; lets internals refactor freely.

Schema-driven (gRPC / Protobuf)

The schema is the contract. Compatibility rules above keep things working.

8.9 Database schema evolution

(See chapter 6.) The high-level pattern:

Add nullable column.
Deploy code that writes both old and new.
Backfill historic rows.
Deploy code that reads new only.
Drop old column (later, after full rollout + recovery window).

This is the "expand/migrate/contract" pattern. Required because deploys are gradual and DBs are shared.

8.10 Encoding for storage vs network

Storage lives forever. Pick a format with strong evolution guarantees (Protobuf with reserved tags, Avro with schema registry).
Network (RPC) lives one request. Looser; even JSON works.
Cache lives short. JSON or MessagePack with a version prefix is fine.

If you serialize Python pickles to disk, they will outlive your Python version, and you will weep. Choose for the long tail.

8.11 gRPC

Google's RPC framework, on top of Protobuf and HTTP/2.

Streaming (uni and bi-directional) supported.
Compact wire format.
Code generation in many languages.
Strong typing, inspection via reflection.
Less browser-friendly (use grpc-web or REST shim).

When to pick gRPC over REST:

Internal service-to-service.
High throughput, low latency.
Polyglot codebases that all share one schema source.

When to stick with REST/JSON:

Public APIs (browser/mobile/3rd-party clients).
Discoverability matters more than throughput.
Simple CRUD with minimal contract complexity.

8.12 Compression

Independent of encoding. Apply if your data is repetitive or large.

gzip: ubiquitous, good ratio, slow.
zstd: similar ratio, much faster than gzip. Default for new systems.
Snappy: Google's, optimized for speed > ratio. Used inside many systems (Cassandra, LevelDB).
LZ4: even faster, lower ratio.
Brotli: best ratio for HTTP, all major browsers support it. Use for static text on web.

For columnar storage (Parquet, ORC), per-column compression with dictionary encoding + run-length is enormously effective.

Key takeaways

JSON is the universal choice for human/external interfaces. Binary (Protobuf, Avro) for internal/storage.
Schema-first formats give you compatibility, code generation, and tooling.
Never reuse field tags; always plan for forward and backward compatibility.
Versioning style: URL is simple; date-based (Stripe) is the most user-friendly long term.
gRPC for internal, REST for external — usually.