CAP, PACELC & Beyond: The Theorems That Shape Distributed Systems
Every distributed database, message queue, and microservice architecture is constrained by a handful of fundamental theorems. Understanding them prevents you from chasing impossible designs and helps you make explicit trade-offs.
The CAP Theorem
Formulated by Eric Brewer (2000), proven by Gilbert & Lynch (2002).
A distributed data store can provide at most two of three guarantees simultaneously:
- Consistency — every read receives the most recent write (linearizability)
- Availability — every request receives a non-error response (no timeout)
- Partition tolerance — the system continues operating despite network partitions
Why "pick two" is misleading
Network partitions will happen. You don't choose whether to tolerate them — you must. So the real choice during a partition is:
| During a partition… | You get | You lose | Example |
|---|---|---|---|
| CP — reject requests until partition heals | Consistency | Availability | HBase, MongoDB (default), etcd, Spanner |
| AP — serve possibly stale data | Availability | Consistency | Cassandra, DynamoDB (eventual), CouchDB |
When there is no partition (the normal case), you can have both C and A. CAP only constrains behavior during a failure.
Normal operation During partition
┌───────────┐ ┌───────────┐
│ C + A │ │ C or A │
│ (both!) │ │ (pick one)│
└───────────┘ └───────────┘
Common misconceptions
| Misconception | Reality |
|---|---|
| "MongoDB is CP, so it's always consistent" | MongoDB sacrifices availability during primary election, but replica reads can be stale |
| "Cassandra is AP, so data is always wrong" | Cassandra supports tunable consistency (QUORUM reads = strong consistency for that query) |
| "SQL = CP, NoSQL = AP" | PostgreSQL with async replicas is effectively AP; CockroachDB is CP and uses SQL |
| "CAP means you lose one forever" | CAP only applies during partitions — normally you get all three |
The PACELC Theorem
Daniel Abadi (2012) extended CAP to cover the normal (no partition) case:
If there is a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.
This is more practical because most of the time there is no partition, and the real trade-off is latency vs consistency.
| System | During Partition (PAC) | Else (ELC) | Classification |
|---|---|---|---|
| PostgreSQL (sync replicas) | PC | EC | PC/EC |
| PostgreSQL (async replicas) | PA | EL | PA/EL |
| Cassandra (QUORUM) | PC | EC | PC/EC |
| Cassandra (ONE) | PA | EL | PA/EL |
| DynamoDB (eventual) | PA | EL | PA/EL |
| DynamoDB (strong) | PC | EC | PC/EC |
| Spanner | PC | EC | PC/EC (with TrueTime for low L) |
| CockroachDB | PC | EC | PC/EC |
| MongoDB (default) | PC | EL | PC/EL |
| Redis (replicated) | PA | EL | PA/EL |
The key insight
Most databases let you tune where you sit on this spectrum per-query or per-table. DynamoDB has both eventually consistent reads (fast, cheap) and strongly consistent reads (slower, 2x cost). Cassandra's consistency level can be set per query.
Consistency Models
"Consistency" means different things depending on context. Here's the hierarchy from strongest to weakest:
Strict / Linearizability
The gold standard. Every operation appears to execute atomically at some point between its invocation and response, and all processes see the same order.
Cost: Requires coordination (consensus protocols). High latency. Who provides it: Spanner (TrueTime), CockroachDB, etcd, ZooKeeper.
Sequential Consistency
All processes see operations in the same order, and each process's operations appear in program order. But the global order might not match real-time order.
Cost: Cheaper than linearizability, still requires ordering.
Causal Consistency
If operation A causally precedes operation B (A happened-before B), then all processes see A before B. Concurrent operations (no causal link) can be seen in any order.
Cost: Moderate. Can be implemented without global coordination. Who provides it: MongoDB (causal sessions), some CRDT-based systems.
Eventual Consistency
If no new updates are made, eventually all replicas converge to the same value. No ordering guarantees during convergence.
Cost: Cheapest. Maximum availability and performance. Who provides it: Cassandra (ONE), DynamoDB (default), DNS, CDN caches.
Read-your-writes Consistency
A client always sees its own writes. Others might not see them immediately.
Cost: Low. Can be achieved with session stickiness. Who provides it: Most databases can be configured for this.
Strongest ────────────────────────────────── Weakest
Linearizable → Sequential → Causal → Read-your-writes → Eventual
Most coordination ──────────────── Least coordination
Highest latency ────────────────── Lowest latency
Consensus Protocols
How do distributed nodes agree on a value? This is the consensus problem.
The FLP Impossibility Result
Fischer, Lynch, Paterson (1985): In an asynchronous system, no deterministic consensus protocol can guarantee termination if even one process may crash. This is why all practical consensus protocols use timeouts (partial synchrony assumption).
Paxos
Leslie Lamport (1989). The foundational consensus protocol:
- Prepare — proposer sends proposal number to acceptors
- Promise — acceptors promise not to accept older proposals
- Accept — proposer sends value, acceptors accept if no newer proposal seen
- Learn — value is chosen when a majority of acceptors agree
Paxos is correct but notoriously hard to implement. Multi-Paxos extends it for a log of values.
Used by: Google Chubby, original Spanner.
Raft
Diego Ongaro & John Ousterhout (2014). Designed to be understandable — same guarantees as Multi-Paxos but with a clearer structure:
- Leader election — nodes vote, one becomes leader
- Log replication — leader appends entries, replicates to followers
- Safety — committed entries are never lost
┌──────────┐
┌────────►│ Follower │
│ └──────────┘
┌──────┴───┐ ┌──────────┐
│ Leader │────►│ Follower │
└──────┬───┘ └──────────┘
│ ┌──────────┐
└────────►│ Follower │
└──────────┘
Leader replicates log entries to all followers.
Commit when majority acknowledges.
Used by: etcd, CockroachDB, TiKV, Consul, RabbitMQ Quorum Queues.
Viewstamped Replication (VR)
Similar to Raft (predates it), used by some systems internally. Same leader-based model.
Byzantine Fault Tolerance (BFT)
Handles nodes that actively lie or behave maliciously (not just crash). PBFT (Practical BFT) requires 3f+1 nodes to tolerate f faulty nodes. Much more expensive — mainly used in blockchain and high-security systems.
The Two Generals Problem
The simplest illustration of distributed agreement being impossible with unreliable communication:
Two generals must agree on an attack time. They communicate by messenger through enemy territory (messages can be lost). No protocol with a finite number of messages can guarantee agreement.
This is why TCP uses a three-way handshake — it doesn't solve the problem, it just makes failure unlikely enough for practical use.
Vector Clocks & Happened-Before
Leslie Lamport's "happened-before" relation (1978) defines causal ordering in distributed systems:
- If A and B are in the same process and A occurs before B, then A → B
- If A is a message send and B is its receipt, then A → B
- If A → B and B → C, then A → C (transitivity)
Vector clocks track this relation:
Process P1: [2, 0, 0] — P1 has done 2 events
Process P2: [1, 3, 0] — P2 has seen P1's event 1 and done 3 of its own
Process P3: [1, 2, 1] — P3 has seen P1:1, P2:2, and done 1 of its own
If vector A ≤ vector B component-wise → A happened before B. If neither A ≤ B nor B ≤ A → events are concurrent (no causal relationship).
Used for: Conflict detection in Dynamo-style databases, CRDTs, version tracking.
CRDTs — Conflict-Free Replicated Data Types
Data structures that can be replicated across nodes and merged without conflicts, regardless of the order updates are received.
| CRDT | Type | Example |
|---|---|---|
| G-Counter | Grow-only counter | Each node maintains its own count; merge = sum |
| PN-Counter | Counter (inc/dec) | Two G-Counters: one for increments, one for decrements |
| G-Set | Grow-only set | Union of all elements across replicas |
| OR-Set | Set with remove | Add-wins observed-remove set |
| LWW-Register | Last-writer-wins | Timestamp-based conflict resolution |
| LWW-Map | Map with LWW per key | Key-value store with per-key timestamps |
Used in: Redis CRDT (Redis Enterprise), Riak, Automerge (collaborative editing), Figma, Apple Notes.
When CRDTs shine
- Offline-first apps that sync later
- Multi-region databases with local writes
- Collaborative editing (text, design tools)
- Edge computing where coordination is expensive
The CALM Theorem
Consistency As Logical Monotonicity (Hellerstein, 2010):
A program can be made eventually consistent (coordination-free) if and only if it is monotonic — it only adds information, never retracts.
Monotonic: Counters, sets (add-only), logs, accumulators. Non-monotonic: Garbage collection, global constraints, "exactly once" delivery.
This tells you which parts of your system can be AP (anything monotonic) and which must involve coordination (anything that requires negation or retraction).
Practical Decision Framework
Choosing consistency for your service
| Service Type | Suggested Model | Why |
|---|---|---|
| Payment / Banking | Linearizable (CP) | Money must not appear/disappear. Prefer errors over incorrect balances |
| Inventory | Serializable or strong | Overselling is costly; slight latency is acceptable |
| User profile | Read-your-writes | User must see their own changes; others can lag |
| Social feed | Eventual | Seeing a post 2 seconds late is fine |
| Analytics / Metrics | Eventual | Approximate counts are acceptable |
| Chat / Messaging | Causal | Messages must appear in order within a conversation |
| DNS / CDN | Eventual with TTL | Stale for seconds/minutes is acceptable |
| Distributed locks | Linearizable | A lock that two processes hold simultaneously is useless |
Per-operation tuning
Many systems let you choose consistency per request:
# DynamoDB — strong vs eventual per read
table.get_item(Key={"id": "123"}, ConsistentRead=True) # strong
table.get_item(Key={"id": "123"}, ConsistentRead=False) # eventual (default, 2x cheaper)
# Cassandra — consistency per query
session.execute(query, consistency_level=ConsistencyLevel.QUORUM) # strong
session.execute(query, consistency_level=ConsistencyLevel.ONE) # fast
Resources
- Designing Data-Intensive Applications — Chapters 5, 7, 8, 9 cover all of this
- The Raft Consensus Algorithm — Interactive visualization
- Jepsen.io — Kyle Kingsbury's distributed systems testing (reality checks for every database)
- CAP Twelve Years Later — Eric Brewer's own revisitation
- Consistency Models — Jepsen's consistency hierarchy diagram
- CRDTs: The Hard Parts (Martin Kleppmann)
:::