tadata
Back to home

CAP, PACELC & Beyond: The Theorems That Shape Distributed Systems

#databases#distributed-systems#architecture#cloud#reliability

Every distributed database, message queue, and microservice architecture is constrained by a handful of fundamental theorems. Understanding them prevents you from chasing impossible designs and helps you make explicit trade-offs.

The CAP Theorem

Formulated by Eric Brewer (2000), proven by Gilbert & Lynch (2002).

A distributed data store can provide at most two of three guarantees simultaneously:

  • Consistency — every read receives the most recent write (linearizability)
  • Availability — every request receives a non-error response (no timeout)
  • Partition tolerance — the system continues operating despite network partitions

Why "pick two" is misleading

Network partitions will happen. You don't choose whether to tolerate them — you must. So the real choice during a partition is:

During a partition…You getYou loseExample
CP — reject requests until partition healsConsistencyAvailabilityHBase, MongoDB (default), etcd, Spanner
AP — serve possibly stale dataAvailabilityConsistencyCassandra, DynamoDB (eventual), CouchDB

When there is no partition (the normal case), you can have both C and A. CAP only constrains behavior during a failure.

        Normal operation          During partition
        ┌───────────┐            ┌───────────┐
        │  C + A    │            │  C or A   │
        │  (both!)  │            │ (pick one)│
        └───────────┘            └───────────┘

Common misconceptions

MisconceptionReality
"MongoDB is CP, so it's always consistent"MongoDB sacrifices availability during primary election, but replica reads can be stale
"Cassandra is AP, so data is always wrong"Cassandra supports tunable consistency (QUORUM reads = strong consistency for that query)
"SQL = CP, NoSQL = AP"PostgreSQL with async replicas is effectively AP; CockroachDB is CP and uses SQL
"CAP means you lose one forever"CAP only applies during partitions — normally you get all three

The PACELC Theorem

Daniel Abadi (2012) extended CAP to cover the normal (no partition) case:

If there is a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.

This is more practical because most of the time there is no partition, and the real trade-off is latency vs consistency.

SystemDuring Partition (PAC)Else (ELC)Classification
PostgreSQL (sync replicas)PCECPC/EC
PostgreSQL (async replicas)PAELPA/EL
Cassandra (QUORUM)PCECPC/EC
Cassandra (ONE)PAELPA/EL
DynamoDB (eventual)PAELPA/EL
DynamoDB (strong)PCECPC/EC
SpannerPCECPC/EC (with TrueTime for low L)
CockroachDBPCECPC/EC
MongoDB (default)PCELPC/EL
Redis (replicated)PAELPA/EL

The key insight

Most databases let you tune where you sit on this spectrum per-query or per-table. DynamoDB has both eventually consistent reads (fast, cheap) and strongly consistent reads (slower, 2x cost). Cassandra's consistency level can be set per query.

Consistency Models

"Consistency" means different things depending on context. Here's the hierarchy from strongest to weakest:

Strict / Linearizability

The gold standard. Every operation appears to execute atomically at some point between its invocation and response, and all processes see the same order.

Cost: Requires coordination (consensus protocols). High latency. Who provides it: Spanner (TrueTime), CockroachDB, etcd, ZooKeeper.

Sequential Consistency

All processes see operations in the same order, and each process's operations appear in program order. But the global order might not match real-time order.

Cost: Cheaper than linearizability, still requires ordering.

Causal Consistency

If operation A causally precedes operation B (A happened-before B), then all processes see A before B. Concurrent operations (no causal link) can be seen in any order.

Cost: Moderate. Can be implemented without global coordination. Who provides it: MongoDB (causal sessions), some CRDT-based systems.

Eventual Consistency

If no new updates are made, eventually all replicas converge to the same value. No ordering guarantees during convergence.

Cost: Cheapest. Maximum availability and performance. Who provides it: Cassandra (ONE), DynamoDB (default), DNS, CDN caches.

Read-your-writes Consistency

A client always sees its own writes. Others might not see them immediately.

Cost: Low. Can be achieved with session stickiness. Who provides it: Most databases can be configured for this.

Strongest ────────────────────────────────── Weakest

Linearizable → Sequential → Causal → Read-your-writes → Eventual

Most coordination ──────────────── Least coordination
Highest latency ────────────────── Lowest latency

Consensus Protocols

How do distributed nodes agree on a value? This is the consensus problem.

The FLP Impossibility Result

Fischer, Lynch, Paterson (1985): In an asynchronous system, no deterministic consensus protocol can guarantee termination if even one process may crash. This is why all practical consensus protocols use timeouts (partial synchrony assumption).

Paxos

Leslie Lamport (1989). The foundational consensus protocol:

  1. Prepare — proposer sends proposal number to acceptors
  2. Promise — acceptors promise not to accept older proposals
  3. Accept — proposer sends value, acceptors accept if no newer proposal seen
  4. Learn — value is chosen when a majority of acceptors agree

Paxos is correct but notoriously hard to implement. Multi-Paxos extends it for a log of values.

Used by: Google Chubby, original Spanner.

Raft

Diego Ongaro & John Ousterhout (2014). Designed to be understandable — same guarantees as Multi-Paxos but with a clearer structure:

  1. Leader election — nodes vote, one becomes leader
  2. Log replication — leader appends entries, replicates to followers
  3. Safety — committed entries are never lost
                 ┌──────────┐
       ┌────────►│ Follower │
       │         └──────────┘
┌──────┴───┐     ┌──────────┐
│  Leader   │────►│ Follower │
└──────┬───┘     └──────────┘
       │         ┌──────────┐
       └────────►│ Follower │
                 └──────────┘
Leader replicates log entries to all followers.
Commit when majority acknowledges.

Used by: etcd, CockroachDB, TiKV, Consul, RabbitMQ Quorum Queues.

Viewstamped Replication (VR)

Similar to Raft (predates it), used by some systems internally. Same leader-based model.

Byzantine Fault Tolerance (BFT)

Handles nodes that actively lie or behave maliciously (not just crash). PBFT (Practical BFT) requires 3f+1 nodes to tolerate f faulty nodes. Much more expensive — mainly used in blockchain and high-security systems.

The Two Generals Problem

The simplest illustration of distributed agreement being impossible with unreliable communication:

Two generals must agree on an attack time. They communicate by messenger through enemy territory (messages can be lost). No protocol with a finite number of messages can guarantee agreement.

This is why TCP uses a three-way handshake — it doesn't solve the problem, it just makes failure unlikely enough for practical use.

Vector Clocks & Happened-Before

Leslie Lamport's "happened-before" relation (1978) defines causal ordering in distributed systems:

  • If A and B are in the same process and A occurs before B, then A → B
  • If A is a message send and B is its receipt, then A → B
  • If A → B and B → C, then A → C (transitivity)

Vector clocks track this relation:

Process P1: [2, 0, 0]  — P1 has done 2 events
Process P2: [1, 3, 0]  — P2 has seen P1's event 1 and done 3 of its own
Process P3: [1, 2, 1]  — P3 has seen P1:1, P2:2, and done 1 of its own

If vector A ≤ vector B component-wise → A happened before B. If neither A ≤ B nor B ≤ A → events are concurrent (no causal relationship).

Used for: Conflict detection in Dynamo-style databases, CRDTs, version tracking.

CRDTs — Conflict-Free Replicated Data Types

Data structures that can be replicated across nodes and merged without conflicts, regardless of the order updates are received.

CRDTTypeExample
G-CounterGrow-only counterEach node maintains its own count; merge = sum
PN-CounterCounter (inc/dec)Two G-Counters: one for increments, one for decrements
G-SetGrow-only setUnion of all elements across replicas
OR-SetSet with removeAdd-wins observed-remove set
LWW-RegisterLast-writer-winsTimestamp-based conflict resolution
LWW-MapMap with LWW per keyKey-value store with per-key timestamps

Used in: Redis CRDT (Redis Enterprise), Riak, Automerge (collaborative editing), Figma, Apple Notes.

When CRDTs shine

  • Offline-first apps that sync later
  • Multi-region databases with local writes
  • Collaborative editing (text, design tools)
  • Edge computing where coordination is expensive

The CALM Theorem

Consistency As Logical Monotonicity (Hellerstein, 2010):

A program can be made eventually consistent (coordination-free) if and only if it is monotonic — it only adds information, never retracts.

Monotonic: Counters, sets (add-only), logs, accumulators. Non-monotonic: Garbage collection, global constraints, "exactly once" delivery.

This tells you which parts of your system can be AP (anything monotonic) and which must involve coordination (anything that requires negation or retraction).

Practical Decision Framework

Choosing consistency for your service

Service TypeSuggested ModelWhy
Payment / BankingLinearizable (CP)Money must not appear/disappear. Prefer errors over incorrect balances
InventorySerializable or strongOverselling is costly; slight latency is acceptable
User profileRead-your-writesUser must see their own changes; others can lag
Social feedEventualSeeing a post 2 seconds late is fine
Analytics / MetricsEventualApproximate counts are acceptable
Chat / MessagingCausalMessages must appear in order within a conversation
DNS / CDNEventual with TTLStale for seconds/minutes is acceptable
Distributed locksLinearizableA lock that two processes hold simultaneously is useless

Per-operation tuning

Many systems let you choose consistency per request:

# DynamoDB — strong vs eventual per read
table.get_item(Key={"id": "123"}, ConsistentRead=True)   # strong
table.get_item(Key={"id": "123"}, ConsistentRead=False)  # eventual (default, 2x cheaper)

# Cassandra — consistency per query
session.execute(query, consistency_level=ConsistencyLevel.QUORUM)  # strong
session.execute(query, consistency_level=ConsistencyLevel.ONE)     # fast

Resources

:::