SQL vs NoSQL: Data Models, Trade-offs & When to Use What
Choosing a database is not about picking "the best one" — it's about matching a data model and consistency guarantee to your access patterns. This guide walks through every major database family, their internal structures, and practical selection criteria.
The Fundamental Divide
| Dimension | SQL (Relational) | NoSQL |
|---|---|---|
| Data model | Tables with rows and columns, strict schema | Varies: documents, key-value, wide-column, graph |
| Schema | Schema-on-write (enforced at insert) | Schema-on-read (flexible, often schemaless) |
| Query language | SQL (standardized) | Varies per engine (some support SQL-like syntax) |
| Consistency | Strong (ACID transactions by default) | Tunable — from eventual to strong |
| Scaling model | Vertical first, sharding is complex | Horizontal first, designed for distribution |
| Joins | Native, multi-table joins are first-class | Typically avoided; denormalization preferred |
The CAP theorem states that a distributed system can guarantee at most two of three: Consistency, Availability, Partition tolerance. Since network partitions are unavoidable, the real choice is between consistency and availability during a partition.
Relational Databases (SQL)
How they work
Data is stored in tables (relations) with a fixed schema. Each row has the same columns. Relationships between tables are expressed through foreign keys and resolved through JOINs at query time.
-- Normalized relational model
CREATE TABLE customers (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL
);
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
customer_id INT REFERENCES customers(id),
total DECIMAL(10,2),
created_at TIMESTAMPTZ DEFAULT now()
);
-- Join at query time
SELECT c.name, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01';
Storage internals
Most relational databases use a B-tree index for primary keys and secondary indexes. Data is stored in pages (typically 8 KB in PostgreSQL, 16 KB in MySQL/InnoDB). Row-oriented storage means each page contains complete rows — efficient for OLTP, less so for analytical scans.
Write-Ahead Log (WAL): All changes are first written to a sequential log before being applied to data pages. This guarantees durability and crash recovery.
ACID properties
| Property | Guarantee |
|---|---|
| Atomicity | All operations in a transaction succeed or none do |
| Consistency | Constraints (foreign keys, checks, unique) are always enforced |
| Isolation | Concurrent transactions don't interfere (configurable levels) |
| Durability | Committed data survives crashes |
When to choose SQL
- Complex relationships between entities (users, orders, products, invoices)
- Strong consistency is non-negotiable (financial transactions, inventory)
- Ad-hoc queries — SQL's expressiveness lets analysts explore freely
- Mature ecosystem — decades of tooling, ORMs, backup strategies
Key engines
| Engine | Strengths | Cloud managed |
|---|---|---|
| PostgreSQL | Extensions (PostGIS, pgvector, TimescaleDB), JSONB, CTEs | RDS, AlloyDB, Azure, Neon |
| MySQL | Web-scale proven, replication maturity | RDS, Cloud SQL, PlanetScale |
| SQL Server | Enterprise features, .NET integration | Azure SQL |
| CockroachDB | Distributed SQL, serializable by default | CockroachDB Cloud |
| SQLite | Embedded, zero-config, serverless-friendly | Turso (distributed SQLite) |
Document Databases
How they work
Data is stored as documents — typically JSON (or BSON). Each document is self-contained: it carries all its data without requiring joins. Documents in the same collection can have different structures.
{
"_id": "order_42",
"customer": {
"name": "Alice Martin",
"email": "alice@example.com"
},
"items": [
{ "sku": "WIDGET-01", "qty": 3, "price": 29.99 },
{ "sku": "GADGET-07", "qty": 1, "price": 149.00 }
],
"total": 238.97,
"status": "shipped",
"created_at": "2026-03-15T10:30:00Z"
}
Storage internals
MongoDB uses a B-tree index on _id and supports secondary indexes on any field, including nested paths and arrays. Documents are stored in BSON format with an internal WiredTiger storage engine that provides compression and document-level concurrency control.
When to choose document DBs
- Varying schemas — product catalogs where each item has different attributes
- Hierarchical/nested data — CMS content, user profiles with embedded preferences
- Rapid prototyping — no migrations, add fields freely
- Read-heavy with known access patterns — embed related data to avoid joins
When to avoid
- Heavy cross-collection joins (you'll end up building SQL badly)
- Transactions spanning many documents (MongoDB supports multi-doc transactions, but they're expensive)
Key engines
| Engine | Model | Notes |
|---|---|---|
| MongoDB | BSON documents | Market leader, Atlas cloud, aggregation pipeline |
| Couchbase | JSON documents | Built-in caching layer, SQL++ query language |
| Amazon DocumentDB | MongoDB-compatible | AWS-managed, not open-source MongoDB |
| Firestore | JSON documents | Serverless, real-time sync, mobile SDKs |
| CouchDB | JSON documents | Multi-master replication, offline-first |
Key-Value Stores
How they work
The simplest data model: a key maps to a value. The store is opaque about the value — it's just bytes. Lookups are O(1) by key.
SET session:abc123 '{"user_id": 42, "role": "admin"}' EX 3600
GET session:abc123
Storage internals
Redis stores everything in memory with optional persistence (RDB snapshots, AOF append log). Data structures go beyond simple strings: lists, sets, sorted sets, hashes, streams, and HyperLogLog.
DynamoDB uses a partition key (hash) to distribute data across nodes and an optional sort key for range queries within a partition. Storage is SSD-backed with single-digit millisecond latency.
When to choose key-value
- Caching — session data, API responses, computed results
- High-throughput, low-latency — shopping carts, rate limiting, leaderboards
- Simple access patterns — lookup by ID only, no complex queries needed
Key engines
| Engine | Type | Notes |
|---|---|---|
| Redis / Valkey | In-memory | Data structures, pub/sub, Lua scripting |
| DynamoDB | Managed, SSD | Auto-scaling, single-digit ms, global tables |
| Memcached | In-memory | Pure cache, multi-threaded, simpler than Redis |
| etcd | Distributed KV | Used by Kubernetes, strong consistency via Raft |
Wide-Column (Column-Family) Stores
How they work
Data is organized into rows identified by a row key, but columns are grouped into column families. Unlike relational tables, each row can have different columns, and columns are stored together on disk by family — enabling efficient scans of specific column groups.
Row key: user#42
Column family "profile":
name = "Alice"
email = "alice@example.com"
Column family "activity":
last_login = "2026-03-15"
login_count = 142
last_page = "/dashboard"
Storage internals
Based on the LSM-tree (Log-Structured Merge-tree) architecture:
- Writes go to an in-memory memtable
- When full, flushed to disk as an immutable SSTable (Sorted String Table)
- Background compaction merges SSTables to reduce read amplification
This makes writes extremely fast (sequential I/O) at the cost of read amplification (may need to check multiple SSTables).
When to choose wide-column
- Time-series data — IoT sensor readings, metrics, event logs
- Write-heavy workloads — billions of events per day
- Wide rows with sparse columns — user activity tracking
- Geographic distribution — multi-region replication
Key engines
| Engine | Notes |
|---|---|
| Apache Cassandra | Peer-to-peer, tunable consistency, CQL query language |
| ScyllaDB | Cassandra-compatible, C++ rewrite, lower latency |
| HBase | Hadoop ecosystem, strong consistency, HDFS-backed |
| Google Bigtable | Fully managed, powers Google Search/Maps/Gmail |
| Amazon Keyspaces | Managed Cassandra-compatible on AWS |
Graph Databases
How they work
Data is modeled as nodes (entities) and edges (relationships). Both can carry properties. Graph databases excel at traversing relationships — something that requires expensive recursive JOINs in SQL.
// Cypher (Neo4j) — find friends of friends who like Python
MATCH (me:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:FRIENDS_WITH]->(fof)
WHERE (fof)-[:LIKES]->(:Topic {name: "Python"})
AND NOT (me)-[:FRIENDS_WITH]->(fof)
RETURN DISTINCT fof.name
Storage internals
Native graph storage (Neo4j): Nodes and relationships are stored as linked records with direct physical pointers. Traversing a relationship is a pointer hop — O(1) regardless of graph size. This is called index-free adjacency.
Non-native graph (JanusGraph, Amazon Neptune): Uses an underlying storage engine (Cassandra, DynamoDB) with an index layer for graph traversals. More scalable horizontally but slower per-hop.
When to choose graph
- Social networks — friends, followers, recommendations
- Knowledge graphs — entity relationships, ontologies
- Fraud detection — finding suspicious relationship patterns
- Network topology — infrastructure dependencies, routing
- Recommendation engines — "users who bought X also bought Y"
Key engines
| Engine | Query Language | Notes |
|---|---|---|
| Neo4j | Cypher | Market leader, native graph storage, ACID |
| Amazon Neptune | Gremlin, SPARQL, openCypher | Managed, supports property graph + RDF |
| ArangoDB | AQL | Multi-model (document + graph + key-value) |
| JanusGraph | Gremlin | Open-source, pluggable storage backend |
| Dgraph | GraphQL, DQL | Distributed, native GraphQL support |
Time-Series Databases
How they work
Optimized for append-heavy, time-stamped data where recent data is accessed more frequently. Most use columnar storage with aggressive compression (delta encoding, run-length encoding) since consecutive timestamps and values are often similar.
-- InfluxQL
SELECT mean("cpu_usage")
FROM "server_metrics"
WHERE time > now() - 1h
GROUP BY time(5m), "host"
When to choose time-series
- Infrastructure monitoring — CPU, memory, disk, network metrics
- IoT — sensor data at scale
- Financial data — tick data, OHLCV candles
- Application metrics — request latency, error rates, throughput
Key engines
| Engine | Notes |
|---|---|
| InfluxDB | Purpose-built TSDB, Flux query language |
| TimescaleDB | PostgreSQL extension — full SQL on time-series |
| Prometheus | Pull-based metrics, de facto for Kubernetes |
| Amazon Timestream | Managed, serverless TSDB on AWS |
| ClickHouse | Column-oriented OLAP that excels at time-series |
| QuestDB | High-performance TSDB with SQL support |
Vector Databases
How they work
Store high-dimensional embedding vectors and support approximate nearest neighbor (ANN) search. Essential for AI/ML applications where similarity search replaces exact matching.
# Typical vector search flow
embedding = model.encode("How do I reset my password?")
results = collection.query(
query_embeddings=[embedding],
n_results=5 # top-5 most similar
)
Index structures
| Algorithm | Type | Trade-off |
|---|---|---|
| HNSW | Graph-based | High recall, more memory |
| IVF | Inverted file | Faster index build, lower recall |
| PQ | Product quantization | Compressed vectors, lower accuracy |
| ScaNN | Hybrid | Google's optimized ANN |
Key engines
| Engine | Notes |
|---|---|
| Pinecone | Managed, serverless, metadata filtering |
| Weaviate | Open-source, multi-modal, GraphQL API |
| Milvus / Zilliz | Open-source, GPU-accelerated, distributed |
| Qdrant | Rust-based, filtering + payload support |
| pgvector | PostgreSQL extension — vectors in your existing DB |
| ChromaDB | Lightweight, Python-native, popular for prototyping |
Decision Framework
By access pattern
| Access Pattern | Best Fit |
|---|---|
| Complex joins, ad-hoc queries, transactions | Relational (SQL) |
| Lookup by key, caching, sessions | Key-Value |
| Nested/hierarchical objects, flexible schema | Document |
| Time-stamped append-only streams | Time-Series |
| Relationship traversal, path finding | Graph |
| High-volume writes, wide sparse rows | Wide-Column |
| Similarity search, embeddings | Vector |
By consistency requirement
| Requirement | Options |
|---|---|
| Strong ACID needed | PostgreSQL, MySQL, CockroachDB, Spanner |
| Tunable consistency OK | Cassandra, DynamoDB, MongoDB |
| Eventual consistency fine | Redis (replication), Couchbase, Cassandra |
The polyglot persistence reality
Most production systems use multiple databases for different workloads:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PostgreSQL │ │ Redis │ │ Elasticsearch│
│ (orders, │ │ (sessions, │ │ (full-text │
│ inventory) │ │ cache) │ │ search) │
└──────┬───────┘ └──────┬──────┘ └──────┬───────┘
│ │ │
└────────────────────┼────────────────────┘
│
┌───────┴───────┐
│ Application │
└───────┬───────┘
│
┌───────┴───────┐
│ Kafka / │
│ Event Bus │
└───────┬───────┘
│
┌────────────┼────────────┐
│ │ │
┌──────┴──────┐ ┌──┴───┐ ┌──────┴──────┐
│ ClickHouse │ │ S3 │ │ Neo4j │
│ (analytics) │ │(lake)│ │ (recs graph)│
└─────────────┘ └──────┘ └─────────────┘
The key is to let events flow between systems via a message bus rather than trying to synchronize databases directly.
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Using MongoDB for heavily relational data | Painful manual joins, data duplication | Use PostgreSQL with JSONB for semi-structured parts |
| Using SQL for simple key lookups at scale | Unnecessary overhead | Add a Redis cache or move to DynamoDB |
| Choosing a DB based on hype | Operational burden, mismatched patterns | Match the DB to your actual access patterns |
| Ignoring operational costs | "Free" open-source still needs backups, upgrades, monitoring | Factor in managed vs self-hosted total cost |
| Premature denormalization | Data inconsistency, update anomalies | Normalize first, denormalize when you have proof of need |
Resources
- Designing Data-Intensive Applications — Martin Kleppmann's essential book
- Database Internals — Alex Petrov's deep dive into storage engines
- DB-Engines Ranking — Popularity tracking for 400+ database engines
- Use The Index, Luke — SQL indexing and tuning guide
- CMU Database Group lectures — Andy Pavlo's database systems course
:::