tadata
Back to home

SQL vs NoSQL: Data Models, Trade-offs & When to Use What

#databases#sql#nosql#data-engineering#cloud

Choosing a database is not about picking "the best one" — it's about matching a data model and consistency guarantee to your access patterns. This guide walks through every major database family, their internal structures, and practical selection criteria.

The Fundamental Divide

DimensionSQL (Relational)NoSQL
Data modelTables with rows and columns, strict schemaVaries: documents, key-value, wide-column, graph
SchemaSchema-on-write (enforced at insert)Schema-on-read (flexible, often schemaless)
Query languageSQL (standardized)Varies per engine (some support SQL-like syntax)
ConsistencyStrong (ACID transactions by default)Tunable — from eventual to strong
Scaling modelVertical first, sharding is complexHorizontal first, designed for distribution
JoinsNative, multi-table joins are first-classTypically avoided; denormalization preferred

The CAP theorem states that a distributed system can guarantee at most two of three: Consistency, Availability, Partition tolerance. Since network partitions are unavoidable, the real choice is between consistency and availability during a partition.

Relational Databases (SQL)

How they work

Data is stored in tables (relations) with a fixed schema. Each row has the same columns. Relationships between tables are expressed through foreign keys and resolved through JOINs at query time.

-- Normalized relational model
CREATE TABLE customers (
    id       SERIAL PRIMARY KEY,
    name     TEXT NOT NULL,
    email    TEXT UNIQUE NOT NULL
);

CREATE TABLE orders (
    id          SERIAL PRIMARY KEY,
    customer_id INT REFERENCES customers(id),
    total       DECIMAL(10,2),
    created_at  TIMESTAMPTZ DEFAULT now()
);

-- Join at query time
SELECT c.name, o.total
FROM customers c
JOIN orders o ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01';

Storage internals

Most relational databases use a B-tree index for primary keys and secondary indexes. Data is stored in pages (typically 8 KB in PostgreSQL, 16 KB in MySQL/InnoDB). Row-oriented storage means each page contains complete rows — efficient for OLTP, less so for analytical scans.

Write-Ahead Log (WAL): All changes are first written to a sequential log before being applied to data pages. This guarantees durability and crash recovery.

ACID properties

PropertyGuarantee
AtomicityAll operations in a transaction succeed or none do
ConsistencyConstraints (foreign keys, checks, unique) are always enforced
IsolationConcurrent transactions don't interfere (configurable levels)
DurabilityCommitted data survives crashes

When to choose SQL

  • Complex relationships between entities (users, orders, products, invoices)
  • Strong consistency is non-negotiable (financial transactions, inventory)
  • Ad-hoc queries — SQL's expressiveness lets analysts explore freely
  • Mature ecosystem — decades of tooling, ORMs, backup strategies

Key engines

EngineStrengthsCloud managed
PostgreSQLExtensions (PostGIS, pgvector, TimescaleDB), JSONB, CTEsRDS, AlloyDB, Azure, Neon
MySQLWeb-scale proven, replication maturityRDS, Cloud SQL, PlanetScale
SQL ServerEnterprise features, .NET integrationAzure SQL
CockroachDBDistributed SQL, serializable by defaultCockroachDB Cloud
SQLiteEmbedded, zero-config, serverless-friendlyTurso (distributed SQLite)

Document Databases

How they work

Data is stored as documents — typically JSON (or BSON). Each document is self-contained: it carries all its data without requiring joins. Documents in the same collection can have different structures.

{
  "_id": "order_42",
  "customer": {
    "name": "Alice Martin",
    "email": "alice@example.com"
  },
  "items": [
    { "sku": "WIDGET-01", "qty": 3, "price": 29.99 },
    { "sku": "GADGET-07", "qty": 1, "price": 149.00 }
  ],
  "total": 238.97,
  "status": "shipped",
  "created_at": "2026-03-15T10:30:00Z"
}

Storage internals

MongoDB uses a B-tree index on _id and supports secondary indexes on any field, including nested paths and arrays. Documents are stored in BSON format with an internal WiredTiger storage engine that provides compression and document-level concurrency control.

When to choose document DBs

  • Varying schemas — product catalogs where each item has different attributes
  • Hierarchical/nested data — CMS content, user profiles with embedded preferences
  • Rapid prototyping — no migrations, add fields freely
  • Read-heavy with known access patterns — embed related data to avoid joins

When to avoid

  • Heavy cross-collection joins (you'll end up building SQL badly)
  • Transactions spanning many documents (MongoDB supports multi-doc transactions, but they're expensive)

Key engines

EngineModelNotes
MongoDBBSON documentsMarket leader, Atlas cloud, aggregation pipeline
CouchbaseJSON documentsBuilt-in caching layer, SQL++ query language
Amazon DocumentDBMongoDB-compatibleAWS-managed, not open-source MongoDB
FirestoreJSON documentsServerless, real-time sync, mobile SDKs
CouchDBJSON documentsMulti-master replication, offline-first

Key-Value Stores

How they work

The simplest data model: a key maps to a value. The store is opaque about the value — it's just bytes. Lookups are O(1) by key.

SET session:abc123 '{"user_id": 42, "role": "admin"}' EX 3600
GET session:abc123

Storage internals

Redis stores everything in memory with optional persistence (RDB snapshots, AOF append log). Data structures go beyond simple strings: lists, sets, sorted sets, hashes, streams, and HyperLogLog.

DynamoDB uses a partition key (hash) to distribute data across nodes and an optional sort key for range queries within a partition. Storage is SSD-backed with single-digit millisecond latency.

When to choose key-value

  • Caching — session data, API responses, computed results
  • High-throughput, low-latency — shopping carts, rate limiting, leaderboards
  • Simple access patterns — lookup by ID only, no complex queries needed

Key engines

EngineTypeNotes
Redis / ValkeyIn-memoryData structures, pub/sub, Lua scripting
DynamoDBManaged, SSDAuto-scaling, single-digit ms, global tables
MemcachedIn-memoryPure cache, multi-threaded, simpler than Redis
etcdDistributed KVUsed by Kubernetes, strong consistency via Raft

Wide-Column (Column-Family) Stores

How they work

Data is organized into rows identified by a row key, but columns are grouped into column families. Unlike relational tables, each row can have different columns, and columns are stored together on disk by family — enabling efficient scans of specific column groups.

Row key: user#42
  Column family "profile":
    name = "Alice"
    email = "alice@example.com"
  Column family "activity":
    last_login = "2026-03-15"
    login_count = 142
    last_page = "/dashboard"

Storage internals

Based on the LSM-tree (Log-Structured Merge-tree) architecture:

  1. Writes go to an in-memory memtable
  2. When full, flushed to disk as an immutable SSTable (Sorted String Table)
  3. Background compaction merges SSTables to reduce read amplification

This makes writes extremely fast (sequential I/O) at the cost of read amplification (may need to check multiple SSTables).

When to choose wide-column

  • Time-series data — IoT sensor readings, metrics, event logs
  • Write-heavy workloads — billions of events per day
  • Wide rows with sparse columns — user activity tracking
  • Geographic distribution — multi-region replication

Key engines

EngineNotes
Apache CassandraPeer-to-peer, tunable consistency, CQL query language
ScyllaDBCassandra-compatible, C++ rewrite, lower latency
HBaseHadoop ecosystem, strong consistency, HDFS-backed
Google BigtableFully managed, powers Google Search/Maps/Gmail
Amazon KeyspacesManaged Cassandra-compatible on AWS

Graph Databases

How they work

Data is modeled as nodes (entities) and edges (relationships). Both can carry properties. Graph databases excel at traversing relationships — something that requires expensive recursive JOINs in SQL.

// Cypher (Neo4j) — find friends of friends who like Python
MATCH (me:Person {name: "Alice"})-[:FRIENDS_WITH]->(friend)-[:FRIENDS_WITH]->(fof)
WHERE (fof)-[:LIKES]->(:Topic {name: "Python"})
  AND NOT (me)-[:FRIENDS_WITH]->(fof)
RETURN DISTINCT fof.name

Storage internals

Native graph storage (Neo4j): Nodes and relationships are stored as linked records with direct physical pointers. Traversing a relationship is a pointer hop — O(1) regardless of graph size. This is called index-free adjacency.

Non-native graph (JanusGraph, Amazon Neptune): Uses an underlying storage engine (Cassandra, DynamoDB) with an index layer for graph traversals. More scalable horizontally but slower per-hop.

When to choose graph

  • Social networks — friends, followers, recommendations
  • Knowledge graphs — entity relationships, ontologies
  • Fraud detection — finding suspicious relationship patterns
  • Network topology — infrastructure dependencies, routing
  • Recommendation engines — "users who bought X also bought Y"

Key engines

EngineQuery LanguageNotes
Neo4jCypherMarket leader, native graph storage, ACID
Amazon NeptuneGremlin, SPARQL, openCypherManaged, supports property graph + RDF
ArangoDBAQLMulti-model (document + graph + key-value)
JanusGraphGremlinOpen-source, pluggable storage backend
DgraphGraphQL, DQLDistributed, native GraphQL support

Time-Series Databases

How they work

Optimized for append-heavy, time-stamped data where recent data is accessed more frequently. Most use columnar storage with aggressive compression (delta encoding, run-length encoding) since consecutive timestamps and values are often similar.

-- InfluxQL
SELECT mean("cpu_usage")
FROM "server_metrics"
WHERE time > now() - 1h
GROUP BY time(5m), "host"

When to choose time-series

  • Infrastructure monitoring — CPU, memory, disk, network metrics
  • IoT — sensor data at scale
  • Financial data — tick data, OHLCV candles
  • Application metrics — request latency, error rates, throughput

Key engines

EngineNotes
InfluxDBPurpose-built TSDB, Flux query language
TimescaleDBPostgreSQL extension — full SQL on time-series
PrometheusPull-based metrics, de facto for Kubernetes
Amazon TimestreamManaged, serverless TSDB on AWS
ClickHouseColumn-oriented OLAP that excels at time-series
QuestDBHigh-performance TSDB with SQL support

Vector Databases

How they work

Store high-dimensional embedding vectors and support approximate nearest neighbor (ANN) search. Essential for AI/ML applications where similarity search replaces exact matching.

# Typical vector search flow
embedding = model.encode("How do I reset my password?")
results = collection.query(
    query_embeddings=[embedding],
    n_results=5  # top-5 most similar
)

Index structures

AlgorithmTypeTrade-off
HNSWGraph-basedHigh recall, more memory
IVFInverted fileFaster index build, lower recall
PQProduct quantizationCompressed vectors, lower accuracy
ScaNNHybridGoogle's optimized ANN

Key engines

EngineNotes
PineconeManaged, serverless, metadata filtering
WeaviateOpen-source, multi-modal, GraphQL API
Milvus / ZillizOpen-source, GPU-accelerated, distributed
QdrantRust-based, filtering + payload support
pgvectorPostgreSQL extension — vectors in your existing DB
ChromaDBLightweight, Python-native, popular for prototyping

Decision Framework

By access pattern

Access PatternBest Fit
Complex joins, ad-hoc queries, transactionsRelational (SQL)
Lookup by key, caching, sessionsKey-Value
Nested/hierarchical objects, flexible schemaDocument
Time-stamped append-only streamsTime-Series
Relationship traversal, path findingGraph
High-volume writes, wide sparse rowsWide-Column
Similarity search, embeddingsVector

By consistency requirement

RequirementOptions
Strong ACID neededPostgreSQL, MySQL, CockroachDB, Spanner
Tunable consistency OKCassandra, DynamoDB, MongoDB
Eventual consistency fineRedis (replication), Couchbase, Cassandra

The polyglot persistence reality

Most production systems use multiple databases for different workloads:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│ PostgreSQL   │     │   Redis     │     │ Elasticsearch│
│ (orders,     │     │ (sessions,  │     │ (full-text   │
│  inventory)  │     │  cache)     │     │  search)     │
└──────┬───────┘     └──────┬──────┘     └──────┬───────┘
       │                    │                    │
       └────────────────────┼────────────────────┘
                            │
                    ┌───────┴───────┐
                    │  Application  │
                    └───────┬───────┘
                            │
                    ┌───────┴───────┐
                    │   Kafka /     │
                    │   Event Bus   │
                    └───────┬───────┘
                            │
               ┌────────────┼────────────┐
               │            │            │
        ┌──────┴──────┐ ┌──┴───┐ ┌──────┴──────┐
        │ ClickHouse  │ │ S3   │ │ Neo4j       │
        │ (analytics) │ │(lake)│ │ (recs graph)│
        └─────────────┘ └──────┘ └─────────────┘

The key is to let events flow between systems via a message bus rather than trying to synchronize databases directly.

Common Mistakes

MistakeImpactFix
Using MongoDB for heavily relational dataPainful manual joins, data duplicationUse PostgreSQL with JSONB for semi-structured parts
Using SQL for simple key lookups at scaleUnnecessary overheadAdd a Redis cache or move to DynamoDB
Choosing a DB based on hypeOperational burden, mismatched patternsMatch the DB to your actual access patterns
Ignoring operational costs"Free" open-source still needs backups, upgrades, monitoringFactor in managed vs self-hosted total cost
Premature denormalizationData inconsistency, update anomaliesNormalize first, denormalize when you have proof of need

Resources

:::