tadata
Back to home

SQL Engines & Database Landscape: Choosing the Right Tool

#sql#databases#data#data-engineering

SQL remains the lingua franca of data, but the database landscape has diversified enormously. From OLTP to OLAP, from managed services to serverless, choosing the right database engine depends on your workload patterns, scale, and operational requirements.

Transactional Databases (OLTP)

For operational workloads requiring strong consistency and low-latency reads/writes:

AWS offers RDS (managed PostgreSQL, MySQL, MariaDB, Oracle, SQL Server) and Aurora — a MySQL/PostgreSQL-compatible engine with up to 5x throughput improvement. Aurora Serverless v2 provides auto-scaling for variable workloads.

GCP provides Cloud SQL (managed PostgreSQL, MySQL, SQL Server) and AlloyDB — a PostgreSQL-compatible database optimized for transactional and analytical workloads with columnar engine capabilities.

Azure offers Azure SQL Database (managed SQL Server), Azure Database for PostgreSQL, and Cosmos DB for globally distributed multi-model scenarios.

Open source: PostgreSQL is the clear leader for new projects, with its extensibility (PostGIS, pgvector, TimescaleDB), robust JSONB support, and active community. MySQL remains dominant in web applications. CockroachDB and TiDB provide distributed SQL for global-scale transactional workloads. Neon offers serverless PostgreSQL with branching capabilities ideal for development workflows.

Analytical Databases (OLAP)

For analytical and data warehouse workloads:

AWS Redshift is a mature columnar data warehouse with Serverless and Spectrum (querying S3 directly) options. Amazon Athena provides serverless SQL on S3 using Trino/Presto under the hood.

GCP BigQuery is arguably the most innovative cloud data warehouse — serverless, separation of storage and compute, built-in ML (BQML), and native support for semi-structured data. Its slot-based pricing model can be very cost-effective.

Azure Synapse Analytics provides dedicated and serverless SQL pools, integrating data warehousing with big data analytics.

Open source: ClickHouse has emerged as the fastest open-source OLAP database, excelling at real-time analytics on large datasets. DuckDB provides an embeddable analytical database ideal for local analysis and data science workflows. Apache Druid serves real-time analytics use cases. StarRocks offers a high-performance alternative to ClickHouse with MySQL compatibility.

NewSQL & Specialized Databases

Beyond traditional categories:

  • Vector databases for AI/ML: Pinecone, Weaviate, Qdrant, Milvus, and pgvector (PostgreSQL extension) — essential for RAG applications and semantic search
  • Time-series databases: InfluxDB, TimescaleDB (PostgreSQL extension), and cloud options like Amazon Timestream and GCP's time-series capabilities in BigQuery
  • Graph databases: Neo4j remains the leader, with Amazon Neptune and GCP's Spanner (which now supports graph queries) as managed alternatives
  • Document databases: MongoDB (and its managed Atlas), AWS DocumentDB, GCP Firestore, and Azure Cosmos DB

Query Engines & Federation

Querying across multiple data sources without moving data:

  • Trino (formerly PrestoSQL) is the leading open-source distributed query engine, federating queries across data lakes, databases, and APIs
  • Apache Spark SQL provides SQL on top of Spark's distributed compute engine
  • dbt (data build tool) has become the standard for SQL-based transformation workflows with testing, documentation, and lineage built in

Key Considerations

  • PostgreSQL first: For most new transactional workloads, PostgreSQL is the default choice — extensible, well-supported, and cloud-portable
  • Separate OLTP from OLAP: Don't run analytical queries on your transactional database — use a dedicated analytical engine
  • Evaluate serverless options: BigQuery, Athena, Aurora Serverless, and Neon eliminate capacity planning for variable workloads
  • Consider DuckDB for local analytics: It's fast, embeddable, and eliminates the need to spin up infrastructure for exploratory analysis
  • Vector search is table stakes: If you're building AI applications, you need vector search — pgvector is often sufficient before reaching for a dedicated vector database