Data Engineering in 2026: Tools, Platforms & Roadmap
Taxonomy inspired by the MAD 2025 Landscape by Matt Turck / FirstMark. Interactive — pan and zoom to explore.
Data engineering has evolved from batch ETL jobs into a discipline that spans real-time streaming, lakehouse architectures, and declarative transformation layers. Choosing the right stack depends on scale, team expertise, and cloud strategy.
At a Glance
| Category | AWS | GCP | Azure | Open Source |
|---|---|---|---|---|
| Orchestration | Step Functions, MWAA | Cloud Composer, Workflows | Data Factory, Logic Apps | Airflow, Dagster, Prefect |
| Processing | Glue, EMR, Athena | Dataflow, BigQuery | Synapse, Databricks | Spark, dbt, Flink, DuckDB |
| Storage / Lakehouse | S3 + Lake Formation, Redshift | BigQuery, BigLake | OneLake / Fabric | Iceberg, Delta Lake, Hudi |
| Data Quality | Glue Data Quality | Dataplex | Purview | Great Expectations, Soda |
| Catalog / Discovery | Glue Catalog, DataZone | Dataplex Catalog | Purview | OpenMetadata, DataHub |
Data Orchestration
Orchestration is the backbone of any data platform. It defines how, when, and in what order data jobs execute.
AWS offers Step Functions for serverless workflow orchestration and Managed Workflows for Apache Airflow (MWAA) for teams already invested in the Airflow ecosystem.
GCP provides Cloud Composer (also Airflow-managed) and Cloud Workflows for lightweight event-driven pipelines.
Azure centers on Azure Data Factory (ADF), which combines orchestration and data movement in a single visual interface, plus Logic Apps for event-driven triggers.
Open source remains dominated by Apache Airflow, but newer entrants like Dagster and Prefect have gained significant traction. Dagster's software-defined assets model brings a declarative approach to pipeline design, while Prefect emphasizes developer experience with a Python-native API. Mage is another rising contender focused on simplicity.
Data Processing & Transformation
The processing layer is where raw data becomes analytics-ready.
AWS Glue provides serverless Spark with a built-in data catalog, while Amazon EMR offers more control over Spark, Hive, and Presto clusters. Amazon Athena handles ad-hoc SQL queries directly on S3.
GCP Dataflow is a fully managed Apache Beam runner supporting both batch and streaming. BigQuery doubles as both warehouse and processing engine with its powerful SQL layer.
Azure Synapse Analytics unifies data warehousing, big data, and data integration. Azure Databricks (jointly developed with Databricks) brings the lakehouse paradigm to Azure.
Open source: Apache Spark remains the standard for large-scale processing. dbt (data build tool) has become essential for SQL-based transformations with built-in testing and documentation. Apache Flink leads in stream processing, while DuckDB has emerged as a fast, embeddable analytical engine for local and medium-scale workloads.
Data Storage & Lakehouse
The lakehouse pattern — combining the flexibility of data lakes with the reliability of data warehouses — is now the default architecture.
AWS relies on S3 as the foundation, with Lake Formation for governance and Redshift Spectrum for querying lake data.
GCP BigQuery natively supports the lakehouse model with external tables on GCS, and BigLake provides unified governance.
Azure offers OneLake through Microsoft Fabric, unifying data across the entire analytics stack.
Open source table formats are the real enablers: Apache Iceberg (backed by Apple, Netflix, and now the industry standard), Delta Lake (from Databricks), and Apache Hudi. Apache Iceberg has emerged as the leading format with broad ecosystem support across all three clouds.
Data Quality & Observability
Ensuring data reliability is no longer optional.
AWS integrates Glue Data Quality for rule-based validation. GCP offers Dataplex data quality tasks. Azure provides Microsoft Purview data quality features.
Open source: Great Expectations remains the most adopted framework for data validation. Soda provides a more accessible YAML-based approach. Monte Carlo and Bigeye lead the commercial observability space, while open-source alternatives like Elementary (for dbt) and OpenMetadata provide lineage and quality monitoring.
Roadmap Considerations
When building a data engineering stack in 2026, consider:
- Start with orchestration: Choose between Airflow (proven, large ecosystem) or Dagster (modern, asset-centric) as your foundation
- Adopt a lakehouse format early: Apache Iceberg is the safest bet for long-term interoperability
- Invest in data quality from day one: Retrofitting quality checks is significantly harder than building them in
- Evaluate managed vs. self-hosted: Managed services reduce operational burden but increase cloud lock-in
- Plan for real-time: Even if you start with batch, design your architecture to accommodate streaming as needs evolve
The trend is clear: the modern data stack is consolidating around open formats, declarative transformation, and unified governance. Cloud providers are competing on managed services, but open-source foundations ensure portability.
References
- MAD 2025 Landscape — Matt Turck / FirstMark: comprehensive map of the Machine Learning, AI & Data ecosystem
- CNCF Landscape — Cloud Native Computing Foundation interactive landscape
- AWS Analytics Services — AWS data lake and analytics service overview
- GCP Data Analytics — Google Cloud data analytics portfolio
- Azure Analytics — Microsoft Azure analytics services
- dbt Developer Hub — dbt transformation framework documentation
- Apache Iceberg — open table format specification and ecosystem
- Dagster — modern data orchestration platform
- Great Expectations — open-source data quality framework
- Ceph -
Pricing Comparison
Compute — General Purpose
| Provider | Service / SKU | Specs | Price | Unit | Region |
|---|---|---|---|---|---|
| Scaleway | DEV1-M | vcpu: 3 · memory: 4 GiB | €0.022 | /1 Hour | PAR1 (Paris, FR) |
| OVHcloud | b3-8 | vcpu: 2 · memory: 8 GiB | €0.038 | /1 Hour | GRA (Gravelines, FR) |
| OVHcloud | b3-16 | vcpu: 4 · memory: 16 GiB | €0.077 | /1 Hour | GRA (Gravelines, FR) |
| Scaleway | GP1-S | vcpu: 8 · memory: 32 GiB | €0.084 | /1 Hour | PAR1 (Paris, FR) |
| GCP | n2-standard-4 | vcpu: 4 · memory: 16 GiB | $0.194 | /h | europe-west1 |
| AWS | m7i.xlarge | vcpu: 4 · memory: 16 GiB | $0.202 | /Hrs | eu-west-3 |
| Azure | Standard_D4s_v5 | vcpu: 4 · memory: 16 GiB | $0.230 | /1 Hour | westeurope |
| GCP | n2-standard-8 | vcpu: 8 · memory: 32 GiB | $0.389 | /h | europe-west1 |
| AWS | m7i.2xlarge | vcpu: 8 · memory: 32 GiB | $0.403 | /Hrs | eu-west-3 |
| Azure | Standard_D4s_v5 | vcpu: 4 · memory: 16 GiB | $0.414 | /1 Hour | westeurope |
| Azure | Standard_D8s_v5 | vcpu: 8 · memory: 32 GiB | $0.460 | /1 Hour | westeurope |
| Azure | Standard_D8s_v5 | vcpu: 8 · memory: 32 GiB | $0.828 | /1 Hour | westeurope |
Object Storage
| Provider | Service / SKU | Specs | Price | Unit | Region |
|---|---|---|---|---|---|
| Scaleway | Standard | tier: Standard · redundancy: 3x replication | €0.010 | /1 GB/Month | PAR (Paris, FR) |
| OVHcloud | Standard | tier: Standard · redundancy: 3x replication | €0.011 | /1 GB/Month | GRA (Gravelines, FR) |
| Azure | Hot LRS | tier: Hot · redundancy: LRS | $0.019 | /1 GB/Month | westeurope |
| Azure | Hot LRS | tier: Hot · redundancy: LRS | $0.020 | /1 GB/Month | westeurope |
| GCP | Standard | tier: Standard · redundancy: Multi-region available | $0.020 | /GiBy.mo | europe-west1 |
| AWS | S3-Standard | tier: Standard · redundancy: 3 AZ | $0.023 | /GB-Mo | eu-west-3 |
Managed PostgreSQL
| Provider | Service / SKU | Specs | Price | Unit | Region |
|---|---|---|---|---|---|
| Scaleway | DB-DEV-M | vcpu: 2 · memory: 4 GiB · engine: PostgreSQL | €0.069 | /1 Hour | PAR (Paris, FR) |
| OVHcloud | db2-7 | vcpu: 2 · memory: 7 GiB · engine: PostgreSQL | €0.105 | /1 Hour | GRA (Gravelines, FR) |
| GCP | db-custom-4-16384 | vcpu: 4 · memory: 16 GiB · engine: PostgreSQL | $0.348 | /h | europe-west1 |
| AWS | db.m7g.xlarge | vcpu: 4 · memory: 16 GiB · engine: PostgreSQL | $0.371 | /Hrs | eu-west-3 |
| Azure | Standard_D4ds_v5 | vcpu: 4 · memory: 16 GiB · engine: PostgreSQL Flexible | $0.424 | /1 Hour | westeurope |
Network Egress
| Provider | Service / SKU | Specs | Price | Unit | Region |
|---|---|---|---|---|---|
| OVHcloud | Egress-Included | direction: Egress · tier: Included (generous free tier) | Free | /1 GB/Month | GRA (Gravelines, FR) |
| Scaleway | Egress-Standard | direction: Egress · tier: 75 GB free, then per GB | €0.010 | /1 GB/Month | PAR (Paris, FR) |
| GCP | Internet-Egress-EU | direction: Egress · tier: First 10 TB | $0.085 | /GiBy | europe-west1 |
| AWS | Data-Out-Internet | direction: Egress · tier: First 10 TB | $0.090 | /GB | eu-west-3 |
Last updated: April 2, 2026 · Indicative on-demand prices, excl. tax. Check official sites for current rates.