tadata
Back to home

Data Engineering in 2026: Tools, Platforms & Roadmap

#data-engineering#aws#gcp#azure#open-source#pipelines
AWS
GCP
Azure
EU / FR
Open Source
Mini Map

Taxonomy inspired by the MAD 2025 Landscape by Matt Turck / FirstMark. Interactive — pan and zoom to explore.

Data engineering has evolved from batch ETL jobs into a discipline that spans real-time streaming, lakehouse architectures, and declarative transformation layers. Choosing the right stack depends on scale, team expertise, and cloud strategy.

At a Glance

CategoryAWSGCPAzureOpen Source
OrchestrationStep Functions, MWAACloud Composer, WorkflowsData Factory, Logic AppsAirflow, Dagster, Prefect
ProcessingGlue, EMR, AthenaDataflow, BigQuerySynapse, DatabricksSpark, dbt, Flink, DuckDB
Storage / LakehouseS3 + Lake Formation, RedshiftBigQuery, BigLakeOneLake / FabricIceberg, Delta Lake, Hudi
Data QualityGlue Data QualityDataplexPurviewGreat Expectations, Soda
Catalog / DiscoveryGlue Catalog, DataZoneDataplex CatalogPurviewOpenMetadata, DataHub

Data Orchestration

Orchestration is the backbone of any data platform. It defines how, when, and in what order data jobs execute.

AWS offers Step Functions for serverless workflow orchestration and Managed Workflows for Apache Airflow (MWAA) for teams already invested in the Airflow ecosystem.

GCP provides Cloud Composer (also Airflow-managed) and Cloud Workflows for lightweight event-driven pipelines.

Azure centers on Azure Data Factory (ADF), which combines orchestration and data movement in a single visual interface, plus Logic Apps for event-driven triggers.

Open source remains dominated by Apache Airflow, but newer entrants like Dagster and Prefect have gained significant traction. Dagster's software-defined assets model brings a declarative approach to pipeline design, while Prefect emphasizes developer experience with a Python-native API. Mage is another rising contender focused on simplicity.

Data Processing & Transformation

The processing layer is where raw data becomes analytics-ready.

AWS Glue provides serverless Spark with a built-in data catalog, while Amazon EMR offers more control over Spark, Hive, and Presto clusters. Amazon Athena handles ad-hoc SQL queries directly on S3.

GCP Dataflow is a fully managed Apache Beam runner supporting both batch and streaming. BigQuery doubles as both warehouse and processing engine with its powerful SQL layer.

Azure Synapse Analytics unifies data warehousing, big data, and data integration. Azure Databricks (jointly developed with Databricks) brings the lakehouse paradigm to Azure.

Open source: Apache Spark remains the standard for large-scale processing. dbt (data build tool) has become essential for SQL-based transformations with built-in testing and documentation. Apache Flink leads in stream processing, while DuckDB has emerged as a fast, embeddable analytical engine for local and medium-scale workloads.

Data Storage & Lakehouse

The lakehouse pattern — combining the flexibility of data lakes with the reliability of data warehouses — is now the default architecture.

AWS relies on S3 as the foundation, with Lake Formation for governance and Redshift Spectrum for querying lake data.

GCP BigQuery natively supports the lakehouse model with external tables on GCS, and BigLake provides unified governance.

Azure offers OneLake through Microsoft Fabric, unifying data across the entire analytics stack.

Open source table formats are the real enablers: Apache Iceberg (backed by Apple, Netflix, and now the industry standard), Delta Lake (from Databricks), and Apache Hudi. Apache Iceberg has emerged as the leading format with broad ecosystem support across all three clouds.

Data Quality & Observability

Ensuring data reliability is no longer optional.

AWS integrates Glue Data Quality for rule-based validation. GCP offers Dataplex data quality tasks. Azure provides Microsoft Purview data quality features.

Open source: Great Expectations remains the most adopted framework for data validation. Soda provides a more accessible YAML-based approach. Monte Carlo and Bigeye lead the commercial observability space, while open-source alternatives like Elementary (for dbt) and OpenMetadata provide lineage and quality monitoring.

Roadmap Considerations

When building a data engineering stack in 2026, consider:

  • Start with orchestration: Choose between Airflow (proven, large ecosystem) or Dagster (modern, asset-centric) as your foundation
  • Adopt a lakehouse format early: Apache Iceberg is the safest bet for long-term interoperability
  • Invest in data quality from day one: Retrofitting quality checks is significantly harder than building them in
  • Evaluate managed vs. self-hosted: Managed services reduce operational burden but increase cloud lock-in
  • Plan for real-time: Even if you start with batch, design your architecture to accommodate streaming as needs evolve

The trend is clear: the modern data stack is consolidating around open formats, declarative transformation, and unified governance. Cloud providers are competing on managed services, but open-source foundations ensure portability.

References

Pricing Comparison

Compute — General Purpose

ProviderService / SKUSpecsPriceUnitRegion
ScalewayDEV1-Mvcpu: 3 · memory: 4 GiB€0.022/1 HourPAR1 (Paris, FR)
OVHcloudb3-8vcpu: 2 · memory: 8 GiB€0.038/1 HourGRA (Gravelines, FR)
OVHcloudb3-16vcpu: 4 · memory: 16 GiB€0.077/1 HourGRA (Gravelines, FR)
ScalewayGP1-Svcpu: 8 · memory: 32 GiB€0.084/1 HourPAR1 (Paris, FR)
GCPn2-standard-4vcpu: 4 · memory: 16 GiB$0.194/heurope-west1
AWSm7i.xlargevcpu: 4 · memory: 16 GiB$0.202/Hrseu-west-3
AzureStandard_D4s_v5vcpu: 4 · memory: 16 GiB$0.230/1 Hourwesteurope
GCPn2-standard-8vcpu: 8 · memory: 32 GiB$0.389/heurope-west1
AWSm7i.2xlargevcpu: 8 · memory: 32 GiB$0.403/Hrseu-west-3
AzureStandard_D4s_v5vcpu: 4 · memory: 16 GiB$0.414/1 Hourwesteurope
AzureStandard_D8s_v5vcpu: 8 · memory: 32 GiB$0.460/1 Hourwesteurope
AzureStandard_D8s_v5vcpu: 8 · memory: 32 GiB$0.828/1 Hourwesteurope

Object Storage

ProviderService / SKUSpecsPriceUnitRegion
ScalewayStandardtier: Standard · redundancy: 3x replication€0.010/1 GB/MonthPAR (Paris, FR)
OVHcloudStandardtier: Standard · redundancy: 3x replication€0.011/1 GB/MonthGRA (Gravelines, FR)
AzureHot LRStier: Hot · redundancy: LRS$0.019/1 GB/Monthwesteurope
AzureHot LRStier: Hot · redundancy: LRS$0.020/1 GB/Monthwesteurope
GCPStandardtier: Standard · redundancy: Multi-region available$0.020/GiBy.moeurope-west1
AWSS3-Standardtier: Standard · redundancy: 3 AZ$0.023/GB-Moeu-west-3

Managed PostgreSQL

ProviderService / SKUSpecsPriceUnitRegion
ScalewayDB-DEV-Mvcpu: 2 · memory: 4 GiB · engine: PostgreSQL€0.069/1 HourPAR (Paris, FR)
OVHclouddb2-7vcpu: 2 · memory: 7 GiB · engine: PostgreSQL€0.105/1 HourGRA (Gravelines, FR)
GCPdb-custom-4-16384vcpu: 4 · memory: 16 GiB · engine: PostgreSQL$0.348/heurope-west1
AWSdb.m7g.xlargevcpu: 4 · memory: 16 GiB · engine: PostgreSQL$0.371/Hrseu-west-3
AzureStandard_D4ds_v5vcpu: 4 · memory: 16 GiB · engine: PostgreSQL Flexible$0.424/1 Hourwesteurope

Network Egress

ProviderService / SKUSpecsPriceUnitRegion
OVHcloudEgress-Includeddirection: Egress · tier: Included (generous free tier)Free/1 GB/MonthGRA (Gravelines, FR)
ScalewayEgress-Standarddirection: Egress · tier: 75 GB free, then per GB€0.010/1 GB/MonthPAR (Paris, FR)
GCPInternet-Egress-EUdirection: Egress · tier: First 10 TB$0.085/GiByeurope-west1
AWSData-Out-Internetdirection: Egress · tier: First 10 TB$0.090/GBeu-west-3

Last updated: April 2, 2026 · Indicative on-demand prices, excl. tax. Check official sites for current rates.