tadata
Back to home

Data Orchestration: Scheduling, Dependencies & the Tool Landscape

#data-engineering#orchestration#airflow#dagster#prefect

Data orchestration is the conductor of your data platform — it decides what runs, when, in what order, and what happens when things fail. Choosing the right orchestrator shapes your team's productivity for years.

What Orchestration Solves

ProblemHow orchestration helps
Dependency managementRun job B only after job A succeeds
SchedulingTrigger pipelines on cron, events, or data arrival
Retry & alertingAutomatically retry failures, notify on-call
ObservabilityCentralized view of all pipeline runs and statuses
BackfillingReprocess historical data with the same logic

Tool Comparison

ToolPhilosophyStrengthsWeaknesses
Apache AirflowDAGs as Python codeMassive ecosystem, battle-tested, large communityComplex setup, scheduler limitations, task isolation
DagsterSoftware-defined assetsType-safe, asset-centric, great local dev experienceSmaller community, learning curve
PrefectWorkflow as Python functionsSimple API, hybrid execution model, cloud-nativeSmaller ecosystem than Airflow
MageModern data pipeline toolInteractive notebooks, built-in streamingNewer, smaller community
KestraDeclarative YAML workflowsLanguage-agnostic, event-driven, scalableLess flexible for complex logic
dbt CloudSQL transform orchestrationNative dbt integration, managedLimited to dbt jobs

Managed vs Self-Hosted

ManagedSelf-Hosted
MWAA (AWS), Cloud Composer (GCP), AstronomerAirflow on Kubernetes, Dagster OSS, Prefect Server
Lower ops burden, higher costFull control, lower cost, more complexity
Best for: teams without dedicated platform engineersBest for: teams with DevOps/platform capacity

Key Architecture Decisions

  • Task-centric vs asset-centric: Airflow thinks in tasks ("run this script"). Dagster thinks in assets ("this table should exist and be fresh"). Asset-centric is gaining favor for analytics.
  • Centralized vs decentralized: One orchestrator for all teams, or per-domain orchestrators? Centralized is simpler; decentralized aligns with data mesh.
  • Event-driven vs schedule-driven: Cron schedules are simple but wasteful if data arrives irregularly. Event triggers (new file in S3, Kafka message) are more efficient.

Resources

:::