tadata
Back to home

Data Lakehouse: The Best of Both Worlds?

#data-architecture#data-engineering#cloud#analytics

The Historical Split

For two decades, organizations chose between two paradigms:

  • Data Warehouse: structured, schema-on-write, fast SQL, expensive storage, governed
  • Data Lake: semi/unstructured, schema-on-read, cheap storage, flexible, often chaotic

The lakehouse pattern emerged to unify these: bring warehouse-grade structure and performance to lake-scale storage.

What Makes a Lakehouse

A lakehouse adds a metadata and transaction layer on top of object storage (S3, GCS, ADLS). The key capabilities:

CapabilityHow It Works
ACID TransactionsMetadata logs track commits, enabling rollback and consistency
Schema EnforcementSchema-on-write is enforced at the table level
Schema EvolutionAdd/rename columns without rewriting data
Time TravelQuery historical snapshots of any table
Unified Batch & StreamingSame tables support both batch writes and streaming upserts
Open FormatsData stored in Parquet/ORC, readable by any engine

The Three Table Formats

FeatureDelta LakeApache IcebergApache Hudi
OriginDatabricksNetflixUber
GovernanceLinux FoundationApache FoundationApache Foundation
MetadataTransaction log (JSON/Parquet)Manifest files + metadata treeTimeline + metadata
Partition EvolutionLimitedNative (hidden partitions)Limited
Engine SupportStrong Spark, growing othersBroadest multi-engineStrong Spark/Flink
Time TravelYesYesYes
Copy-on-Write / Merge-on-ReadBothBothBoth (core strength)
Community Momentum (2026)Very highVery highModerate

Lake vs Warehouse vs Lakehouse

DimensionData LakeData WarehouseLakehouse
Storage costLow (object storage)High (proprietary)Low (object storage)
Query performanceVariableHighHigh (with optimization)
Schema enforcementNone (schema-on-read)StrictConfigurable
Data typesAll (structured, semi, unstructured)Structured onlyAll
ACID transactionsNoYesYes
Governance maturityLowHighMedium-High
Vendor lock-inLowHighLow-Medium
ML/AI workloadsNativeRequires exportNative

When to Choose What

Choose a warehouse when:

  • Your workloads are purely SQL analytics
  • You need maximum query performance on structured data
  • Your team is SQL-first with limited engineering capacity

Choose a lakehouse when:

  • You need both analytics and ML on the same data
  • Cost at scale matters (petabyte range)
  • You want open formats and engine flexibility
  • You need to support streaming and batch

Keep a data lake when:

  • You primarily store raw files for archival or ML training
  • Schema is genuinely unknown at write time
  • Cost is the primary constraint

Architecture Patterns

A typical lakehouse architecture follows a medallion pattern:

  • Bronze -- raw ingestion, append-only, minimal transformation
  • Silver -- cleaned, deduplicated, conformed, business keys applied
  • Gold -- aggregated, business-ready, optimized for consumption

Each layer is a set of tables in the chosen format (Delta, Iceberg, Hudi) on object storage.

Key Decisions

  • Table format: Iceberg has the broadest engine support; Delta has the deepest Spark integration
  • Compute engine: Spark, Trino, DuckDB, Snowflake (with Iceberg), Athena
  • Catalog: AWS Glue, Hive Metastore, Nessie, Unity Catalog, Polaris
  • Compaction strategy: automated vs manual, frequency vs cost

Resources