tadata
Back to home

Data Lake Formats: From Files to Table Abstractions

#data-engineering#data-lake#parquet#iceberg#delta-lake

The modern lakehouse rests on two layers of format choices: the file format that stores bytes on disk, and the table format that adds transactional semantics on top. Getting this right determines query speed, storage cost, schema flexibility, and ecosystem compatibility.

File Format Comparison

FeatureParquetORCAvroCSV / JSON
Storage modelColumnarColumnarRow-basedRow-based
CompressionExcellent (Snappy, Zstd, LZ4)Excellent (Zlib, Snappy)Good (Deflate, Snappy)Poor
Schema evolutionLimited (append columns)LimitedStrong (full compatibility)None
Read performance (analytics)Excellent — column pruningExcellentPoor — reads full rowVery poor
Write performanceGoodGoodExcellent — fast serializationExcellent
Ecosystem supportUniversal (Spark, Flink, Trino, DuckDB)Hive/Spark-centricKafka, Avro-native toolsUniversal but slow
Nested dataStrong (repeated/group types)StrongStrongJSON only
Ideal use caseAnalytics, lakehouse storageHive workloadsStreaming, schema registryData exchange, debugging

Parquet has become the de facto standard for analytical storage. Its columnar layout enables predicate pushdown and column pruning, which can reduce I/O by 90%+ on wide tables.

Table Format Comparison

FeatureApache IcebergDelta LakeApache Hudi
GovernanceApache FoundationLinux Foundation (Databricks origin)Apache Foundation
ACID transactionsYesYesYes
Time travelYes — snapshot-basedYes — version logYes — timeline-based
Schema evolutionFull (add, drop, rename, reorder)Add/rename columnsAdd columns
Partition evolutionYes — hidden partitioningNo — requires rewriteLimited
Row-level deletesCopy-on-write + merge-on-readDeletion vectorsMerge-on-read native
Query enginesSpark, Flink, Trino, DuckDB, Athena, BigQuerySpark, Flink, Trino (via connectors)Spark, Flink, Trino
Cloud catalogAWS Glue, Nessie, REST Catalog, PolarisUnity Catalog, GlueHive Metastore
Branching / taggingYes (via Nessie)Shallow CloneNo
Community momentum (2026)Very high — industry convergenceHigh — Databricks ecosystemModerate — Uber/AWS

Performance Benchmarks (Indicative)

OperationIcebergDelta LakeHudiRaw Parquet
Point lookup (1 row)~50ms~60ms~80msFull scan
Scan 1% of partitions~2s~2s~3s~2s (if partitioned)
Upsert 1M rows into 1B table~30s~35s~25sN/A (immutable)
Time travel (read old snapshot)~1s overhead~1s overhead~2s overheadN/A
Schema evolution (add column)Metadata onlyMetadata onlyMetadata onlyRewrite required
Partition evolutionMetadata onlyFull rewriteFull rewriteFull rewrite

Benchmarks are order-of-magnitude estimates on comparable hardware. Actual performance depends on data size, cluster configuration, and engine.

Evolution Timeline

2009  ──── Apache Avro 1.0 released
2010  ──── Apache Parquet (Dremel paper inspiration)
2013  ──── ORC format introduced (Hive project)
2016  ──── Apache Hudi created at Uber
2017  ──── Delta Lake created at Databricks
2018  ──── Apache Iceberg created at Netflix
2020  ──── Delta Lake open-sourced
2022  ──── Iceberg adopted by AWS, Snowflake, Dremio
2023  ──── Databricks acquires Tabular (Iceberg founders)
2024  ──── Delta Lake UniForm reads Iceberg metadata
2025  ──── Iceberg REST Catalog becomes standard API
2026  ──── Industry converges: Iceberg as interchange format

Decision Framework

Do you need ACID transactions on your lake?
├── No  → Raw Parquet/ORC files are sufficient
└── Yes
    ├── Databricks-centric stack? → Delta Lake (native integration)
    ├── Multi-engine / multi-cloud? → Apache Iceberg
    └── Heavy upsert workload (CDC)? → Apache Hudi or Iceberg MoR

Storage Layout Best Practices

PracticeImpactDetails
Target 128-512 MB file sizesRead performanceSmall files cause metadata overhead; large files waste I/O
Use hidden partitioning (Iceberg)Query speed + flexibilityPartition by transform (day, bucket) without user-facing columns
Enable column statisticsPredicate pushdownMin/max stats allow skipping irrelevant files
Compact regularlyPerformance maintenanceMerge small files produced by streaming writes
Expire old snapshotsStorage costTime travel history grows unbounded without cleanup

Resources

:::