Data Lake Formats: From Files to Table Abstractions
#data-engineering#data-lake#parquet#iceberg#delta-lake
The modern lakehouse rests on two layers of format choices: the file format that stores bytes on disk, and the table format that adds transactional semantics on top. Getting this right determines query speed, storage cost, schema flexibility, and ecosystem compatibility.
File Format Comparison
| Feature | Parquet | ORC | Avro | CSV / JSON |
|---|---|---|---|---|
| Storage model | Columnar | Columnar | Row-based | Row-based |
| Compression | Excellent (Snappy, Zstd, LZ4) | Excellent (Zlib, Snappy) | Good (Deflate, Snappy) | Poor |
| Schema evolution | Limited (append columns) | Limited | Strong (full compatibility) | None |
| Read performance (analytics) | Excellent — column pruning | Excellent | Poor — reads full row | Very poor |
| Write performance | Good | Good | Excellent — fast serialization | Excellent |
| Ecosystem support | Universal (Spark, Flink, Trino, DuckDB) | Hive/Spark-centric | Kafka, Avro-native tools | Universal but slow |
| Nested data | Strong (repeated/group types) | Strong | Strong | JSON only |
| Ideal use case | Analytics, lakehouse storage | Hive workloads | Streaming, schema registry | Data exchange, debugging |
Parquet has become the de facto standard for analytical storage. Its columnar layout enables predicate pushdown and column pruning, which can reduce I/O by 90%+ on wide tables.
Table Format Comparison
| Feature | Apache Iceberg | Delta Lake | Apache Hudi |
|---|---|---|---|
| Governance | Apache Foundation | Linux Foundation (Databricks origin) | Apache Foundation |
| ACID transactions | Yes | Yes | Yes |
| Time travel | Yes — snapshot-based | Yes — version log | Yes — timeline-based |
| Schema evolution | Full (add, drop, rename, reorder) | Add/rename columns | Add columns |
| Partition evolution | Yes — hidden partitioning | No — requires rewrite | Limited |
| Row-level deletes | Copy-on-write + merge-on-read | Deletion vectors | Merge-on-read native |
| Query engines | Spark, Flink, Trino, DuckDB, Athena, BigQuery | Spark, Flink, Trino (via connectors) | Spark, Flink, Trino |
| Cloud catalog | AWS Glue, Nessie, REST Catalog, Polaris | Unity Catalog, Glue | Hive Metastore |
| Branching / tagging | Yes (via Nessie) | Shallow Clone | No |
| Community momentum (2026) | Very high — industry convergence | High — Databricks ecosystem | Moderate — Uber/AWS |
Performance Benchmarks (Indicative)
| Operation | Iceberg | Delta Lake | Hudi | Raw Parquet |
|---|---|---|---|---|
| Point lookup (1 row) | ~50ms | ~60ms | ~80ms | Full scan |
| Scan 1% of partitions | ~2s | ~2s | ~3s | ~2s (if partitioned) |
| Upsert 1M rows into 1B table | ~30s | ~35s | ~25s | N/A (immutable) |
| Time travel (read old snapshot) | ~1s overhead | ~1s overhead | ~2s overhead | N/A |
| Schema evolution (add column) | Metadata only | Metadata only | Metadata only | Rewrite required |
| Partition evolution | Metadata only | Full rewrite | Full rewrite | Full rewrite |
Benchmarks are order-of-magnitude estimates on comparable hardware. Actual performance depends on data size, cluster configuration, and engine.
Evolution Timeline
2009 ──── Apache Avro 1.0 released
2010 ──── Apache Parquet (Dremel paper inspiration)
2013 ──── ORC format introduced (Hive project)
2016 ──── Apache Hudi created at Uber
2017 ──── Delta Lake created at Databricks
2018 ──── Apache Iceberg created at Netflix
2020 ──── Delta Lake open-sourced
2022 ──── Iceberg adopted by AWS, Snowflake, Dremio
2023 ──── Databricks acquires Tabular (Iceberg founders)
2024 ──── Delta Lake UniForm reads Iceberg metadata
2025 ──── Iceberg REST Catalog becomes standard API
2026 ──── Industry converges: Iceberg as interchange format
Decision Framework
Do you need ACID transactions on your lake?
├── No → Raw Parquet/ORC files are sufficient
└── Yes
├── Databricks-centric stack? → Delta Lake (native integration)
├── Multi-engine / multi-cloud? → Apache Iceberg
└── Heavy upsert workload (CDC)? → Apache Hudi or Iceberg MoR
Storage Layout Best Practices
| Practice | Impact | Details |
|---|---|---|
| Target 128-512 MB file sizes | Read performance | Small files cause metadata overhead; large files waste I/O |
| Use hidden partitioning (Iceberg) | Query speed + flexibility | Partition by transform (day, bucket) without user-facing columns |
| Enable column statistics | Predicate pushdown | Min/max stats allow skipping irrelevant files |
| Compact regularly | Performance maintenance | Merge small files produced by streaming writes |
| Expire old snapshots | Storage cost | Time travel history grows unbounded without cleanup |
Resources
- Apache Iceberg Documentation
- Delta Lake Documentation
- Parquet Format Specification
- Jack Vanlightly — Lakehouse Format Comparison
- Tabular — Why Apache Iceberg
:::