What Is a Lakehouse?
The data lakehouse blends the scalability and low cost of data lakes with the reliability and structure of a data warehouse. In 2025, lakehouses are powered by ACID table formats like Apache Iceberg, Delta Lake, and Apache Hudi layered on object storage (S3, GCS, ADLS). They unlock time travel, schema evolution, and upserts while keeping compute engines (Spark, Trino, Flink, Snowflake external tables, BigQuery) decoupled from storage.
Lakehouse Reference Model (Medallion)
The Medallion Architecture (Bronze ā Silver ā Gold) organizes data by refinement level, improving trust and performance.
Why ACID Table Formats Matter
- Atomic Writes: No partial files or inconsistent reads for streaming/batch concurrency.
- Schema Evolution: Safely add/rename columns and manage compatibility.
- Time Travel: Audit, reproduce, and roll back tables to known points.
- Partition + Clustering: Prune data efficiently for fast scans.
- Compaction: Merge small files to improve query performance and cost.
Iceberg vs Delta vs Hudi
| Capability | Iceberg | Delta | Hudi |
|---|---|---|---|
| Upserts/Merge | Yes (MERGE INTO) | Yes (MERGE, DBX-native) | Yes (Copy-On-Write / Merge-On-Read) |
| Streaming Ingest | Spark, Flink | Spark, Structured Streaming | Spark, Flink (strong) |
| Engine Interop | Spark, Flink, Trino, Snowflake ext. | Spark, Trino, DBX, others | Spark, Flink, Presto |
| Time Travel | Yes | Yes | Yes |
| Governance Tooling | Catalogs (Glue/Hive/REST), RBAC via engines | Unity Catalog (Databricks), engine RBAC | Glue/Hive, engine-based controls |
Performance Tuning Cheat Sheet
- File Size: Target 256ā1024MB Parquet files; avoid tiny files.
- Compression: Prefer ZSTD or Snappy depending on CPU budget and data entropy.
- Partitioning: Use identity partitioning for high-cardinality columns sparingly; consider bucketing or clustering.
- Compaction: Schedule regular compaction and manifest rewrites; monitor small file counts.
- Predicate Pushdown: Model queries to maximize pruning; avoid functions on partition columns.
Governance & Security
Lakehouses must balance agility and compliance:
- Row/Column Security: Enforce via engine policies (Trino/Spark), catalogs (Unity/Glue), or policy engines.
- Data Contracts: Schemas and SLAs published per table; break-glass procedures for changes.
- Lineage: Track source-to-gold lineage in OpenLineage-compatible tools.
- PII Handling: Tokenize, hash, or encrypt sensitive fields prior to gold zones.
Migration Playbook (From Lake ā Lakehouse)
- Inventory: Profile Parquet/CSV footprints, access patterns, and consumer dependencies.
- Pick a Table Format: Choose Iceberg/Delta/Hudi based on engine preferences and upsert needs.
- Bootstrap Bronze: Register existing Parquet paths as managed tables; validate snapshots.
- Build Silver: Conform data and enforce tests (dbt) with incremental models.
- Curate Gold: Publish marts and semantic layers; add row/column-level controls.
Practical SQL Examples
-- Apache Iceberg: Create + Partition
CREATE TABLE catalog.sales_iceberg (
order_id BIGINT,
order_date DATE,
country STRING,
amount DECIMAL(18,2)
) PARTITIONED BY (months(order_date));
-- Delta Lake: Optimize + Z-Order
OPTIMIZE db.sales_delta ZORDER BY (country, order_date);
-- Apache Hudi: Upsert via Spark SQL
MERGE INTO hudi_sales h
USING staged_updates s
ON h.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET amount = s.amount
WHEN NOT MATCHED THEN INSERT *;
Engines & Interoperability
Pair the table format with the right engines:
- Spark/Flink: Heavy lifting for batch + streaming transforms and compaction.
- Trino/Presto: Low-latency federated queries across Iceberg/Delta/Hudi and JDBC sources.
- DuckDB: Developer productivity for local analysis; reads Parquet and sometimes table formats via connectors.
- Snowflake/BigQuery: External tables on lakehouse storage for hybrid architectures.
Cost Model
| Area | Lever | Impact |
|---|---|---|
| Storage | Compression, tiering, delete old snapshots | 15ā40% |
| Compute | Autoscale, spot, cache warmups | 20ā50% |
| Catalog | Reduce snapshot churn, compact manifests | 10ā20% |
Quality & Observability
Embedding tests and metadata into the lifecycle prevents data drift:
- dbt Tests: Uniqueness, accepted values, freshness, and custom assertions on silver/gold.
- OpenLineage: End-to-end lineage across jobs and tables, integrated with orchestrators.
- Monitors: File counts, small file ratios, average partition size, and table snapshot age.
Checklist Before Production
- ā Partitioning and clustering strategy validated against top queries
- ā Compaction cadence automated with backpressure safeguards
- ā Time-travel retention aligned with audit requirements
- ā Access policies and masking tested across engines
- ā Cost dashboards tracking storage, compute, and catalog churn
Conclusion
The lakehouse has matured into the default choice for scalable analytics. Whether you choose Iceberg, Delta, or Hudi, success depends on table hygiene (partitioning, compaction), governance (contracts, lineage), and engineering discipline (tests, cost controls). Start with one domain, nail the medallion flow, and scale with confidence.
Need an expert review? Eficsy can assess your lakehouse design, tune performance, and reduce costs. Book a consultation ā