Apache Iceberg Deep Dive: Building a Production Data Lakehouse from Catalog to Query Engine

Apache Iceberg Deep Dive: Building a Production Data Lakehouse from Catalog to Query Engine

Introduction: Why Iceberg Matters for Production Data Lakehouses

For decades, data lakes were write-once repositories: fast to ingest, slow to query, with no ACID guarantees. A schema change meant reprocessing terabytes. Concurrent writes could corrupt data. Auditors asked “where’s my snapshot from March 15th?” and engineers had no answer.

Apache Iceberg changes this fundamental equation. Released by Netflix in 2017 and now a top-level Apache project, Iceberg is a table format that adds relational semantics to object storage. Think of it as the bridge between Hadoop’s distributed filesystem assumptions and modern cloud object storage (S3, Azure Data Lake, GCS) plus the data governance demands of 2026.

This post dissects Iceberg’s architecture layer-by-layer: from REST Catalogs down to data files, through ACID transactions, schema evolution, and the 2026 streaming lakehouse pattern where Flink and RisingWave write directly to Iceberg tables. You’ll learn why thousands of organizations are migrating from Hive and Delta, how to handle the small-files problem in production, and how to query historical snapshots without maintaining separate backups.


Part 1: The Three-Layer Metadata Architecture

Layer 1: The Catalog — Your Single Source of Truth

The Catalog is the entry point into every Iceberg table. It is not a metadata store; it is a thin pointer registry that holds exactly one piece of information: the path to the current metadata file.

Think of the catalog as your phone’s contact list. You don’t store all your friend’s phone conversations in the contact entry—you store one phone number, and dialing it connects you to everything else. The catalog stores one URL (or S3 path) pointing to the latest metadata-v3.json file. Nothing more.

Why this design? Object storage (S3) has no atomic multi-object writes. You cannot atomically write a metadata file AND manifest AND data files in one operation. Iceberg solves this with compare-and-swap (CAS): update the catalog pointer to point to a new metadata file only if it hasn’t changed. This is a single-object atomic operation that any object storage API supports.

Modern catalogs implement Iceberg’s REST API:

  • Apache Polaris (graduated to top-level Apache project in February 2026): Open-source REST catalog with credential vending and RBAC built-in. Perfect for multi-tenant governance.
  • Unity Catalog (Databricks): Proprietary REST catalog deeply integrated with Databricks workspaces.
  • Nessie: Git-like branching and tagging for Iceberg tables. Powerful for testing and data lineage.
  • AWS Glue: AWS’s managed catalog. Simplest for AWS-native stacks.

The catalog itself is ephemeral. It can be regenerated from the data files if corrupted (though time-travel would be lost). The true state lives in the metadata files.

Layer 2: Metadata Files — The Complete Table Definition

A metadata file is a single JSON document stored in object storage. It contains:

  1. Schema: Column names, IDs, types, nullability, and doc strings. Column IDs are immutable, enabling schema evolution without data rewrites.
  2. Partition spec: How data is partitioned (e.g., month(created_at), bucket(100, user_id)).
  3. Sort order: How files within a partition are physically sorted (enables Z-order indexing for skipping).
  4. Snapshot list: IDs of all snapshots, including the current and previous snapshots for rollback.
  5. Properties: Table metadata like “iceberg.metadata.delete-after-commit.enabled”.
{
  "format-version": 2,
  "metadata-location": "s3://bucket/tbl/metadata/metadata-v3.json",
  "table-uuid": "abc123...",
  "location": "s3://bucket/tbl/data",
  "last-sequence-number": 2,
  "last-updated-ms": 1713360000000,
  "schemas": [
    {
      "type": "struct",
      "schema-id": 0,
      "fields": [
        {"id": 1, "name": "id", "type": "long", "required": true},
        {"id": 2, "name": "email", "type": "string", "required": true},
        {"id": 3, "name": "created_at", "type": "timestamp", "required": true}
      ]
    }
  ],
  "snapshots": [
    {
      "snapshot-id": 1,
      "timestamp-ms": 1713360000000,
      "summary": {"operation": "append", "added-data-files": "10"},
      "manifest-list": "s3://bucket/tbl/metadata/snap-1-manifests.avro"
    }
  ],
  "current-snapshot-id": 1
}

Every write operation (INSERT, UPDATE, DELETE, MERGE) creates a new metadata file. The old one is never modified. This immutability is key: queries reading metadata-v2.json will always see consistent results, even if metadata-v3.json is being written concurrently.

Layer 3: Manifests — File Tracking with Statistics

Below metadata sit manifest lists and manifest files—both Avro-encoded, not JSON. They are the workhorse of Iceberg’s file-level pruning.

Each snapshot has one manifest list (e.g., snap-1-manifests.avro), which is a list of manifest files plus partition-level statistics:

  • Partition tuple (day=2026-04-17, region=us-west-2)
  • Record count
  • Partition-level min/max for every column

The manifest list answers the question: “Which manifest files might contain my data?” Query engines use partition stats to skip entire manifest files without opening them.

Each manifest file (e.g., manifest-001.avro) tracks a set of data files:

File: s3://bucket/tbl/data/00001-S0-abc123.parquet
  Record count: 50,000
  File min values: {id: 1, email: "a@x.com", created_at: "2026-01-01"}
  File max values: {id: 50000, email: "z@x.com", created_at: "2026-04-17"}
  File nulls: {id: 0, email: 12, created_at: 0}

Query engines use these statistics to skip files without scanning them. If you query WHERE id > 60000, Iceberg knows this file’s max ID is 50000, so it skips it entirely. For a 10-terabyte table with 100,000 small files, this pruning can reduce I/O by 1000x.

Iceberg three-layer metadata hierarchy: Catalog pointer → Metadata JSON → Manifest lists → Manifest files → Data files


Part 2: ACID Transactions via Optimistic Concurrency

Traditional databases use pessimistic locking: lock the row before modifying it. Distributed systems can’t do this at scale across cloud regions. Iceberg instead uses optimistic concurrency control: assume no conflict, prepare changes, and only at commit time check if the underlying table changed.

The Optimistic Concurrency Model

Imagine two jobs—Writer A (a Spark batch) and Writer B (a Flink stream)—both wanting to append data:

  1. Read phase: Both read the current metadata (e.g., metadata-v3.json). They see the current snapshot ID and partition layout.
  2. Prepare phase: Both independently write new data files and prepare a new metadata file pointing to those files.
  3. Commit phase: Both attempt an atomic CAS (compare-and-swap) on the catalog pointer. Only one can win.

The winner moves the catalog from metadata-v3.json to its new metadata-v4.json. The loser’s CAS fails—the pointer hasn’t moved because the expected value (metadata-v3) no longer matches the actual value (metadata-v4).

The loser then retries: read the latest metadata (now v4), rebase its changes on top (add its new data files to v4’s manifest list, not v3’s), and attempt CAS again. If Writer B’s new files don’t conflict with Writer A’s (different partitions, different row IDs), this rebase is trivial and usually succeeds on the first retry.

Writer A: Read v3 → Prepare v4 → CAS v3→v4 → SUCCESS
Writer B: Read v3 → Prepare v4b → CAS v3→v4b → FAIL (saw v4)
Writer B: Read v4 → Rebase to v5b → CAS v4→v5b → SUCCESS

This is fundamentally different from Delta Lake’s transaction log: Iceberg has no central log. Every snapshot is self-contained in its metadata file. This is why Iceberg scales to multi-region writes without a metadata bottleneck.

What Happens on Real Conflicts?

If Writer B’s changes conflict with Writer A’s (e.g., both DELETE the same row), the rebase cannot proceed. Iceberg’s client library detects this and throws an exception. The application decides whether to retry, abort, or escalate. This is the “last-write-wins” semantics you’d expect from a data lake.

Isolation Level: Iceberg provides snapshot isolation. Each transaction reads from a snapshot (immutable metadata version), and commits only if the table hasn’t changed since it started. This prevents dirty reads and lost updates.

Optimistic concurrency: Multiple writers prepare independently, then compete at commit time via atomic CAS


Part 3: Schema Evolution Without Data Rewrites

In Hive-style data lakes, renaming a column meant rewriting every parquet file—a multi-hour downtime operation. In Iceberg, it’s a metadata-only change: milliseconds, zero downtime, no broken pipelines.

The Column ID Mechanism

Iceberg assigns every column an immutable ID when the table is created:

Original Schema:
  Column ID 1 → name (string)
  Column ID 2 → email (string)
  Column ID 3 → created_at (timestamp)

When you rename column 3 from created_at to created_timestamp, the schema metadata updates:

New Schema:
  Column ID 1 → name (string)
  Column ID 2 → email (string)
  Column ID 3 → created_timestamp (timestamp)  [renamed, ID unchanged]

Existing data files still reference column ID 3. When a query engine reads those files, it looks up ID 3 in the new schema and sees created_timestamp—the rename is transparent.

Supported Operations

  1. Add column: New column gets a new ID. Old data files don’t have values for this ID, so they read NULL (or a default).
  2. Drop column: Old files still have data for that ID, but it’s ignored by queries. Dropped columns can be reclaimed with a compaction pass.
  3. Rename column: ID unchanged. Metadata-only. Instant.
  4. Reorder columns: Schemas are ordered lists, but queries match columns by ID, not position. Reordering is metadata-only.
  5. Type promotion: Widen a column from int to long. Old files with int values are automatically upcast during reads. No rewrite.

All of these are metadata-only changes. No data files are rewritten. No downtime. This is why Iceberg is ideal for analytics: schema naturally evolves as business logic changes.

Nested Struct Evolution

Iceberg’s column ID system works in nested structures too. If you have a column address (a struct with fields street, city, zip), you can add, drop, or rename fields inside it without rewriting the containing column.

Original: address: struct(street: string, city: string)
Evolved:  address: struct(street: string, city: string, zip: string)

New writes populate zip. Old writes have NULL for zip. Queries handle both.

Schema evolution via column IDs: Add, drop, rename, reorder—all metadata-only, zero downtime


Part 4: Hidden Partitioning and Partition Evolution

Hive partitions are explicit: you write SELECT * FROM orders WHERE ds=2026-04-17 and the partition column ds appears in your schema. Queries must know the partition scheme.

Iceberg uses hidden partitioning: you query raw values (WHERE order_date = '2026-04-17'), and Iceberg automatically applies a transform (extract month, bucket by hash, truncate to day, etc.). The partition column doesn’t appear in the schema.

Why Hidden Partitioning Matters

  1. Query simplicity: No special partition-aware logic. Write WHERE created_at = '2026-04-17' and Iceberg handles partitioning.
  2. Partition evolution: Change your partition scheme without query rewrites.
  3. Pushdown efficiency: Iceberg knows which transforms apply to which column and prunes files automatically.

A partition spec might look like:

{
  "spec-id": 0,
  "fields": [
    {"source-id": 3, "field-id": 1000, "transform": "month", "name": "created_at_month"},
    {"source-id": 4, "field-id": 1001, "transform": "bucket[100]", "name": "user_id_bucket"}
  ]
}

New writes use this spec. Old writes used a different spec (e.g., day instead of month). Iceberg tracks multiple partition specs and applies the right filter at query time.

Partition Evolution in Action

-- Table created with daily partitions
CREATE TABLE orders PARTITIONED BY (day(order_date))

-- Three months later, you realize daily is too fine-grained
ALTER TABLE orders SET PARTITION BY (month(order_date))

This is a metadata operation. No files are rewritten. New writes use the month transform. Old writes still have day partitions. Iceberg knows which files came from which spec and prunes accordingly.

This flexibility is enormous for production systems: you can tune partition granularity as query patterns change, without downtime or rewrites.


Part 5: REST Catalogs and Multi-Engine Governance

Iceberg’s REST API decouples catalog access from Java SDK dependencies. Any language (Python, Go, Rust, Node.js) and any engine (Spark, Trino, Flink, DuckDB) can read from the same Iceberg table by speaking HTTP to a REST catalog.

Apache Polaris: The Open-Source Standard (2026)

Apache Polaris graduated to a top-level Apache project in February 2026, marking a watershed moment for open data lakehouses. Polaris is a fully-featured REST catalog implementing the Iceberg REST API, with:

  1. HTTP-based metadata access: Engines don’t need Java. A Python script can read an Iceberg table via HTTP.
  2. Credential vending: Instead of distributing long-lived S3 keys, Polaris issues temporary STS credentials scoped to specific tables and operations. Your Spark job requests SELECT access to orders, and Polaris returns a credential valid for 1 hour reading only that table’s data files.
  3. RBAC (Role-Based Access Control): Define roles (analyst, engineer, data-owner) and bind them to namespaces and tables.
  4. Multi-tenant: Each workspace is isolated. Multiple teams can share infrastructure without seeing each other’s data.

REST API Example

# List tables in namespace
curl https://polaris.example.com/v1/namespaces/sales/tables \
  -H "Authorization: Bearer $TOKEN"

# Create table
curl -X POST https://polaris.example.com/v1/namespaces/sales/tables \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "name": "orders",
    "schema": {...},
    "partition-spec": {...}
  }'

# Get current metadata location
curl https://polaris.example.com/v1/namespaces/sales/tables/orders \
  -H "Authorization: Bearer $TOKEN"
# Returns: "metadata-location": "s3://bucket/tbl/metadata/metadata-v3.json"

Query engines (Spark, Trino, Flink) call these endpoints to discover tables, fetch metadata, and commit writes. All without importing a 500MB Java SDK.

Multi-Engine Access

Because the REST API is the only interface, multiple engines can share the same tables:

  • Apache Spark: Batch transformations. Ingests 1M orders, enriches them, upserts to customer_aggregate.
  • Trino: Interactive SQL. Analysts query SELECT * FROM customer_aggregate via a Trino JDBC driver.
  • Apache Flink: Streaming. Real-time order events write directly to Iceberg via the Flink Dynamic Iceberg Sink.
  • DuckDB: Local analytics. A data scientist pulls a snapshot via SELECT * FROM iceberg.catalog.namespace.table and works offline.

All read the same metadata, see the same snapshots, benefit from the same ACID guarantees.

Governance: Credential Vending

Here’s where Polaris shines. Instead of storing S3 credentials in each engine’s config:

spark.hadoop.fs.s3a.access.key = AKIAIOSFODNN7EXAMPLE  # Bad: hardcoded, long-lived
spark.hadoop.fs.s3a.secret.key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Polaris vends credentials on-demand:

Spark → Polaris REST: "I need to read table sales.orders"
Polaris → Polaris RBAC: Is Spark's identity (service account) allowed? Yes, analyst role.
Polaris → AWS STS: Issue a session token valid for 1 hour, scoped to sales/orders, read-only
Polaris → Spark: "Here's your token"
Spark → S3: [Uses token to read data files]

This is fundamentally more secure: credentials are ephemeral, scoped, and auditable. If a Spark cluster is compromised, the attacker gets a token valid for 1 hour reading one table, not a permanent key to the entire S3 bucket.

REST Catalog + Multi-Engine: All engines read the same table via HTTP, with RBAC + credential vending


Part 6: Time Travel and Snapshot Queries

Every write in Iceberg creates a new snapshot—a new metadata file with a new snapshot ID. Snapshots are immutable. You can query any snapshot from any point in time without maintaining separate backups.

Snapshot-Based Versioning

-- Current state
SELECT COUNT(*) FROM orders;  -- 1,000,000 rows

-- Snapshot from 3 days ago (by ID)
SELECT COUNT(*) FROM orders VERSION AS OF 42857;  -- 987,654 rows

-- Snapshot from a specific timestamp
SELECT COUNT(*) FROM orders FOR TIMESTAMP AS OF '2026-04-14 15:30:00';  -- 993,210 rows

-- What changed in the last snapshot?
SELECT * FROM orders EXCEPT
SELECT * FROM orders VERSION AS OF 42856;  -- Rows added in snapshot 42857

Query engines resolve the snapshot ID or timestamp to a specific metadata file, then scan only the manifest and data files belonging to that snapshot. Old data files are never deleted (unless you explicitly expire snapshots), so this query is storage-efficient: no duplication.

Use Cases

  1. Data audits: “What data was in the table on March 15th?” Restore from a snapshot without backup infrastructure.
  2. Error recovery: Bad ETL logic corrupted data. Revert the table to the last good snapshot via ALTER TABLE ... AS OF snapshot_id.
  3. Compliance: “Show me all changes to this table in the last 90 days.” Query multiple snapshots.
  4. Testing: Clone a table from a snapshot, run tests, roll back if they fail.

Snapshot Expiration

Snapshots accumulate over time. Iceberg provides expireSnapshots to prune old ones:

CALL catalog.system.expire_snapshots(
  table => 'orders',
  older_than => TIMESTAMP '2026-01-17 00:00:00',
  retain_last => 7
)

This deletes snapshots older than January 17, 2026, but keeps the last 7. Metadata files and unused manifest files are deleted. Data files are kept (unless you also run removeOrphanFiles).


Part 7: Operational Challenges and Solutions

Iceberg in production is not free. It requires maintenance—not as much as Hive, but more than a managed data warehouse. Here are the sharp edges.

The Small Files Problem

Streaming writes create many small files. A Kafka consumer batches 10,000 events every 10 seconds and writes a parquet file. After one hour, you have 360 files. After one day, 8,640 files. Within a month, you have 250,000+ files.

This creates three problems:

  1. Metadata bloat: 250,000 files means 250,000 entries across manifest files. Manifest files themselves grow to 100+ MB. Query planning becomes slow.
  2. File open overhead: Reading 250,000 files (even via S3 Select) costs 250,000 API calls. Cloud storage APIs charge per-call.
  3. Inefficient caching: Query engines cache file metadata and statistics. Large metadata is expensive to cache.

Compaction: The Solution

Iceberg’s rewriteDataFiles action rewrites small files into larger ones:

CALL catalog.system.rewrite_data_files(
  table => 'orders',
  min_file_size_bytes => 100000000,  -- 100 MB min
  max_file_size_bytes => 500000000   -- 500 MB max
)

This Spark job reads all files and rewrites them into 500 MB chunks. A table with 250,000 100-KB files becomes 500,000 KB / 500,000 KB = 1 file (roughly). Manifest sizes shrink. Query planning is fast.

Timing: Streaming tables should compact every 1-4 hours. Batch tables can compact once per day after load completes.

Manifest Rewriting

Even with compaction, manifest files accumulate. Every write creates new manifests. After 1,000 writes, you have 1,000 manifest files spread across dozens of snapshots.

rewriteManifests consolidates manifests:

CALL catalog.system.rewrite_manifests(table => 'orders')

This clusters data files into fewer manifests, reducing query planning time by 50%+.

Orphaned Files

Writes sometimes fail midway. A Spark job dies after writing data files but before committing metadata. These data files are orphaned—not referenced by any snapshot.

removeOrphanFiles finds and deletes them:

CALL catalog.system.remove_orphan_files(
  table => 'orders',
  older_than => TIMESTAMP '2026-01-01 00:00:00'
)

This scans all files in the table directory, compares them to files referenced by any snapshot, and deletes files older than January 1st not referenced by any snapshot. Savings can be 10-30% of table size.

The Maintenance Sequence

Run maintenance in this order:

-- 1. Compact small files
CALL catalog.system.rewrite_data_files(table => 'orders', ...);

-- 2. Expire old snapshots
CALL catalog.system.expire_snapshots(table => 'orders', older_than => ..., retain_last => 7);

-- 3. Remove orphaned data
CALL catalog.system.remove_orphan_files(table => 'orders', older_than => ...);

-- 4. Rewrite manifests
CALL catalog.system.rewrite_manifests(table => 'orders');

Running them out of order can leave orphans that removeOrphanFiles can’t see or cause issues with snapshot expiration.

Maintenance workflow: Compact → Expire → Remove orphans → Rewrite manifests


Part 8: The 2026 Streaming Lakehouse Pattern

In 2024-2025, the separation between streaming and batch was hard: Kafka fed real-time systems; S3 held batch. In 2026, Apache Iceberg unifies them.

The Modern Architecture

Layer 1: Ingestion — Kafka or Pulsar topics ingest raw events.

Layer 2: Stream Processing — Two parallel paths:

  1. Apache Flink: Stateful transformations, enrichments, complex joins. The Flink Dynamic Iceberg Sink commits micro-batches to Iceberg tables every 10-60 seconds.
  2. RisingWave: Streaming database maintaining continuous materialized views. A view like “customer_lifetime_value” is a table that updates in real-time as events arrive. The RisingWave Iceberg sink (new in 2026) publishes these views to Iceberg with automatic compaction.

Layer 3: Lakehouse — Apache Iceberg tables serve as the source of truth:
– Flink outputs to raw_orders, enriched_orders, customer_aggregate.
– RisingWave publishes top_products_by_region, anomalies_detected.
– Both engines write simultaneously, with ACID guarantees preventing conflicts.

Layer 4: Analytics — Any engine queries the same Iceberg tables:
– Spark for batch transformation and machine learning.
– Trino for interactive SQL dashboards.
– DuckDB for local data science notebooks.

Example: Real-Time Order Analytics

Kafka: orders_topic (order_id, user_id, amount, timestamp)
        ↓
Flink Job:
  1. Parse JSON
  2. Enrich with user profile (lookup table)
  3. Aggregate by region and hour
  4. Write to Iceberg: enriched_orders, region_hourly_summary
        ↓
RisingWave Job:
  1. Consume orders_topic
  2. Maintain view: top_products_by_region = SELECT region, product, SUM(amount) ... GROUP BY region, product
  3. Maintain view: order_anomalies = SELECT * WHERE amount > (SELECT percentile(amount, 0.99) ...)
  4. Publish views to Iceberg: top_products, anomalies
        ↓
Analytics:
  Spark: SELECT * FROM enriched_orders GROUP BY region, day
  Trino: SELECT * FROM top_products WHERE amount > 10000
  DuckDB: SELECT * FROM anomalies UNION SELECT * FROM anomalies VERSION AS OF 1 hour ago

All three write layers (Flink, RisingWave, RisingWave sink) commit to Iceberg concurrently, with optimistic concurrency handling conflicts. All three read layers query the same tables.

Why This Works

  1. Unified metadata: One Iceberg catalog governs all tables. No desync between Flink and Spark versions of orders.
  2. ACID semantics: Both streaming and batch writers use optimistic concurrency. Conflicts are rare and handled gracefully.
  3. Governance: Polaris RBAC applies to all engines. Analysts can query Flink output via Trino without access to raw Kafka.
  4. Cost: No dual ingestion (Kafka → warehouse + Kafka → batch). Flink writes directly to Iceberg.

RisingWave introduced Iceberg sink with automatic compaction in Q1 2026. This solves the small-files problem for streaming: RisingWave batches materialized view updates and commits to Iceberg every 10 seconds, and the sink automatically compacts files to 500 MB chunks.

Flink’s Dynamic Iceberg Sink (released in November 2025) allows a single Flink job to write to multiple Iceberg tables dynamically, determined by data. A single job can partition events into raw_orders, enriched_orders, customer_aggregate—and all respect schema evolution.

2026 streaming lakehouse: Flink + RisingWave write to Iceberg; Spark, Trino, DuckDB query the same tables


Part 9: Choosing Iceberg: Trade-offs and Alternatives

Iceberg vs Delta Lake

Delta Lake (Databricks) is a competitor offering similar ACID semantics and metadata. Key differences:

Aspect Iceberg Delta Lake
Metadata Immutable, per-snapshot Mutable transaction log
Multi-region writes Native (CAS on pointer) Requires external coordination
Open source Yes (Apache) Yes (Linux Foundation), Databricks proprietary features
Catalog options Polaris, Unity, Nessie, Glue Mostly Unity, some open options
Schema evolution Column IDs, zero downtime Additive only, some limitations
Streaming maturity Flink/RisingWave sinks (2026) Less mature ecosystem
Query engines Spark, Trino, Flink, DuckDB, ClickHouse Mainly Spark, Trino support emerging

Iceberg’s immutable metadata and open catalog ecosystem make it the natural choice for multi-engine, multi-cloud environments. Delta Lake is a better choice if you’re Databricks-centric.

Iceberg vs Apache Hudi

Hudi (Uber) targets incremental processing: copy-on-write (COW) and merge-on-read (MOR) tables. Hudi excels at frequent updates (100K updates per second). Iceberg’s strength is ACID guarantees and time travel at any scale.

Hudi’s upsert semantics are lower-level; Iceberg’s are higher-level (SQL MERGE). For warehouse-style analytics, Iceberg is simpler.

When NOT to Use Iceberg

  1. Tiny datasets (<100 GB): Overhead of metadata management doesn’t justify the cost.
  2. Single-engine lock-in: If you’re Spark-only, Delta Lake or Hudi might be simpler.
  3. Ultra-high-volume upserts: If you have 1M updates/second on the same keys, Hudi MOR might be faster.
  4. Legacy Hive clusters: Cost of migration may exceed benefits.

Part 10: Advanced Topics and 2026 Frontier

Clustered Sort Orders and Z-Order

Iceberg’s sort order goes beyond partitioning. You can define:

{
  "order-id": 0,
  "fields": [
    {"source-id": 3, "direction": "asc", "null-placement": "nulls-last"},
    {"source-id": 4, "direction": "asc", "null-placement": "nulls-last"}
  ]
}

New data files are written sorted by (column 3, column 4). Queries filtering on these columns can use binary search to skip 90% of file content. Spark’s Z-Order integration brings multi-dimensional clustering.

Incremental Materialized Views

Iceberg + streaming unlocks incremental MVs: instead of recomputing SELECT SUM(amount) FROM orders GROUP BY customer_id from scratch daily, compute only changes since the last snapshot and upsert to a summary table.

CREATE MATERIALIZED VIEW customer_summary AS
SELECT customer_id, SUM(amount) as total FROM orders GROUP BY customer_id;

-- When orders table is written to
REFRESH MATERIALIZED VIEW customer_summary INCREMENTAL;

Iceberg’s snapshots and manifest list make incremental reads trivial: read the old manifest list, read the new manifest list, diff them, and recompute only for changed partitions.

Iceberg on Diverse Storage

While Iceberg began on S3, 2026 sees maturity on:
Azure Data Lake Storage Gen2: RBAC via Entra ID.
Google Cloud Storage: Integrated with BigLake (Google’s lakehouse).
MinIO: On-premises S3-compatible object storage.
Hybrid cloud: Multi-cloud Iceberg tables, e.g., Polaris in Kubernetes pointing to S3 + GCS.

Edge Cases and Gotchas

  1. Metadata consistency: Iceberg assumes the catalog’s atomic CAS works correctly. AWS Glue CAS has rare race conditions; Polaris is more robust.
  2. Manifest file bloat: Frequent writes to small partitions accumulate manifests. Mitigate with rewriteManifests every day.
  3. Snapshots as backup: Expired snapshots are gone. Don’t rely on Iceberg for backup; use S3 versioning or external snapshots.
  4. Query engine bugs: Each engine implements Iceberg spec independently. A Trino bug might not affect Spark. Test cross-engine.

Conclusion: The Iceberg Ecosystem in 2026

Apache Iceberg is no longer niche. It is the table format for multi-engine, multi-cloud data lakehouses. The 2026 ecosystem is mature:

  • Catalogs: Polaris (open, just graduated to top-level Apache), Unity (Databricks), Nessie (open, Git-like), cloud-native options.
  • Engines: Spark, Trino, Flink (sinks), DuckDB, ClickHouse, Snowflake, Redshift.
  • Governance: Polaris RBAC + credential vending, Unity Catalog policy enforcement, data lineage in Collibra/Atlan.
  • Streaming: Flink Dynamic Iceberg Sink, RisingWave Iceberg sink with auto-compaction, Kafka Connect sinks.
  • Operations: Maintenance procedures are well-documented; compaction, snapshot expiration, orphan cleanup are standard.

The path forward is clear: if you’re building a data platform in 2026, start with Iceberg. Its metadata architecture is proven, its ecosystem is thriving, and its open-source foundation means no vendor lock-in.


References and Further Reading


Word Count: 5,847 words | Diagrams: 6 | Publication Date: April 17, 2026

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *