Iceberg vs Delta vs Hudi for Industrial Lakehouse: 2026 ADR
If you are designing an industrial lakehouse in mid-2026, you no longer have the luxury of “we will pick the table format later.” The choice between Iceberg vs Delta vs Hudi industrial lakehouse stacks is now the single most consequential architectural decision in your data platform, and the three formats have diverged enough that a deferred decision is itself a decision — usually the wrong one. AWS shipped S3 Tables as an Iceberg-native managed service. Databricks made Unity Catalog its commercial centre of gravity and pulled Delta Lake along with it. Snowflake’s Open Catalog (Polaris) graduated and now backs production Iceberg deployments at scale. Hudi quietly remains the deepest answer to streaming upserts. Meanwhile your historian is straining, your unified namespace is shoving Sparkplug B payloads at Kafka at rates the warehouse team did not sign up for, and someone on the AI team wants a feature store yesterday. This is an architecture decision record. We will treat it like one.
The format you pick determines who can read your data five years from now, how much pain a schema change costs, whether your streaming pipeline is one job or three, and whether your governance plane is a product feature or a hand-rolled mess. This ADR walks through the context, the decision drivers, the three options with their internals, the consequences of each, the industrial reference patterns that work in 2026, and a defensible recommendation with a decision tree you can use directly.
Context: Industrial Lakehouse in 2026
Answer-first summary: An industrial lakehouse in 2026 is the convergence layer where unified-namespace telemetry, historian data, MES events, and quality/maintenance records land in one open, queryable substrate — cheap object storage underneath, transactional table formats on top, multiple compute engines on the side. The choice of table format (Iceberg, Delta, Hudi) controls portability, governance, streaming behaviour, and total cost of ownership across the next five to ten years.
The industrial lakehouse is not a marketing rebrand of “Hadoop in the cloud.” It is a specific architectural answer to a specific operational pain: historians do not scale horizontally, time-series databases do not join cleanly to relational reference data, and warehouses balk at semi-structured payloads. By 2026, three things have hardened. First, S3, ADLS Gen2, and GCS are the durable storage substrate; nobody is building a serious greenfield system on HDFS. Second, Parquet plus an open table format on top of object storage has displaced both column-store warehouses and time-series-only stores for the bulk-historical analytical workload. Third, the catalog has emerged as the actual control plane — the table format alone is plumbing, but the catalog is where governance, multi-engine access, lineage, and sharing live.
For an industrial site or a process-industry enterprise, the table format choice has unusually long teeth. The data volumes are large (a single line of OPC UA tags running at 100 ms can produce billions of events per year), the retention horizons are long (regulatory and warranty requirements often demand seven to ten years), and the consumers are heterogeneous (control engineers, process engineers, data scientists, BI analysts, and increasingly LLM-grounded copilots). The wrong format choice does not break the system; it makes every interaction with it slightly more expensive, forever.
Decision Drivers
Answer-first summary: Five drivers dominate this decision: the shape of your time-series ingest (append-only vs upsert-heavy), how often your schema evolves and how forgiving the format must be, the transactional semantics you need for concurrent writers and readers, the catalog and governance posture you can commit to, and the breadth of the engine ecosystem you need to support. Iceberg leads on portability and catalog openness, Delta on Spark integration and managed governance, Hudi on streaming upserts and record-level indexing.

Take the drivers one at a time.
Time-series shape. Most industrial telemetry is append-dominant — events arrive in time order, never to be modified. But “append-dominant” is not “append-only.” Late-arriving sensor data, edge buffer flushes after a network partition, and corrected values from the historian’s after-the-fact reconciliation all introduce out-of-order writes and, occasionally, in-place updates. If your dominant pattern is appends with the occasional snapshot-level rewrite, all three formats handle it. If your dominant pattern is genuine record-level upserts at high frequency — for example, a CDC stream from your MES that constantly revises work-order rows — Hudi’s record-level indexing and Merge-on-Read tables are in a class of their own. Iceberg and Delta both support UPDATE and MERGE, but neither was designed for upserts as a primary workload, and the performance gap shows up under load.
Schema evolution. Industrial schemas drift constantly. A new sensor goes in. A tag is renamed when a unit is reclassified. A measurement field promotes from integer to float when a higher-resolution transmitter is installed. The table format has to handle these without rewriting petabytes. Iceberg’s schema evolution is the most rigorous of the three — every field has a stable ID, so column renames are metadata-only, and partition specs can themselves evolve. Delta supports column mapping (rename without rewrite) and added a number of evolution features in 2024–2025, but its model is less principled than Iceberg’s at the spec level. Hudi supports schema-on-read with reconciliation rules; it works, but it is less ergonomic.
Transactional semantics. Multiple writers, multiple readers, and atomic visibility are non-negotiable in a serious lakehouse. Iceberg uses snapshot isolation with optimistic concurrency control at the catalog level — a write commits a new metadata file pointer, and conflicting writes are detected at commit time. Delta uses an ordered transaction log (_delta_log) with checkpoints; commits are serializable. Hudi uses a timeline-based MVCC approach with explicit instants. All three give you ACID; they differ in how concurrent writers behave under contention and how cleanly you can roll back. For industrial pipelines with a single dominant writer (usually Spark or Flink) and many readers, the difference is small. For multi-writer scenarios — for example, a streaming ingest and a batch backfill operating on the same table — the operational ergonomics diverge.
Catalog and governance. This is the axis that has moved the most between 2024 and 2026. The Iceberg REST catalog specification is now the de facto open standard, implemented by Snowflake’s Polaris (donated to Apache), the open-source Unity Catalog, Lakekeeper, Nessie, and AWS Glue. Delta’s first-class governance home is Unity Catalog (the Databricks-anchored version), with Delta Sharing as the cross-org access protocol. Hudi’s governance story is the thinnest of the three — typically Hive Metastore plus AWS Glue. If your organization has any sovereignty, regulatory, or multi-cloud requirement, the catalog openness gap is the single biggest reason Iceberg has pulled ahead.
Ecosystem reach. Iceberg has the broadest engine support: Spark, Flink, Trino, Presto, Dremio, DuckDB, Snowflake, BigQuery (read), Daft, and most of the AI training frameworks. Delta is Spark-first and Databricks-anchored, with Delta UniForm as the compatibility bridge that lets non-Databricks engines read it as Iceberg. Hudi is strong on Spark and Flink, with growing Trino and Presto support but lighter coverage elsewhere. If you can commit to Spark/Databricks for compute, Delta is excellent. If you need any other engine to be a first-class citizen — and most industrial shops do, because Trino is often the BI substrate and DuckDB is the new edge analytics darling — Iceberg wins on portability.
These five axes are not equally weighted. In our experience auditing 2026 industrial deployments, catalog openness and ecosystem reach are the drivers that age the worst when you ignore them. Time-series shape and transactional semantics matter, but they are fixable later. Catalog choice is the one you cannot easily reverse.
Options Compared: Iceberg, Delta, Hudi
Answer-first summary: Iceberg, Delta, and Hudi all give you ACID transactions, schema evolution, and time-travel on top of object-stored Parquet. They differ structurally in how they record writes (manifest tree vs ordered log vs timeline), how they handle compaction and clustering, and how their catalog plane is governed. The three are not interchangeable below the surface — picking one is picking a write path, a compaction model, and a governance posture.

Apache Iceberg
Iceberg’s central data structure is a tree of metadata files. A write produces new data files (Parquet), a new manifest file listing those data files with column-level statistics, a new manifest list pointing to all current manifests, and a new metadata JSON containing the current snapshot, schema, and partition spec. The catalog stores a single pointer — the location of the current metadata JSON. A commit, at its core, is an atomic update of that pointer.
That design has several consequences. Reads scale to large tables because the manifest tree gives engines column-level pruning before any data file is opened. Snapshots are first-class — every commit creates one, time-travel is SELECT ... FOR VERSION AS OF ..., and rollback is metadata-only. Partition evolution is supported because the partition spec is itself versioned, which is impossible in older systems where the partition layout is baked into the directory structure. Hidden partitioning lets you specify transforms like bucket(16, sensor_id) or hour(event_ts) without forcing query authors to write boilerplate filters.
Compaction in Iceberg is explicit. You run rewrite_data_files to compact small files into larger ones, rewrite_manifests to keep the manifest tree healthy, and expire_snapshots to garbage-collect old snapshots and their files. Iceberg gives you the procedures; you (or your managed catalog provider) decide when to run them. AWS S3 Tables and Snowflake Open Catalog will run them for you on a schedule.
The catalog is where Iceberg has surged in 2026. The REST Catalog specification has been adopted by Snowflake’s Polaris (now an Apache project), the open-source Unity Catalog (Databricks contributed an OSS version), Lakekeeper, Nessie, and AWS Glue. That means a single Iceberg table can be governed by, say, Polaris in dev and Glue in prod, and any compliant engine can read it through either catalog with the same client code. This level of catalog portability does not exist for the other two formats.
Delta Lake
Delta’s central data structure is an ordered transaction log of JSON files in a _delta_log/ directory, with periodic Parquet checkpoints to keep replay fast. A write produces new data files and a new JSON commit recording the added/removed file list, schema changes, and operation metadata. Readers replay the log (starting from the most recent checkpoint) to compute the current set of files.
This design has its own virtues. Transaction semantics are clean — serializable isolation falls out of the ordered log. Time-travel is VERSION AS OF n or TIMESTAMP AS OF t. The MERGE operation is excellent, with deep optimization in the Databricks runtime; for SCD2-style updates and upsert workloads, Delta on Databricks is genuinely fast. Liquid clustering (Databricks 2024+, with Delta 3.x bringing it to open-source) removes the need to choose partition columns up front — you declare clustering keys, and Delta keeps data laid out for those keys without forcing partition decisions.
The trade-off is gravity. Delta is open-source and its spec is public, but the centre of mass is the Databricks runtime. Features tend to land there first (liquid clustering, Photon-optimized writes, Auto Loader, deletion vectors), and non-Databricks engines tend to lag. Delta UniForm — the feature that exposes a Delta table’s data with an Iceberg metadata view — is a real bridge for read interoperability, but it is a bridge, not parity. If you are happy on Databricks (with Unity Catalog for governance and Delta Sharing for cross-org access), Delta is excellent. If you need to make Trino, Flink, DuckDB, and Snowflake all first-class writers, Iceberg is the safer bet.
For Unity Catalog specifically, the trajectory in 2026 is interesting. Databricks open-sourced the Unity Catalog API and reference implementation in 2024, and the OSS Unity Catalog now supports Iceberg tables alongside Delta. So even Delta’s flagship governance plane is moving toward format-pluralism. That blurs the historical “Delta means Databricks lock-in” critique, but only in the catalog direction — the runtime gravity remains.
Apache Hudi
Hudi’s central data structures are the timeline (an ordered log of instants under .hoodie/) and the file groups (logical groupings of base Parquet files and delta log files for the same record set). Each instant records an action — commit, deltacommit, compaction, clean — with state transitions (requested, inflight, completed). Hudi supports two table types: Copy-on-Write (CoW), where every write produces new base Parquet files, and Merge-on-Read (MoR), where writes append to row-based log files that are periodically merged into base files.
This is the format that takes upserts seriously. Hudi maintains a record-level index (Bloom, simple, HBase-backed, or the newer Record-Level Index built into the timeline) that maps record keys to file groups. When a record arrives, Hudi looks it up, knows which file group owns it, and writes either a base-file update (CoW) or a log-file entry (MoR). That is fundamentally cheaper than the “scan-and-merge” approach the other two use for upserts. For CDC pipelines and streaming ingest where the same primary key arrives many times across hours, Hudi’s record-level indexing is the difference between hours and minutes.
Hudi also has the strongest built-in support for incremental queries — a reader can ask “give me all changes since instant X” and get only the affected records, which is gold for downstream change-propagation. Flink integration is first-class. The MoR table type, with asynchronous compaction, delivers low write amplification and near-real-time read latency in the same table.
The trade-offs are real. Hudi’s community is smaller than Iceberg’s or Delta’s. Query-engine support outside Spark and Flink is thinner — Trino support exists but lags Iceberg’s. Operational complexity is higher: you must reason about timeline cleaning, compaction scheduling, and index choice. The catalog ecosystem is centred on Hive Metastore and AWS Glue; there is no equivalent of the Iceberg REST Catalog spec.
For a pure streaming-upsert workload, Hudi is the technically best answer. For a mixed workload where streaming upserts are one of several patterns, the operational overhead pushes most 2026 teams toward Iceberg with Flink CDC patterns, or toward a hybrid (Hudi for ingest, Iceberg for the curated layer).
Consequences
Answer-first summary: Each format locks you into a different set of long-term properties. Iceberg locks you into a portable, catalog-centric architecture with explicit compaction operations and the responsibility to pick a catalog implementation. Delta locks you into Spark-shaped compute and (in practice) deeper integration with Databricks tooling in exchange for the smoothest single-engine experience. Hudi locks you into a richer operational surface for streaming upserts and a thinner ecosystem beyond Spark and Flink.

The consequences are not symmetric. Each format optimizes for a different priority, and the long-term cost of fighting that priority is high.
If you choose Iceberg, you gain genuine engine portability — Spark, Flink, Trino, Snowflake, Daft, DuckDB, BigQuery all read your tables, and managed offerings like AWS S3 Tables and Snowflake Open Catalog handle maintenance for you. You commit to making a catalog decision (Polaris vs Nessie vs Glue vs Unity OSS) that the format alone does not dictate, and you accept that high-frequency record-level upserts will need careful Flink CDC patterns rather than coming for free.
If you choose Delta, you gain the smoothest Spark experience available — MERGE is fast, liquid clustering removes partition guesswork, Unity Catalog is a coherent governance product, Delta Sharing solves cross-organization data exchange cleanly. You accept gravity toward Databricks runtime, and you accept that non-Spark engines will lag features. Delta UniForm narrows but does not close the interoperability gap.
If you choose Hudi, you gain best-in-class record-level upserts, native incremental queries, low-latency Merge-on-Read tables, and a mature streaming story. You accept higher operational complexity — timeline management, compaction tuning, index choice — and a thinner non-Spark/Flink engine ecosystem. The catalog story remains Hive-Metastore-anchored.
A pattern worth naming: in 2026, the consequence that matters most for industrial customers is catalog openness. Industrial data has a habit of outliving the platform team that built it. Catalog portability — the ability to swap your governance plane without re-platforming your data — is the architectural property that ages best. That is the single biggest reason Iceberg has become the default recommendation for new industrial lakehouse deployments unless there is a specific reason to choose otherwise.
There is also a “stuck in the middle” consequence to flag. Teams that try to hedge — using Delta for some pipelines and Iceberg for others, or using UniForm as a permanent fence-sitting solution — usually end up with the disadvantages of both. The pipelines that write Delta produce subtly different optimization patterns than the ones that write Iceberg, the governance planes diverge, and the operational runbook doubles. If you genuinely need multi-format support (for example, because you are migrating), make it a time-bounded migration program, not a steady state.
Industrial Patterns: UNS Sink, Historian Migration, ML Feature Store
Answer-first summary: Three industrial use cases dominate 2026 lakehouse deployments: a sink for the unified namespace (UNS → Kafka → table format), a migration target for legacy historian data (PI, Aspentech IP.21, Wonderware), and a feature store for plant-level ML models. All three are well served by Iceberg-on-REST-catalog, with Flink as the streaming writer and Trino, Spark, and Daft as the readers. The decision is rarely about the format itself — it is about how disciplined your bronze/silver/gold layering and your catalog governance are.

The reference architecture in the diagram above is the one we see most frequently in successful 2026 deployments. Start at the left edge.
Plant edge. PLCs and DCS controllers expose tags via OPC UA, and Sparkplug B edge nodes publish to an MQTT broker that anchors the unified namespace architecture. The UNS gives you a single, hierarchically organized stream of plant state that follows ISA-95 or ISA-88 conventions. For the field-level integration of newer cells, OPC UA FX is increasingly the substrate beneath the UNS.
Streaming plane. Kafka (or Redpanda) ingests the UNS topics, and Flink jobs do the window-aggregate-enrich work — deduplication of edge retransmissions, joining tag IDs to asset metadata, computing simple aggregates like one-minute means. Flink writes directly into Iceberg tables; the Flink-Iceberg connector is mature in 2026 and handles exactly-once semantics through Iceberg’s snapshot-commit model.
Bronze, silver, gold. The bronze layer holds raw events, partitioned by hour (or by hour(event_ts) if you use Iceberg’s hidden partitioning), with minimal schema enforcement. Silver tables are cleaned, joined to dimensions (asset tag → unit → site), and deduplicated. Gold tables are aggregated KPIs — OEE, energy intensity, yield per batch — at whatever cadence the business consumers need.
Catalog and governance. A single Iceberg REST catalog instance — Polaris, Lakekeeper, or open-source Unity Catalog — registers every table and enforces RBAC at the namespace level. Lineage is captured automatically by the catalog (most implementations integrate with OpenLineage). For sensitive customer data, the catalog enforces tenant isolation; for SCADA-derived data, it enforces site-level access boundaries.
Query and ML plane. Trino or Starburst is the BI substrate — fast SQL, broad connector ecosystem, well-understood operations. Spark or Databricks handles heavy batch transforms (the silver-to-gold and feature-engineering passes). DuckDB and Daft are increasingly the choice for edge BI, where an analyst on a workstation wants to slice a hundred-million-row Iceberg table without round-tripping to a cluster. Ray or SageMaker pulls features for model training. Power BI or Superset closes the loop with dashboards.
A specific pattern worth highlighting: the ML feature store on Iceberg. Feast and Tecton both support Iceberg as a backing store, which means a feature defined for a manufacturing AI model lives in the same table format as the analytical data — same schema, same lineage, same governance. This collapses what used to be a separate data plane (feature store) into the lakehouse, which is a significant simplification when you are running a dozen plant-level models that need consistent features.
For observability of this pipeline — UNS ingestion lag, Flink job health, Iceberg commit conflicts — the modern pattern is eBPF-based observability at the cluster level paired with the catalog’s own audit logs at the data-plane level. That combination gives you operational visibility into both the compute and the data simultaneously.
The historian migration pattern is a variant of the same architecture. Instead of (or in addition to) the live UNS, a batch job reads from the legacy historian (PI, IP.21, Wonderware) and writes into the bronze layer. The migration is rarely a one-shot — it is a parallel-running period where the lakehouse catches up with years of history while live ingestion runs in parallel. Iceberg’s snapshot model is especially helpful here because the migration job can write to a side branch, validate, and atomically promote.
Recommendation and Decision
Answer-first summary: For a new industrial lakehouse in 2026, the default recommendation is Apache Iceberg with a REST Catalog (Polaris, Unity OSS, or Lakekeeper) and Flink for streaming ingest. Choose Delta Lake when the entire compute stack is Databricks and Unity Catalog is already your governance home. Choose Hudi when streaming upserts dominate and you have the operational depth to run it. For mixed workloads, consider a hybrid where Hudi is the ingest layer and Iceberg is the curated layer, but make this a deliberate two-format design, not an accident.

The decision tree above codifies the recommendation. Walk through it.
If you need multi-engine reads across Spark, Trino, Flink, and DuckDB — which is the modal industrial requirement — you are in Iceberg territory unless something else dominates. If sovereignty and open-catalog posture are required (a real concern for European industrial deployments under EU data governance, and for any deployment with a sovereign-cloud requirement), Iceberg with an open REST catalog is essentially the only answer. If a managed cloud table layer is preferred, AWS S3 Tables, Snowflake Open Catalog, and the various cloud-managed Polaris offerings all give you Iceberg with operational toil handled.
If your entire compute stack is on Databricks and Unity Catalog (the proprietary version) is already your governance home, Delta Lake with liquid clustering is the right answer. You will get the fastest Spark performance, the cleanest MERGE story, and an integrated governance product. The “lock-in” framing is overstated when this is genuinely your environment; you are buying a coherent product. The trap is choosing Delta because you think it is the path of least resistance and then discovering eighteen months later that the ML team needs Trino and the European subsidiary needs an open catalog.
If streaming CDC with frequent record-level upserts is your dominant pattern — really dominant, not just present — Hudi with MoR tables, Flink writers, and the record-level index is technically the strongest. Many 2026 teams choose a hybrid pattern instead: Hudi for the streaming ingest layer where its upsert performance matters, then Spark or Flink jobs that convert (or replicate) into Iceberg for the downstream curated layer that the rest of the organization queries. This hybrid pattern is more operational work but combines the best of both formats for genuine streaming-upsert-dominant workloads.
The ADR’s decision: for the modal 2026 industrial lakehouse, choose Iceberg. Choose Delta only when the entire stack is Databricks-anchored and you have decided Unity Catalog is your governance plane. Choose Hudi only when streaming upserts are the dominant workload and you have the operational capacity to manage its complexity. Hybrid Hudi-into-Iceberg is a legitimate pattern for the streaming-heavy edge of industrial workloads, but treat it as a two-format design with clear ownership, not an accidental drift.
Trade-offs, Gotchas, and What Goes Wrong
Answer-first summary: The failure modes are not about the format itself — they are about catalog drift, small-file storms, partition-spec mistakes, compaction lag, and the slow rot of governance discipline. Every format has the same failure surface; the prevention pattern is operational, not architectural.
The most common 2026 failure mode is catalog sprawl. A team starts with one catalog, then someone spins up another for a project, then the BI team has their own. Soon you have three catalogs holding overlapping pointers to the same tables, RBAC is inconsistent, and a renamed column in one catalog is invisible in another. The mitigation is policy: one production catalog, one staging catalog, period. If you need multi-environment access, federate at the engine level, not by duplicating catalogs.
Small-file storms kill more lakehouse pipelines than any other technical issue. Streaming writes from Flink or Spark Structured Streaming produce many small files; without active compaction, queries slow down geometrically. All three formats have compaction procedures; the failure is not running them on a schedule. Iceberg’s rewrite_data_files, Delta’s OPTIMIZE, and Hudi’s clustering job all need to be cron-driven or managed-service-handled. Managed offerings (S3 Tables, Snowflake Open Catalog, Databricks Auto-Optimize) handle this for you, which is a real reason to consider them.
Partition-spec mistakes were the leading bug source in 2023–2024 Iceberg deployments. Choosing day(event_ts) for a high-frequency telemetry table produces too few partitions; choosing event_ts directly produces too many. The 2026 best practice is hour(event_ts) for high-frequency telemetry plus a bucket(N, asset_id) for the asset dimension, with N sized to the read concurrency. Iceberg’s partition evolution lets you fix this later, which is a feature the other formats lack at the same level of rigor.
Compaction lag is the slow killer. Compaction jobs that run late, fail silently, or are scheduled less frequently than ingest produce an ever-growing tail of small files and an ever-growing manifest tree. Alert on it. Treat compaction job health as a first-class SRE concern, not an afterthought.
Governance rot — the slow drift of who can read what, which tables are tagged sensitive, which lineage edges are missing — is the failure mode that hurts most in regulated industries. The mitigation is catalog-as-code: define your table namespaces, RBAC grants, and tagging in a Git-tracked policy file and apply it through CI, not through ad-hoc clicks.
Practical Recommendations
The pattern that works in production looks like this. Start with one production catalog and one staging catalog, both Iceberg REST. Pick Polaris, Lakekeeper, or open-source Unity Catalog based on which one your platform team can operate. Run Flink as the streaming writer from Kafka into the bronze layer with hidden partitioning by hour. Use Spark for batch silver-to-gold transforms and ML feature engineering. Use Trino as the SQL-for-everyone substrate. Use DuckDB or Daft for edge analytics that does not need a cluster.
Schedule compaction nightly for high-volume tables and weekly for low-volume ones; alert on compaction-job failures the same way you would alert on a missed payroll batch. Define your partition specs explicitly per table — there is no universal answer. Treat catalog configuration as code; review schema changes in pull requests. Keep tenant isolation in the catalog, not in the application layer. For Databricks-anchored shops, do Delta with liquid clustering and accept the gravity for what it is. For streaming-upsert-dominant shops, do Hudi MoR with the record-level index, and have a real operations team.
The mistake to avoid above all is treating the table format as a decision made once at design time and then ignored. The format choice is the bottom of an operational stack — compaction policy, catalog policy, partition policy, governance policy — that you have to actively run. Picking Iceberg does not save you from that work; it gives you a substrate that ages well while you do the work.
FAQ
Which open table format is best for industrial IoT in 2026?
For most industrial IoT lakehouse deployments in 2026, Apache Iceberg with a REST catalog (Polaris, Lakekeeper, or open-source Unity Catalog) is the default recommendation. It gives you the broadest engine portability — Spark, Flink, Trino, DuckDB, Snowflake, BigQuery — and a vendor-neutral governance plane that ages well as your platform team and your cloud strategy evolve. Choose Delta Lake when your entire compute stack is on Databricks with Unity Catalog, or Hudi when streaming upserts dominate your write pattern.
Can I use Iceberg, Delta, and Hudi together?
You can, but you usually should not. Each format has its own optimization patterns, governance plane, and operational runbook. Running multiple formats in parallel as a steady state typically gives you the disadvantages of all three with the cleanest experience of none. The legitimate multi-format pattern is a time-bounded migration (Delta to Iceberg, for example) or a deliberate two-tier design (Hudi for streaming ingest, Iceberg for the curated downstream layer with explicit conversion). Delta UniForm narrows the read interoperability gap with Iceberg but should not be treated as a permanent fence-sitting solution.
How does Iceberg REST catalog compare to Hive Metastore?
The Iceberg REST catalog is a versioned, HTTP-based, transactional catalog API designed for the multi-engine cloud era. Hive Metastore is the older Thrift-based catalog inherited from the Hadoop era. The REST catalog supports atomic commits, multi-table transactions in some implementations, vendor-neutral interoperability across implementations (Polaris, Lakekeeper, Glue, Unity OSS), and modern governance features like fine-grained RBAC. Hive Metastore still works for basic cases but is on a clear deprecation trajectory for new industrial lakehouse deployments.
Does AWS S3 Tables replace running my own Iceberg catalog?
S3 Tables is AWS’s managed Iceberg-native storage offering — it gives you Iceberg tables with automated maintenance (compaction, snapshot expiry) and a managed catalog endpoint. If your data is on AWS and your engines (EMR, Athena, Glue, Trino, Snowflake) can talk to S3 Tables through its Iceberg REST endpoint, it is a reasonable way to outsource the operational toil of running compaction and catalog infrastructure. The trade-off is the usual managed-service trade-off: less control, AWS-shaped integration, and pricing that varies with table volume. For multi-cloud or sovereignty-sensitive deployments, a self-hosted Polaris or Lakekeeper is often preferable.
Is Delta Lake locked to Databricks?
Delta Lake the project is open-source and the protocol is public. The reference Spark implementation is open. However, the centre of mass — feature velocity, performance optimization, governance integration (Unity Catalog), and the broader ecosystem (Delta Sharing, Delta Live Tables) — is on Databricks. Non-Databricks engines can read and write Delta tables, but they tend to lag features. Delta UniForm exposes a Delta table with Iceberg metadata for cross-format reads. In practice, Delta is a great choice if Databricks is your home; it is a less good choice if you need Trino, Flink, Snowflake, or DuckDB to be equal first-class writers.
Further Reading and References
Internal:
- Cloud and DevOps for Industrial IoT (pillar) — broader context for industrial cloud architecture.
- Unified Namespace Architecture with HiveMQ and Sparkplug B (2026) — the UNS substrate that feeds the lakehouse.
- OPC UA FX Field-Level Communications Analysis (2026) — the field-level layer beneath the UNS.
- eBPF Observability with Pixie and Cilium (2026) — observability for the lakehouse compute plane.
External:
- Apache Iceberg project documentation (iceberg.apache.org) — spec, REST Catalog specification, Flink and Spark integration guides.
- Delta Lake project documentation (delta.io) — protocol spec, liquid clustering, Delta UniForm.
- Apache Hudi project documentation (hudi.apache.org) — table types, record-level index, timeline model.
- Snowflake Open Catalog (Polaris) — Apache project page and Snowflake documentation.
- Databricks Unity Catalog — proprietary and open-source distributions, Delta Sharing protocol.
- AWS S3 Tables announcement and documentation — Iceberg-native managed storage on AWS.
- OpenLineage project — lineage emission standard supported across the three formats.
