Iceberg Catalogs: Polaris vs Nessie vs Unity vs Glue

For two years, Apache Iceberg solved the table-format problem. Then 2024 arrived with a harder question: which catalog should own the metadata? Snowflake open-sourced Polaris (June 2024, Apache 2.0, now incubating at the ASF), Databricks released Unity Catalog under Linux Foundation governance (also June 2024), and Dremio’s Project Nessie matured beyond alpha with full Iceberg REST Catalog API support. Meanwhile AWS Glue—already managing billions of table pointers—added REST endpoints to compete. The Iceberg catalogs comparison has become the central architectural decision of 2025-2026 lakehouse deployments.

This is not merely a feature checklist. The catalog you choose determines whether your Trino cluster can read Snowflake’s tables without copying data, whether you can version tables like git commits, whether your row-level access policies survive a team transition, and whether you’re locked into a vendor’s Python SDK or free to swap engines. It’s the boundary layer between table format and data governance—the control plane that every query engine must trust.

Why the Catalog Layer Suddenly Matters

The genius of Apache Iceberg was separation of concerns. The table format (Parquet snapshots, manifest lists, metadata versioning) stayed in object storage. Query engines (Spark, Trino, Flink) learned to read Iceberg files directly. But Iceberg’s original design left catalogs alone: a table pointer resolver was just an interface that implementations would fill in.

For years, JDBC or in-process Hive MetaStore sufficed. Teams ran a single engine (Spark) against one catalog (Hive) and called it done. Iceberg worked fine in that world. By 2023, that world broke. A single team needed Trino for ad hoc queries (MPP performance), Spark for ML pipelines (scikit-learn integrations), Snowflake for business analytics (SQL IDE + row-level security), and Flink for streaming ingest (millisecond latency). All reading the same table, with the same row-level policies, and the same concurrency guarantees. Each engine has its own Iceberg client. Each client needs to know which snapshots are valid, which branches exist, who can read what. That information has to live somewhere centralized and be queryable via a standard protocol. Enter the REST Catalog spec.

The Iceberg REST Catalog specification (v1.0, open at iceberg.apache.org) standardizes the HTTP verbs and JSON payloads that engines use to load table metadata, commit new snapshots, list and manage namespaces, enforce authentication via OAuth2 and AWS SigV4, and delegate authorization to pluggable policy layers. That spec is what made it possible to swap out Hive MetaStore for Polaris, or run Nessie, or point Spark at Glue’s new REST endpoint. The table format stays in S3; the catalog becomes interchangeable. And that interchangeability matters because catalogs do more than resolve pointers. They enforce access control, track lineage, detect conflicts, and manage snapshots.

Reference Architecture: Catalog as the Lakehouse Control Plane

On the left: five query engines (Trino, Spark, Snowflake, Flink, DuckDB). Each has an Iceberg client library. When a query starts, the client doesn’t go directly to object storage. Instead, it hits the REST Catalog API with a loadTable(namespace, table_name) call.

The catalog backend—Polaris, Nessie, Unity, or Glue—resolves that request. It authenticates the caller (OAuth2 token or SigV4 signature), checks permissions (RBAC, row-level policies, column masking), fetches the current metadata pointer (which snapshot is live), and returns the metadata JSON, snapshot ID, and manifest file locations. The engine then reads Parquet files directly from S3/ADLS/GCS. No data flows through the catalog; it’s purely a metadata orchestrator.

When the engine commits (writes a new snapshot), it sends the metadata back to the catalog with a commitTable call. The catalog validates optimistic concurrency (Is the table version I read still the current version?), namespace-level isolation (Does this caller own this table?), and schema evolution (Are the new columns valid?). Only then does it update the metadata pointer. Competing writes to the same table within milliseconds will conflict; one succeeds, one fails. The loser retries with a fresh loadTable. This design scales horizontally because each catalog backend can be load-balanced, and metadata is stateless JSON.

The Iceberg REST Catalog Spec: Engine Contracts and Optimism

The REST Catalog spec defines six categories of endpoints: Namespace Management (GET/POST /v1/namespaces, DELETE /v1/namespaces/{namespace}), Table Operations (POST/GET /v1/namespaces/{namespace}/tables, update via POST, delete via DELETE, list via GET), Views (optional in 2026, most catalogs skip), Snapshots and Advanced (commit history, branch/tag management for Nessie), and Metrics (row counts, file statistics).

The authentication layer wraps this. Every request must carry credentials: OAuth2 Bearer Token (standard for SaaS catalogs like Polaris Cloud and Unity Catalog), AWS SigV4 (Glue, self-hosted on EC2), or Custom JWT/API key (self-hosted Polaris). The authorization layer is where catalogs diverge. The REST spec says “throw a 403 if unauthorized” but doesn’t define RBAC or row-level security. That’s a catalog vendor choice.

Here’s the critical sequence for writes: (1) Engine calls loadTable → receives TableMetadata JSON + current_version_id = 5. (2) Engine reads data, computes, writes new Parquet files. (3) Engine builds new Manifest and MetadataFile, calls commitTable(new_metadata, expected_version_id=5). (4) Catalog server checks: current_version_id == 5? If yes, write version 6 metadata to persistent store and return success. If no, return 409 Conflict, engine retries. This optimistic locking model works because metadata is tiny (JSON + Avro, typically under 1 MB per commit), catalog stores are fast (usually key-value or relational DB), and concurrent writes to the same table are rare in practice (different tables, different teams, no contention). The gotcha: if you have a batch job that makes 100 writes to the same table in 30 seconds, and 5 competing jobs do the same, you’ll see a storm of version conflicts. Each conflict adds a retry loop. Catalogues handle this via exponential backoff and write-ahead queueing, but it’s visible as latency spikes.

Catalog-by-Catalog Deep Dive

Apache Polaris: The Vendor-Neutral REST Catalog

Polaris is Snowflake’s gift to the Iceberg ecosystem. In June 2024, Snowflake donated its internal catalog to the Apache Software Foundation. It’s written in Java and Spring Boot, runs as a stateless REST service, and implements the full Iceberg REST Catalog spec plus role-based access control (RBAC).

Key strengths:
– Vendor-neutral: Polaris doesn’t care if you run Spark, Snowflake, Trino, or DuckDB. All it sees is REST requests.
– RBAC out of the box: Polaris models permissions as Principal → Principal Role → Catalog Role. A principal is a user or service account. A principal role is a group membership (e.g., “data_engineers”). A catalog role is a permission set per namespace (e.g., “data_engineers can SELECT but not ALTER on prod_analytics”). Column-level privileges are supported in some contexts.
– Multi-tenancy: Polaris supports creating multiple “catalogs” (in Polaris terminology, not Iceberg terminology) within a single Polaris deployment. Each tenant is isolated.
– Self-hosted: Polaris runs on Kubernetes, Docker, or even bare VMs. No vendor lock-in on operations.
– Apache 2.0 license (as of June 2024 incubation).

Weaknesses:
– Young ecosystem: Incubating at Apache, not yet stable. Production use exists but expect backward-compatibility breaks in 0.x releases.
– OAuth2 only for hosted instances (good), but self-hosted deployments need your own OAuth2 provider or JWT handling.
– RBAC is basic: Polaris doesn’t do row-level security or column masking. You get “user can read table or user cannot.” For fine-grained access, you layer a data virtualization engine (Trino, Superset) on top.
– Metadata storage is pluggable but defaults to PostgreSQL: You bring your own Postgres HA.

Polaris fits if: You want a standard-compliant Iceberg catalog that any engine can use, you have multiple teams (tenants) and need light-touch RBAC (roles + namespaces), you’re okay with self-hosting or want to evaluate before committing to a SaaS vendor, or you’re building a data platform on Kubernetes.

Project Nessie: The Version Control Catalog

Nessie, led by Dremio, is the wild card. It brings git-style branching, tagging, and merging to Iceberg table metadata. Instead of one “current” snapshot per table, you have branches: main, dev, feature-xyz, each with its own snapshot history.

Key strengths:
– Branching: Write to feature/new_ml_pipeline, test in isolation, merge to main. No risk of breaking production queries. This is huge for data engineering teams that run nightly builds or CI/CD pipelines for data transformations.
– Lineage tracking: Nessie tracks who committed what, when, and why. Every commit is tagged with a message. The audit trail is explicit.
– Merging logic: Nessie can detect conflicts when merging branches (e.g., both branches added a column with the same name). It asks you to resolve before merge, not after.
– REST Catalog API support: Nessie implements the full v1.0 spec, so engines don’t need custom code.
– Apache 2.0, Dremio-led but community-driven.

Weaknesses:
– Merging semantics are tricky: Git’s three-way merge assumes text. Nessie extends it to table schemas, but edge cases exist. If branch A drops column X and branch B alters column X, the merge result is ambiguous. Nessie detects this but the resolution is manual.
– No fine-grained RBAC: Nessie is about versioning, not access control. You get the catalog; you don’t get who-can-read-column-X. You layer Polaris or Unity on top, or use Trino’s built-in column masking.
– Operational complexity: Running Nessie + a persistent store (Postgres, DynamoDB, etc.) adds moving parts. Nessie isn’t as operationally boring as Glue.
– Smaller adoption in 2026: Less battle-tested than Polaris or Unity in very large enterprises.

Nessie fits if: You treat data as code (dbt, Terraform, CI/CD), you want version control on table schemas and snapshots, you have separate dev/staging/prod environments for data, or you’re willing to manage the operational overhead for the branching benefit.

Unity Catalog: The Governance Juggernaut

Unity Catalog is Databricks’ governance engine, open-sourced in June 2024 under Linux Foundation stewardship. It’s the most feature-rich catalog on this list, but also the most opinionated.

Key strengths:
– Fine-grained access control: Row-level security (predicates on rows), column masking, and column-level READ/WRITE privileges. Unity Catalog integrates with your Databricks workspace or runs standalone.
– Data discovery and lineage: Unity Catalog ships with a UI where data engineers browse tables, see who owns what, and trace data flows from source to downstream consumers. This lineage is built-in, not bolted-on.
– UniForm (Universal Format): Unity Catalog can manage both Iceberg and Delta tables in the same namespace. If you’re migrating from Delta to Iceberg (or hedging your bets), Unity Catalog handles both.
– Linux Foundation: Not a single vendor, so better governance prospects.

Weaknesses:
– Databricks-optimized: While Unity Catalog runs standalone, Databricks is the first-class citizen. The UI assumes you’re in Databricks. Integration with other engines (Trino, Spark on EC2) works but feels secondary.
– No branching: Unlike Nessie, Unity Catalog doesn’t support dev/prod branches. You get one view of the table at a time.
– Complex authorization model: Fine-grained RBAC is powerful but requires detailed planning. You must define roles, privileges, and inheritance carefully. Misconfiguration can lock you out of tables.
– Deployment model confusion: As of 2026, Unity Catalog on Kubernetes (community-managed) lacks some features available on Databricks. You must choose: run on Databricks (locked in) or run standalone (less polished).

Unity Catalog fits if: You have sensitive data and need row-level security or column masking, you want integrated data lineage and governance UIs, you run Databricks or you’re okay with community-managed Kubernetes deployments, or you’re migrating from Delta to Iceberg and want a smooth path.

AWS Glue Data Catalog: The Managed Shortcut

Glue is the oldest player in this story. AWS released it in 2016 as a fully-managed metadata store for EMR, Redshift, and S3. Iceberg support was added in 2023, and REST endpoints in 2024.

Key strengths:
– Fully managed: No infrastructure to run. AWS handles HA, backups, and scaling.
– Lake Formation integration: If you use Lake Formation for access control, Glue is the natural choice. IAM policies map to table and column permissions automatically.
– Native Iceberg support: Glue understands Iceberg formats and snapshots natively. No extra configuration.
– Crawlers: Glue Crawlers can auto-detect Iceberg table schemas and update metadata. Useful for low-touch, high-latency pipelines.
– Low operational burden: You’re not running anything; AWS is.

Weaknesses:
– AWS-locked: If you use Trino on-premise, or Snowflake’s independent deployment, Glue is accessible only via REST (and requires cross-account IAM setup). Multi-cloud is painful.
– Limited branching/versioning: Glue doesn’t support table branches or versioning like Nessie. You get one “current” table per name.
– RBAC is Lake Formation-centric: Fine-grained access works well if you’re deep in the AWS ecosystem. If you’re multi-cloud, Lake Formation is hard to replicate.
– REST spec compliance: Glue’s REST Catalog implementation (added 2024) is still catching up to v1.0. Some engines report compatibility issues.

Glue fits if: Your data stack is AWS-native (EC2, EMR, Redshift, Lambda, Step Functions), you want minimal operational overhead, single-engine (Spark on EMR) or Snowflake is your primary compute, or you’re not multi-cloud.

Multi-Engine Federation: One Table, Five Engines

Here’s the real-world scenario that catalogs enable:

You have a production table analytics.customers in Polaris. It’s stored as Iceberg in S3 with 500M rows, 50 GB compressed. You update it nightly via a Spark job that runs on EC2. Ad hoc analysts query it via Trino from their laptop. The data science team has a dbt pipeline that runs on Databricks and reads the same table. An ML model scores predictions from Flink in real-time. The CFO wants it in Snowflake for her BI tool.

With a shared Iceberg catalog: (1) Spark writes a new snapshot (snapshot ID 42) with yesterday’s new rows. (2) Trino reads the latest snapshot (42) and filters to recent customers. (3) Databricks reads snapshot 42 and joins with another table for ML features. (4) Flink reads snapshot 42 but caches it; it won’t re-read until a new snapshot exists. (5) Snowflake syncs via REST, gets snapshot 42, and materializes in its own table store.

All five engines see the same logical table with identical schemas and access policies. No data copies between engines (Snowflake is an exception; it materializes a copy, but the Iceberg metadata tells it when to sync). The catalog is the single source of truth. The concurrency here is automatic: Spark and Databricks write to different partitions or use Iceberg’s optimistic concurrency to check for conflicts. Trino and Flink are read-only, so they never conflict.

The gotcha: if Spark, Databricks, and a third job all try to write to the same table in rapid succession, conflicts pile up. Version 42 commits successfully. Version 43 (Databricks) tries to commit expecting parent=42, succeeds. Version 44 (third job) tries to commit expecting parent=42, gets a 409, retries, succeeds as version 45. This is correct behavior but appears as latency.

Trade-offs and Failure Modes

No catalog is perfect. Here are the real gotchas:

Metadata-File Explosion: Iceberg stores table history as a linked list of metadata files, each JSON plus Avro-serialized manifests. On a busy table with 100 commits per hour, 365 days per year, that’s 876,000 metadata files. Each is hundreds of KB. Your metadata store (Polaris’s Postgres, Nessie’s DynamoDB, etc.) becomes a bottleneck. Workaround: implement metadata expiration (Iceberg GC) to clean up old snapshots. Most catalogs support this but it’s not automatic.

REST Catalog Hot-Spotting: All five engines hit the same loadTable endpoint for the same table. If your catalog isn’t load-balanced, it becomes a choke point. With caching (Trino caches metadata for 5 minutes by default), this is fine. Without caching, you’re in trouble. Most production deployments run 3+ catalog replicas behind a load balancer.

Nessie Merge Surprises: Git’s merge works because text is line-based. When two branches both add column X, merge fails. Nessie extends this to schemas, but the logic is complex. If branch A drops a column and branch B alters it, what should merge do? Nessie asks you to resolve manually. This is correct but requires discipline.

Polaris RBAC Isn’t Granular: You can grant “SELECT on namespace” but not “SELECT on table where date > ‘2026-01-01′”. If you need row-level security, you add Trino on top and push predicates down. This works but adds latency and operational complexity.

Unity Catalog Deployment Variance: As of 2026, Unity Catalog on Kubernetes (community-managed) lacks some features available on Databricks. You must choose: run on Databricks (locked in) or run standalone (less polished). There’s no middle ground yet.

Concurrency Contention: If your write rate to the same table exceeds ~100 commits per second, even optimistic concurrency struggles. You’ll see exponential backoff and client-side retries. Workaround: partition writes by logical dataset (different tables, different schemas), or use Iceberg’s hidden partitioning to stagger commits across a distributed key.

Recommendations: Picking the Right Catalog

Use this decision tree to narrow your choice:

Greenfield AWS-native, single team: AWS Glue Data Catalog. Managed, zero ops overhead, works great with EMR, Redshift, Glue Studio. No multi-engine complexity, no branching needed. Cost: Glue charges per 100,000 objects stored + REST requests. Usually under $1K/month for small-to-medium deployments.

Greenfield, multi-engine (Trino + Spark), good RBAC sufficient: Apache Polaris. Standard REST spec, any engine works. Lightweight role-based access. Self-hosted, full control over infrastructure. Avoid Polaris if you need row-level security or column masking (layer Trino on top, or use Unity).

Mature data platform, fine-grained governance required: Unity Catalog. Row-level security, column masking out of the box. Data lineage UI + discovery. If on Databricks: first-class support. If standalone: community support (improving). Cost: Free (open-source) but operational complexity if self-hosting.

CI/CD for data, branching is non-negotiable: Project Nessie. Git-style version control on table metadata. Isolate dev/feature branches from production. Requires operational investment (Nessie server + persistent store). Pair with Polaris or Unity for access control.

Hybrid, on-premise plus cloud, data sovereignty: Polaris (self-hosted) + Nessie (optional). Polaris runs anywhere (Kubernetes, VM, bare metal). Nessie adds branching if you want it. No reliance on managed services outside your control. Higher operational burden but full sovereignty.

Frequently Asked Questions

Q: Is Polaris really an Apache project?
Yes, as of June 2024. Snowflake donated it to the ASF under Apache 2.0. It’s in the Incubating phase (not yet stable release), so expect breaking changes in 0.x versions. Stable 1.0 is expected late 2025 or early 2026.

Q: Can I migrate from Glue to Polaris without rewriting my tables?
Yes. Both understand Iceberg metadata and snapshots. You point your engines at Polaris instead of Glue, and they read the same S3 tables with the same metadata. The only catch: Glue stores credentials in Lake Formation; Polaris uses OAuth2 or SigV4. You’ll need to update authentication logic in your engines.

Q: What does the REST Catalog spec actually standardize?
HTTP verbs and JSON schemas for load, commit, list, and delete operations. It does NOT standardize authorization, branching, or lineage. That’s why catalogs differ. Polaris adds role-based access, Nessie adds branches, Unity adds fine-grained policies. All conform to REST but extend differently.

Q: Unity Catalog vs Polaris — which one wins?
Different competitions. Fine-grained access control and lineage: Unity wins decisively. Simplicity and vendor-neutrality: Polaris wins. Maturity and production hardening: Polaris (June 2024, community vetting) vs Unity (also June 2024, but Databricks-backed). Operational overhead: Glue wins (managed). Polaris and Unity are roughly equivalent.

Q: Does Snowflake natively support reading Polaris tables?
Not directly. Snowflake has its own catalog (Polaris was donated to Apache, not kept for Snowflake). However, you can read Polaris-managed Iceberg tables via Snowflake’s Iceberg integration if you give Snowflake the S3 credentials and the metadata file path. It’s a workaround, not native.

Q: How do I handle catalog HA and DR?
Glue: AWS handles it (multi-region replication available). Polaris: Run 3+ replicas behind a load balancer. Metadata is stored in Postgres; use Postgres replication for DR. Nessie: Similar to Polaris; Nessie’s state lives in a persistent store (DynamoDB, Postgres, etc.). Replicate that store for HA/DR. Unity on Databricks: Managed by Databricks. On Kubernetes: use persistent volume replication.

Iceberg Catalogs: Polaris vs Nessie vs Unity Comparison (2026)

Iceberg Catalogs: Polaris vs Nessie vs Unity vs Glue

Why the Catalog Layer Suddenly Matters

Reference Architecture: Catalog as the Lakehouse Control Plane

The Iceberg REST Catalog Spec: Engine Contracts and Optimism

Catalog-by-Catalog Deep Dive

Apache Polaris: The Vendor-Neutral REST Catalog

Project Nessie: The Version Control Catalog

Unity Catalog: The Governance Juggernaut

AWS Glue Data Catalog: The Managed Shortcut

Multi-Engine Federation: One Table, Five Engines

Trade-offs and Failure Modes

Recommendations: Picking the Right Catalog

Frequently Asked Questions

Further Reading

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories