Cilium Mesh vs Istio Ambient: 2026 ADR for Sidecarless Meshes

The sidecar service mesh era ended quietly between 2024 and 2026. Cilium Service Mesh shipped GA L7 features in v1.14 (Aug 2023) and matured its mesh story through v1.16 and v1.17. Istio Ambient reached beta in v1.22 (May 2024) and went GA in v1.24 (Nov 2024). Both eliminate the per-pod Envoy that defined Istio 1.x. The Cilium Service Mesh vs Istio Ambient choice is now a real architectural decision, not a hype comparison, and the wrong call burns a quarter of platform-team time on rip-and-replace.

Architecture at a glance

Cilium Mesh vs Istio Ambient: 2026 ADR for Sidecarless Meshes — architecture diagram — Architecture diagram — Cilium Mesh vs Istio Ambient: 2026 ADR for Sidecarless Meshes

This post is an ADR (architecture decision record) for sidecarless service mesh selection in 2026. It compares data paths, mTLS identity, L7 policy surface, observability, and migration cost. The thesis: Cilium wins when you already need a CNI rewrite or kernel-level policy. Istio Ambient wins when you have existing Istio CRD investment or multi-cluster federation. Every other angle is noise at the scale most teams operate at. The post ends with an ADR template and three context-specific decisions you can paste into your repo.

Context: why the sidecar died

The sidecar mesh died because the per-pod Envoy footprint stopped paying its rent. A 2023 Solo.io benchmark put a single Envoy sidecar at 40-80 MB resident memory and 0.1-0.5 vCPU idle. On a cluster with 5,000 pods that is 200-400 GB of Envoy memory before a single request flows. Restart blast radius is the other killer: rolling an Istio control plane upgrade rolls every workload.

The pain manifested in four ways most platform teams remember from 2022-2024:

Memory cost on large clusters. The Envoy sidecar baseline at Istio 1.18 with default xDS push was ~50 MB. At 10,000 pods that is 500 GB of mesh memory before workloads run.
mTLS certificate rotation storms. SDS pushes rotate SVIDs every 24 hours by default. With 10,000 sidecars and a synchronized clock, the control plane saw correlated load spikes.
Restart coupling. A pod restart for an application bug also rebuilt the sidecar’s xDS state, doubling cold-start time. Sidecar inject errors made deploys flaky.
Day-2 upgrade pain. Upgrading Istio meant rolling every workload. Teams missed quarterly CVE windows.

Cilium and Istio responded differently. Cilium leaned into eBPF: move L4 and identity into the kernel; only spin up Envoy when an L7 policy demands it. Istio kept Envoy but moved it out of pods: a per-node L4 proxy called ztunnel (Rust) handles transport and mTLS, and a per-namespace or per-service waypoint proxy (Envoy) handles L7. Both eliminate the sidecar. They disagree on where the L7 work lives and how mTLS identity is wrapped.

The two architectures at a glance

Cilium Service Mesh and Istio Ambient are both sidecarless meshes, but they differ in three structural choices: where L4 lives (Cilium kernel eBPF vs Istio ztunnel userspace Rust), where L7 lives (Cilium per-node Envoy via CiliumEnvoyConfig vs Istio per-namespace waypoint Envoy), and how mTLS is framed (Cilium native TLS or WireGuard vs Istio HBONE tunneling). These three choices drive every downstream trade-off.

The reference architectures look superficially similar. Both have a control plane that pushes config to per-node data-plane agents. Both terminate L4 close to the workload and only invoke a full L7 proxy when policy requires it. The differences emerge in the implementation primitives.

Cilium runs a cilium-agent DaemonSet that programs eBPF into the kernel — tc and XDP hooks at the network device, cgroup/connect4 hooks at the socket layer for service routing, and sockops/sk_msg for sidecar-less TCP redirect. The L4 path never leaves kernel context. L7 happens by configuring an Envoy that the agent embeds (or runs as a node-local DaemonSet pod), selected via CiliumEnvoyConfig CRDs and CiliumNetworkPolicy with toPorts.rules.http rules. Encryption is either native TLS to the destination identity, or transparent WireGuard or IPsec at the node level.

Istio Ambient runs istiod as the control plane, plus an istio-cni DaemonSet that installs iptables or nftables rules redirecting pod traffic to the local ztunnel. The ztunnel is a Rust process (DaemonSet, one per node) that terminates mTLS and tunnels L4 traffic over HBONE — HTTP/2 CONNECT over mTLS, defined in the Istio HBONE specification. When namespace-level or service-level L7 policy is needed, traffic is forwarded from the ztunnel to a waypoint proxy (Envoy) running as a regular Deployment. The waypoint applies AuthorizationPolicy, VirtualService, and RequestAuthentication resources, then forwards back through HBONE to the destination ztunnel.

Data path deep dive

The Cilium data path executes in three or four eBPF programs and one optional userspace hop. The Istio Ambient data path executes in two iptables redirects, two ztunnel processes, and zero or one waypoint hop. The Cilium L4 path saves roughly one userspace context switch and one TCP segment reassembly relative to Ambient. At L7, both paths are dominated by Envoy.

Cilium L4 path

A TCP connect() from a pod hits a cgroup/connect4 eBPF hook. The hook does service load balancing in kernel — replacing the service ClusterIP with a backend pod IP — using a BPF_MAP_TYPE_HASH keyed on (service IP, port). No conntrack, no kube-proxy iptables, no userspace. On the egress NIC, a tc hook applies network policy (L3/L4 match), optionally encrypts via WireGuard, and emits the packet. The kernel-resident equivalent of the data plane for L4 is a few thousand BPF instructions and a hash map lookup.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: orders-allow-frontend
  namespace: shop
spec:
  endpointSelector:
    matchLabels:
      app: orders
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP

This policy compiles into eBPF map entries and never touches Envoy. There is no per-connection latency cost beyond a hash lookup. For comparison, the equivalent in Istio Ambient is an AuthorizationPolicy resource that ztunnel evaluates in userspace.

Conntrack and connection state

Cilium maintains its own BPF conntrack table separate from Linux netfilter’s. The BPF table is per-endpoint and supports up to ~512 k entries per node by default (tunable). Entries time out per the same rules as netfilter conntrack but the table lives in eBPF map memory and the lookups are O(1). On a node sustaining 50 k concurrent flows the conntrack memory footprint is ~80 MB.

Istio Ambient delegates conntrack to the Linux kernel because traffic still flows through iptables redirects to ztunnel. ztunnel itself maintains application-level connection state (HBONE streams) which is independent. This means Ambient inherits whatever netfilter conntrack tuning your distro ships — typically nf_conntrack_max of 256 k on default kernels, which a busy mesh exhausts. Tune it before going to production.

Cilium L7 path

When an HTTP-aware rule is declared, the agent provisions a node-local Envoy listener and inserts a bpf_sk_assign redirect to it. Traffic is delivered to Envoy via a Unix domain socket, Envoy applies the HTTP filter chain, then writes back to a target socket that hits eBPF again on the way out. This is one userspace hop, one less than Istio Ambient for L7.

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: orders-http
spec:
  endpointSelector:
    matchLabels:
      app: orders
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - method: "GET"
                path: "/api/v1/orders/.*"
              - method: "POST"
                path: "/api/v1/orders"

Istio Ambient L4 path

A pod connect() is matched by iptables rules installed by istio-cni. The SYN packet is redirected to the local ztunnel listening on a UNIX-like socket. ztunnel looks up the destination workload, opens an HBONE tunnel (HTTP/2 CONNECT framed over mTLS) to the destination node’s ztunnel, and pipes bytes. There is no L7 parsing. ztunnel is written in Rust, uses tokio, and per Solo.io’s published benchmarks adds ~0.4 ms p50 over no-mesh at moderate RPS.

Istio Ambient L7 path

When a waypoint is attached to a service or namespace (via the istio.io/use-waypoint label and a Gateway resource), ztunnel routes traffic through the waypoint instead of directly to the destination ztunnel. The waypoint is just an Envoy Deployment. Traffic flows: client pod to client ztunnel to waypoint Envoy (HBONE in, HBONE out) to server ztunnel to server pod. That is three HBONE hops and one Envoy round trip.

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: waypoint
  namespace: shop
  labels:
    istio.io/waypoint-for: service
spec:
  gatewayClassName: istio-waypoint
  listeners:
    - name: mesh
      port: 15008
      protocol: HBONE
---
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: orders-from-frontend
  namespace: shop
spec:
  targetRefs:
    - kind: Service
      group: ""
      name: orders
  rules:
    - from:
        - source:
            principals: ["cluster.local/ns/shop/sa/frontend"]
      to:
        - operation:
            methods: ["GET", "POST"]
            paths: ["/api/v1/orders*"]

mTLS identity: SPIFFE on both, framing differs

Both meshes use SPIFFE SVID identity (X.509 SVIDs, trust domain in the URI SAN), but they frame mTLS differently. Cilium uses native TLS to the destination identity or transparent WireGuard. Istio Ambient uses HBONE — HTTP/2 CONNECT over mTLS — which lets ztunnel multiplex many connections over one TLS session and pass identity headers cleanly to waypoints.

Cilium identity is sourced from CiliumIdentity objects (computed from pod labels) or, in newer versions, integrated with SPIRE for full SPIFFE SVID compliance. The SPIFFE ID looks like spiffe://cluster.local/identity/12345 where the integer maps to a label-set hash. Cilium can issue X.509 SVIDs via SPIRE or use its native CRD-based identity.

Istio identity is sourced from Kubernetes ServiceAccount via istiod, with SPIFFE URIs like spiffe://cluster.local/ns/shop/sa/frontend. The mTLS handshake is wrapped in HBONE: the outer TLS uses the service identity, and the inner HTTP/2 CONNECT request carries metadata headers (baggage, identity hints). HBONE is documented in the Istio Ambient architecture docs and uses ALPN value istio to negotiate.

The practical difference: HBONE makes connection pooling between two nodes trivial — one persistent mTLS session carries all traffic between any workloads on those two nodes. Cilium’s native TLS opens per-flow sessions unless you use WireGuard at the node level, in which case the kernel handles tunneling but you lose per-workload identity in the wire frame (it is reconstructed from BPF metadata). HBONE is more debugger-friendly. WireGuard is more performant. Pick based on whether you reach for tcpdump more often than iperf3.

What HBONE actually carries

HBONE is misunderstood as just CONNECT over TLS. It is, but the framing matters. The outer mTLS handshake authenticates the two ztunnel endpoints — node-pair identity. The inner HTTP/2 CONNECT request contains a :authority pseudo-header with the destination workload’s SPIFFE identity, plus optional baggage and traceparent headers per W3C trace-context. The waypoint, when present, terminates the inner stream, applies L7 policy, and re-frames outbound. This double-framing is what enables a single TLS session to multiplex 10 k+ flows between two nodes without per-flow handshake cost.

Cilium’s equivalent is operationally different. Without WireGuard, each workload-to-workload flow opens its own TLS session using the source workload’s SPIFFE SVID directly — there is no proxy in the middle to multiplex on. With WireGuard enabled, the kernel maintains one tunnel per node-pair and tags packets with eBPF metadata that reconstructs workload identity at the destination. Both approaches preserve end-to-end identity attestation; only the wire format differs.

Trust domain and rotation

Both default to a 24-hour SVID rotation window. Cilium delegates to SPIRE when configured, which lets you reuse the same trust domain across non-Kubernetes workloads (VMs, lambdas). Istio Ambient supports a shorter rotation cadence via istiod flags and integrates with SPIRE federation for cross-trust-domain scenarios. Both implement RFC 8446 TLS 1.3 by default.

L7 policy surface

Cilium expresses L7 policy via CiliumNetworkPolicy (with http block) and CiliumEnvoyConfig (for raw Envoy filter chains). Istio Ambient uses the existing VirtualService, DestinationRule, AuthorizationPolicy, and the Kubernetes Gateway API targetRefs. Istio has a larger policy surface and more deployed knowledge; Cilium has tighter coupling to network identity.

The Cilium policy model fuses L3/L4 and L7 into one CRD. The same CiliumNetworkPolicy can have both toPorts.ports (L4) and toPorts.rules.http (L7) blocks. When only L4 is declared, the policy is fully kernel-resident with zero Envoy involvement. When L7 is declared, the matching flow is steered to Envoy on first packet. This minimizes Envoy load — you pay only for the flows that need L7.

The Istio Ambient policy model is split. AuthorizationPolicy evaluated at L4 (without waypoint) supports source identity and port/protocol but not HTTP method or path. AuthorizationPolicy evaluated at L7 (with waypoint attached) supports the full HTTP semantics — methods, paths, headers, JWT claims. The split is logical but confusing during migration. Teams hit the gotcha where an AuthorizationPolicy works in sidecar mode but silently does nothing in ambient mode because no waypoint is attached.

Operational ergonomics of policy

The policy CRDs have second-order effects on team workflow. A Cilium policy review reads like a network ACL — destination workload, source workloads, ports, optional HTTP path. The author and reviewer are both thinking in network terms. An Istio AuthorizationPolicy review reads like an application access control rule — principals, operations, conditions. The author and reviewer are thinking in identity-and-method terms. Neither framing is wrong, but the team’s existing mental model determines which feels natural.

Teams with a strong NetSecOps function gravitate to Cilium because the policy surface matches what they used to express in firewall rules. Teams with a strong AppSec or API gateway function gravitate to Istio because the policy surface matches OAuth scopes and OpenAPI operation IDs. The right answer is the framing your reviewers can audit without a learning curve.

JWT and request authentication

Both meshes implement JWT validation via Envoy filter, so the capability set is identical in steady state. In Cilium you express it via CiliumEnvoyConfig referencing the envoy.filters.http.jwt_authn filter directly. In Istio Ambient you use RequestAuthentication and AuthorizationPolicy with jwtRules. The Istio abstraction is friendlier; the Cilium escape hatch is more powerful when you need a custom filter that Istio’s abstractions don’t expose. For details on collecting the resulting auth events, see our companion piece on the OpenTelemetry collector architecture and pipeline patterns.

Observability: Hubble vs Kiali plus OTel

Cilium ships Hubble — a flow-level observability stack that taps eBPF directly to emit L3/L4 and L7 metadata without sampling. Istio Ambient relies on Envoy access logs from waypoints plus the Kiali UI, and integrates with OpenTelemetry. For raw network visibility, Hubble is unmatched. For request-level distributed tracing, Istio plus OTel is more mature.

Hubble emits a flow record for every connection the kernel sees, with no sampling required because the work is BPF map updates. The flow includes the SPIFFE identities of both ends, the verdict (allowed or denied) and the matching policy. At 100 k RPS in production, Hubble has been observed to add under 3% CPU overhead in published Isovalent benchmarks. The data shape is amenable to drill-down from show me the policy denials in the last 5 minutes to a specific flow.

Istio Ambient relies on Envoy at the waypoint emitting access logs and metrics. For traffic that bypasses the waypoint (pure L4), ztunnel emits its own metrics but with less fidelity than Envoy. The standard approach is to ship Envoy stats to Prometheus and traces to an OpenTelemetry Collector, which then routes to Jaeger or Tempo. The flow record style of Hubble is not natively present.

For deeper coverage of how eBPF reshapes the observability stack, see the eBPF Kubernetes observability ADR which covers replacing legacy APM with kernel-level telemetry.

Performance: numbers that matter

Both meshes target sub-millisecond p50 overhead at moderate RPS, but the comparison depends heavily on whether L7 is on the path. The Cilium L4-only path is the fastest path either system offers because it stays in the kernel. The Istio Ambient L7 path with waypoint has roughly the same overhead as a single sidecar Envoy hop because the waypoint is an Envoy.

Published 2024-2025 benchmarks are noisy and instance-class-dependent, so treat any single number with suspicion. The directional findings that hold across multiple sources:

L4-only, Cilium kernel path: ~0.05-0.15 ms p50 overhead vs no mesh, 0.5-1.5% CPU at 10k RPS per node.
L4-only, Istio Ambient ztunnel: ~0.3-0.6 ms p50 overhead, 2-4% CPU at 10k RPS per node. Per-node memory ~50-80 MB resident.
L7 with Envoy (both): ~0.5-1.5 ms p50 overhead beyond L4, +5-15% CPU at 10k RPS. Both pay roughly the same Envoy tax.
mTLS handshake amortized: HBONE pools connections per node-pair, so amortized cost is ~5-10 us per stream after warmup. Cilium native TLS without pooling is ~50-100 us per new flow.

If you need a more rigorous treatment, the USENIX NSDI 2023 proceedings on mesh and zero-copy data planes and the ACM SoCC 2022 paper on sidecar overhead frame the right benchmark methodology. Run your own at your workload’s actual request size distribution and connection churn.

Throughput and tail latency

Throughput at line rate is rarely the limiting factor in modern Kubernetes meshes; tail latency is. At 50 k RPS per node both meshes can sustain near-line-rate on a 25 GbE NIC, but p99 latency diverges. Cilium’s kernel L4 path holds p99 within roughly 1.5x p50 because there is no userspace scheduler in the loop. Istio Ambient’s ztunnel, being a Rust tokio process, sees p99 inflate to roughly 2-3x p50 under contention because tokio task scheduling and TLS record assembly are subject to OS scheduler jitter.

The waypoint path adds another scheduler hop. If your waypoint Deployment is colocated on a busy node, p99 can spike to 5-10 ms even when p50 stays under 1 ms. The mitigation is dedicated node pools for waypoint Deployments, which adds a node-pool taint and toleration discipline most platform teams find acceptable.

Memory amortization at scale

The headline saving of sidecarless meshes is memory. A cluster with 10,000 pods on sidecar Istio at 50 MB per sidecar carries 500 GB of mesh memory. The same cluster on Istio Ambient runs ~150 ztunnel instances (one per node on a 150-node cluster) at ~80 MB each, plus ~20 waypoint instances at ~150 MB each — totaling roughly 15 GB, a 33x reduction. Cilium Service Mesh on the same cluster runs ~150 cilium-agent instances at ~120 MB each (the agent does CNI plus mesh), totaling ~18 GB. Both are within the same order of magnitude. The sidecar savings dominate any agent-vs-agent comparison.

CPU at high RPS

The CPU story is more workload-dependent than the memory story. Pure L4 throughput at 100 k RPS per node burns roughly 4-6 vCPU on Istio Ambient ztunnel and roughly 1-2 vCPU on Cilium’s kernel path. That is a 3-4x difference but on a 32-vCPU node it is 12% vs 5% — both well within reasonable headroom. The L7 case erases the difference because Envoy dominates either way. The right question is not which mesh uses less CPU but how often do my flows actually need L7 inspection. If 80% of inter-service traffic is L4-only, Cilium has a structural advantage. If 80% needs HTTP-aware policy, you are paying Envoy either way.

Trade-offs and failure modes

Neither mesh is a free lunch. Cilium’s failure modes are kernel-version sensitive and policy-debugging hostile. Istio Ambient’s failure modes cluster around the L4-vs-L7 dual policy model and waypoint scaling. Both have non-trivial migration cost from a deployed sidecar Istio.

Cilium failure modes

Kernel-version coupling is the first one. Cilium Service Mesh L7 needs kernel 5.10+, ideally 5.15+, and some features require 6.1+. Managed Kubernetes nodes pinned to older LTS kernels lose capabilities silently — features fall back to userspace or get disabled.

Policy debugging is the second one. When a CiliumNetworkPolicy denies a flow, the diagnostic is a Hubble flow record with a verdict label. There is no kubectl describe that shows this policy on this CRD denied because rule N matched. Cilium has improved this with cilium policy trace and hubble observe –verdict DROPPED, but in 2026 it is still less ergonomic than istioctl analyze.

Per-node Envoy contention is the third. If many workloads on a node hit L7 policies, the single embedded Envoy can become a bottleneck. The CiliumEnvoyConfig deployment mode (separate Envoy pod) helps but adds operational surface.

Encryption choice forces a trade. WireGuard transparent encryption is fast but obscures per-workload identity on the wire — packet captures show node IPs, not pod identities. Native TLS preserves identity but loses connection pooling benefits. Pick one based on whether your auditor or your benchmark suite is louder.

Istio Ambient failure modes

Dual policy surface confuses teams during migration. An AuthorizationPolicy with an HTTP method match silently does nothing if no waypoint is attached to the target service. The remediation — attaching a waypoint via Gateway API — is a separate workflow most teams have not internalized.

Waypoint scaling is a Deployment, not a DaemonSet. If you attach a waypoint to a high-traffic namespace, it becomes a critical-path Envoy with all of Envoy’s tuning needs (worker thread count, connection buffer sizes, HPA on CPU). Operators who came from sidecar Istio assume the data plane scales horizontally per pod; with waypoints, it scales per waypoint Deployment.

ztunnel as a single point per node. The ztunnel DaemonSet is critical-path for all pod-to-pod traffic on that node. A ztunnel crash blackholes the node’s pod traffic until restart (typically <2 s in published Solo.io tests, but worst case during xDS push storms could be longer). PodDisruptionBudget on the ztunnel DaemonSet and tested upgrade rollouts are mandatory.

CNI compatibility is a constraint. Istio Ambient installs iptables rules via istio-cni. If you run another CNI plugin that owns iptables (Calico chained mode, some custom firewall agents) you can get rule-ordering bugs. Test the combination before committing.

Migration cost from sidecar Istio

Migrating from a sidecar Istio deployment to Ambient is the cheap path. You keep VirtualService, DestinationRule, AuthorizationPolicy, PeerAuthentication, and RequestAuthentication. You drop the sidecar injection label from namespaces, label them istio.io/dataplane-mode: ambient, and optionally add waypoints to L7-policy namespaces. The CRD investment is preserved.

Migrating from a sidecar Istio deployment to Cilium is the expensive path. You translate every VirtualService to a CiliumEnvoyConfig or CiliumNetworkPolicy with HTTP rules. You re-implement AuthorizationPolicy as Cilium L7 rules. You change your CNI (Cilium needs to be the CNI). Plan a quarter of platform-engineering time for a cluster with a non-trivial Istio policy footprint. For a comprehensive view of the alternative — staying within the Istio ecosystem — see our companion ADR on Istio Ambient mesh versus Linkerd in 2026.

Migration cost from no-mesh

The greenfield path is the cheapest in absolute terms but the most ambiguous in scope. Without an existing mesh you have no CRD inventory to translate, but you also have no policies that capture today’s intended service-to-service behavior. The first three months of a sidecarless-mesh rollout from no-mesh are policy archaeology — observing traffic with Hubble (Cilium) or ztunnel access logs (Ambient), drafting policies in dry-run mode, then enforcing them.

Cilium’s dry-run is more ergonomic here: CiliumNetworkPolicy supports an enableDefaultDeny: false posture combined with Hubble’s policy-verdict labels to identify flows that would be denied if you flipped the default. Istio Ambient lacks an equivalent first-class dry-run; teams typically deploy AuthorizationPolicy with action: AUDIT (preview feature in v1.25) which logs but does not enforce, then promote to action: ALLOW after a quiet period. Audit mode is functional but the tooling around it is less mature than Hubble’s flow explorer.

Practical recommendations and ADR template

Pick Cilium Service Mesh if you already need to replace your CNI, want kernel-resident L4 policy, and have an eBPF-fluent platform team. Pick Istio Ambient if you have existing Istio CRD investment, need multi-cluster mesh federation, or value the larger Envoy debugging ecosystem. Avoid both if a no-mesh deployment with Kubernetes NetworkPolicy and application-level mTLS meets your needs — that path is still valid in 2026.

Pre-decision checklist

Before writing the ADR, answer these in writing:

What is the current CNI, and is it staying? Cilium-the-mesh requires Cilium-the-CNI.
How many AuthorizationPolicy, VirtualService, DestinationRule resources exist? If more than 50, Ambient migration is cheaper than Cilium.
What is the kernel version on production nodes? Cilium L7 wants 5.15+.
Is the platform team stronger on Linux kernel and BPF, or on Envoy and L7?
Multi-cluster federation: required now or within 12 months?
Compliance audit posture: is per-workload identity on the wire (HBONE) more defensible than node-level WireGuard?

Decision matrix

ADR template

Copy this into docs/adr/0007-sidecarless-mesh.md:

# ADR-0007: Sidecarless Service Mesh Selection

## Status
Proposed | Accepted | Superseded by ADR-NNNN

## Context
We operate N production clusters with M pods. We currently run
[sidecar Istio | no mesh | Linkerd | other]. Pain points: memory cost,
upgrade pain, mTLS rotation. Constraints: kernel version K,
existing CNI C, compliance regime R.

## Decision
We will adopt [Cilium Service Mesh | Istio Ambient].

## Rationale
- CNI position: we already run or do not run Cilium as CNI.
- Existing Istio CRD count: P resources.
- Kernel version on production nodes: K.
- Multi-cluster federation: required or not required within 12 months.
- Team skills: kernel/eBPF or Envoy/L7 strong.

## Consequences
Positive: ...
Negative: ...
Reversal cost: estimated X engineer-quarters.

## Alternatives considered
- Sidecar Istio (rejected: memory cost).
- Linkerd (rejected: smaller policy surface).
- No mesh (rejected: compliance R requires per-flow mTLS).

Three concrete decision contexts

Context A — Greenfield platform, kernel 6.x, no existing mesh, 200 services. Decision: Cilium Service Mesh as both CNI and mesh. Rationale: no migration cost on either side, eBPF gives observability for free via Hubble, kernel-resident L4 is the lowest-overhead path. Reversal cost: high (would have to swap CNI). Trade-off accepted: the team must learn eBPF debugging, but the operational simplicity of one product owning CNI plus mesh plus observability pays back inside two quarters.

Context B — Existing sidecar Istio, 1,200 pods, 80 AuthorizationPolicy and 40 VirtualService resources. Decision: Istio Ambient. Rationale: CRDs port over unchanged; team already knows istioctl; migration is mostly relabel namespaces and attach waypoints. Reversal cost: low (can roll back to sidecars per-namespace). Trade-off accepted: dual policy surface confuses junior engineers, mitigated by a one-page internal cheat sheet on when to attach a waypoint.

Context C — Multi-cluster federation across 6 regions, mixed kernels, Kubernetes managed by three cloud providers. Decision: Istio Ambient. Rationale: Cilium’s cluster-mesh is solid but Istio’s east-west gateway pattern is more battle-tested for multi-provider, ztunnel handles heterogeneous kernel versions gracefully, and HBONE multiplexing reduces inter-cluster connection count. Reversal cost: medium. Trade-off accepted: ztunnel single-point-per-node failure mode mitigated by PDB and tested rolling-upgrade procedure.

What the ADR should not say

Three claims show up in vendor-influenced ADRs and weaken the document. First, X is faster than Y without a concrete benchmark scenario — both meshes are fast enough that microsecond differences rarely show up in service SLOs. Second, X has better observability — Hubble and Kiali plus OTel are different shapes of observability, not better-or-worse. Third, X is more secure — both implement SPIFFE mTLS over TLS 1.3 and pass the same compliance posture for the auditors who matter. Keep the ADR rationale grounded in CNI position, CRD inventory, kernel version, team skills, and federation requirements. Those five inputs determine the right answer in over 90% of cases.

Decision review cadence

Sidecarless mesh choices made in 2026 should be reviewed at the next major version of either project — likely Istio 1.27 (late 2026) or Cilium 1.18 (mid-2026). The technology surface moves fast. ztunnel may absorb selective L7 work that today requires a waypoint; Cilium may ship native multi-cluster federation parity with Istio. Annotate the ADR with a review-by date and a list of trigger events (new GA release, internal incident, kernel-version refresh) that would force re-evaluation. ADRs are durable but not eternal.

FAQ

Is Cilium Service Mesh production-ready in 2026?

Yes. Cilium has been GA since 2018 as a CNI; the service mesh layer reached GA in v1.14 (Aug 2023) and matured through v1.16 (2024) and v1.17 (2025). Major adopters include Google (GKE Dataplane V2), AWS (EKS Anywhere), and many CNCF end-user case studies. The eBPF dependency is no longer the bleeding edge it was in 2022 — kernel 5.15 is widely available across managed Kubernetes. Operational maturity is on par with sidecar Istio, with the caveat that policy-debugging tooling is still catching up to istioctl.

Does Istio Ambient replace Istio sidecars completely?

Yes for new clusters, and yes incrementally for existing ones. You can run sidecar and ambient modes in the same cluster during migration by labeling namespaces with istio.io/dataplane-mode: ambient while leaving others on sidecar injection. Ambient went GA in Istio v1.24 (Nov 2024) and is the recommended default for v1.25+. Sidecar mode remains supported and is still the right choice for some scenarios — large StatefulSet workloads where the per-pod blast radius matters less, for example.

Can I run Cilium Service Mesh on top of another CNI?

No. Cilium-the-mesh requires Cilium-the-CNI because the data path is eBPF programs attached to the pod network. You can chain Cilium with another CNI in some configurations (Cilium chaining mode for AWS VPC CNI or Azure CNI), but the mesh features assume Cilium owns the kernel hooks. If you cannot replace your CNI, choose Istio Ambient instead. The CNI position is the single most predictive input to the decision.

How do I migrate observability tools when switching to a sidecarless mesh?

The wire format changes. Sidecar Istio emitted Envoy access logs from every pod; ambient emits them only from waypoints. Cilium emits Hubble flow records, which is a different shape from access logs. Re-derive your SLI dashboards: where you queried istio_requests_total from Prometheus, you now query waypoint Envoy metrics with different labels. Plan a parallel-run period of 2-4 weeks where both old and new sources feed dashboards, then deprecate the old ones once SLO continuity is verified.

What about WebAssembly extensions like Wasm filters?

Both meshes support Envoy Wasm filters where Envoy is on the path. In Cilium, declare them via CiliumEnvoyConfig referencing the Wasm filter chain. In Istio Ambient, attach Wasm filters to waypoints via EnvoyFilter or WasmPlugin resources. Neither runs Wasm in the kernel-resident path (Cilium L4) or in ztunnel — Wasm is an Envoy feature, and the L4 fast paths skip Envoy. If you depend on Wasm for auth or telemetry, ensure your hot paths actually traverse Envoy.

Is sidecarless the same as no mesh?

No. Sidecarless meshes still impose a data-plane proxy per node (ztunnel) or kernel eBPF programs (Cilium). They still need a control plane, certificate rotation, and observability pipelines. The cost of the proxy moves from per-pod to per-node, which is a 50-500x reduction in mesh component count, but it is not zero. The no-mesh choice — vanilla Kubernetes NetworkPolicy plus application-level mTLS — remains a valid 2026 architecture for teams whose compliance does not require per-flow identity attestation.

How does sidecarless mesh interact with Knative or KEDA scale-to-zero?

Both Cilium and Istio Ambient handle scale-to-zero better than sidecar Istio did. With sidecars, every cold start paid the sidecar init container cost — typically 1.5-3 seconds for Envoy to receive xDS and become healthy. Sidecarless meshes have no per-pod proxy to warm; the ztunnel or eBPF path is already loaded on the node. Cold start drops to whatever the application itself needs. The one caveat: Istio Ambient waypoints are themselves Deployments and need to be running before the first request hits a waypoint-attached service.

What is the gotcha with hostNetwork pods and DaemonSets?

Pods running with hostNetwork: true bypass the pod network and therefore bypass the mesh. In Cilium, hostNetwork pods still get policies applied at the host network namespace via CiliumClusterwideNetworkPolicy with node selectors, but L7 features do not engage. In Istio Ambient, hostNetwork pods are entirely outside the mesh — ztunnel only intercepts pod network traffic. Privileged DaemonSets (monitoring agents, log collectors) typically run on host network and are unaffected. Plan their authentication independently.