Cilium 1.17 Service Mesh: End-to-End Production Tutorial (2026)

If you have been running a sidecar-based service mesh for the last three years, you already know what this guide is going to argue: the sidecar tax — 100–300 MB of extra memory per pod, a second proxy hop per request, and three CRDs to debug every outage — is no longer worth paying when an eBPF-native option does the same job from the kernel. This Cilium 1.17 service mesh tutorial walks through the full production path: architecture, a working install with real Helm values, mTLS over SPIFFE, an enforced L7 policy, Hubble-based observability, and a step-by-step migration from Istio sidecar mode. By the end you will have a sidecarless mesh you can defend in an architecture review.

Context: how we got to the sidecarless mesh

The service mesh story compresses into three eras. Era one (2018–2021) was the sidecar era — Istio and Linkerd injected a per-pod Envoy or linkerd2-proxy and routed every connection through it. The model was clean but expensive: every pod paid an extra container’s worth of memory, every request paid two userspace hops, and every cluster operator paid the cognitive cost of three control planes (Istiod, Envoy, Kubernetes).

Era two (2022–2024) was the ambient era — Istio’s ambient mode and Linkerd’s per-node proxy variants moved L4 out of the pod into a per-node ztunnel, keeping L7 in an opt-in waypoint proxy. This cut the sidecar tax dramatically but kept Envoy on the data path for any pod that needed HTTP-level rules.

Era three (2025–2026) is the eBPF era. Cilium graduated from CNCF in October 2023 — the first project to reach Graduated status with a service mesh story built on eBPF rather than userspace proxies. By Cilium 1.17, the mesh feature set covers L3/L4 policy entirely in the kernel, L7 policy via a per-node Envoy DaemonSet (one Envoy per node, not per pod), mTLS via SPIFFE/SPIRE, Gateway API support, and Cluster Mesh for multi-cluster L7 routing. The 1.17 release shipped in early 2026 with hardened SPIRE-based mTLS as a default-on option, Hubble OpenTelemetry export out of the box, and Gateway API conformance for the full HTTPRoute spec.

The practical consequence: a Cilium-meshed cluster runs roughly one Envoy DaemonSet pod per node and zero per-pod proxies. For a 50-node cluster running 2,000 pods, that’s 50 Envoys instead of 2,000 — a 40x reduction in proxy footprint, plus the elimination of pod-startup ordering bugs caused by sidecar injection.

Architecture Overview

Before installing anything, it helps to see how the pieces fit. The reference architecture has four moving parts: the Cilium agent (DaemonSet, one per node), the Cilium operator (Deployment, cluster-wide), an Envoy DaemonSet for L7 work, and the Hubble observability stack (Relay + UI + metrics).

Data path: eBPF in the kernel

The data path is where Cilium earns its reputation. When a pod sends a packet, it does not traverse a userspace proxy for L3/L4 decisions. Instead, eBPF programs attached at three hook points — the socket layer (for fast pod-to-pod shortcuts), the traffic-control (tc) layer (for veth-pair traffic), and XDP (for ingress at the NIC) — make the policy and routing decision in-kernel. This eliminates the two context-switches per packet that sidecar meshes incur and is the primary reason Cilium typically shows 30–60% lower p99 latency than Istio sidecar mode in published benchmarks (Isovalent’s 2024 mesh benchmark report, Cilium 1.14 vs Istio 1.20, identical workloads).

For pods on the same node, Cilium can short-circuit the network stack entirely via socket-layer eBPF (SOCK_OPS and sk_msg programs), forwarding bytes between sockets without going near the network interface. This is invisible to the application but cuts intra-node latency by roughly half.

Control plane: CRDs all the way down

Cilium’s control plane is a set of Kubernetes Custom Resources, watched by the operator and pushed to each agent. The main ones you will touch are:

CiliumNetworkPolicy (namespaced) and CiliumClusterwideNetworkPolicy — L3/L4/L7 policy with selectors richer than upstream NetworkPolicy.
CiliumIdentity — the cluster-wide identity of a workload, derived from labels. This is what eBPF maps key off.
CiliumEndpoint — one per pod, the bridge between Kubernetes pods and Cilium identities.
CiliumEgressGatewayPolicy — for SNAT’ing pod traffic through a fixed egress IP.
CiliumEnvoyConfig and CiliumClusterwideEnvoyConfig — direct Envoy xDS config for advanced L7 cases.

The operator handles IPAM, identity allocation, CRD garbage collection, and (in 1.17) the SPIRE Server lifecycle when mutual auth is enabled.

Where Envoy fits

This is the most misunderstood part of Cilium’s mesh story. Cilium does use Envoy — but only for L7 work, and only as a per-node DaemonSet. When you write a CiliumNetworkPolicy with HTTP rules (path matching, method filtering, header inspection), the agent generates Envoy xDS config and the per-node Envoy enforces it. For pure L3/L4 policy (CIDR blocks, port allowlists, label-based selection), Envoy is never on the path — eBPF handles everything in the kernel.

The Envoy DaemonSet is also what serves Gateway API ingress and any TLS termination workloads. In Cilium 1.17, the Envoy version is pinned to a security-patched build and updated independently of the agent, so you can patch Envoy CVEs without rolling the entire DaemonSet.

The net result: most clusters never need more than one Envoy per node, and pure L4 workloads can run with no Envoy involvement at all — set envoy.enabled=false in Helm and your mesh runs entirely on eBPF.

Step-by-step Tutorial

This section is the working install. Copy-paste-ready commands, real YAML, no pseudocode.

Cluster prerequisites

You need a Kubernetes cluster on a kernel that supports the eBPF features Cilium 1.17 uses. The practical requirement is Linux kernel 5.10 or newer for the full feature set including socket-LB, bandwidth manager, and the BPF host routing fast path. Some features (notably WireGuard transparent encryption and the latest bpf_loop optimizations) work best on 5.15+. Older kernels (4.19, 4.20) still boot Cilium but lose features one by one — the Cilium docs maintain a feature matrix at docs.cilium.io.

Cloud distributions you can use as-is:

EKS — Bottlerocket or Amazon Linux 2023 nodes run kernel 5.15+. Install Cilium in CNI chaining mode or as the primary CNI (set eni.enabled=true and routingMode=native).
GKE — Dataplane V2 is a forked Cilium, so for vanilla Cilium use a standard GKE cluster with Dataplane V2 disabled. The Cilium agent then takes over CNI duties.
AKS — use the BYO CNI option and install Cilium per the AKS-specific Helm overlay.
kubeadm/self-managed — any node OS with kernel 5.10+ and iptables (or nftables) works.

Confirm the kernel version on every node before proceeding:

kubectl get nodes -o wide \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}'

Helm install

Add the Helm repo and install Cilium 1.17. The values below are a sensible production baseline — kube-proxy replacement on, Hubble on, mesh features on, SPIRE-based mTLS on.

helm repo add cilium https://helm.cilium.io/
helm repo update

Save this as values.yaml:

# values.yaml — Cilium 1.17 production mesh baseline
kubeProxyReplacement: true
k8sServiceHost: "kube-apiserver.example.internal"
k8sServicePort: 6443

routingMode: native
ipv4NativeRoutingCIDR: 10.0.0.0/8
autoDirectNodeRoutes: true

bpf:
  masquerade: true
  hostLegacyRouting: false

# Mesh features
envoy:
  enabled: true
  rollOutPods: true

l7Proxy: true

# Mutual authentication via SPIRE
authentication:
  mutual:
    spire:
      enabled: true
      install:
        enabled: true
        server:
          dataStorage:
            size: 1Gi

# Hubble
hubble:
  enabled: true
  relay:
    enabled: true
  ui:
    enabled: true
  metrics:
    enabled:
      - dns
      - drop
      - tcp
      - flow
      - port-distribution
      - icmp
      - "httpV2:exemplars=true;labelsContext=source_ip,source_namespace,source_workload,destination_ip,destination_namespace,destination_workload,traffic_direction"
  export:
    static:
      enabled: true
      filePath: /var/run/cilium/hubble/events.log

operator:
  replicas: 2

prometheus:
  enabled: true
  serviceMonitor:
    enabled: true

Install:

helm install cilium cilium/cilium \
  --version 1.17.2 \
  --namespace kube-system \
  --values values.yaml

Verify the install

The cilium CLI is the fastest way to confirm the install. Install it from github.com/cilium/cilium-cli, then:

cilium status --wait
cilium connectivity test --test-namespace cilium-test

cilium status should print OK for the agent DaemonSet, operator, Hubble Relay, and Envoy DaemonSet. The connectivity test deploys client/server pods across nodes and runs ~80 scenarios including pod-to-pod, pod-to-service, host-to-pod, NodePort, and policy enforcement. Expect 5–10 minutes on a fresh cluster.

Enable mTLS

With authentication.mutual.spire.enabled=true in the Helm values, Cilium provisions a SPIRE Server and SPIRE Agent DaemonSet automatically. To require mTLS for traffic between two workloads, add the authentication stanza to a CiliumNetworkPolicy:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: payments-require-mtls
  namespace: prod
spec:
  endpointSelector:
    matchLabels:
      app: payments-api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: checkout-frontend
      authentication:
        mode: "required"
      toPorts:
        - ports:
            - port: "8443"
              protocol: TCP

When this policy is applied, any checkout-frontend pod connecting to payments-api must present a valid SPIFFE SVID. The handshake happens in-kernel via eBPF socket ops — there is no sidecar terminating TLS. See diagram 3 for the full sequence.

L7 policy example

L7 rules are where Envoy enters the picture. The policy below allows the checkout-frontend to call only GET /api/v1/orders/* and POST /api/v1/orders on the orders-api, with a required X-Tenant-ID header on the POST:

apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
  name: orders-l7
  namespace: prod
spec:
  endpointSelector:
    matchLabels:
      app: orders-api
  ingress:
    - fromEndpoints:
        - matchLabels:
            app: checkout-frontend
      toPorts:
        - ports:
            - port: "8080"
              protocol: TCP
          rules:
            http:
              - method: "GET"
                path: "/api/v1/orders/.*"
              - method: "POST"
                path: "/api/v1/orders"
                headers:
                  - "X-Tenant-ID: .+"

Verdicts (allowed / denied) show up in Hubble flows in real time — hubble observe --type l7 --verdict DROPPED is your first debug command.

Hubble Observability

Hubble is the observability layer baked into Cilium. It taps the same eBPF programs the data path uses, so every flow is observable without adding any agent or sidecar.

To use Hubble:

cilium hubble enable --ui
cilium hubble port-forward &
hubble status
hubble observe --namespace prod --follow

Browse the UI:

cilium hubble ui

The UI shows a live service map with L3/L4/L7 flow annotations, policy verdicts overlaid in red/green, and a per-flow timeline. For long-term storage, the OpenTelemetry export is enabled by default in 1.17 — point it at any OTel-compatible backend (Tempo, Jaeger, Honeycomb, Datadog) by adding a Collector config to your cluster. Flow metrics (counts, drops, p50/p95/p99 latency, HTTP status codes) flow through Prometheus via the ServiceMonitor enabled in the Helm values above.

A practical workflow: keep hubble observe --verdict DROPPED running in one terminal while you roll out a policy change. Any DROPPED flow gives you the source/destination identity, the dropped port, and the policy that denied it — usually enough to diagnose a misconfigured selector in under a minute.

Migration from Istio Sidecar

Most production teams arriving at Cilium are migrating from Istio. The good news: Istio and Cilium can run side-by-side during the migration, because Istio operates on sidecars and Cilium operates on the CNI. The bad news: you have to translate Istio CRDs to Cilium ones, and some Istio features (most notably outlier detection and complex traffic splitting) need different approaches.

A safe migration sequence:

Audit. Export every VirtualService, DestinationRule, PeerAuthentication, AuthorizationPolicy from your cluster (kubectl get vs,dr,pa,authorizationpolicy -A -o yaml > istio-audit.yaml). Categorize: which rules are L4 only (cleanly translatable), which are L7 (translatable but verify), which use Istio-specific features like outlier detection or fault injection (need a Gateway API or external tool).
Install Cilium alongside Istio. Use cni.exclusive=false so Cilium does not eject the Istio CNI plugin. Both can run.
Pick a pilot namespace. Choose one with low blast radius and well-understood traffic. Remove the istio-injection=enabled label, redeploy the workloads (sidecars disappear), label the namespace cilium.io/managed=true.
Translate policies. Istio PeerAuthentication with mtls.mode=STRICT becomes Cilium authentication.mode=required. AuthorizationPolicy with HTTP rules becomes a CiliumNetworkPolicy with rules.http. VirtualService traffic splitting becomes Gateway API HTTPRoute (Cilium 1.17 supports the full HTTPRoute spec).
Verify with Hubble. Watch the service map and hubble observe --verdict for the namespace. Any unexpected DROPPED flow is a missing rule.
Iterate. Move one namespace at a time. After every namespace cuts over cleanly for 48 hours, move to the next.
Uninstall Istio. Once every namespace is on Cilium, istioctl uninstall --purge removes Istiod and the leftover CRDs.

Plan on 1–2 weeks per major namespace for a careful migration. Teams that have published their migrations (Datadog moved off Istio sidecars to Cilium in 2024, documented in their engineering blog) report the policy-translation step as the longest part.

Trade-offs and Gotchas

No technology is free. The honest list:

Kernel version. Cilium 1.17 needs 5.10+ for the full feature set. Clusters on 4.19 kernels (e.g., older RHEL 7 derivatives) work but lose socket-LB, bandwidth manager, and some encryption modes. Plan a node OS upgrade before adopting.
GKE Dataplane V2. GKE’s Dataplane V2 is a Google-managed Cilium fork. You cannot install vanilla Cilium on top — you would have two Ciliums fighting for eBPF maps. Either disable Dataplane V2 (only available on standard clusters, not Autopilot) or live with Google’s slightly older Cilium build.
Cluster Mesh edge cases. Cilium’s Cluster Mesh is rock-solid for L4 multi-cluster service discovery, but L7 cross-cluster (HTTP path-based routing across clusters) is newer and has rough edges with specific CNI configurations. Test the exact topology before committing.
L7 still goes through Envoy. “Sidecarless” does not mean “no proxy” — L7 policy still terminates at the per-node Envoy. The win is one Envoy per node instead of one per pod, not zero proxies.
Debugging eBPF. When something goes wrong at the kernel level, bpftool and cilium-dbg are your friends. Plan for an on-call engineer who knows eBPF basics, or budget time to grow that skill.

Practical Recommendations + checklist

Adopting Cilium 1.17 as your service mesh, ranked by impact:

Start with L4 only. Get pure eBPF L3/L4 policies and Hubble observability working before turning on L7 rules or mTLS. This is 80% of the value and 20% of the complexity.
Use the Helm chart, not the operator. The Helm chart is the supported install path. The Cilium operator is for advanced lifecycle automation, not first-time install.
Pin the version. Cilium minor versions ship every ~3 months; treat upgrades as planned events with a canary and rollback plan.
Run cilium connectivity test after every upgrade. It catches policy and connectivity regressions before users do.
Wire Hubble to your existing observability stack. OTel export to your current backend (Tempo, Datadog, Honeycomb) gives you a single pane of glass.
Document your identity model. SPIFFE IDs are derived from K8s labels — write down which labels are policy-relevant so engineers don’t accidentally break authentication by renaming a label.

Pre-production checklist:

[ ] All nodes on kernel 5.10+
[ ] kube-proxy decision made (replace vs keep)
[ ] cilium connectivity test passing
[ ] Hubble UI reachable for SREs
[ ] At least one CiliumNetworkPolicy deny-by-default in a non-critical namespace
[ ] mTLS verified with hubble observe --type l7 --verdict AUDIT_REQUIRED showing zero misses
[ ] Prometheus scraping Hubble metrics
[ ] OTel exporter wired to long-term backend
[ ] Runbook for “policy DROPPED a real request” written
[ ] Upgrade rehearsal in staging completed

FAQ

How is Cilium different from Istio Ambient? Both are sidecarless. Istio Ambient uses a userspace ztunnel for L4 and an opt-in waypoint Envoy for L7. Cilium uses eBPF for L4 (kernel-space, no proxy) and a per-node Envoy DaemonSet for L7. Cilium typically has lower p99 latency for L4 traffic; Istio Ambient has tighter integration with Istio’s existing CRD ecosystem. See our deeper comparison in the further reading.

Does Cilium replace kube-proxy? Yes, optionally. With kubeProxyReplacement: true, Cilium implements Service, NodePort, LoadBalancer, and ClusterIP semantics directly in eBPF. This eliminates iptables churn on large clusters and is one of the bigger latency wins. You can also leave kube-proxy in place and run Cilium in CNI-only mode.

What kernel version do I need? 5.10 minimum for the full Cilium 1.17 feature set. 5.15+ is recommended. Some advanced features (e.g., bpf_loop-based optimizations) prefer 6.x. Older kernels boot but lose features — check the Cilium docs feature matrix.

Cluster Mesh vs Submariner for multi-cluster? Cluster Mesh is Cilium-native, supports L4 and (in 1.17) most L7 use cases, and uses your existing Cilium control plane. Submariner is CNI-agnostic and uses IPsec or WireGuard tunnels. Pick Cluster Mesh if you are Cilium-only; pick Submariner if you have heterogeneous CNIs across clusters.

Can Cilium do TLS termination? Yes — via the per-node Envoy DaemonSet and Gateway API HTTPRoute with TLS configuration. The cert lifecycle is managed via standard Kubernetes Secret objects or cert-manager. For workload-to-workload mTLS, SPIFFE SVIDs are managed by SPIRE and rotated automatically (default 1-hour SVID TTL).

Will Cilium work with my existing NetworkPolicy? Yes. Cilium honors upstream NetworkPolicy resources alongside CiliumNetworkPolicy. You can adopt Cilium gradually and migrate policies later.

What is the upgrade path? Cilium supports rolling upgrades within a minor version (1.17.0 → 1.17.x) and one minor-version skip (1.16.x → 1.17.x). Plan for a 30–60 minute rolling restart of the agent DaemonSet on a 50-node cluster.

Cilium 1.17 Service Mesh: End-to-End Production Tutorial (2026)

Cilium 1.17 Service Mesh: End-to-End Production Tutorial (2026)

Context: how we got to the sidecarless mesh

Architecture Overview

Data path: eBPF in the kernel

Control plane: CRDs all the way down

Where Envoy fits

Step-by-step Tutorial

Cluster prerequisites

Helm install

Verify the install

Enable mTLS

L7 policy example

Hubble Observability

Migration from Istio Sidecar

Trade-offs and Gotchas

Practical Recommendations + checklist

FAQ

Further Reading

About the author

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories