ADR-001: Adopt eBPF-Based Observability for Kubernetes Clusters
Status: Accepted (March 2026)
Primary Keyword: eBPF Kubernetes observability
Archetype: Architecture Decision Record (ADR)
Target Audience: Platform engineers, SREs, DevOps architects evaluating observability stacks
Executive Summary
This Architecture Decision Record documents our organizational decision to standardize on eBPF-based observability as the primary monitoring and instrumentation approach for Kubernetes clusters, superseding traditional agent-based APM platforms (Datadog, New Relic, Dynatrace).
Key findings:
– 90% overhead reduction versus traditional APM agents (CPU and memory)
– Zero instrumentation effort — no SDK integration, no code changes, no deployment delays
– 300% year-over-year adoption growth across CNCF member organizations (2024-2026)
– 67% of large-scale Kubernetes operators already using at least one eBPF observability tool
– Industry validation: Splunk’s OpenTelemetry eBPF Instrumentation (OBI) announcement at KubeCon EU 2026
This decision aligns with industry momentum, operational efficiency requirements, and the maturation of open-source tools (Cilium, Hubble, Pixie, Tetragon) that have achieved production-grade stability.
Problem Statement
Traditional APM approaches impose significant operational and resource constraints on Kubernetes environments:
- Agent overhead: In-process or sidecar agents consume 2-4% CPU and 100-300 MB memory per pod, creating operational burden on high-density clusters
- Instrumentation friction: SDK integration requires code changes, vendor-specific APIs, and multi-team coordination across polyglot services
- Data granularity trade-offs: Agent-based approaches rely on sampling and selective instrumentation, missing tail behavior and low-cardinality phenomena
- Onboarding latency: Weeks of engineering effort to instrument microservices across an organization
- Vendor lock-in: Proprietary instrumentation formats and APIs create switching costs
eBPF fundamentally inverts this model: instead of adding instrumentation to applications, eBPF programs run directly in the Linux kernel, observing application behavior transparently and efficiently.
What is eBPF? Foundational Concepts
eBPF as a Kernel Virtual Machine
eBPF (extended Berkeley Packet Filter) is a sandboxed virtual machine that runs programs inside the Linux kernel. Unlike traditional kernel modules, eBPF programs:
- Are loaded dynamically without recompiling the kernel
- Run with kernel privilege but sandboxed execution constraints
- Cannot block or iterate indefinitely
- Have verifiable memory safety guarantees
Think of eBPF as analogous to the JavaScript engine in a web browser—both are runtime sandboxes that execute untrusted code safely. The web browser’s JavaScript engine ensures that malicious scripts cannot access the host filesystem or corrupt browser memory. Similarly, the eBPF verifier ensures that malicious or buggy kernel programs cannot crash the kernel, leak memory, or access unauthorized memory regions.
The Verifier: Static Analysis as a Safety Mechanism
Before any eBPF program executes, it passes through the eBPF verifier—a static analyzer that simulates all possible program paths to guarantee safety.

The verifier performs three critical validation stages:
1. Control Flow Graph (CFG) Validation
The verifier constructs a directed acyclic graph of the program’s control flow and ensures:
– No loops exist (eBPF programs must terminate)
– No unreachable instructions follow conditional branches
– All jumps land on valid instruction boundaries
2. Execution Path Simulation
The verifier acts as an abstract interpreter, simulating every possible execution path through the program:
– Tracks register state (initialized, uninitialized, scalar, pointer)
– Verifies memory accesses occur within bounds
– Ensures pointers derive from valid kernel data structures
– Validates that stack operations don’t overflow
3. Memory Safety Enforcement
The verifier tracks the “register state” of each processor register throughout all program paths. A register’s state encodes:
– Type: Is it uninitialized, a scalar value, or a pointer to kernel memory?
– Bounds: For pointers, what is the valid memory range this pointer can access?
– Initialization: Has this register been written to before being read?
Example: If your program attempts to dereference a pointer without verifying it’s within a valid range, the verifier rejects it. If a register is read before being initialized, the verifier rejects it.
Post-Verification Hardening
Upon successful verification, the kernel applies two hardening steps:
1. Just-in-Time (JIT) Compilation: The eBPF bytecode is compiled to native machine instructions for the CPU architecture (x86-64, ARM, RISC-V), eliminating the interpreter overhead and achieving performance parity with natively compiled kernel code.
2. Read-only Memory Protection: The kernel memory page holding the compiled eBPF program is marked read-only. If any attacker attempts to modify the program after loading, the kernel panics rather than allowing silent corruption.
This combination ensures that an eBPF program, once loaded and JIT-compiled, is as efficient and trustworthy as natively compiled kernel code—but without the audit burden and deployment friction of kernel modules.
eBPF Maps: Kernel Data Persistence
eBPF programs operate in the kernel but need a mechanism to store state and communicate results to userspace. This is where eBPF maps come in.
What Are eBPF Maps?
eBPF maps are in-kernel data structures that persist across eBPF program invocations and serve as the bridge between kernel and userspace. A map is allocated in kernel memory (via the bpf() syscall) and is accessible to both:
– eBPF programs (reading and writing via helper functions like bpf_map_lookup_elem())
– Userspace processes (via bpf() syscall or memory-mapped file handles)
Maps are analogous to a shared queue or mailbox: the eBPF program writes observations (network flows, function calls, system events) into the map, and a userspace collector reads and processes them.

Map Types and Use Cases
BPF_MAP_TYPE_HASH and BPF_MAP_TYPE_ARRAY
General-purpose key-value storage for maintaining state (connection tracking, per-service metrics, request counters). Hash maps provide O(1) lookup by key; array maps use fixed integer indices.
BPF_MAP_TYPE_PERCPU_HASH and BPF_MAP_TYPE_PERCPU_ARRAY
Per-CPU variants where each logical CPU in the system has its own copy of the map. This eliminates race conditions without locks—multiple CPUs can write simultaneously to their own memory regions without contention. Essential for high-throughput metrics collection.
BPF_MAP_TYPE_RING_BUFFER
A modern, lock-free circular buffer for streaming events from kernel to userspace with zero-copy semantics. Events are written to the ring buffer, and userspace reads from it without transferring data. Critical for high-cardinality event streams (HTTP requests, syscalls).
BPF_MAP_TYPE_PERF_BUFFER
An earlier event streaming mechanism (predecessor to ring buffers) that uses per-CPU perf buffers. Slightly less efficient than ring buffers but still acceptable for many observability workloads.
Memory Safety in Maps
eBPF programs cannot access arbitrary kernel memory—they can only manipulate their context (registers, stack) and kernel data structures exposed through maps. The verifier enforces this:
– Pointer dereferences must be explicitly validated
– Bounds checks are mandatory for array access
– Memory accesses use kernel-controlled copy functions (bpf_probe_read_kernel(), bpf_probe_read_user())
This design prevents eBPF programs from corrupting kernel state while allowing deep observability into system behavior.
Entry Points: How eBPF Hooks Into Kernel Events
eBPF doesn’t exist in a vacuum—it needs attachment points where programs are triggered. The Linux kernel provides multiple mechanisms:

kprobes and uprobes: Function Instrumentation
kprobes attach eBPF programs to arbitrary kernel functions (system calls, device drivers). When the kernel executes the target function, the kprobe fires, executes the attached eBPF program, and continues.
uprobes are the userspace equivalent—they attach eBPF programs to functions within userspace binaries. For example, you can attach a uprobe to the do_HTTP_read() function in your Go application, and every time that function executes, the eBPF program runs with access to the function’s registers and stack.
Overhead: Near-zero when the probe fires (just a kernel jump), but there is a small overhead to checking whether any probes are attached before executing a function.
Tracepoints: Predefined Kernel Events
Kernel tracepoints are static instrumentation points compiled into the kernel—they mark important events (process creation, file open, network packet transmission). eBPF programs attach to these predefined tracepoints without modifying kernel code.
Tracepoints are more efficient than kprobes because they’re already present in the kernel’s execution path; kprobes require a CPU exception handler to be invoked.
XDP: Packet Processing at Driver Level
XDP (eXpress Data Path) is an early packet processing framework where eBPF programs attach at the network driver level—before the packet enters the kernel’s network stack. XDP programs can:
– Drop packets early (DDoS mitigation)
– Redirect packets between interfaces
– Parse and modify packet headers
– Collect packet statistics
XDP runs in the critical path of packet processing, so eBPF programs must be extremely efficient (sub-microsecond execution).
tc eBPF: Traffic Control and Egress/Ingress Filtering
The traffic control (tc) subsystem uses eBPF programs to implement packet scheduling, queuing, and filtering at the kernel’s egress and ingress points. Cilium (Kubernetes CNI) heavily relies on tc eBPF for policy enforcement.
Kubernetes-Centric eBPF Observability Stack
The production-ready eBPF stack for Kubernetes in 2026 is built from four complementary CNCF projects:

Cilium + Hubble: Network Observability and CNI
Cilium is a Kubernetes CNI (Container Network Interface) that replaces the default networking layer with eBPF-powered networking. Instead of iptables and userspace proxies, Cilium uses eBPF programs to enforce network policies, load-balance traffic, and provide L3-L7 service routing—all in kernel space.
Hubble is Cilium’s observability component. While Cilium’s eBPF programs enforce policies, Hubble’s eBPF programs observe network flows, generating a stream of network events:
– Flow metadata: Source/destination IP, port, protocol, DNS names, service names
– L7 parsing: HTTP/HTTPS request paths, gRPC service names, Kafka topics
– Latency metrics: Per-flow p50/p95/p99 latencies
– Traffic direction: Ingress, egress, and intra-cluster flows
Hubble outputs flow events to a ring buffer, which a userspace collector aggregates into:
– Service dependency maps (who calls whom)
– Network latency heatmaps
– Policy violation alerts
Cost: 1-2% CPU overhead per node (for both Cilium and Hubble combined), plus ~20 MB memory. Scales sublinearly with cluster size.
Pixie: Application-Level Observability and APM
Pixie provides zero-instrumentation APM for Kubernetes. Unlike traditional APM agents that require SDK integration, Pixie uses eBPF uprobes to intercept function calls in application binaries.
What Pixie Captures:
– HTTP/1.1, HTTP/2, gRPC requests: Full request/response bodies and headers
– Database queries: SQL queries sent by your application
– Redis commands: Commands issued to in-memory caches
– Message queues: Kafka produce/consume calls
Pixie captures 100% of requests (not sampled) by storing them in a rolling in-memory buffer (8 GB default) on each node. After 60 seconds, the buffer rolls, and older entries are discarded. For long-term retention, aggregated metrics (latencies, error rates) are exported.
Data Collection Mechanism:
Pixie’s eBPF programs attach uprobes to runtime symbols in language runtimes (Go, Python, Java, Node.js, .NET). When a function like net.http.ReadRequest() executes, Pixie’s eBPF code runs in-kernel and captures request metadata directly from memory—without requiring code instrumentation.
Cost: 1-3% CPU overhead per node, 50-100 MB memory. The memory cost scales with request volume and buffer size (user-tunable).
Tetragon: Runtime Security and Syscall Observability
Tetragon is Cilium’s security observability engine. It uses eBPF to monitor and optionally enforce syscall-level behavior:
Observability:
– Track file access (open, unlink, chmod)
– Monitor process execution and parent-child relationships
– Log network connections and bind events
– Detect and record container escapes
Enforcement:
– Block unauthorized system calls
– Deny file access based on policy
– Prevent process execution
– Audit trails for compliance (PCI-DSS, HIPAA, SOC2)
Tetragon is particularly powerful for detecting runtime anomalies: if a containerized application suddenly starts executing curl or wget, Tetragon can alert or block the access.
Cost: 0.5-1% CPU overhead per node, 20-50 MB memory.
Splunk OBI: Standardized eBPF Instrumentation (OpenTelemetry)
At KubeCon EU 2026, Splunk announced the beta launch of OpenTelemetry eBPF Instrumentation (OBI), co-developed with Grafana Labs (who donated Beyla to the OpenTelemetry project).
OBI standardizes eBPF instrumentation around OpenTelemetry, the CNCF’s vendor-neutral observability standard. Instead of each tool (Pixie, Cilium, Tetragon) exporting telemetry in its own format, OBI ensures all eBPF observability converges on OTLP (OpenTelemetry Protocol).
What OBI Provides:
– Automatic distributed tracing via eBPF (no SDK required)
– RED metrics (Rate, Errors, Duration) per service
– Support for all major languages (Go, Java, Python, Node.js, .NET, Ruby, C/C++, Rust)
– Vendor-agnostic export via OTLP
Why It Matters: OBI decouples eBPF observability from any single vendor. You deploy OBI, get observability data in standard OTLP format, and choose any backend (Grafana, Prometheus, Loki, Splunk, New Relic) to consume the data.
Architectural Comparison: eBPF vs. Traditional APM
The performance and operational differences between eBPF-based and traditional APM approaches are substantial:

Resource Overhead
Traditional APM (Agent-based):
– CPU: 2-4% per pod (Datadog, New Relic agents)
– Memory: 100-300 MB per pod
– Network bandwidth: Continuous agent-to-backend communication
– Scales linearly with pod count
Example: A 100-node cluster with 50 pods per node requires 5,000 running agent processes, each consuming resources.
eBPF-based Approach:
– CPU: 0.5-1% per node (for all observability combined: Cilium, Hubble, Pixie, Tetragon)
– Memory: 10-50 MB per node for eBPF kernel programs
– Network bandwidth: Aggregated, filtered telemetry only
– Scales sublinearly with pod count
Same 100-node cluster requires only 100 eBPF programs (one per node), each running in kernel space with kernel-level filtering and aggregation.
Performance Benchmarks (groundcover Flora study, 2025):
– Flora (eBPF): +9% CPU, +0% memory overhead
– Datadog agent: +249% CPU, +227% memory overhead
– OpenTelemetry auto-instrumentation: +59% CPU, +27% memory overhead
– Pixie agent: +32% CPU, +9% memory overhead
Bottom line: eBPF achieves 90% less overhead than traditional APM agents.
Instrumentation Effort
Traditional APM:
– SDK integration per language: weeks of engineering
– Code review and testing cycles
– Dependency management and versioning
– Vendor-specific APIs to learn
– Application restarts required
eBPF:
– Deploy eBPF programs and collectors
– Zero application code changes
– Deploy in minutes, not weeks
– Vendor-agnostic (OTLP standard)
– No application restart needed
Data Granularity
Traditional APM:
– Sampling: Often 10-50% of requests captured
– SDK-limited: Only what the SDK explicitly instruments
– Lower cardinality: Missing tail behavior, rare error paths
eBPF:
– 100% request capture (for 60-second rolling windows)
– Kernel-level visibility: All function calls, syscalls, network operations
– Higher cardinality: Tail latencies, rare error conditions visible
Why eBPF Is Winning: Adoption Metrics and Momentum
CNCF Survey Data (2024-2026)
- 67% adoption: Of organizations running Kubernetes at large scale, 67% use at least one eBPF observability tool
- 300% YoY growth: eBPF observability adoption grew 300% year-over-year from 2024 to 2026
- Vendor adoption: Datadog, New Relic, and Dynatrace all announced eBPF-native observability modes in 2025-2026
Industry Momentum
KubeCon EU 2026 (Amsterdam):
– Splunk announced OBI (OpenTelemetry eBPF Instrumentation) in beta
– CiliumCon (co-located event) focused on eBPF production deployments
– OpenSSF (Open Source Security Foundation) featured Tetragon in security talks
Public Cloud Defaults:
– Google GKE: eBPF mode enabled by default
– Amazon EKS: eBPF support GA (generally available)
– Microsoft AKS: eBPF networking available
Open Source Maturity
- Cilium: CNCF graduated project (highest maturity level)
- Hubble: Stable observability platform
- Pixie: CNCF incubation, production deployments at scale
- Tetragon: CNCF incubation, runtime security workloads
- OpenTelemetry eBPF (Beyla): Donated to CNCF, vendor-backed
Decision Matrix and Trade-off Analysis
The following table summarizes key trade-offs:
| Criterion | Traditional APM | eBPF-based |
|---|---|---|
| CPU Overhead | 2-4% per pod | 0.5-1% per node |
| Memory Overhead | 100-300 MB per pod | 10-50 MB per node |
| Instrumentation Effort | Weeks (SDKs, code changes) | Minutes (deploy collectors) |
| Data Granularity | Sampled (10-50%) | 100% capture (windowed) |
| Vendor Lock-in | High (proprietary APIs) | Low (OTLP standard) |
| Language Support | Limited (SDK per language) | Universal (kernel-level) |
| Polyglot Support | Difficult (per-language SDKs) | Automatic (kernel intercepts all) |
| Learning Curve | Vendor-specific | Vendor-agnostic (OTLP) |
| Production Readiness | Mature | GA (2026) |
| Scaling Efficiency | Linear | Sublinear |
Recommendation: For any Kubernetes cluster with >20 nodes or >500 pods, eBPF-based observability delivers superior operational efficiency and lower TCO.
Implementation Architecture

Recommended Deployment Model (Four Layers)
Layer 1: Network Observability (Cilium + Hubble)
– Deploy Cilium as the CNI
– Enable Hubble for flow visibility
– Output: Service maps, network latency metrics, policy audit logs
Layer 2: Application Observability (Pixie)
– Deploy Pixie agents on all nodes
– Attach uprobes to application runtimes
– Output: Distributed traces, request latencies, service dependencies
Layer 3: Security Observability (Tetragon)
– Deploy Tetragon for runtime monitoring
– Define policies for syscall enforcement
– Output: Audit trails, security events, compliance logs
Layer 4: Standardized Telemetry (Splunk OBI / OpenTelemetry)
– Use OBI or vendor-backed eBPF instrumentation
– Export all telemetry via OTLP
– Decouple backend choice (Grafana, Prometheus, Splunk, etc.)
Data Flow
- Kernel: eBPF programs execute, generating events
- Maps: Events accumulate in ring buffers and hash maps
- Collectors: Userspace daemons read from maps via bpf() syscalls
- Aggregation: Events are filtered, sampled, and aggregated in-kernel (via eBPF) or in-process
- Export: Aggregated telemetry shipped to backends via OTLP or vendor protocols
Operational Considerations
Deployment:
– eBPF programs are loaded at container/pod startup via DaemonSets
– No kernel module recompilation or system reboots needed
– Rollback is instantaneous (unload eBPF program)
Security:
– Verifier prevents malicious programs from loading
– RBAC controls who can load eBPF programs
– Audit trails (Tetragon) track all program loads
Performance Monitoring:
– Monitor eBPF program execution latency via kernel metrics
– Track map memory usage and collision rates
– Alert on verifier rejections (indicates incompatible kernel version)
Known Limitations and Edge Cases
Kernel Version Dependency
eBPF features depend on Linux kernel version:
– Kernel 5.0+: Basic eBPF (maps, kprobes, tracepoints)
– Kernel 5.8+: Ring buffers (zero-copy event streaming)
– Kernel 5.10+: BPF CO-RE (portable eBPF bytecode across kernel versions)
Mitigation: Use BPF CO-RE to ensure eBPF programs work across kernel versions without recompilation.
Verifier Complexity
As eBPF programs grow in complexity, the verifier may reject valid programs due to state explosion. Example: deeply nested loops or complex pointer arithmetic may be rejected even if safe.
Mitigation: Use eBPF helper functions (kernel-provided utilities) instead of complex in-program logic. Keep programs focused and modular.
Language-Specific Challenges
Not all languages expose userspace symbols equally:
– Go: Goroutines have complex memory layouts; uprobes sometimes capture inconsistent state
– Python: Interpreted; function calls don’t always map to memory boundaries
– Java: JIT compilation means function addresses change dynamically
Mitigation: Newer tools like OBI use language-specific plugins to handle these edge cases.
Sampling and Tail Loss
Pixie and ring buffer implementations use rolling windows. Under extreme load, older events are discarded before userspace reads them.
Mitigation: Increase buffer sizes (tunable at deployment time) or use kernel-side filtering to reduce event volume.
Conclusion and Recommendation
We recommend adopting an eBPF-first observability strategy for all new Kubernetes deployments and migrating existing workloads over 12 months.
Key Rationale
- Operational Efficiency: 90% less overhead than traditional APM
- Zero Instrumentation: No SDK integration, no code changes, no deployment delays
- Industry Validation: 300% YoY adoption, CNCF standardization, vendor consensus
- Production Readiness: Cilium is CNCF graduated; Pixie and Tetragon are mature; OBI announced at KubeCon EU 2026
- Vendor Flexibility: OpenTelemetry standard decouples eBPF collection from backend choice
- Scalability: Sublinear resource growth with cluster size and pod count
Migration Path
Phase 1 (Months 1-3): Pilot eBPF observability on non-critical cluster using Cilium + Hubble for network observability.
Phase 2 (Months 4-6): Add Pixie for application tracing. Validate data quality and export pipeline.
Phase 3 (Months 7-9): Deploy Tetragon for security observability. Train security teams on policy enforcement.
Phase 4 (Months 10-12): Migrate backend choice to vendor-neutral OTLP. Deprecate legacy APM agents.
Success Metrics
- Resource utilization: 90% reduction in agent CPU/memory overhead
- Instrumentation time: Reduce onboarding time for new services from weeks to minutes
- Data completeness: Achieve 100% request capture (for rolling windows)
- Adoption: 100% of new Kubernetes deployments use eBPF observability
References and Further Reading
CNCF and Industry Reports
- CNCF Observability TAG: eBPF Adoption Survey (2025)
- Splunk Introduces OpenTelemetry eBPF Instrumentation and Kubernetes Operator at KubeCon EU 2026
- KubeCon EU 2026: Kubernetes Matures – BSD, eBPF, and mTLS
Technical Documentation
- eBPF Official Documentation
- eBPF Verifier Reference
- eBPF Maps Documentation
- Cilium Project Documentation
- Hubble: Network & Security Observability
- Pixie Observability Platform
- Tetragon: Security Observability
- OpenTelemetry eBPF Instrumentation (Beyla)
Academic and Deep-Technical References
- The eBPF Runtime in the Linux Kernel (arXiv)
- End-to-end Mechanized Proof of an eBPF Virtual Machine
- Tigera eBPF Guides
- Red Hat Introduction to eBPF
Comparisons and Case Studies
- eBPF-Based Kubernetes Observability Guide: From Cilium Hubble to Tetragon
- O’Reilly: Observability Engineering with Cilium
- Flora: The new eBPF Observability Sensor from groundcover (2025)
- Building a Production eBPF Observability & Security Stack for Kubernetes (2026)
OpenTelemetry and Standards
Appendix: eBPF Glossary
Bytecode: Machine-independent instruction format; eBPF programs are compiled to bytecode, then JIT-compiled to native instructions.
Kprobe: Kernel probe; attaches eBPF programs to arbitrary kernel functions.
Uprobe: Userspace probe; attaches eBPF programs to functions in userspace binaries.
Tracepoint: Static instrumentation point in kernel code; predefined and more efficient than kprobes.
XDP (eXpress Data Path): Early packet processing framework where eBPF programs run at the network driver level.
Ring Buffer: Lock-free circular buffer for streaming events from kernel to userspace with zero-copy semantics.
JIT Compilation: Just-in-time compilation of eBPF bytecode to native machine instructions for the CPU architecture.
Verifier: Static analyzer that ensures eBPF programs are safe to execute before loading into the kernel.
Helper Functions: Kernel-provided utilities callable from eBPF programs (e.g., bpf_map_lookup_elem(), bpf_get_current_pid_tgid()).
BPF CO-RE (Compile Once, Run Everywhere): Technology enabling eBPF programs to run across multiple kernel versions without recompilation.
OTLP (OpenTelemetry Protocol): Standard protocol for exporting observability data (traces, metrics, logs) in a vendor-agnostic format.
Document Version: 1.0
Last Updated: April 17, 2026
Status: Accepted
Next Review Date: October 2026
