Continuous Profiling with eBPF: Flamegraphs in Prod

Continuous Profiling with eBPF: Flamegraphs in Prod

Continuous Profiling with eBPF: Flamegraphs in Prod

The bug that costs you the most money is rarely the one that pages you. It is the function quietly burning 14% of every CPU across a thousand pods, the allocation pattern that doubled your node count after a routine release, the lock contention that only appears at peak. Metrics tell you the box is hot. Traces tell you which request was slow. Neither tells you which line of code is the culprit. Continuous profiling with eBPF closes that gap: it samples stack traces from every process on every node, all the time, in production, and turns them into flamegraphs you can query like any other signal.

The old objection was overhead. You profiled in staging, captured a 60-second snapshot, and hoped the workload looked like prod. eBPF removes the excuse: the kernel does the sampling, no code changes are required, and the cost lands near 1% of CPU.

What this covers: how always-on profiling works at the kernel level, the pprof format and the new OpenTelemetry profiling signal, the trade-offs of frame pointers versus DWARF unwinding, and how to deploy Grafana Pyroscope and Parca on Kubernetes with real config.

Context and Background

Profiling is not new. perf, gprof, Java Flight Recorder, Go’s net/http/pprof, and py-spy have existed for years. What is new is making profiling continuous and fleet-wide rather than a thing you reach for after an incident. The pattern was popularized by Google’s internal Google-Wide Profiling, described in a 2010 paper showing that always-on, low-overhead sampling across a datacenter pays for itself many times over in capacity saved.

The incumbents split into two camps. Language-runtime profilers (Go pprof, JFR, async-profiler for the JVM) are accurate and rich but require per-language integration and often per-process configuration. Kernel-level profilers built on eBPF — Parca Agent, Grafana Pyroscope’s eBPF mode, Polar Signals, Elastic Universal Profiling, and the OpenTelemetry eBPF profiler — sample every process on a node from one agent, regardless of language, with no recompilation.

The economic argument behind the second camp is the part most teams miss. The Google-Wide Profiling paper made the case that even a sub-1% optimization, applied across a fleet of tens of thousands of machines, justifies the entire profiling infrastructure many times over. At cloud scale, where compute is a line item you can read on an invoice, the same logic holds: if continuous profiling helps you find and remove 10% of wasted CPU across a large deployment, the savings dwarf the storage and agent cost by a wide margin. This reframes profiling from a developer convenience into an infrastructure investment with a measurable return, which is why platform and FinOps teams — not just individual engineers debugging a slow endpoint — are the ones driving adoption in 2026.

It helps to put real numbers on that argument, because the ROI is easy to assert and harder to feel. Take a fleet of 500 m6i.4xlarge nodes (16 vCPU each, 8,000 vCPU total) running steadily at on-demand pricing of roughly $0.77/hour per node — about $337,000 a month before any committed-use discount. Continuous profiling on those nodes costs you the agent overhead (around 1% of CPU, so roughly five nodes’ worth, about $3,400/month) plus object storage and query infrastructure for the profile archive (a few hundred dollars a month for days of retention). Now suppose a quarter of profiling-driven review turns up a serialization hot path, a needless JSON re-parse, and an over-eager retry loop that together account for 12% of fleet CPU. Removing them lets you shrink the fleet by 60 nodes — about $40,000/month saved against a profiling bill under $4,000. That is a 10x return in the first quarter, and the agent overhead you were nervous about is a rounding error inside the win. The point is not that every quarter yields a 12% find; it is that even a single 2-3% find a year more than pays for always-on profiling forever.

eBPF (extended Berkeley Packet Filter) is the enabling technology. It lets you load small, verified programs into the Linux kernel that run on events — here, a hardware or software timer firing on each CPU. The kernel verifier statically proves each program terminates and touches only memory it is allowed to, which is why running arbitrary logic in kernel context is safe enough for production. When the timer fires, your program walks the current stack and records it. Because the work happens in kernel context on an interrupt, there is no agent polling each process and no instrumentation in the application.

The verifier deserves more than a passing mention, because its constraints shape every profiler built on eBPF. It is a static analyzer that explores all reachable paths through your program before the kernel will load it, proving three things: that the program always terminates (no unbounded loops), that every memory access is within bounds the verifier can track, and that the program respects the helper-call and context rules for the program type. Historically the verifier capped programs at a few thousand instructions and rejected unbounded loops outright, which is exactly why a stack unwinder could not just while (frame) { ... } its way up the call chain — it had to be written as a bounded loop with a hard frame ceiling. Bounded loops and a one-million-instruction complexity budget on modern kernels relaxed this somewhat, but the core discipline remains: a profiler’s in-kernel code path is a small, provably-terminating program that does the cheapest possible work and defers everything else. That is not a limitation engineers route around; it is the reason the whole approach is safe enough to run in production at all.

If you are already running eBPF for networking or security observability, profiling is a natural next layer; see our ADR on eBPF for Kubernetes observability for how it displaces traditional APM agents. The CNCF ecosystem has converged hard on this approach, and the OpenTelemetry profiling signal — now in public alpha as of March 2026 — makes profiles the fourth observability signal alongside metrics, logs, and traces. The original thesis of this article is simple but underappreciated: the value of profiling is not in the snapshot, it is in the continuity. A single flamegraph tells you where time goes right now; a continuous archive lets you answer questions you did not know to ask — which release regressed, which tenant is expensive, which hot path crept up over a quarter. That archival property, not the flamegraph itself, is what turns profiling from a debugging tool into a capacity-planning and FinOps instrument.

How eBPF Continuous Profiling Works

Continuous profiling with eBPF works by attaching a small eBPF program to a per-CPU perf_event timer that fires at a fixed frequency (commonly 19–100 Hz). On each tick the program captures the running thread’s user-space and kernel stack, hashes it into a BPF map, and a userspace agent periodically drains those counts and emits them as pprof-formatted profiles. No application changes, no recompilation.

Continuous profiling eBPF data flow from kernel to flamegraph

Figure 1: The end-to-end path of a continuous profiling eBPF pipeline — a kernel timer drives an eBPF program that walks stacks, which are symbolized, encoded as pprof, stored, and rendered as flamegraphs.

The diagram traces one sample’s journey. An unmodified production binary is running. The kernel’s perf_event subsystem fires a timer interrupt on a CPU. The attached eBPF program executes in kernel context, walks the stack, and records it. The raw addresses are later mapped to function names (symbolization), encoded into the pprof format, shipped to a profile store backed by object storage, and finally rendered in Grafana as flamegraphs and differential views. Each stage has real engineering subtlety, and the next three subsections unpack the parts that matter most.

It helps to contrast this with how a traditional APM agent works. The APM model injects a library or bytecode agent into each process, which instruments specific functions and emits spans — accurate for the paths it covers, but blind to everything it was not told to watch, and carrying a per-call cost that grows with how much you instrument. The eBPF model inverts that: it watches everything at a fixed, traffic-independent cost, because the sampling rate is set by a timer, not by how many functions execute. A service handling ten times the requests does not generate ten times the profiling overhead; it generates the same number of timer ticks per second per core. That decoupling of overhead from workload volume is the structural reason eBPF profiling stays cheap under load exactly when you most want visibility.

Sampling, not instrumentation

A profiler answers “where is CPU time spent?” with statistics, not exhaustive tracing. If you sample the call stack 19 times a second across thousands of cores, the functions that appear most often are, by the law of large numbers, the functions consuming the most CPU. This is the central trick that makes the overhead so low: you are not recording every function entry and exit (that is tracing, and it is expensive). You are taking a periodic snapshot.

Parca Agent, for example, observes user-space and kernel-space stack traces 19 times per second per core and builds pprof profiles from the aggregated data. Pyroscope’s eBPF mode defaults to a similar low frequency. The sampling rate is a knob: higher frequency means finer resolution for short-lived functions but more overhead. For fleet-wide always-on use, low frequencies (around 19–100 Hz) are the sweet spot, because you accumulate statistical confidence over time and across many cores rather than within a single short window.

The choice of 19 Hz rather than a round 20 Hz is a deliberate anti-aliasing trick worth understanding. Many workloads have periodic behavior — a 50 ms tick loop, a 100 ms batch flush, a garbage-collection cycle that runs on a regular cadence. If your sampling frequency shares a common factor with a periodic workload, your samples can lock in phase with it and systematically over- or under-count the periodic function, the same way a strobe light can freeze a spinning wheel. Picking a prime number like 19 (or 97, or 101 at higher rates) makes it far less likely your sampler resonates with any workload period, so the samples spread evenly across each periodic phase. It is a small detail that quietly improves the fairness of every flamegraph the tool produces.

The arithmetic is worth internalizing. At 19 Hz on a 64-core node, you collect roughly 1,216 samples per second, about 73,000 per minute, per node. A function that occupies 5% of CPU will, on average, appear in 5% of those samples — and with tens of thousands of samples, the statistical error on that estimate is tiny. This is why a low rate is not a compromise: the confidence comes from volume, and continuous collection delivers volume for free. A 60-second perf record on one machine cannot match the statistical power of a week of fleet-wide sampling, even though its instantaneous rate is far higher. The mental model to carry is that you are building a probability distribution over your code’s CPU consumption, and every additional sample sharpens it.

Make that error bar concrete. Sampling is a binomial process: each sample either lands in your function (probability p) or does not. The standard error on the estimated fraction is roughly sqrt(p(1-p)/N). For a function at p = 0.05 observed over one minute on that 64-core node (N ≈ 73,000 samples), the standard error is sqrt(0.05 × 0.95 / 73,000) ≈ 0.0008, or about 0.08 percentage points — so your 5% estimate is solid to within a tenth of a percent after just sixty seconds. Now flip it around for a rare function. A path that is genuinely 0.1% of CPU has p = 0.001, and over a single minute on a single node you expect only about 73 hits — enough to know it exists, but the relative error is large. To pin a 0.1% function to within ±10% relative error you need on the order of 100,000 samples in it, which means hours on one node or minutes across the fleet. This is the precise mathematical reason continuous, fleet-wide collection is not merely convenient: it is what gives you the sample count to resolve the small-but-pervasive functions that, summed across thousands of cores, are where the real money hides.

There is a subtlety in what you sample. CPU profiling driven by a software timer (PERF_COUNT_SW_CPU_CLOCK) samples wall-clock-on-CPU time. You can instead drive sampling from hardware performance counters — cache misses, branch mispredictions, retired instructions — to answer different questions. Polar Signals’ work on hardware timers and eBPF explores exactly this: a profile keyed on last-level-cache misses points you at memory-bound code that a CPU-time profile would not flag. Most teams start with CPU-time sampling and only reach for hardware-counter profiling when chasing a specific microarchitectural bottleneck.

Walking the stack from the kernel

The hard part is the stack walk. When the eBPF program runs, it has the CPU registers, including the instruction pointer and stack pointer. To build a full call stack it must “unwind” — find each caller’s return address up the chain. There are two mechanisms, and the choice has large consequences, covered in the next section. The eBPF helper bpf_get_stackid() can do a fast frame-pointer-based walk and store the result in a stack map, deduplicating identical stacks by hash so the map stays small.

There are real constraints on what the kernel will let you do here. eBPF programs run under a verifier that bounds loop counts and stack depth, so an unwinder cannot simply recurse without limit — it walks up to a fixed maximum number of frames and stops. Deep stacks can therefore be truncated, which is why you occasionally see flamegraphs that bottom out abruptly. The program also runs in interrupt context with tight time and memory budgets, so the design favors recording compact stack IDs in-kernel and doing the expensive symbolization later in userspace. Understanding this split — cheap capture in the kernel, heavy interpretation outside it — explains most of why production profilers are architected the way they are, and why a misconfigured maximum-stack-depth setting can quietly hide the very frames you are hunting for.

The stack-depth ceiling is not arbitrary, and it bites in predictable places. bpf_get_stackid() and its successor bpf_get_stack() cap the number of frames they will return — PERF_MAX_STACK_DEPTH defaults to 127 — and a custom DWARF unwinder must declare its own bounded loop ceiling that the verifier can check. Most workloads are well under that, but three patterns blow past it: deeply recursive code (a recursive parser or tree walk), languages with heavy framework indirection (a Spring or Rails request can stack dozens of framework frames before reaching your handler), and async runtimes that chain continuations. When the walk truncates, you do not get an error — you get a flamegraph whose deepest frames are simply missing, so the leaf where time is actually spent can vanish off the bottom while the misleading framework frames near the root remain. The tell is a flat, abrupt base across many stacks at exactly the same depth. If you see that, raise the depth limit (where the tool exposes it) and re-check, because the truncated frames are frequently the ones you most wanted to see.

From raw addresses to readable functions

A stack walk yields a list of memory addresses, not function names. Symbolization maps each address back to a function, file, and line using the binary’s symbol table and DWARF debug info. This can happen on the node (the agent reads the binary) or be deferred to the backend. Deferred symbolization keeps the agent light and handles stripped binaries by looking up debuginfo separately, but it requires shipping unsymbolized addresses plus build IDs. Getting symbolization right for stripped, statically linked, and JIT-compiled binaries is where most of the real complexity lives.

Two details make symbolization robust in practice. First, addresses must be normalized: a process is loaded at a randomized base because of Address Space Layout Randomization (ASLR), so the agent subtracts the load bias to recover the static virtual address that matches the on-disk binary. Second, the build ID — a hash the linker embeds in the ELF .note.gnu.build-id section — uniquely identifies the exact binary, so the backend can fetch the matching debuginfo even months later from a debuginfod server. This is why deferred symbolization scales: the agent ships a tiny tuple of build ID plus normalized address, and a central service does the heavy mapping once and caches it. Skip the build ID and you are guessing which version of a binary produced a stack — a guess that fails the moment you run two releases at once during a rollout.

Walk through the normalization step concretely, because it is where silent corruption creeps in. A profiler sees an instruction pointer of, say, 0x5612a4c3f120 captured at runtime. That number is meaningless on its own — it includes the ASLR slide chosen for that process at exec time. The agent reads /proc/<pid>/maps to find the executable mapping for that address, learns the mapping started at virtual address 0x5612a4c30000 and corresponds to file offset 0x0 in the binary’s first executable segment, and subtracts to recover the static address 0xf120 that the on-disk ELF actually describes. Only that static address can be looked up in the binary’s .symtab or DWARF line table. Get the mapping arithmetic wrong — for instance by ignoring that a PIE binary’s segments can have non-zero file offsets, or by symbolizing against a different build than the one running — and every frame resolves to the wrong function with total confidence, which is worse than [unknown] because it looks correct. The build ID is the guardrail: the backend refuses to symbolize against debuginfo whose build ID does not match, so a mismatched binary fails loudly instead of producing a plausible lie.

The debuginfod flow is what makes this practical at scale. debuginfod is a federated HTTP service (the upstream protocol most distributions now run) that serves debug information keyed by build ID: you GET /buildid/<hex-build-id>/debuginfo and receive the separated .debug file for exactly that binary. A profiling backend that has only a build ID and a normalized address can therefore fetch the matching debuginfo on demand — from your distribution’s public debuginfod for OS packages, or from an internal debuginfod you populate from your CI artifact store for your own services — symbolize once, and cache the result keyed by build ID forever. The architectural payoff is that your production nodes never need debug symbols installed (they ship only stripped binaries plus the embedded build ID), keeping images small, while the backend still recovers full file-and-line detail. The failure mode to plan around is the binary whose build ID is not registered anywhere: it symbolizes to hex, and the only fix is to make sure every artifact your fleet runs is uploaded to a debuginfod the backend can reach.

A Deeper Walk-through: Unwinding, pprof, and Deployment

Three things separate a toy profiler from a production one: how it unwinds stacks when frame pointers are missing, how it represents and ships data, and how it deploys at fleet scale. Let us go through each, with runnable configuration.

eBPF stack unwinding decision path with frame pointers and DWARF

Figure 2: How an eBPF profiler decides between fast frame-pointer unwinding and DWARF-based unwinding on each timer tick, then aggregates stacks into pprof.

Frame pointers versus DWARF unwinding

A frame pointer is a CPU register (RBP on x86-64) that, by convention, points to the base of the current stack frame. If every function preserves it, walking the stack is a trivial pointer chase: follow RBP to the saved RBP of the caller, repeat. It is fast enough to do inside an eBPF program on every tick.

The problem: compilers omit frame pointers by default to free up a register and shave a few instructions. For decades -fomit-frame-pointer was the norm, which broke naive unwinding. Two things changed. First, distributions began re-enabling frame pointers — Fedora, Ubuntu, and others now compile their default packages with -fno-omit-frame-pointer precisely because the profiling benefit outweighs the tiny cost. Second, profilers like Parca implemented DWARF-based stack walking inside eBPF. DWARF unwind tables (the .eh_frame section) describe how to compute the caller’s frame at any instruction, even without a frame pointer. Polar Signals’ engineers showed that modern kernels are capable enough to run a DWARF unwinder in eBPF, so the profiler works correctly even on binaries built with frame pointers omitted.

It is worth understanding why DWARF unwinding is hard to do in the kernel, because it explains the architecture of every serious eBPF profiler. The .eh_frame section is not a table you can index — it is a bytecode program (the DWARF Call Frame Information, or CFI, virtual machine) that, for any given instruction pointer, tells you how to compute the Canonical Frame Address and where the saved return address and registers live relative to it. Interpreting that bytecode inside an eBPF program is exactly the kind of unbounded work the verifier forbids. So the practical design, pioneered by Parca, is to pre-process .eh_frame in userspace into a flat, sorted unwind table — essentially rows of (instruction-pointer-range, CFA-rule, RBP-rule) — and ship that table into a BPF map. The in-kernel unwinder then does a bounded binary search into the table for each frame and applies the simple arithmetic rule it finds, never interpreting DWARF bytecode at runtime. That offline-compile-to-a-lookup-table pattern is the only way to get DWARF correctness under the verifier’s bounded-execution rules, and it is why DWARF mode uses noticeably more agent memory: those unwind tables for libc, your runtime, and every shared library have to live resident in BPF maps.

The trade-off is real. Frame-pointer unwinding is cheap but needs cooperation from the whole stack (your code, your libraries, libc). DWARF unwinding works everywhere but is computationally heavier and requires shipping or accessing unwind tables. Production profilers use frame pointers when present and fall back to DWARF when not — exactly the branch in Figure 2.

# Check whether a binary was built with frame pointers preserved.
# If readelf shows .eh_frame and the function prologues push rbp/mov rsp,rbp,
# frame-pointer unwinding will work; otherwise the profiler needs DWARF.
objdump -d /usr/bin/your-binary | grep -A2 '<main>:' | head
readelf -S /usr/bin/your-binary | grep -E 'eh_frame|debug'

# Build Go with frame pointers (Go keeps them on amd64/arm64 by default).
go build -o app ./...

# Build C/C++ with frame pointers preserved for cleaner stacks.
gcc -fno-omit-frame-pointer -g -O2 -o app app.c

The pprof format and the OpenTelemetry profiling signal

Profilers need a wire format. The de facto standard is pprof, originally Google’s profiling format, a gzip-compressed protocol buffer that encodes samples (a stack plus one or more values such as CPU nanoseconds or bytes allocated), a location table, a function table, and a string table. Its key design choice is deduplication: every distinct stack frame and string appears once and is referenced by index, so a profile with millions of samples stays compact. Both Pyroscope and Parca speak pprof natively.

The internal structure rewards a closer look, because it is the same indexed-table design that the OpenTelemetry format inherits. A pprof Profile is built from a handful of repeated message tables that reference each other by integer index. The string_table is a single array of all strings (function names, file paths, label keys) where index 0 is always the empty string; nothing else stores a string inline — everything stores an index. A Function records a name and filename as indices into that string table. A Location represents one program address and holds one or more Line entries pointing at functions (more than one when the address is inlined). A Sample is then just a list of location indices (the stack, leaf-first) plus a parallel list of values — for a CPU profile that value is sample count or nanoseconds; for a heap profile it is bytes or object count. The reason a profile with millions of samples stays small is that the genuinely repetitive data — the same runtime.mallocgc frame appearing in a hundred thousand stacks — is stored exactly once as a Location, and every stack that contains it stores a four-byte index. Knowing this layout demystifies a lot of profiler behavior: merging two profiles means concatenating sample lists and de-duplicating the location and string tables; and a profile bloated to hundreds of megabytes almost always means an exploded number of distinct stacks (often from spurious labels or truncation artifacts), not a lot of samples, because samples themselves are nearly free.

The bigger 2026 development is the OpenTelemetry profiling signal. Profiles formally entered public alpha in March 2026, making them OpenTelemetry’s fourth signal. The OTLP Profiles format was inspired by pprof and developed with pprof maintainers; crucially, data round-trips between pprof and OTLP Profiles with no loss, and a native translator ships to guarantee interoperability. The OpenTelemetry Collector now has a profiles receiver, and there is a dedicated OpenTelemetry eBPF profiler that emits the signal. Practically, this means the profiler you deploy today and the backend you query do not have to come from the same vendor — a meaningful shift from the lock-in of proprietary APM.

The OTLP Profiles model also pushes deduplication further than pprof did, sharing the resource and attribute model with the other signals so that profiling data references the same service, host, and Kubernetes metadata your traces and metrics already carry. Concretely, OTLP Profiles hoists the dictionaries that pprof kept per-profile up to the message level — a shared string_table, plus shared tables for mappings, locations, functions, link, and attributes — so a batch of profiles from many processes deduplicates across all of them at once rather than each profile carrying its own copies. Attributes are the same key-value AnyValue type used by spans and metrics, attached to samples or locations by index, which is what lets a frame carry the identical service.name, k8s.pod.name, and host.name semantic-convention attributes your traces already use. There is also a link table that can associate a sample with a trace ID and span ID. That shared model is the quiet payoff of making profiles a first-class OpenTelemetry signal: a stack frame in a flamegraph can be correlated with the trace that exercised it and the metric that flagged it, because all three describe the same resource, and in the strongest case you can jump from a slow span directly to the exact stack that was on-CPU while it ran. Being alpha, the spec and tooling are still moving, so pin versions and expect churn — but the direction is clear, and building on the standard now means you inherit that cross-signal correlation as the ecosystem matures rather than bolting it on later.

# OpenTelemetry Collector: receive OTLP profiles and forward them.
# Profiles are alpha as of mid-2026; pin versions and expect churn.
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
processors:
  batch: {}
exporters:
  otlphttp/profiles:
    endpoint: https://your-backend.example.com
service:
  pipelines:
    profiles:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/profiles]

Deploying Pyroscope and Parca on Kubernetes

Both tools follow the same shape: an agent (a DaemonSet, one pod per node) collects profiles via eBPF and pushes them to a backend that stores blocks in object storage and serves a query API to Grafana. Figure 3 shows the topology.

Kubernetes deployment topology for eBPF profiling agents and backend

Figure 3: A fleet-wide eBPF profiling deployment on Kubernetes — a per-node agent enriches profiles with pod labels and pushes them to a backend that stores blocks in object storage and serves Grafana.

The modern Grafana path uses Grafana Alloy (the open-source agent that replaced Grafana Agent) with its pyroscope.ebpf component, pushing to a Pyroscope backend. Pyroscope 2.0 made its v2 architecture the default, writing profiles directly to object storage and removing the in-memory ingesters and local disks the older design needed — simpler ops and lower resource use at scale. That architectural shift matters operationally: the older design coupled write throughput to the memory and disk of stateful ingesters, which were the thing that fell over first under a profiling spike. Writing straight to object storage decouples ingestion from storage, so the system scales horizontally with cheap stateless components and your durability story becomes “whatever S3-compatible bucket you trust.” It is the same pattern that played out in metrics and logs, arriving in profiling a few years later.

The relabeling step in the Alloy config is not boilerplate — it is where you control cost and usefulness at once. By copying Kubernetes metadata into stable labels (namespace, pod, service, version) you make profiles queryable the same way you query metrics, and by not copying volatile fields you keep cardinality bounded. Get this wrong and you either cannot slice your data by service, or you drown the backend in series. Treat the relabel rules as the contract between your fleet and your profiling bill.

// Grafana Alloy: discover Kubernetes pods and profile them via eBPF.
// Save as config.alloy and run with the Alloy DaemonSet.

discovery.kubernetes "pods" {
  role = "pod"
}

discovery.relabel "pods" {
  targets = discovery.kubernetes.pods.targets
  rule {
    source_labels = ["__meta_kubernetes_namespace"]
    target_label  = "namespace"
  }
  rule {
    source_labels = ["__meta_kubernetes_pod_name"]
    target_label  = "pod"
  }
}

pyroscope.ebpf "default" {
  forward_to = [pyroscope.write.backend.receiver]
  targets    = discovery.relabel.pods.output
}

pyroscope.write "backend" {
  endpoint {
    url = "http://pyroscope.monitoring.svc.cluster.local:4040"
  }
}

Parca takes a comparable approach with Parca Agent, an always-on eBPF profiler that auto-discovers targets (Kubernetes containers and systemd units) with zero code changes or restarts. It builds pprof profiles and pushes them to a Parca server.

# Parca Agent as a DaemonSet (abridged). Needs privileged access for eBPF.
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: parca-agent
  namespace: parca
spec:
  selector:
    matchLabels: { app.kubernetes.io/name: parca-agent }
  template:
    metadata:
      labels: { app.kubernetes.io/name: parca-agent }
    spec:
      hostPID: true
      containers:
        - name: parca-agent
          image: ghcr.io/parca-dev/parca-agent:latest
          args:
            - /bin/parca-agent
            - --remote-store-address=parca.parca.svc.cluster.local:7070
            - --remote-store-insecure
            - --node=$(NODE_NAME)
          securityContext:
            privileged: true
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef: { fieldPath: spec.nodeName }

Once profiles flow, you query them. In Pyroscope you select a profile type and a label set, and you get a flamegraph for the time range:

# Pyroscope query: CPU profile for one service, last hour, in Grafana.
process_cpu:cpu:nanoseconds:cpu:nanoseconds{service_name="checkout"}

CPU, memory, and off-CPU profiling

CPU profiling answers “what is burning cycles.” But two other axes matter. Memory (allocation) profiling samples allocation sites to show what is generating garbage or leaking — invaluable for finding the allocation that doubled your node count. Off-CPU profiling captures time not spent on CPU: blocked on I/O, locks, channel sends, or the scheduler. A service can be slow while its CPU profile looks idle, because it is waiting; off-CPU profiling is the only view that surfaces that. eBPF can capture all three by attaching to different events — timers for CPU, allocation hooks or uprobes for memory, and scheduler tracepoints (sched_switch) for off-CPU. Coverage varies by tool and language, so verify before you rely on it.

The mechanism differs meaningfully between the three. CPU profiling is event-driven by time — it fires on a clock and asks “what is running now.” Off-CPU profiling is event-driven by transitions — it hooks sched_switch, records a timestamp and stack when a thread leaves the CPU, and measures the gap until it returns, attributing that blocked duration to the stack that was waiting. Memory profiling is the trickiest: a true allocation profile needs to intercept the allocator, which for native code means uprobes on malloc/free (expensive if called millions of times a second, so it is sampled) and for managed runtimes means hooking the language’s own allocator. Because of this cost asymmetry, many teams run CPU profiling continuously and enable memory or off-CPU profiling selectively, on the services where they suspect a problem, rather than fleet-wide. Knowing which question each profile type answers — “what is hot,” “what is waiting,” “what is allocating” — is the difference between staring at a flamegraph and actually diagnosing a regression.

Off-CPU profiling deserves a worked example, because its accounting is unintuitive and easy to misread. The sched_switch tracepoint fires every time the kernel scheduler context-switches a CPU from one task to another, and it carries both the outgoing and incoming task. An off-CPU profiler arms two halves of a state machine on this single event: when a thread is switched out, it captures that thread’s stack and stamps a timestamp keyed by thread ID into a BPF map; when that same thread is later switched back in, it looks up the stored timestamp, computes now - stored as the off-CPU duration, and attributes that nanosecond delta to the stack it saved at switch-out. Picture a request handler that calls into a database client, which blocks on a socket read for 40 ms. At switch-out the captured stack is handler -> db.Query -> net.Read -> [blocked], and 40 ms later when the socket has data and the scheduler resumes the thread, the profiler records 40 ms against that exact stack. Sum that across thousands of requests and the off-CPU flamegraph shows net.Read as a fat bar — the smoking gun for an I/O-bound slowdown that the CPU flamegraph, where that thread contributed nothing because it was off-CPU the whole time, would render completely invisible. The gotcha to internalize: off-CPU time includes all reasons a thread is not running — genuine I/O waits, lock contention, and also plain scheduler queueing when the node is CPU-saturated — so a node that is simply oversubscribed will show inflated off-CPU time everywhere, and you must rule out CPU saturation before reading a single fat off-CPU frame as an I/O or locking problem.

Reading a differential flamegraph

A flamegraph shows where time goes; a differential flamegraph shows what changed. You pick a baseline window (before a deploy) and a candidate window (after), align stacks by function, and color the delta — frames that grew are red, frames that shrank are blue. This is the single most useful artifact continuous profiling produces, because it turns “the p99 got worse after Tuesday’s release” into “this function now accounts for 8% more CPU.” Figure 4 shows the workflow as a release gate.

To read one correctly, normalize first. Two windows almost never have the same total sample count, so comparing raw counts is misleading — a frame can look bigger simply because the second window had more traffic. Good differential tooling compares percentages of each window, so the delta reflects a genuine shift in where time is spent rather than a change in volume. Watch for the classic trap of attributing a regression to a frame that merely moved: if a caller was inlined or a wrapper was added, time can appear to relocate from one function to a neighbor without any real slowdown. The reliable signal is a new hot path or a frame whose share grew well beyond the noise floor of your sampling. Differential flamegraphs also shine outside deploys: compare a fast tenant to a slow one, a healthy node to a struggling one, or peak traffic to a quiet hour, and the delta points straight at the divergent code path.

The normalization pitfall is worth making numeric, because the wrong subtraction produces convincing nonsense. Suppose the baseline window collected 100,000 samples and the candidate window — taken during a busier hour — collected 150,000. A function serialize() appears in 8,000 baseline samples and 13,500 candidate samples. Subtract raw counts and you “discover” a 5,500-sample regression and a red frame that looks alarming. But normalize to share-of-window: 8,000/100,000 = 8.0% versus 13,500/150,000 = 9.0%, a real but modest +1.0 percentage-point shift — most of the raw increase was simply more total traffic, not a code regression. The opposite error is just as common: a function whose raw count rose can have a falling share if the rest of the workload grew faster, which a count-based diff would paint blue and a share-based diff would correctly leave near-neutral. There is a subtler trap still: a single percentage point of CPU is sometimes the entire signal you are hunting (that is a real node or two at scale), so do not dismiss small share deltas as noise without checking them against the sampling error bar from earlier — if your standard error on a frame is 0.1 points, a 1.0-point shift is ten sigma and absolutely real. The discipline, then, is always-normalize-then-compare-against-the-noise-floor, and the reliable regression signature remains a genuinely new leaf appearing in the candidate that was absent from the baseline.

Differential flamegraph workflow comparing baseline and candidate releases

Figure 4: Using a differential flamegraph as a release gate — compare a baseline window to a candidate window, render the delta, and roll back or promote based on the regression.

Trade-offs, Gotchas, and What Goes Wrong

The “near-zero overhead” claim is real but conditional. eBPF profiling at low frequency typically costs around 1% of CPU, and vendor and community measurements cluster in that range. But overhead scales with sampling frequency, stack depth, and unwinding method: crank the frequency to 1000 Hz with deep DWARF unwinds on every tick and you will notice it. Measure your own overhead with a controlled experiment before trusting a marketing number.

Symbolization is the most common failure. Stripped binaries, missing debuginfo, statically linked Go binaries with unusual layouts, and containers without their original build artifacts all produce flamegraphs full of hex addresses or [unknown] frames. The fix is shipping debuginfo or using build-ID-based symbol lookup, but it is operational work you must plan for.

Interpreted and JIT languages are the hard case. eBPF sees native frames beautifully — Go, Rust, C, C++ unwind cleanly. But Python, Ruby, Node.js, and the JVM run a bytecode interpreter or JIT, so the native stack shows interpreter internals, not your functions. Profilers need language-specific unwinders (reading interpreter frame structures) to recover meaningful stacks, and coverage varies. Pyroscope and Parca have invested heavily here, but do not assume a kernel-level profiler gives you Python line numbers out of the box; verify for your runtime.

The JIT case adds a second twist beyond unwinding: symbolization for code that does not exist in any file. A JVM or V8 engine generates machine code at runtime, so there is no on-disk binary to look up. Profilers handle this by reading a perf map — a file the runtime can be asked to emit that maps JIT-generated address ranges to method names — or by consuming the runtime’s own profiling interface. If neither is wired up, those frames render as raw addresses no matter how good your native symbolization is. The practical upshot: getting clean stacks for a JIT language is a per-runtime integration project, not a flip of the profiler’s switch, and it is the single most common reason a team’s first eBPF profiling rollout produces disappointing flamegraphs for their Java or Node services.

The perf-map mechanism is worth knowing in detail because it is brittle in ways that surprise people. When a JVM is run with -XX:+PreserveFramePointer and an agent like perf-map-agent (or the modern async-profiler‘s map emitter), the runtime writes a file at /tmp/perf-<pid>.map containing lines of the form <hex-start-address> <hex-size> <method-name> — one row per JIT-compiled method. A profiler symbolizing a JIT frame does a range lookup into that file: find the row whose address range contains the captured instruction pointer, and the method name is yours. The brittleness is everywhere. The file is keyed by PID and lives in the process’s view of /tmp, so a profiler agent in a different mount namespace (the normal case in Kubernetes) has to resolve /tmp inside the target container’s namespace to even find it. The map is append-mostly and reflects the JIT’s current state, so a method that was recompiled or deoptimized can leave a stale or missing entry, producing intermittent [unknown] frames for code that symbolized fine a minute earlier. And nothing emits the file unless the runtime is explicitly configured to, which is why a Java service profiled with no extra flags shows a wall of hex. The takeaway is that JIT symbolization is a live, stateful contract between the runtime and the profiler, not a static lookup, and it has to be set up per workload and verified under real load, not just at startup.

Security and kernel requirements bite. eBPF profiling needs a recent kernel (BTF and modern eBPF features help enormously), and the agent typically runs privileged or with CAP_BPF, CAP_PERFMON, and hostPID. Locked-down clusters, some managed Kubernetes offerings, and gVisor or Firecracker sandboxes may restrict or block the required capabilities. The hostPID: true requirement in particular is a common sticking point with security teams, because it lets the agent see every process on the node — which is precisely the point, but also a privilege escalation surface you must justify and audit.

Cardinality is the failure mode that creeps up on you. Profiles are small per-sample, but the storage and query cost scale with the number of distinct label combinations, exactly as they do in Prometheus. Attach a high-cardinality label — a request ID, a pod UID, a customer ID — to your profiles and you can explode the series count by orders of magnitude, blowing up both your object-storage bill and your query latency. The discipline is the same one you apply to metrics: keep labels low-cardinality (service, namespace, version), and never tag a profile with anything unbounded. Retention is the other lever; most teams keep high-resolution profiles for days and downsample or drop after that, because the value of a flamegraph decays quickly once the relevant release is gone. The same cost-control mindset you would apply to GPU and node right-sizing applies directly here.

Put numbers on the cardinality blowup, because it is the one mistake that turns a cheap signal into a budget line. Cardinality is the product of the distinct values of each label. A sane label set — say 40 services × 8 namespaces × 5 live versions × 3 profile types — yields about 4,800 distinct series, which any profiling backend handles comfortably. Now add a single unbounded label. Tag profiles with pod name and, with 200 pods churning across a week of rollouts producing perhaps 3,000 distinct pod names, you have multiplied your series count to roughly 14 million; tag them with a per-request trace_id and the count is effectively unbounded — every profile is its own series, deduplication collapses, and both ingestion and every query degrade because the index can no longer group anything. The storage math compounds it: each distinct series carries index overhead and its own block metadata independent of how few samples it holds, so a million near-empty series can cost more than a thousand fat ones. The rule that falls out of the arithmetic is blunt — every label you add multiplies, so only add labels whose distinct-value count is small and bounded, and push anything per-request or per-pod-instance into the trace/span correlation link instead of into a profile label.

One more anti-pattern: treating the flamegraph as proof rather than a hypothesis. A flamegraph shows correlation — this function consumed time — not causation. A wide frame might be doing exactly the work it should; the question is always whether the work is necessary, not merely whether it is large. Pair the profile with a clear performance question before you start optimizing, or you will spend a sprint shaving a function that was never the bottleneck.

Practical Recommendations

Start by deploying one agent in a non-critical namespace and confirm two things: profiles arrive, and symbolization works for your primary language. Do not roll out fleet-wide until you have seen a clean flamegraph of your own service, because fixing symbolization across a thousand nodes is far worse than fixing it on one. Pick the OpenTelemetry-aligned path where you can; the profiling signal is alpha but the round-trip guarantee with pprof means you are not betting on a single vendor.

Wire profiling into your release process. The differential flamegraph as a deploy gate (Figure 4) catches performance regressions that no metric will show until your bill arrives. Treat profiling cost like any other observability cost: cap retention, control label cardinality, and measure the agent’s own CPU and memory before scaling out. For the broader build-versus-buy reasoning on replacing agent-based APM with eBPF, the eBPF observability ADR lays out the decision; and if you are profiling AI workloads, pair this with inference cost optimization to turn flamegraph findings into real savings.

Resist the urge to optimize the first wide frame you see. The highest-leverage use of continuous profiling is not heroic point fixes but a steady feedback loop: each sprint, look at the top CPU and allocation consumers across your most expensive services, ask whether each is doing necessary work, and remove the cheapest large waste first. Tie the wins to dollars — a function that drops from 8% to 3% of fleet CPU is a node-count reduction you can put on a slide — and the practice funds itself. Link profiling to your incident reviews too: after a latency regression, the differential flamegraph between the healthy and degraded windows is often the fastest path to root cause, and capturing it becomes a standard postmortem artifact rather than a thing one engineer happened to check.

A short checklist:

  • [ ] Confirm kernel version and eBPF capabilities (BTF, CAP_BPF/CAP_PERFMON) on every node type.
  • [ ] Verify symbolization for each language in your fleet before fleet-wide rollout.
  • [ ] Enable frame pointers where you control the build; rely on DWARF where you do not.
  • [ ] Choose a pprof- or OTLP-Profiles-compatible backend to avoid lock-in.
  • [ ] Set sampling frequency low (19–100 Hz) for always-on; measure actual overhead.
  • [ ] Add a differential-flamegraph check to your deploy pipeline.
  • [ ] Budget object-storage retention and control label cardinality.

Frequently Asked Questions

What is continuous profiling and how is it different from regular profiling?

Regular profiling is an ad-hoc, short snapshot you capture when investigating a specific problem, usually in staging. Continuous profiling runs always-on across your whole fleet in production, sampling stack traces at low frequency and storing them as a queryable signal. The difference matters because performance problems often appear only under real production load and traffic patterns. With continuous data you can look back at any time window, compare releases, and find slow regressions that an occasional snapshot would miss entirely.

How much overhead does eBPF continuous profiling actually add?

In typical always-on configurations the overhead is around 1% of CPU, which is why eBPF profiling is considered safe for production. That figure depends on sampling frequency, average stack depth, and whether the profiler uses cheap frame-pointer unwinding or heavier DWARF unwinding. Low frequencies (19–100 Hz) keep cost minimal because statistical confidence accumulates over time and across cores. If you raise the frequency aggressively or profile very deep stacks with DWARF on every tick, overhead rises, so measure it in your own environment rather than trusting a single published number.

Do I need to change my application code to use eBPF profiling?

No. The defining advantage of eBPF-based profiling is that it requires zero application changes, no recompilation, and no restarts. The kernel samples stacks of unmodified processes, and a single node agent profiles every workload on that node regardless of language. You may need to enable frame pointers or ship debug information to get clean symbols, and interpreted languages need runtime-aware unwinders, but the application binary itself stays untouched. This is the key difference from runtime profilers like Go pprof or Java Flight Recorder, which require per-process integration.

What is the difference between frame pointer and DWARF stack unwinding?

Frame-pointer unwinding follows a register that points to each stack frame, making the stack walk a fast pointer chase that is cheap enough to run on every timer tick. The catch is that compilers often omit frame pointers, breaking the walk. DWARF unwinding instead reads compiler-generated unwind tables to reconstruct the call chain even without frame pointers, so it works on any binary, but it is computationally heavier and needs access to those tables. Production eBPF profilers use frame pointers when available and fall back to DWARF when they are missing, getting the best of both.

Can eBPF profile Python, Java, and other interpreted languages?

It can, but not automatically. eBPF sees native stacks cleanly, so compiled languages like Go, Rust, and C unwind well. Interpreted and JIT languages run a bytecode interpreter, so the native stack shows interpreter internals rather than your functions. Profilers need language-specific unwinders that understand the interpreter’s frame structures to recover meaningful Python, Ruby, Node.js, or JVM stacks. Pyroscope and Parca support several of these, but coverage and quality vary by runtime and version, so always verify that you get real function names for your language before relying on it in production.

What is a differential flamegraph and when should I use one?

A differential flamegraph compares two profiling windows and colors each frame by how much its share of time changed — typically red for frames that grew and blue for those that shrank. It is the fastest way to answer “what changed?” after a deploy, a configuration shift, or a traffic pattern change. Use it as a release gate by comparing the window before a rollout to the window after, and use it during incidents by comparing a healthy period to the degraded one. The key caveat is to normalize by percentage rather than raw sample counts, so a busier window does not masquerade as a regression, and to treat a growing frame as a hypothesis to investigate rather than proof of a bug.

Should I use Pyroscope or Parca?

Both are strong open-source, eBPF-based continuous profilers, and the choice often comes down to ecosystem. Grafana Pyroscope integrates tightly with Grafana dashboards and Alloy, and its 2.0 architecture writes profiles directly to object storage for simpler operations at scale. Parca, from Polar Signals, pioneered DWARF-based eBPF unwinding and has a clean Kubernetes-native agent. If you already run Grafana, Pyroscope is the path of least resistance. Either way, prefer the pprof- and OTLP-Profiles-compatible setup so you can switch backends later without re-instrumenting.

Further Reading

By Riju — about

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *