OpenTelemetry Collector Architecture: Pipelines, Processors, Exporters

OpenTelemetry Collector architecture is the single most misunderstood layer of the modern observability stack. Teams adopt the OTel SDKs in their services, point them at a vendor backend, and then discover six months later that they cannot rotate vendors, cannot tail-sample at scale, and have no place to enforce attribute hygiene before bills explode. The Collector is the answer to all three problems, and as of the v0.119+ release stream in 2026 it has finally matured into the seam where every signal — traces, metrics, logs, profiles — gets shaped before it hits storage. This post walks the Collector from the inside out: receivers, processors, exporters, the pipeline graph, the agent-vs-gateway deployment decision, tail sampling math, and the high-cardinality survival tactics that keep a busy gateway from OOM-ing at 2 a.m. Working YAML for every claim, plus five diagrams you can lift into your own design docs. What this post covers: pipeline mechanics, deployment topologies, the contrib-vs-core split, tail-sampling decision flow, and concrete config for multi-tenant fanout.

Architecture at a glance

OpenTelemetry Collector Architecture: Pipelines, Processors, Exporters — architecture diagram — Architecture diagram — OpenTelemetry Collector Architecture: Pipelines, Processors, Exporters

Why the OpenTelemetry Collector exists

The OpenTelemetry Collector exists because SDK-to-backend direct export couples your services to one vendor and one schema forever. The Collector inserts a vendor-neutral, in-network process that can re-encode, enrich, sample, drop, and fan-out telemetry without any application redeploy. It is the only place you can change observability strategy without touching service code.

Direct export from SDKs looks fine on day one. By month six you discover the limits. Service owners hardcode endpoint URLs. Attribute schemas drift between teams. The vendor agent runs out-of-process anyway and adds latency. Your bill scales with every cardinality mistake an intern makes in a new label. Switching vendors requires editing every service. The CNCF OpenTelemetry Collector documentation frames the Collector as the “vendor-agnostic implementation of how to receive, process and export telemetry data” — but the deeper reason it exists is operational sovereignty.

The Collector replaces vendor-specific agents (Datadog Agent, Splunk UF, New Relic Infra) with one process speaking OTLP. It runs on every host as an agent and at the cluster edge as a gateway. It accepts any input format your fleet still emits — Prometheus scrape, statsd, Jaeger, Zipkin, syslog, Kafka — and emits any output your storage accepts. Crucially, it owns the cardinality budget. If a developer pushes a label with 50 million unique values, the Collector is where you catch it, not your bill.

There is also a less-discussed reason: the Collector decouples release cycles. OTel SDKs ship inside service binaries; bumping them means re-deploying every service. Collector upgrades happen out-of-band. New processors, new sampling policies, new exporters all roll out by editing one YAML file and restarting one Deployment. That decoupling is why mature platform teams treat the Collector as part of the platform, not part of the application.

Core reference architecture: receivers, processors, exporters

The Collector is a directed graph of three node types. Receivers ingest data in some wire format. Processors transform, enrich, sample, or drop. Exporters serialize and send to a destination. A pipeline binds one or more receivers to a chain of processors to one or more exporters, scoped to a single signal — traces, metrics, logs, or profiles. The full design is in the opentelemetry-collector-contrib repository.

Three rules govern the graph:

A pipeline cannot mix signals — a traces pipeline cannot also handle metrics.
Processors execute in declared order, top to bottom, with no implicit reordering.
Components are shared across pipelines if you declare the same name twice; do not instantiate the same receiver twice or you will get a port conflict.

Receivers — the ingest surface

The contrib distribution ships 90+ receivers. The ones you will actually use in 2026 are short:

otlp — gRPC on 4317 and HTTP on 4318. The default for everything.
prometheus — scrapes Prometheus exposition format from a configured target list. The 2025 rewrite (prometheusreceiver) now supports OTLP-native metric translation including exemplars.
k8s_cluster — emits cluster-state metrics (pod phase, container restarts) from the Kubernetes API.
kubeletstats — pulls per-pod CPU/memory/network/IO from the kubelet summary API.
filelog — tails files, supports multiline parsing, container log paths, and stanza-style operators.
kafka — consumes traces, metrics, or logs serialized as OTLP from a Kafka topic. Critical for queue-buffered pipelines.
jaeger, zipkin, statsd — legacy ingest for migration windows.
hostmetrics — host-level CPU, memory, disk, network, processes. Required on agent deployments.

A minimal receivers block looks like this:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'k8s-pods'
          kubernetes_sd_configs:
            - role: pod
  hostmetrics:
    collection_interval: 30s
    scrapers:
      cpu: {}
      memory: {}
      disk: {}
      filesystem: {}
      network: {}

Processors — the value layer

Processors are where the Collector earns its keep. The processors you should know cold:

memory_limiter — enforces a soft and hard memory ceiling, dropping data when over. Mandatory in every production pipeline. Place it first.
batch — coalesces telemetry into export-sized batches. Place it last, immediately before the exporter.
attributes — add, update, hash, or delete attributes by key.
resource — same as attributes but scoped to resource-level attributes (host, service, k8s pod).
k8sattributes — enriches every span/metric/log with pod, namespace, node, and workload labels by watching the K8s API.
resourcedetection — fills in cloud-provider attributes (EC2 instance-id, GCP zone, Azure VM SKU) at startup.
tail_sampling — buffers complete traces and decides keep/drop based on rules across all spans.
transform — full OTTL (OpenTelemetry Transformation Language) for arbitrary attribute logic. The 2025-era replacement for the older metricstransform and spanmetrics processors.
filter — drops telemetry matching an OTTL condition. Use for noisy health-check spans.
probabilistic_sampler — head sampler. Cheap but blind to span content.
cumulativetodelta — converts cumulative metrics to delta, required by some backends (Datadog).

Exporters — the egress surface

Exporters serialize and ship. The shortlist:

otlp and otlphttp — the default. Targets Tempo, Mimir, Loki, Jaeger, any OTLP-compatible backend.
prometheusremotewrite — for Mimir, Cortex, Thanos, or a self-hosted Prometheus with remote_write enabled.
loki — direct push to Loki for logs.
datadog, splunk_hec, awsxray, googlecloud — vendor-native exporters.
kafka — re-publish OTLP to a topic. Required for fan-out and replay.
file — write to disk. Useful for debugging and disaster-recovery snapshots.

Every exporter supports the standard sending_queue (in-memory FIFO with disk-backed persistence in v0.116+), retry_on_failure (exponential backoff), and timeout blocks. The combined behavior — queue + retry + timeout — is what makes the Collector a reliable buffer in front of flaky backends.

Pipelines — the binding layer

Pipelines live under service::pipelines. Each signal gets its own pipeline. You can have N pipelines per signal.

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus, hostmetrics]
      processors: [memory_limiter, k8sattributes, resourcedetection, batch]
      exporters: [prometheusremotewrite/mimir]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlphttp/loki]

Three pipelines, three signals, shared receivers and processors. The Collector instantiates each unique component once and the pipeline graph holds references.

Agent vs gateway vs hybrid: choosing your topology

Pick agent-only for small clusters under 50 nodes; gateway-only for multi-cluster ingest; hybrid for everything in between. The hybrid pattern — DaemonSet agents that ship to a central Deployment gateway — is the 2026 default because it solves three problems at once: per-host enrichment, central policy enforcement, and tail-sampling correctness.

Agent pattern

Run a Collector on every host (DaemonSet in Kubernetes, systemd unit on VMs). Services inside the host export to localhost:4317. Pros: zero network hop, automatic host enrichment via hostmetrics and kubeletstats, no upstream single-point-of-failure during a network blip. Cons: cannot tail-sample (only sees fragments of a trace), pushes vendor credentials to every host, hard to enforce schema globally.

Use agent-only when the cluster is small, you only need head sampling, and you trust every node with vendor credentials.

Gateway pattern

Run a central Deployment of Collectors behind a service. Every SDK and every agent ships to the gateway. Pros: tail sampling works (gateway sees whole traces), vendor credentials live in one Secret, schema enforced in one place, easy to fan out to multiple backends. Cons: extra network hop, gateway is a critical-path dependency, must be scaled horizontally with care to avoid trace-fragmentation across replicas.

Use gateway-only when you must tail sample, when you fan out to more than one backend, or when compliance demands central PII scrubbing.

Hybrid (the production default)

Agents on every host do cheap local work — host metrics, log tailing, k8s enrichment, probabilistic head sampling. Gateways do expensive central work — tail sampling, vendor fan-out, cumulative-to-delta conversion, OTTL transformations. This is the topology the OpenTelemetry Operator ships templates for, and it is what every CNCF case study from 2024–2025 documents.

Trace-fragmentation is the gotcha. If the gateway has N replicas and a single trace’s spans hit different replicas, no replica sees the whole trace and tail sampling makes the wrong decision. The fix is loadbalancingexporter on the agents (or a dedicated front-end Collector layer) that hashes by trace-id so all spans for a trace land on the same gateway replica.

exporters:
  loadbalancing:
    routing_key: traceID
    protocol:
      otlp:
        tls:
          insecure: true
    resolver:
      dns:
        hostname: otel-gateway.observability.svc.cluster.local
        port: 4317

Deep dive: tail sampling, k8s enrichment, multi-tenant fanout

Tail sampling — the math

Head sampling decides keep/drop when the first span is created. It is blind: a 1% sample rate drops 99% of all traces including the rare error trace you needed. Tail sampling buffers complete traces (typically 30s window) and decides based on the full trace: keep all errors, keep all slow traces, keep 1% of healthy traces.

The tail_sampling processor in contrib supports composite policies:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: keep-errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: keep-slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: keep-vip-tenant
        type: string_attribute
        string_attribute:
          key: tenant.tier
          values: [enterprise, platinum]
      - name: baseline-probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 1.0}
      - name: rate-limit-noisy
        type: and
        and:
          and_sub_policy:
            - name: noisy-route
              type: string_attribute
              string_attribute:
                key: http.route
                values: [/healthz, /metrics]
            - name: rate-limit
              type: rate_limiting
              rate_limiting: {spans_per_second: 10}

The math: with num_traces: 100000 and 30s decision_wait, the in-memory buffer holds up to 100k traces. At 1000 traces/sec, the buffer turns over every 100 seconds — well above the 30s wait. Sizing wrong here is the #1 cause of gateway OOMs. Rule of thumb: num_traces >= expected_new_traces_per_sec * decision_wait * 2.

Tail sampling forces gateway deployment and forces trace-id-hashed load balancing. There is no way around it.

Resource detection + k8s enrichment

Two processors, two concerns. resourcedetection runs at startup and adds cloud-provider attributes from instance metadata services (IMDS). k8sattributes runs continuously and adds pod-level attributes by watching the Kubernetes API and matching by source pod IP.

processors:
  resourcedetection:
    detectors: [env, eks, ec2, k8snode]
    timeout: 2s
    override: false
  k8sattributes:
    auth_type: serviceAccount
    passthrough: false
    extract:
      metadata:
        - k8s.pod.name
        - k8s.pod.uid
        - k8s.deployment.name
        - k8s.namespace.name
        - k8s.node.name
        - k8s.pod.start_time
      labels:
        - tag_name: app.kubernetes.io/component
          key: app.kubernetes.io/component
          from: pod
    pod_association:
      - sources:
          - from: resource_attribute
            name: k8s.pod.ip
      - sources:
          - from: connection

The pod_association block is the subtle part. The Collector identifies which pod sent a given OTLP message by either an attribute the SDK set (k8s.pod.ip) or by the source IP of the gRPC connection. The latter only works when the agent is on the same node as the source pod — another reason hybrid topologies dominate.

Multi-tenant fanout

Real platforms ship to more than one backend. You might send everything to a self-hosted Tempo + Mimir + Loki stack for engineers, mirror error traces to Datadog for the on-call SRE team, and stream a sampled copy to a Kafka topic for the data team’s analytics warehouse.

exporters:
  otlp/tempo:
    endpoint: tempo.observability.svc:4317
    tls: {insecure: true}
  datadog:
    api:
      site: datadoghq.com
      key: ${env:DD_API_KEY}
    traces:
      compute_stats_by_span_kind: true
  kafka/analytics:
    brokers: [kafka-1:9092, kafka-2:9092]
    topic: otel-traces-sampled
    encoding: otlp_proto

processors:
  filter/errors_only:
    error_mode: ignore
    traces:
      span:
        - 'status.code != STATUS_CODE_ERROR'

service:
  pipelines:
    traces/full:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [otlp/tempo]
    traces/datadog_errors:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, filter/errors_only, batch]
      exporters: [datadog]
    traces/analytics:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, probabilistic_sampler, batch]
      exporters: [kafka/analytics]

Three pipelines, one receiver, three exporters — the receiver fans out automatically. This is the pattern that lets you migrate vendors without service redeploys: add the new exporter, run both in parallel, cut over when you trust the data.

The filter/errors_only processor uses OTTL — the OpenTelemetry Transformation Language — which became the canonical way to express attribute and signal logic in 2024 and is now stable in v0.119. OTTL is the single most powerful processor in the Collector. A few patterns worth memorizing:

processors:
  transform/sanitize:
    error_mode: ignore
    trace_statements:
      - context: span
        statements:
          - delete_key(attributes, "http.request.header.authorization")
          - delete_key(attributes, "http.request.header.cookie")
          - set(attributes["http.url"], Substring(attributes["http.url"], 0, 256))
          - replace_pattern(attributes["http.url"], "\\?.*$", "")
    metric_statements:
      - context: datapoint
        statements:
          - set(attributes["env"], "prod") where attributes["env"] == nil

That single processor scrubs auth headers, truncates URLs to 256 bytes (cardinality kill switch), strips query strings, and backfills missing env labels. Every one of those four lines maps to an outage I have seen in production. OTTL replaced a half-dozen single-purpose processors and is worth learning end-to-end.

For deeper observability patterns at the cluster edge, see the companion deep-dive on OpenTelemetry instrumentation for industrial IoT observability and the eBPF observability stack with Pixie and Cilium tutorial.

Contrib vs Core, and the Operator

The Collector ships as two distributions. otelcol (Core) contains a minimal vetted set — OTLP, batch, memory_limiter, attributes, a handful of exporters. otelcol-contrib (Contrib) contains everything: 90+ receivers, 40+ processors, 50+ exporters. Production almost always uses Contrib or a custom distribution built with ocb (OpenTelemetry Collector Builder) that pulls in only the components you actually use. Smaller binary, smaller attack surface.

The OpenTelemetry Operator (v0.110+ in 2026) manages Collector and instrumentation lifecycle in Kubernetes via two CRDs:

OpenTelemetryCollector — declarative Collector deployments with modes daemonset, deployment, sidecar, or statefulset.
Instrumentation — auto-injects language-specific SDK agents (Java, Python, Node.js, .NET, Go) into pods labeled with instrumentation.opentelemetry.io/inject-java: "true".

The auto-injection pattern is the closest thing to “magic” the project has. Annotate a pod, the Operator’s mutating admission webhook mounts the agent jar and sets JAVA_TOOL_OPTIONS. Zero code changes.

OTel Profiles — the new fourth signal

OpenTelemetry Profiles reached general availability in late 2025 as the fourth core signal after traces, metrics, logs. The signal is continuous profiling — CPU and memory profiles collected at low overhead via eBPF, encoded in pprof-compatible format, shipped over OTLP. The Collector v0.118+ supports a profiles pipeline alongside traces/metrics/logs.

receivers:
  otlp:
    protocols:
      grpc: {endpoint: 0.0.0.0:4317}
exporters:
  otlp/pyroscope:
    endpoint: pyroscope.observability.svc:4040
service:
  pipelines:
    profiles:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp/pyroscope]

The eBPF profiler (Elastic’s contributed profiler is the upstream choice) attaches to a single privileged DaemonSet per node and profiles every process without per-app instrumentation. For teams already running eBPF-based observability — see the architectural decision record on replacing APM with eBPF on Kubernetes — profiles slot in naturally.

Trade-offs and failure modes

The Collector is not a free upgrade. Six failure modes to plan for.

1. The OOM cascade. Tail sampling buffers traces in memory. A traffic spike, an unbounded label, or a misconfigured num_traces and the gateway OOMs. The pod restarts, drops its buffer, and the load balancer routes traffic to the remaining replicas which then OOM in turn. memory_limiter is the only thing standing between you and this cascade. Configure it with check_interval: 1s, limit_percentage: 80, spike_limit_percentage: 25 and accept that under pressure the Collector drops data rather than dies.

2. Trace fragmentation. Without trace-id-aware load balancing, tail sampling across N gateway replicas gets the keep/drop decision wrong for every distributed trace. The symptom is “errors disappear randomly.” The fix is loadbalancingexporter on a front-end Collector layer.

3. Cardinality explosion at the exporter. A new attribute with high cardinality (user-id, request-id, URL with query string) blows up Prometheus or Mimir indexes. The Collector cannot detect this for you. Add an attributes/cardinality processor that deletes or hashes known dangerous keys before the metrics exporter.

4. OTLP version drift. SDK and Collector OTLP versions are independent. v1.0 changed the protobuf for logs; v1.3 added profiles. Pin your contrib version and your SDK versions together, upgrade quarterly. Skipping versions is fine; running mismatched majors is not.

5. Pipeline misordering. batch before tail_sampling defeats tail sampling because the batch processor breaks trace coherence. memory_limiter after tail_sampling does nothing — the tail buffer has already consumed memory. There is no validator for this. Read your YAML carefully.

6. Vendor exporter coupling. Some vendor exporters (Datadog, Splunk HEC) do significant in-process transformation that breaks if you swap the upstream signal shape. Test exporter changes against the actual vendor backend before rolling them to production. Use the file exporter to capture golden outputs.

Sizing the gateway — concrete numbers

There is no substitute for load-testing your own workload, but the 2024–2025 community benchmarks give you a starting point. A single Collector replica on 2 vCPU and 4 GiB of memory running the standard pipeline (memory_limiter, k8sattributes, batch, otlp exporter) handles roughly 50,000 spans per second of OTLP-gRPC ingress with sub-100 ms P99 latency through the pipeline. Add tail_sampling with a 30 s decision window and the same hardware drops to about 15,000 spans per second because the per-trace state machine becomes the bottleneck. Add transform with five OTTL statements and you lose another 20%.

Memory scales linearly with num_traces in the tail buffer. A trace averaging 20 spans at 300 bytes each costs 6 KB. At num_traces: 100000 that is roughly 600 MB just for the buffer, before processor overhead. Plan for 2x that as headroom and let memory_limiter enforce the ceiling.

Network is rarely the bottleneck for traces but can be for metrics. prometheusremotewrite with compression sends roughly 1 byte per sample including labels; a million active series scraped at 15 s interval is about 67 KB/s sustained, trivial for any cluster network. Without compression it is 5–10x higher and you will see it in your CNI metrics.

Practical recommendations

Build your Collector deployment in this order:

Start with the OpenTelemetry Operator on a non-prod cluster. Deploy a daemonset mode Collector first, get host metrics flowing.
Add a deployment mode gateway. Point agents at it via otlp exporter.
Enable k8sattributes and resourcedetection on the gateway, not the agents — the gateway has stable API watches.
Add loadbalancingexporter between agents and gateway as soon as you scale the gateway past one replica.
Add tail_sampling only after the gateway is stable for a week under real load.
Add a second exporter for vendor fan-out before you commit to any vendor.
Add memory_limiter as the first processor in every pipeline. Non-negotiable.
Run otelcol-contrib --config validate in CI on every config change.
Monitor the Collector with itself — the otelcol_* self-metrics tell you receive rate, drop rate, queue depth, sampling rate.
Build a custom distribution with ocb once you have stabilized the component list. Cuts the binary 60% and removes unused attack surface.

FAQ

What is the difference between OTel agent and gateway?

An OTel agent runs as a DaemonSet or sidecar next to your applications, doing per-host work like log tailing, host metrics, and local enrichment. An OTel gateway runs as a central Deployment that receives data from agents and SDKs, performs expensive central operations like tail sampling and vendor fan-out, and ships to backends. Most production deployments run both in a hybrid topology, with agents feeding a gateway behind a load balancer.

When should I use tail sampling versus head sampling?

Use head sampling when you need predictable cost and do not care about preserving rare events — a 1% probabilistic sampler is simple and cheap. Use tail sampling when you must keep all error traces, all slow traces, or all traces from premium tenants. Tail sampling forces gateway deployment and trace-id-aware load balancing because the decision needs the complete trace. Most teams run both: head sample at the agent for baseline reduction, tail sample at the gateway for intelligent retention.

Do I need the Contrib distribution or is Core enough?

Core ships only a vetted minimum: OTLP, batch, memory_limiter, attributes, and a few exporters. Production teams almost always need Contrib for k8sattributes, tail_sampling, prometheus receiver, resourcedetection, transform (OTTL), and vendor-specific exporters like Datadog or Splunk HEC. The best practice in 2026 is to build a custom distribution with the OpenTelemetry Collector Builder that pulls only the components you actually deploy, giving you a smaller binary and a smaller attack surface than Contrib.

How do I prevent the Collector from running out of memory?

Always place memory_limiter as the first processor in every pipeline with check_interval: 1s, limit_percentage: 80, spike_limit_percentage: 25. Size tail_sampling‘s num_traces to at l

OpenTelemetry Collector Architecture: Pipelines, Processors, Exporters

OpenTelemetry Collector Architecture: Pipelines, Processors, Exporters

Architecture at a glance

Why the OpenTelemetry Collector exists

Core reference architecture: receivers, processors, exporters

Receivers — the ingest surface

Processors — the value layer

Exporters — the egress surface

Pipelines — the binding layer

Agent vs gateway vs hybrid: choosing your topology

Agent pattern

Gateway pattern

Hybrid (the production default)

Deep dive: tail sampling, k8s enrichment, multi-tenant fanout

Tail sampling — the math

Resource detection + k8s enrichment

Multi-tenant fanout

Contrib vs Core, and the Operator

OTel Profiles — the new fourth signal

Trade-offs and failure modes

Sizing the gateway — concrete numbers

Practical recommendations

FAQ

What is the difference between OTel agent and gateway?

When should I use tail sampling versus head sampling?

Do I need the Contrib distribution or is Core enough?

How do I prevent the Collector from running out of memory?

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories