Edge Observability ADR: OpenTelemetry vs Prometheus+Loki

Edge Observability ADR: OpenTelemetry vs Prometheus+Loki

Edge Observability ADR: OpenTelemetry vs Prometheus+Loki

Choosing an edge observability architecture is one of the highest-leverage, hardest-to-reverse decisions a platform team makes. At the edge you have constrained CPU, scarce memory, intermittent links, and thousands of identical nodes. Get the telemetry stack wrong and you either go blind during the exact incidents you deployed edge compute to handle, or you drown your backhaul and your bill in low-value data. This post is written as an Architecture Decision Record (ADR): two real options, the criteria that separate them, a default recommendation, and the consequences you sign up for either way.

The two contenders are an OpenTelemetry-centric stack (an OTel Collector running at the edge) and a Prometheus + Loki stack (Prometheus for metrics, Loki for logs, with an agent on each node). Both are CNCF projects. Both are production-grade. They are not equivalent.

What this covers: context and constraints, decision drivers, both architectures in detail, a context-dependent recommendation, the gotchas that bite at 3 a.m., and a checklist you can act on.

TL;DR

  • OpenTelemetry-centric: One Collector, one protocol (OTLP), all three signals. Pipeline-at-the-source filtering, vendor-neutral export, native traces. Costs you a pipeline to operate and a faster release cadence to track.
  • Prometheus + Loki: Two mature pipelines your team likely already knows. Lean agent-mode metrics with WAL buffering, cheap label-only log indexing. No native tracing; add Tempo or Jaeger for that.
  • Signal coverage is the sharpest divide. OpenTelemetry is signal-agnostic. Prometheus + Loki is metrics and logs only.
  • Buffering is the most under-tested criterion. Both stacks buffer to disk, both are bounded, both drop the oldest data first when full. Size for the worst outage, not the average.
  • Cardinality is the cost lever. Drop and aggregate at the edge in either stack before bytes cross a metered link.
  • Default recommendation: a hybrid — OpenTelemetry Collector at a per-site gateway, exporting to Prometheus-compatible metrics storage and Loki. Deviate to plain Prometheus + Loki for metrics-and-logs-only fleets on very constrained leaves.

Terminology primer

A few terms recur below and are worth grounding before the architecture sections.

OTLP (OpenTelemetry Protocol): the vendor-neutral wire format OpenTelemetry uses to carry metrics, logs, and traces. One protocol for all three signals is the property that makes the Collector flexible.

Collector pipeline: the receiver-processor-exporter chain inside an OpenTelemetry Collector. Receivers ingest, processors transform, exporters ship. The pipeline is where edge filtering and buffering live.

remote_write: the Prometheus mechanism for forwarding scraped samples to a remote backend, buffered by a write-ahead log so a link or backend outage does not immediately lose data.

WAL (write-ahead log): an on-disk buffer that persists data before it is forwarded, so a crash or outage can replay rather than drop. Both Prometheus remote_write and the Collector’s persistent queue rely on disk-backed buffering.

Cardinality: the number of unique label or attribute combinations in your telemetry. High cardinality is the dominant scaling and cost hazard for metrics systems.

Tail-based sampling: deciding whether to keep a trace after seeing all its spans, so you can keep the slow or errored traces. It requires buffering the whole trace, which is memory-hungry.

Context and problem statement

Edge fleets break the assumptions that cloud observability quietly relies on. A datacenter agent runs on a fat host with a reliable, low-latency link to its backend. An edge agent runs on a gateway, an industrial PC, or an ARM box with maybe 1-2 vCPU and a gigabyte of RAM, sharing that hardware with the workload it is supposed to observe. It connects over cellular, satellite, or a flaky site uplink that drops for minutes or hours. Multiply that node by a few thousand and three problems dominate.

First, footprint. Every megabyte of RAM and every percent of CPU your telemetry agent consumes is stolen from the workload. The agent must be lean and bounded, not best-effort.

Second, connectivity. The link will fail. When it does, telemetry has to buffer locally and replay on reconnect without losing the data that explains the outage. An architecture that assumes a live connection to its backend is disqualified before you start.

Third, cardinality and cost. Edge fleets generate enormous label cardinality: per-device, per-sensor, per-firmware-version series multiplied across the fleet. Unbounded cardinality is the classic way to melt a Prometheus server or run up a vendor bill. The Prometheus documentation is explicit that high-cardinality labels are a primary scaling hazard, and warns against labels like user IDs or unbounded identifiers (Prometheus instrumentation docs). The OpenTelemetry project frames the same problem as a pipeline concern, with the Collector positioned as the place to filter, aggregate, and drop before data ever crosses the WAN (OpenTelemetry Collector docs).

There is a fourth, quieter constraint: scale of sameness. A thousand near-identical nodes means a config change is a fleet operation. A processor you add, a buffer you resize, or a label you drop has to roll out like firmware, with staged rollout and rollback. This favors architectures whose behavior is configuration-driven and versionable over those tuned per node by hand. It also raises the stakes on getting the defaults right, because a bad default is now a bad default times a thousand.

The decision, then, is not “which tool is better.” It is “which architecture survives constrained hardware, hostile networks, runaway cardinality, and fleet-wide rollout while still answering the questions on-call asks.” Figure 1 maps those constraints onto the edge-to-cloud path.

Edge observability problem context showing constrained nodes intermittent links cardinality and cost pressure

Figure 1: The edge observability problem space. Constrained nodes, intermittent backhaul, high cardinality, and egress cost all squeeze the telemetry path between the fleet and the backend.

Decision drivers

An ADR lives or dies on its criteria. We weight seven, and we state them up front so the recommendation is auditable rather than aesthetic. Figure 2 lays them out as the scorecard we apply to both options. The point of writing them down is that six months from now, when someone questions the decision, the argument is reconstructable. You can see which axis was weighted heavily, and whether the context that justified that weighting still holds. A recommendation without explicit criteria is just a preference, and preferences do not survive a fleet migration debate.

1. Footprint at the edge. Resident memory, CPU under load, and binary size on constrained nodes. A 50 MB ceiling on a gateway is a different world from a 500 MB allowance on a fat host. We care about steady-state and about behavior under backpressure, when buffers fill.

2. Offline buffering. What happens when the uplink drops for an hour. Does telemetry persist to disk, bound itself, and replay in order on reconnect? Or does it drop silently or balloon until the process is OOM-killed? This is the single most important edge criterion and the one most often ignored in datacenter-shaped comparisons.

3. Signal coverage. Metrics, logs, and traces. This is where the two options diverge most sharply. OpenTelemetry is signal-agnostic by design: one Collector, one wire protocol (OTLP), all three signals. Prometheus plus Loki covers metrics and logs; distributed tracing needs a third component such as Grafana Tempo or Jaeger. If you need traces at the edge, that asymmetry matters.

4. Cardinality control. Where and how you can drop, aggregate, or relabel high-cardinality data before it crosses the WAN. Doing this at the edge protects both the link and the backend.

5. Vendor neutrality. Can you change backends later without re-instrumenting every device. OTLP is a vendor-neutral protocol with broad backend support; Prometheus remote_write is also widely supported. Both score well, but the lock-in surfaces differ.

6. Operational burden. How many moving parts, how the config is shaped, how upgrades roll across a fleet, and how much your team already knows. Familiarity is a real cost, not a footnote.

7. Cost. Egress bytes over metered links plus backend ingest and storage. At fleet scale, the bytes you drop at the edge are the line item you actually control.

Decision criteria scorecard for edge observability footprint buffering signal coverage cardinality neutrality ops cost

Figure 2: The seven decision drivers. Each option is judged against the same scorecard; no single axis decides the outcome.

These drivers are deliberately ordered by how hard they are to fix later. Footprint and buffering are architectural; cost and ops burden can be tuned. Weight them for your context, not ours.

To make the comparison legible at a glance, here is how the two options tend to land against each driver. Treat it as directional, not absolute, because your hardware budget and team skills move the needle.

Driver OpenTelemetry-centric Prometheus + Loki
Footprint Good (custom build) to moderate Lean in agent mode
Offline buffering Strong (persistent queue) Strong for metrics (WAL); separate for logs
Signal coverage Metrics, logs, traces Metrics and logs; traces need Tempo or Jaeger
Cardinality control At source in pipeline At source via relabeling
Vendor neutrality Strong (OTLP) Strong (remote write)
Ops burden New pipeline to learn Familiar to most teams
Cost Controlled at source Controlled at source

The table hides the most important fact, which is that the rows are not independent. Choosing OpenTelemetry for signal coverage drags ops burden up. Choosing Prometheus for familiarity locks tracing out unless you add a component. An ADR exists precisely to surface those couplings rather than scoring each axis in isolation.

Option A: OpenTelemetry-centric (Collector at the edge)

The OpenTelemetry-centric design puts an OTel Collector on each edge node, or on a per-site gateway aggregating several nodes. Applications and the host emit metrics, logs, and traces in OTLP. The Collector receives them, processes them through a pipeline, and exports them upstream over OTLP or to any supported backend. Figure 3 shows the topology.

The Collector pipeline is the heart of the design: receivers ingest data, processors transform it, and exporters ship it out. For the edge, three processors carry the weight. The batch processor groups telemetry to cut request overhead, which matters when every byte over a metered link costs money. The memory_limiter processor caps RAM and applies backpressure before the process is OOM-killed, which is essential on a 1 GB node. Filtering and attribute processors drop or aggregate high-cardinality data at the source, addressing driver 4 before bytes leave the box. For durability under driver 2, exporters can be wrapped with a persistent queue (a file-backed sending queue) so telemetry survives a link outage and replays on reconnect rather than evaporating.

There is a deliberate tiering choice inside this design. The Collector can run as a thin agent on the leaf and as a heavier gateway at the site, and the two run the same binary with different configs. Leaf agents do the cheap, local work: receive, batch, head-sample, drop the obviously noisy attributes, and forward. The gateway does the expensive, fleet-aware work: tail-based sampling that needs the whole trace in memory, cross-node aggregation, and the largest persistent buffers. This split keeps the per-node footprint honest while still giving you smart processing somewhere in the path. It also means a single config repository, versioned and rolled out like firmware, governs the whole fleet’s behavior.

Pros. One agent for all three signals, so you instrument once and route everywhere. OTLP is vendor-neutral, so swapping backends is a Collector config change, not a fleet re-instrumentation. Cardinality control, tail processing, and redaction all live in a pipeline you can version and roll out. The Collector ships in a slim distribution and supports building a custom binary with only the components you need, trimming footprint and attack surface.

Cons. It is a pipeline you now operate. Mis-sized memory_limiter or batch settings cause drops or backpressure that are non-obvious to debug. Tail-based sampling for traces needs the full trace in memory, which is awkward on constrained nodes and usually belongs at a gateway tier, not the leaf. The ecosystem moves fast; some receivers and processors are less battle-tested than Prometheus internals that have run for a decade. And a single agent for everything is a single point of failure if you do not supervise it well.

A small configuration sketch makes the edge shape concrete. The pipeline below caps memory, batches, drops a noisy high-cardinality attribute, and wraps the exporter in a persistent queue.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 200
  batch:
    send_batch_size: 512
  attributes/drop:
    actions:
      - key: request_id
        action: delete

exporters:
  otlp:
    endpoint: gateway.internal:4317
    sending_queue:
      enabled: true
      storage: file_storage   # disk-backed, survives restarts

The limit_mib ceiling is the line between a graceful agent and one that takes the workload down with it. The file_storage-backed sending_queue is what lets telemetry survive a multi-hour link outage and replay on reconnect. Drop request_id and similar unbounded attributes at the source, and you cut both backend cardinality and egress in one move.

The OpenTelemetry Collector is positioned by the project as the recommended way to receive, process, and export telemetry without locking into one vendor (OpenTelemetry Collector docs). At the edge, that pipeline-at-the-source property is exactly what you want. One more edge-specific note. The Collector supports a probabilistic head-sampler on leaf nodes and a tail sampler at the gateway. So you keep cheap, fast sampling near the device. And you reserve the memory-hungry, smarter sampling for where there is RAM to spare.

OpenTelemetry Collector edge topology with receivers processors exporters persistent queue and OTLP backend

Figure 3: OpenTelemetry-centric topology. A Collector per node or per site runs a receiver-processor-exporter pipeline with a persistent queue, exporting OTLP upstream.

Option B: Prometheus + Loki (plus agent)

The Prometheus + Loki design treats metrics and logs as separate, mature pipelines. For metrics, a lightweight agent runs on or near each node. Prometheus in agent mode scrapes local targets and forwards via remote_write to a central or regional Prometheus-compatible backend; it does not store or query locally, which keeps the edge footprint small. For logs, a collector such as Grafana Alloy or Promtail tails files and journals and pushes to a central Loki instance. Figure 4 shows the split.

This is the model most ops teams already know. Prometheus scrapes targets on an interval and exposes a battle-hardened metrics model; remote_write carries samples upstream with a write-ahead log (WAL) buffering on disk so a backend or link outage does not immediately lose data. Loki indexes only labels, not log contents, which keeps its index cheap and its storage object-store-friendly. For metrics-and-logs fleets, the combination is lean, well-understood, and inexpensive to run.

The agent-mode detail is what makes this viable at the edge. A full Prometheus server keeps a local time-series database and serves queries, which is too heavy for a constrained gateway. Agent mode strips that out: it scrapes, buffers to the WAL, and forwards via remote_write, doing no local storage or querying. The result is a small, predictable footprint that still gives you the entire Prometheus exporter ecosystem and PromQL on the backend side. For logs, Grafana Alloy or Promtail tails files and the systemd journal, attaches a small set of labels, and pushes to Loki. The labels are the only thing Loki indexes, so cardinality discipline applies to logs too: a label per request ID will hurt Loki the same way it hurts Prometheus.

Pros. Maturity and ubiquity: your team likely knows PromQL, the exporters exist for nearly everything, and the failure modes are documented to death. Agent mode plus remote_write is a genuinely good edge metrics path, with the WAL providing on-disk buffering across outages. Loki’s label-only index keeps log cost low. Cardinality limits are first-class in Prometheus, and remote_write supports relabeling to drop series before they leave the node.

Cons. Two pipelines, two agents, two mental models. No native distributed tracing; if you need traces you add Grafana Tempo or Jaeger, which is the asymmetry against OpenTelemetry. The buffering story is metrics-centric: the WAL protects remote_write, but log buffering depends on your log agent’s own queue and disk settings, which you must size separately. And cardinality still bites: Prometheus scaling guidance is explicit that high-cardinality labels are the dominant failure mode, so you must enforce limits and relabeling discipline at the edge (Prometheus instrumentation docs).

A concrete edge configuration makes the metrics path tangible. Prometheus agent mode forwards via remote_write with relabeling that drops a high-cardinality series before it leaves the node.

remote_write:
  - url: https://central.example/api/v1/write
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_gc_.*"        # drop noisy runtime series at the edge
        action: drop
    queue_config:
      capacity: 10000
      max_shards: 4

The queue_config governs how remote_write buffers and shards under backpressure, and the WAL underneath it persists samples across an outage. The write_relabel_configs block is your cardinality and egress control: drop or rewrite series at the source, before the metered link.

Note one subtlety. Prometheus agent mode and Grafana Alloy can themselves speak OTLP, so the line between the two options blurs in practice. You can run a Prometheus-shaped metrics path and still receive OTLP. That seam is exactly what the recommendation below exploits, letting you adopt OpenTelemetry as the collection plane while keeping Prometheus and Loki as backends. The choice is therefore less a fork in the road than a question of which plane you standardize on first.

Prometheus agent remote write to central Prometheus plus Loki log path with WAL buffering edge topology

Figure 4: Prometheus plus Loki topology. Prometheus agent mode forwards metrics via remote write with WAL buffering; a log agent ships to Loki. Tracing requires a separate component.

Decision and consequences

Decision. For a general-purpose edge fleet in 2026, default to an OpenTelemetry Collector at a per-site gateway tier, with leaf nodes emitting OTLP and the gateway running the heavyweight processing (filtering, aggregation, tail-sampling, persistent queue). Export metrics to a Prometheus-compatible backend and logs to Loki or any OTLP-native store. This is a hybrid: OpenTelemetry as the collection and routing plane, the Prometheus and Loki ecosystem as backends you already trust. Figure 5 shows the reference topology.

Why this default. It satisfies the criteria that are hardest to reverse. Signal coverage is solved once: one wire protocol carries metrics, logs, and traces, so adding tracing later is a config change rather than a new agent fleet. Vendor neutrality is structural, because OTLP decouples instrumentation from backend. Cardinality control and a persistent sending queue live in a pipeline you version and roll out. Putting the Collector at a gateway tier keeps leaf-node footprint minimal while concentrating memory-hungry work, like tail-based sampling, where there is RAM to spare.

When to deviate. Choose plain Prometheus + Loki under three conditions. Your fleet is metrics-and-logs only with no near-term tracing need. Your team already operates Prometheus fluently. And your leaf nodes are too constrained to run anything but the thinnest agent. In that world, Prometheus agent mode plus a log shipper is leaner to operate, and you pay for nothing you do not use. Adding OpenTelemetry there is complexity without payoff. Conversely, if you need traces from day one, or you expect to switch backends, lead with the Collector.

Positive consequences. One instrumentation contract across the fleet. Backend portability. Cardinality and cost controlled at the source. A clean path to add traces.

Negative consequences. You now operate a Collector pipeline and must master its tuning. There is a learning curve for teams steeped in Prometheus. A gateway tier is another component to make highly available. And the Collector’s pace of change means you will track releases more actively than you track Prometheus.

Status and scope. This ADR records a default for a general-purpose fleet, not a mandate. It is reversible at the backend layer, because OTLP decouples instrumentation from storage, but expensive to reverse at the collection layer, because that touches every node. We therefore commit to the collection plane deliberately and keep the backend choice loose. Revisit this decision when fleet hardware shifts materially, when a tracing requirement appears or disappears, or when the team’s operational fluency changes. Record the revisit as a superseding ADR rather than an edit, so the history of why stays intact.

One implementation note keeps the migration sane. Because Prometheus agent mode and Grafana Alloy both speak OTLP, you can adopt the Collector gateway first and leave existing Prometheus agents in place, pointing them at the gateway. That lets you stand up the OpenTelemetry plane without a flag-day rewrite of every node, then migrate leaves to native OTLP emission on your own schedule. The gateway becomes the stable seam that absorbs the transition.

Recommended hybrid edge topology OpenTelemetry Collector gateway exporting to Prometheus backend and Loki

Figure 5: Recommended hybrid. Leaf nodes emit OTLP to a per-site Collector gateway that filters, samples, buffers, and exports to Prometheus-compatible metrics storage and Loki logs.

The cost model nobody puts in the diagram

Architecture diagrams show data flowing; they do not show the invoice. At fleet scale the cost of edge observability has three components, and only one of them is the backend storage line that finance usually scrutinizes. The first is egress over metered links: cellular and satellite uplinks charge per byte, so every metric, log, and span you forward is a recurring cost multiplied across the fleet. The second is backend ingest and cardinality: most managed metrics backends price on active series, and unbounded labels make that line grow superlinearly. The third is storage and retention: logs and traces are voluminous, and naive full retention is rarely worth it.

The architectural lever for all three is the same: drop, aggregate, and sample at the edge, before the WAN. A metric you summarize at the gateway into a histogram costs a fraction of a metric you ship raw. A trace you tail-sample to keep only the slow and errored ones cuts span volume by an order of magnitude in typical workloads. A log line you drop at the source never enters Loki’s index. This is why both options earn a good score on cost in the table above: each gives you source-side controls. The difference is where those controls live and how uniformly you can apply them. The Collector pipeline applies one filtering language to all three signals, while the Prometheus path uses relabeling for metrics and a separate config for logs. Neither is wrong, but the single pipeline is easier to reason about when the bill arrives.

Trade-offs, gotchas, and what goes wrong

The architecture diagram is the easy part. These are the failure modes that turn up in production.

Collector resource use. A Collector with unbounded queues and no memory_limiter will OOM the node under a telemetry spike, taking your workload with it. Always set memory_limiter, cap queue sizes, and load-test under the spike, not the average. The agent must fail gracefully, not catastrophically.

WAL and buffer sizing. Prometheus remote_write buffers to a WAL, and the Collector’s persistent queue buffers to disk, but both are bounded. A long outage on a fleet with high sample rates can fill the disk or hit the buffer cap and start dropping the oldest, most-explanatory data first. Size buffers for your worst realistic outage, and alert on buffer-fill percentage, not just on disk-full.

Clock skew. Edge nodes drift. NTP may be unreachable during the very outage you care about. Skewed timestamps scramble correlation across metrics, logs, and traces and break time-bounded queries. Discipline your NTP and consider stamping ingest time at the gateway as a cross-check.

Backpressure. When the backend or link is slow, pressure propagates back to the agent. Without bounded queues and a clear drop policy, backpressure becomes an OOM. Decide explicitly what you drop first, and make it the low-value, high-volume data.

Tail-sampling at the edge. Tail-based sampling needs the whole trace buffered before it decides. On a leaf node that memory rarely exists. Do tail-sampling at the gateway tier, where you have headroom, and keep leaf nodes to head-sampling or no sampling.

Scrape interval versus push. Prometheus pulls on an interval, which means a node that goes offline between scrapes leaves a gap rather than a buffered burst. The push model in OTLP buffers locally and replays, which behaves differently across the same outage. Neither is wrong, but they produce different shapes of missing data, and your alerting must understand which model you run. A staleness alert tuned for pull will misfire against a push pipeline, and vice versa.

Config drift across the fleet. A thousand nodes mean a thousand chances for a partial rollout to leave half the fleet on an old buffer size or sampling rate. Treat telemetry config as versioned artifacts with staged rollout and health checks, the same way you treat application deploys. A silent config drift is how one site quietly stops reporting while the dashboard looks green elsewhere.

Practical recommendations

Treat the edge agent as a workload with a hard resource budget, not a free rider. Pin its CPU and memory, and verify it under spike load before fleet rollout. Decide your buffering posture explicitly: persistent, disk-backed, bounded, with an alert on fill percentage. Push cardinality control to the source so the bytes you pay to move are the bytes worth moving. Put memory-hungry processing at a gateway tier and keep leaves thin. And choose the architecture for the signals you will need in twelve months, not just today, because re-instrumenting a fleet is the expensive path.

Two practices separate teams that sleep at night from teams that firefight. The first is observing the observer. Export the agent’s own health metrics, queue depth, dropped-sample counts, and memory usage, and alert on them. An edge telemetry pipeline that fails silently is worse than no pipeline, because you trust a dashboard that has quietly gone blind. The second is staged rollout for config. A buffer resize or a new sampling rule should canary on a handful of sites, bake, and then roll forward with an automatic rollback trigger. Telemetry config is part of your blast radius, and a bad rule pushed fleet-wide can take down monitoring across every site at once. Tie both into the same progressive-delivery tooling you use for application config, so the edge observability layer is governed with the same rigor as the workloads it watches.

Checklist before you commit:

  • Set a hard RAM and CPU budget per node and load-test at spike, not average.
  • Enable memory_limiter (OTel) or bound the WAL and queues (Prometheus) with an explicit drop policy.
  • Make buffering persistent, disk-backed, and bounded; alert on buffer-fill percentage.
  • Drop and aggregate high-cardinality data at the edge before the WAN.
  • Confirm signal coverage: if traces are on the roadmap, lead with OpenTelemetry.
  • Run tail-based sampling at a gateway tier, never on constrained leaves.
  • Discipline NTP and consider gateway-side ingest timestamps for correlation.
  • Version your pipeline config and roll it out like fleet firmware.

FAQ

Is OpenTelemetry a replacement for Prometheus?
Not exactly. OpenTelemetry is a collection and routing layer that handles metrics, logs, and traces and exports over OTLP. Prometheus is a metrics system with storage and a query language. They overlap on metrics collection, but most teams use OpenTelemetry to collect and route, then store metrics in a Prometheus-compatible backend. They are complementary more than competing.

Can Prometheus and Loki handle distributed tracing?
No. Prometheus handles metrics and Loki handles logs. Distributed tracing needs a separate component such as Grafana Tempo or Jaeger. This is a key asymmetry versus OpenTelemetry, which is signal-agnostic and carries traces natively over OTLP. If tracing is on your roadmap, factor that gap into the decision early.

How do I handle telemetry during an edge network outage?
Buffer locally and replay on reconnect. The OpenTelemetry Collector supports a persistent, disk-backed sending queue; Prometheus remote_write buffers to a write-ahead log. Both are bounded, so size them for your worst realistic outage and alert on buffer-fill percentage. Without persistent buffering, an outage silently drops the exact data that explains the incident.

What is the smallest-footprint edge observability option?
For metrics-and-logs-only fleets on very constrained nodes, Prometheus agent mode plus a thin log shipper is typically the leanest path, since agent mode does no local storage or querying. A custom-built OpenTelemetry Collector with only needed components is competitive and gives you traces and one wire protocol for the cost of a slightly larger agent.

Why does cardinality matter so much at the edge?
Edge fleets multiply labels across thousands of nodes: per-device, per-sensor, per-firmware-version. Unbounded cardinality melts metrics backends and inflates egress and storage cost. Both stacks let you relabel and drop at the source. Doing it at the edge protects the metered link and the backend simultaneously, which is why it is a primary decision driver.

Should the Collector run on every node or at a gateway?
For most fleets, run a thin emitter on leaves and a heavier Collector at a per-site gateway. The gateway concentrates memory-hungry work like tail-based sampling and persistent buffering where there is RAM, while leaves stay lean. Per-node Collectors make sense when nodes are well-resourced or sites have a single device.

Further reading


Riju is the editor of iotdigitaltwinplm.com, writing on industrial IoT, digital twins, and the cloud-native platforms that run them. More about this site.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *