Video Analytics Edge Architecture: Low-Latency Design (2026)

Last Updated: 2026-05-18

The camera-to-cloud-to-AI loop, in the form that ruled the surveillance and IIoT vendor decks of 2018-2022, is broken in 2026. A modern video analytics edge architecture does not ship every frame to a cloud GPU and wait for a verdict. It cannot — the bandwidth bill is absurd, the round-trip latency overshoots the action you want to react to, and the legal and privacy posture of pushing raw faces and licence plates over the public internet is no longer tenable in most jurisdictions. The architecture that actually wins, across factory floors, retail estates, transport hubs, smart cities, and warehouse robotics, runs heavy inference on the edge, sends only structured events and short clips to the cloud, and reserves cloud GPUs for the forensic and long-tail work that genuinely benefits from a vision-language model. This post is the reference architecture for that pattern: every layer of the pipeline, the DeepStream and Triton internals that make it cheap on Jetson and on x86 GPUs, the model partitioning patterns that decide what runs where, the retention and replay topology, and a defensible cost model you can take to a CFO. By the end you should be able to draw the diagram, defend the trade-offs, and size a deployment without hand-waving.

Why Edge Video Analytics Is the Default in 2026

An edge video AI pipeline is the right default in 2026 because of three numbers and one regulatory shift. The numbers: a single 4K H.265 camera at 15 fps and reasonable quality produces roughly 6-12 Mbps sustained; a sensible mid-size deployment runs 50-200 cameras; cloud GPU inference on raw frames costs roughly 30-100x what equivalent inference on a Jetson Orin or an on-prem L4 card costs once you account for ingress, egress, decode, and idle. Multiply those and a pure cloud architecture for a 100-camera site lands somewhere between five and twenty thousand US dollars per month before storage. Move the same inference to the edge and that figure collapses by an order of magnitude.

The regulatory shift is the one nobody likes to talk about at vendor events. The EU AI Act phased compliance regime, the UK biometric guidance updates that landed through 2024 and 2025, India’s DPDP rules, and most US-state biometric laws now treat raw video carrying identifiable faces or plates as a sensitive category that you should not be moving across networks without justification. The cleanest way to comply, by a wide margin, is to do the inference where the camera is, push only the structured event (“vehicle entered zone B at 14:02:11.4, plate matched watch-list, confidence 0.91, thumbnail attached”), and keep the underlying video locally except where a downstream investigator explicitly pulls it.

There is a third, less appreciated reason: latency. Most production decisions that video analytics drives — opening a gate, alerting a worker, dispatching a forklift, halting a robot — need to happen within a few hundred milliseconds of the triggering frame. Round-tripping the frame to a public cloud, even one in the same region, eats 80-200 ms before inference starts. Edge inference, on a Jetson Orin NX or a small x86 box with an L4, runs the same model in 8-20 ms and stays comfortably inside human-perception thresholds.

The pattern that emerged is uncontroversial across the major edge AI stacks — NVIDIA Metropolis with DeepStream and Triton on Jetson, AWS Panorama, Azure Percept (now Azure Edge Video Service), Google’s Coral and Vertex AI Edge — and across the open-source side built on GStreamer, ONNX Runtime, and OpenVINO. We will walk it.

Reference Architecture

A complete camera to cloud architecture in 2026 has five concerns laid out left to right: the camera layer, the edge node, the event router, the cloud or regional data centre, and the operator and investigator UI. Cross-cutting concerns — time synchronization, identity and access, model management, and observability — touch all of them.

Camera layer and ingest

IP cameras still dominate. ONVIF Profile S, T, and M cover the live stream, event, and metadata profiles you actually use; RTSP carrying H.264 or, increasingly, H.265 (HEVC) is the live transport. A growing fraction of deployments are moving to AV1 on the camera side where the silicon supports it, because the codec savings reduce edge decode load proportionally. For non-IP cameras — CSI on Jetson, USB on industrial PCs, GMSL on automotive-grade boxes — v4l2src or nvarguscamerasrc in GStreamer terms feeds the same pipeline.

Time synchronization is the part everybody underestimates. PTP (IEEE 1588) or at minimum NTP with a local stratum-2 source must keep every camera’s frame clock within a few milliseconds of every other camera and of the edge node. Without that, multi-camera tracking (“the same person walked from camera 3 to camera 7”) becomes statistically impossible and forensic timelines drift over a shift.

Edge node

The edge node decodes the RTSP stream on-chip via NVDEC (NVIDIA), VPU (Intel), or VideoToolbox-class accelerators (on Apple Silicon edges, which we see more of in retail kiosks now), runs a pre-processing chain — scale, crop, region of interest, colour-space conversion — and hands a batched tensor to the inference component. The inference component runs the primary detector, a tracker that maintains identity across frames, optional secondary inference (classification, pose, action recognition), and post-processing analytics like line crossings, dwell time, and zone occupancy. The output of the edge node is two streams: a metadata stream (JSON or Protobuf events) that goes upstream, and an annotated video stream (optional) that the operator UI consumes.

On a Jetson AGX Orin 64 GB the practical envelope for a single node is 12-24 cameras at 1080p 15-30 fps with a YOLOv9-class detector plus a tracker plus a light classifier, or roughly 4-8 cameras at 4K with the same workload. An x86 server with one L4 GPU lands in the same neighbourhood with more headroom for secondary inference. The exact number depends on codec mix, ROI cropping, and how aggressively you quantize.

Event router and local store

Between the edge inference and the network is a thin component most architectures get wrong on the first attempt: an event router. Its job is to apply rules (“only emit if person is in zone B and dwell > 5 s”), deduplicate (“the same vehicle generated 300 frames of detection; emit one event with start, end, and a representative thumbnail”), and back-pressure (“the network is congested; buffer events for up to N minutes”). Without an event router you flood your message bus with frame-rate-cardinality events that no downstream system wants. With one, your cloud and storage costs drop by another order of magnitude and your operator dashboards stay sane.

The local store is a ring buffer on NVMe — usually 2-7 days of raw video and metadata, replayable on demand. This is what lets investigators “pull the last 30 seconds before the alert” without your cloud paying egress for video that nobody will ever watch.

Cloud and regional DC

The cloud or regional data centre receives the structured event stream over MQTT, Kafka, Kinesis, or Pub/Sub. It indexes events in a metadata store (OpenSearch and ClickHouse are the two we see most often in 2026), runs heavy or forensic inference on the small subset of clips that need it — re-identification across non-overlapping cameras, large vision-language models for natural-language search (“find clips where a forklift was carrying a yellow pallet”), and forensic upscaling — and stores video and clips in an object store with a lifecycle policy.

UI layer

Operators want live, low-latency video; investigators want random-access replay. Those are two different protocols. Live preview rides on WebRTC because it is the only widely-supported web transport that holds glass-to-glass latency under 500 ms; replay rides on HLS or DASH because they handle scrubbing, seeking, and CDN delivery efficiently. RTSP is fine on a wall-mounted monitor inside the facility but never the right transport to a browser.

The narrow contract at every interface is what keeps this architecture debuggable when it grows from one site to a hundred. Cameras emit framed bytes. The edge node emits structured events and clips. The cloud emits indexed, searchable history. Mix those concerns at any boundary and the system rots within a year.

Edge Inference Stack: DeepStream, Triton, GStreamer

The single most consequential decision in an edge video AI pipeline is the inference framework. In 2026 the practical choices on NVIDIA hardware are NVIDIA DeepStream (which is a GStreamer-based meta-framework wrapping CUDA, TensorRT, and Triton) and Triton Inference Server running standalone behind a GStreamer or custom pipeline. On Intel hardware the equivalent is OpenVINO behind GStreamer’s gst-va and vaapi elements; on Apple Silicon edges it is Core ML behind AVFoundation; on Qualcomm and Rockchip silicon it is the SNPE or RKNN runtimes behind their own meta-pipelines. The architectural shape is the same across all of them; we will walk the NVIDIA DeepStream version because it is the most fully featured and the reference everybody else implicitly benchmarks against.

A canonical DeepStream pipeline starts with uridecodebin consuming RTSP, multiplexes N camera streams via nvstreammux into a batched tensor, runs a primary inference engine (nvinfer in PGIE mode) with a detector — YOLOv9, PeopleNet, TrafficCamNet, or a custom TensorRT engine — and feeds the bounding boxes into nvtracker. nvtracker ships three trackers in the NVIDIA reference: IOU (cheap, position-only), NvDCF (correlation filter, identity-aware, the practical default), and a DeepSORT variant for cross-frame re-identification. From there you chain one or more secondary inferences (nvinfer in SGIE mode) that classify each tracked object — vehicle make and model, pose estimation, action recognition — and then run nvdsanalytics to apply spatial rules: line crossings, ROI containment, direction filters, dwell. The annotated frames go through nvdsosd to overlay bounding boxes and labels, then a tee splits the stream into a metadata branch (via nvmsgconv + nvmsgbroker to Kafka, MQTT, AMQP, or Azure IoT Hub) and a video branch (via nvv4l2h264enc or nvv4l2h265enc into HLS or WebRTC).

The Triton piece is where DeepStream stopped being a closed framework and became something teams actually want to standardize on. Modern DeepStream pipelines can target nvinferserver instead of nvinfer, which routes inference through a local Triton Inference Server over gRPC or shared memory. That means the same Triton model repository — versioned, A/B-able, hot-swappable — serves both the edge pipeline and any cloud microservice you build. The DeepStream team’s own guidance, and NVIDIA’s documentation, treats this as the recommended path for non-trivial deployments. Triton’s ensemble feature lets you express a multi-stage model graph (preprocess → detect → crop → classify) as a single inference call, which is essential for keeping latency tight when you have more than one model in the chain.

A few practitioner notes that the docs do not say directly. First, the nvstreammux batch size is the single biggest performance knob. Set it equal to the number of cameras for the smallest latency variance; set it higher than the camera count when you can tolerate a few extra milliseconds of buffering in exchange for better GPU utilization. Second, INT8 quantization with a representative calibration dataset typically gives you a 2.5-3.5x throughput win over FP16 on Orin and Ada-class GPUs with a 1-2% mAP loss that is invisible at typical operating thresholds — always measure on your data, never on COCO. Third, the GStreamer queue element with leaky=downstream is your friend on any path that touches the network; without it a single slow consumer back-pressures the entire pipeline and the camera frames pile up in the decoder.

The GStreamer base matters because it is the only realistic substrate for this kind of plugin DAG. The DeepStream plugins are GStreamer plugins; OpenVINO’s video pipeline is GStreamer plugins; AWS Panorama at the bottom of the stack is GStreamer. If your team is going to be productive on edge video, GStreamer fluency — gst-launch-1.0, gst-inspect, debugging with GST_DEBUG=3 — is not optional.

Model Partitioning: Edge vs Cloud Split

The naive position on edge AI is “run everything on the edge.” That is wrong for any deployment that needs forensic search, cross-camera re-identification, or natural-language query. The right position is partition the model graph, run the latency-critical and cardinality-heavy parts on the edge, and route the small, high-value subset of frames to a heavier cloud or regional model.

Three partition patterns recur. The first is early-exit: a tiny edge detector runs on every frame at 30 fps, and exits the pipeline immediately when the frame is uninteresting. “Uninteresting” is defined as “no detection above threshold” or “no motion in any ROI.” The vast majority of frames in any real deployment exit here, which is what makes the rest of the architecture tractable. The second is cascade: when the tiny detector fires, a larger edge model runs to confirm and classify; when that model’s confidence is low or the situation is genuinely ambiguous, the frame or clip is uploaded to a cloud forensic model. The third is multi-modal late fusion: edge models run on visual features, audio features (gunshot detection, glass break, language), and metadata (door contact sensors, RFID), and a small cloud rule engine or learned policy fuses them into a single decision.

Sizing the split is straightforward once you accept that frame value is heavy-tailed. In a typical retail or facility deployment, fewer than 1% of frames contain an event of interest, and fewer than 0.05% need forensic inference. That gives you a clear sizing rule: if your cloud inference fraction is above 1-2%, your edge gates are too loose; if it is below 0.01%, you are probably missing events that the cloud model would catch and you should loosen them.

What goes in each tier is also constrained. On the edge you want models small enough that a single GPU runs them at full camera rate: YOLOv9-n through YOLOv9-m for detection, MobileNetV4 or EfficientNet-B0 through B3 for classification, X3D-XS or X3D-S for action recognition, a small NvDCF or DeepSORT tracker. In the cloud you can afford much heavier models: vision-language models like InternVL2, Qwen2-VL, or LLaVA-NeXT for open-vocabulary search and reasoning; large re-identification models for cross-camera matching; super-resolution and forensic-enhancement networks; and increasingly, video diffusion models for synthetic gap-filling. The bridge between the two is the structured event stream, not the raw frames.

There is a related decision about model life cycle. The model registry — MLflow, Weights & Biases, or a homegrown S3-backed store — is the single source of truth. The edge nodes pull TensorRT or ONNX engines from it; the cloud Triton servers pull the same models from the same registry; A/B tests are driven by routing a percentage of cameras to a candidate model and watching the precision and recall against a labelled shadow set. Without this discipline you end up with seventeen versions of the detector across forty edge nodes and no way to roll back when one of them regresses on a long-tail scene.

The split inference architecture is also where compliance gets cleanest. The cloud only ever sees frames the edge already decided are interesting, which is precisely the subset for which a documented retention and access policy exists. The rest never leaves the building.

Storage, Streaming, and Replay

Real deployments forget that video analytics low latency is not just about inference latency — it is also about the time from event to operator action, and a large part of that is how fast you can get the relevant video onto a screen. Three retention tiers and three streaming protocols, mapped to the right consumers, is the layout that actually works.

The hot tier is the edge NVMe ring buffer, sized for 2-7 days of full-fidelity recording at the camera bitrate plus the analytics metadata. This is what lets you replay “the 30 seconds before the alert” without paying egress. The most reliable layout in 2026 is a GStreamer hlssink2 writing 6-second HLS segments plus an mpegtsmux-based raw archive, indexed by event ID and timestamp. The ring buffer policy (“oldest segment wins”) is enforced by a small daemon that watches free space and prunes — never rely on the application to free its own segments, because under load it will not.

The warm tier is a regional VMS or object store: S3, GCS, or Azure Blob, fronted by a CDN for replay traffic, with a metadata index in OpenSearch or ClickHouse. Lifecycle policies move segments from edge to warm on a fixed cadence (every 3 days, every 7 days, or only event-tagged clips). The warm tier is what investigators and analysts hit for the 7-30 day window: complex queries (“all line crossings by vehicles wider than 2 m between 14:00 and 18:00 last Tuesday”) are served from the index, and the linked clip is streamed via HLS or DASH.

The cold tier is glacier-class archive — S3 Glacier Deep Archive, GCS Coldline, Azure Archive Blob — for the legal-hold and long-tail retention window. Cold storage is cheap per byte but expensive per retrieval, so the cold tier should only ever hold event-tagged clips of 30-60 seconds, not raw continuous footage. If you find yourself archiving continuous video for compliance, push back on the requirement before you build it — the cost curve is brutal at 90-day-plus retention on raw 4K.

Now the protocols. Live operator video must ride on WebRTC because it is the only widely-supported web transport that holds glass-to-glass latency under 500 ms across the public internet. The architecture that scales is a media-server fanout (Janus, mediasoup, or Amazon Kinesis Video WebRTC) that takes the RTSP stream from the edge once, transcodes once, and fans out to many operators. Direct camera-to-browser WebRTC works for one camera and one operator; it does not work for fifty operators looking at the same camera, and it never works through corporate firewalls without a TURN server.

RTSP stays on-prem, behind firewalls, where it belongs. It is the camera-to-edge transport and the wall-monitor transport, and that is it.

HLS and MPEG-DASH are the right transports for replay and for non-time-critical live (a manager checking a feed from their phone, where 6-15 seconds of latency is acceptable). They are CDN-friendly, browser-native via HLS.js and dash.js, and they handle the seek, scrub, and variable-bitrate cases that WebRTC handles poorly. Most production deployments serve all three: WebRTC for the live control room, HLS for everything else live, HLS or DASH for replay.

The cross-cutting concern that bites everybody at least once is camera-to-edge clock skew during replay. If your edge timestamps and your camera timestamps diverge — and they will — your replay slider does not match your event timestamps and operators stop trusting the system. The fix is to stamp every segment and every event with the edge node’s PTP-disciplined clock, and to display the offset to the camera’s wall clock as a clearly-labelled “camera-reported” field, never as the source of truth.

Cost Model and Sizing

A useful video analytics edge architecture cost model has to track six things: edge compute, network egress, cloud inference, warm storage, cold storage, and the human-cost layer (model training, ops, integration). The waterfall below is a representative shape for a single-site deployment as it scales from a baseline through to full retention.

The baseline that everybody compares to is something like ten 1080p cameras at 15 fps, edge-only inference on a single Jetson Orin NX or Orin AGX, no cloud inference, and 72 hours of local retention. In 2026 dollars that is roughly one Orin (USD 1,500-2,500 capex amortized over three years), a small UPS, and a couple of NVMe drives — call it USD 100-150 per month all-in. We will call that 1.0x.

Upgrade the same cameras to 4K and the costs do not double — they roughly 2.2x. Decode load scales with resolution, inference load scales with the input tensor area, and you usually have to step up either to a bigger Orin or to a small x86 box with an L4. The compute jump dominates.

Scale to fifty 4K cameras and you hit 6-7x baseline. You will need three to five Orins or one well-sized server with one or two L4s, plus a redundant network path, plus an edge UPS or two. NVDEC saturation is the constraint that bites here — each Orin has a fixed number of decode streams it can sustain at 4K, and you partition cameras across nodes accordingly.

Add a cloud forensic layer that processes the top 1-5% of clips through a vision-language model on burstable L40S or H100 capacity, and you are at roughly 8.4x baseline. This is the line item that scales with sensitivity tuning more than with camera count: if you let too much through your edge gates, this number explodes. The sizing rule we have come to trust: cloud inference cost should land at 25-40% of edge inference cost; significantly above that and your gates are wrong.

Layer in 30-day warm retention in S3, GCS, or Azure Blob with CDN-fronted HLS delivery and a metadata index, and you reach roughly 9.7x. The breakdown is storage (cheap), CDN egress to the operator UI (the dominant component, depends entirely on how often clips are watched), and the index (a small fraction unless you index every detection — do not).

Finally, a one-year cold archive in Glacier-class storage with appropriate lifecycle policies lands you at roughly 10.3x baseline. Cold storage is almost free per byte; the cost shows up at retrieval, which is why you cap cold to event-tagged clips and refuse the “archive all the raw video forever” requirement.

Two structural observations. First, the cost curve is sublinear in cameras (because decode and inference batch) and superlinear in retention (because storage and egress compound). The decision that matters most is not how many cameras you deploy — it is how long you keep the video and at what fidelity. Second, the cloud inference fraction is the single most controllable line item once the system is running. Tune the edge gates first, optimize cloud GPU class second, optimize storage tier third.

Trade-offs, Gotchas, and What Goes Wrong

The same five failure modes recur across every deployment we have seen reviewed in 2026. The honest reference architecture names them.

Decoder saturation is the silent killer. A Jetson Orin NX can advertise “12 streams of 1080p”, and it can — once. Add a tracker, a secondary classifier, a transcode for WebRTC, and you find the decoder is already at 90% before the inference engine sees a frame. The first thing to measure on a new pipeline is nvidia-smi dmon or tegrastats decoder utilization, not GPU utilization.

Camera firmware variance is the second. ONVIF Profile S compliance is a polite suggestion across the industry, and two cameras from the same vendor with the same model number can ship different RTSP behaviour after a firmware update. Reconnect logic, codec renegotiation, and tolerant timestamp handling on the edge are not edge cases; they are the steady state. Build them in from day one.

Model drift in production is the third. The detector that hit 92% recall in the lab loses two or three points a quarter as lighting changes, camera lenses age, scenes change layout, and the operating-design domain drifts. Without a labelled shadow set and a continuous evaluation loop, you discover the drift only when an investigator complains, by which time you have months of bad data. The shadow-mode discipline that production AV programs use, which we discuss in our autonomous vehicle stack reference architecture coverage, applies one-for-one to edge video.

WebRTC firewall reality is the fourth. Most enterprise networks do not pass UDP cleanly to the public internet, and most do not allow inbound STUN. You will need a TURN server (Coturn is the standard) deployed in the customer’s DMZ or the regional cloud, and you will need to actually test it from inside the customer’s network during the pilot, not from your laptop. We have lost more pilot weeks to this than to any inference problem.

Cloud egress costs blowing up on a retrospective is the fifth. The architecture above defends against it by keeping raw video local, but a poorly written investigator UI that streams full-resolution video to a hundred concurrent analysts during a major incident can outrun a month’s egress budget in an afternoon. Rate-limit, cap concurrent streams, and prefer thumbnail-strip scrubbing over full-fidelity scrubbing in the investigator UI. The fix is design, not bigger bandwidth.

The honest reference architecture does not eliminate these failure modes. It keeps them attributable to the right layer, which is the difference between a system you can ship and a system you cannot.

Practical Recommendations

If you are designing or refactoring an edge video AI pipeline in 2026, pre-commit to the architectural decisions that consistently separate working deployments from troubled ones.

Start from DeepStream plus Triton on NVIDIA hardware, OpenVINO behind GStreamer on Intel, or RKNN behind GStreamer on Rockchip. Do not write your own meta-pipeline. The plugin ecosystems exist for a reason and you will not out-build them.
Decide the model partition before you pick the silicon. The edge gates determine how much cloud inference you commission; the cloud inference fraction determines how often you can afford to call a VLM; both determine the silicon you actually need.
Pick three retention tiers and three protocols. Hot edge for 2-7 days; warm regional for 7-30 days; cold archive only for event-tagged clips. WebRTC for live operator, HLS for live overview and replay, RTSP only on-prem.
Treat the event router as a first-class component. Without it, your message bus, your storage, and your operator UI all collapse under cardinality. With it, the rest of the stack is tractable.
Wire the safety, privacy, and audit case into the architecture on day one. Edge inference is the cleanest privacy posture available; do not undo it by lazily uploading raw video on a different code path.
Invest in observability earlier than you think you should. Decoder utilization, inference batch fill, dropped frames per camera, event router queue depth, and end-to-end latency from frame timestamp to event emit — those five metrics catch nine out of ten production incidents before customers do.

These are not glamorous decisions. They are the ones that separate the systems that ship from the systems that do not.

FAQ

What is a video analytics edge architecture?
A video analytics edge architecture runs the latency-critical and cardinality-heavy parts of a video AI pipeline on hardware located near the cameras — typically a Jetson Orin, an x86 server with an L4 GPU, or an equivalent Intel or Qualcomm edge box — and reserves the cloud for forensic, long-tail, and cross-site inference. The edge node handles decode, primary inference, tracking, and event routing; the cloud handles indexing, vision-language model search, re-identification, and warm or cold retention. The contract between layers is structured events and short clips, not raw video.

Why not just run video analytics in the cloud?
Three reasons. Bandwidth, because streaming raw 4K video from dozens of cameras to a public cloud costs more than the inference itself and saturates most site uplinks. Latency, because 200-400 ms round-trip times to a cloud GPU break any decision that has to be acted on in real time. Compliance, because in 2026 most jurisdictions treat raw video carrying faces or licence plates as sensitive data and edge inference is the cleanest way to keep it local.

What is the difference between DeepStream and Triton?
DeepStream is NVIDIA’s GStreamer-based meta-framework for video analytics pipelines — it provides plugins for decode, multiplex, inference, tracking, analytics, on-screen display, and message brokering. Triton Inference Server is a model-serving runtime that hosts inference models behind a gRPC or HTTP API. In modern DeepStream deployments the nvinferserver plugin routes inference to a local Triton instance, which means the same model repository can serve both edge and cloud workloads. They are complementary: DeepStream owns the pipeline, Triton owns the model serving.

What hardware do I need for a 50-camera deployment?
For fifty 1080p cameras at 15 fps with a YOLO-class detector plus a tracker and a light classifier, you can plan on three to four Jetson AGX Orin 64 GB units or one well-sized x86 server with one or two NVIDIA L4 cards. At 4K the same workload typically needs five to six Orins or an L40S-class card. Decoder saturation, not GPU saturation, is the usual ceiling — measure decoder utilization, not just GPU utilization, when sizing.

Should I use RTSP, WebRTC, or HLS for video streaming?
Use RTSP from camera to edge node, always. Use WebRTC from edge or media server to the live operator UI when you need sub-500-millisecond glass-to-glass latency, which the operations control room usually does. Use HLS or MPEG-DASH for replay, for non-time-critical live (managers, dashboards, mobile), and for any CDN-fronted delivery. Most production stacks serve all three concurrently and route consumers based on use case.

Video Analytics Edge Architecture: Low-Latency Design (2026)

Video Analytics Edge Architecture: Low-Latency Design (2026)

Why Edge Video Analytics Is the Default in 2026

Reference Architecture

Camera layer and ingest

Edge node

Event router and local store

Cloud and regional DC

UI layer

Edge Inference Stack: DeepStream, Triton, GStreamer

Model Partitioning: Edge vs Cloud Split

Storage, Streaming, and Replay

Cost Model and Sizing

Trade-offs, Gotchas, and What Goes Wrong

Practical Recommendations

FAQ

Further Reading

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories