LLM Gateway Architecture: The Control Plane for AI Apps

LLM Gateway Architecture: The Control Plane for AI Apps

LLM Gateway Architecture: The Control Plane for AI Apps

A good LLM gateway architecture is the single most underrated piece of production AI infrastructure. The first AI feature ships with one provider SDK wired straight into application code. The fifth feature ships with three SDKs, two API keys pasted into environment variables, no shared rate limiting, no cost visibility, and a retry loop copied between four services. When the primary provider has a bad hour, every one of those services fails independently. That sprawl is the problem an LLM gateway exists to solve.

A gateway is a proxy that sits between your applications and every model provider. It exposes one unified, usually OpenAI-compatible, API and centralizes the cross-cutting concerns: routing, failover, caching, rate limits, budgets, key management, guardrails, and observability. It turns per-app SDK chaos into one governed control plane.

What this covers: the responsibilities of an LLM gateway, the data-plane versus control-plane split, the reference architecture, how semantic caching and routing actually work, the latency and correctness trade-offs, and how LiteLLM, Portkey, Cloudflare, Kong, and Envoy compare.

Context and Background

Two years ago, “calling an LLM” meant one vendor and one endpoint. In 2026 a typical AI product talks to several: a frontier model for hard reasoning, a cheap small model for classification, an embedding model for retrieval, and often a self-hosted open-weight model for high-volume or sensitive traffic. Each provider has its own SDK, authentication scheme, error semantics, rate limits, and pricing. Wiring all of that into application code directly is the AI-era equivalent of letting every microservice open its own raw database connection with no pooling.

The incumbent answer is the AI gateway pattern, and the ecosystem has matured fast. LiteLLM is the dominant open-source option, exposing 100-plus providers behind one OpenAI-format proxy with cost tracking, load balancing, and guardrails. Portkey is a managed and open-source gateway built specifically for GenAI workloads, advertising routing to 1,600-plus models and 50-plus guardrails. Cloudflare, Kong, and the Envoy community ship gateways that extend existing edge or API-gateway infrastructure into the LLM domain.

The shift mirrors what happened to web services a decade ago. We moved cross-cutting concerns — TLS termination, authentication, rate limiting, observability — out of every service and into a shared ingress and service mesh. The LLM gateway is that same consolidation, applied to a new class of backend whose calls are slow, expensive, non-deterministic, and metered by the token. If you already run an API gateway, the mental model transfers almost directly; the differences are in caching semantics and cost accounting, which I cover below. For a related platform pattern in industrial systems, see our writeup on unified namespace architecture, which solves the same “stop point-to-point sprawl” problem for IoT data.

The Reference Architecture

An LLM gateway architecture splits into two halves: a data plane that proxies every request in the hot path, and a control plane that holds configuration, keys, budgets, and routing policy out of the hot path. The data plane does authentication, caching, routing, guardrails, and provider calls per request; the control plane is updated rarely and read often. Keeping them separate is what lets the proxy stay fast while policy stays centrally governed.

Layered LLM gateway architecture showing data plane and control plane across providers

Figure 1: The layered LLM gateway architecture. Client apps hit one OpenAI-compatible API. The data plane handles auth, caching, routing, guardrails, and provider adapters in the request path; the control plane holds keys, quotas, and config; both feed a shared observability sink.

In Figure 1, every client — web app, mobile backend, or autonomous agent — speaks one API instead of five SDKs. The request enters the data plane, passes authentication and rate limiting, checks the cache, gets routed to a provider through a guardrails stage, and returns with usage metadata attached. The control plane never sits in the request path; it publishes configuration that the data plane reads. The observability sink receives a structured record of every call. This separation is the architectural spine everything else hangs from.

Why a gateway beats per-app SDK sprawl

The direct-SDK approach scatters policy across your codebase. Rate limits live in one service’s retry loop, cost tracking lives nowhere, and a provider key rotation means a deploy of every service that holds it. There is no single place to answer “how much did team X spend on GPT-class models last week” or “redact PII before it ever leaves our network.” A gateway makes those concerns first-class and global. Add a provider once, in config, and every app can use it. Rotate a key once, and every app picks it up.

The cost of consolidation is a network hop and a shared dependency. That trade is almost always worth it past two or three AI features, but it is a real trade, and I quantify the latency cost later. The gateway also becomes a critical-path dependency: if it is down, all AI features are down. That is why production gateways run as horizontally scaled, stateless data-plane pods backed by a fast shared store, never as a single instance.

Consider the concrete before-and-after. Before a gateway, adding Anthropic alongside your existing OpenAI integration means importing a second SDK, learning its Messages API, writing a second retry wrapper, threading a second key through your secrets pipeline, and adding a second cost-tracking path — repeated in every service that needs it. After a gateway, you add Anthropic once in the gateway’s config, point a logical alias at it, and every service can call it through the API it already uses. The marginal cost of the Nth provider drops from “a project per service” to “a config edit.” That asymmetry is the entire economic argument for the pattern, and it compounds as both your provider count and your service count grow.

The unified, OpenAI-compatible API

The unifying contract is almost always the OpenAI Chat Completions schema (and increasingly the newer Responses API shape). Applications send the same request body regardless of which model answers; the gateway’s provider adapters translate to Anthropic’s Messages API, Google’s Gemini API, Amazon Bedrock, or a self-hosted vLLM endpoint, then translate the response back. This is what makes a model swap a one-line config change instead of a code change.

The catch is that the OpenAI schema is a lowest-common-denominator. Provider-specific features — Anthropic’s prompt caching blocks, Gemini’s safety settings, structured-output modes — either get mapped imperfectly or exposed through pass-through extensions. A good gateway documents exactly which features are normalized and which are pass-through, because silent feature loss is a common and frustrating failure mode.

Core responsibilities at a glance

A production gateway owns, at minimum: provider routing and automatic failover; load balancing across keys and deployments; exact and semantic caching; rate limiting and per-key quotas; cost budgets and chargeback; centralized key and secret management; guardrails including PII (Personally Identifiable Information) redaction; retries with backoff; and end-to-end observability with distributed tracing. Each of these is a feature you would otherwise reimplement, badly, in every application. The next section walks through the two that have the most subtle engineering: routing with failover, and caching.

Deeper Analysis: Routing, Failover, and Caching

The two hardest things a gateway does well are deciding where to send a request and deciding when not to send it at all. Routing and failover keep you available when providers misbehave; caching keeps you cheap and fast. Both have correctness traps that look fine in a demo and bite in production.

Routing and automatic failover

Routing decides, per request, which deployment should answer. The simplest policy is a static primary with an ordered fallback list. More sophisticated policies do weighted load balancing across multiple keys or regions, latency-based routing to the fastest healthy deployment, or cost-based routing to the cheapest model that meets a quality bar. LiteLLM lets you define weighted deployments and configure fallbacks, retries, and routing at the key and team level. Portkey supports latency- and cost-based routing, fallbacks, canary deployments, and circuit breakers.

Sequence diagram of an LLM gateway request with cache miss and provider failover

Figure 2: A request that survives a provider failure. The gateway authenticates, checks quota, misses the cache, forwards to the primary provider, gets a 429 rate-limit error, retries on a fallback provider, stores the successful response, and returns it with usage headers.

Figure 2 shows the failover path that makes a gateway worth running. The primary provider returns a 429 (rate limited), and instead of surfacing that error to the user, the gateway retries on a configured fallback. The critical engineering detail is which errors are retryable. A 429 or a 503 (service unavailable) is a clean retry-and-failover candidate. A 400 (bad request) is not — retrying a malformed prompt on another provider just wastes money and latency. A naive “retry everything” policy turns one bad request into N bad requests across every provider you have.

Failover also has to respect idempotency and cost. Streaming responses complicate retries: if the primary streamed 300 tokens before failing, you cannot cleanly resume on a fallback, so most gateways restart the generation, and you pay for the discarded tokens. Circuit breakers help here — once a provider trips a failure threshold, the gateway stops sending it traffic for a cooldown window instead of paying the timeout tax on every request.

There is also a subtle quality dimension to failover that pure availability thinking misses. Falling over from a frontier reasoning model to a cheaper fallback keeps the request alive but may silently degrade the answer. For a customer-facing summarizer that is fine; for a code-generation agent whose output runs unattended, a quiet downgrade to a weaker model can produce confidently wrong results that pass surface checks. The mature pattern is to tag each route with a minimum quality tier and only fail over within that tier, accepting a hard error rather than a dangerous downgrade when no in-tier capacity is available. This is the kind of policy that belongs in the control plane precisely because it is a business decision, not a transport detail.

Load balancing deserves its own note. When you hold multiple keys for the same provider — common, because per-key rate limits are the real ceiling at scale — the gateway spreads traffic across them to multiply your effective throughput. The naive approach is round-robin, but round-robin ignores that keys can have different remaining quotas and that some deployments are slower than others. Weighted and least-busy strategies do better: route proportionally to each deployment’s headroom, and shed load from a key the moment it starts returning 429s rather than continuing to hammer it. Done well, this turns a fleet of individually rate-limited keys into one large, smooth capacity pool.

Exact and semantic caching

Caching is where a gateway earns its latency and cost savings, and where it most easily becomes wrong. There are two layers. Exact caching keys on a normalized hash of the request — same model, same parameters, same prompt — and returns a stored response on a hit. Cloudflare reports identical-request caching can cut latency by up to 90 percent and avoid the provider call entirely. Semantic caching goes further: it embeds the prompt into a vector and returns a stored answer when a previous prompt is similar enough, above a configured threshold. LiteLLM implements this via Redis or Qdrant vector search; Portkey supports both simple and semantic caching.

Flowchart of exact then semantic cache lookup logic in an LLM gateway

Figure 3: The two-stage cache lookup. The gateway normalizes and hashes the prompt for an exact match first; on a miss it embeds the prompt and checks vector similarity against the threshold; only a full miss reaches the provider, and the fresh response populates both layers.

The trap in Figure 3 is the similarity threshold. Set it too loose and the cache returns an answer to a different question — “What is the refund policy for orders over $100?” served from a cached “What is the refund policy?” is a correctness incident, not a cache hit. Set it too tight and the hit rate collapses to roughly exact-match levels and you have paid for embeddings and a vector lookup for nothing. The right threshold is workload-specific and must be measured, not guessed, and it should never be applied to prompts whose answers depend on parameters the embedding ignores (user identity, account state, timestamps).

Caching also interacts badly with non-determinism and freshness. Any request with temperature above zero is, by design, supposed to vary — caching it silently removes that variation. Any request whose correct answer changes over time (today’s date, current inventory, a tool-call result) must either skip the cache or carry a short TTL (time to live). The safe default is to cache aggressively for deterministic, stateless, read-only prompts and to make caching opt-in for everything else.

A worked example makes the economics concrete. Suppose an FAQ-style support assistant serves one million requests a month, each averaging 1,500 input tokens and 400 output tokens against a model priced around $3 per million input tokens and $15 per million output tokens. Uncached, that is roughly $4,500 in input and $6,000 in output, about $10,500 a month, ignore the spread for round numbers. Now assume a 40 percent combined cache hit rate — plausible for a workload with a long tail of repeated questions. Those 400,000 served-from-cache requests cost only the cache lookup (and, for the semantic layer, an embedding call at a fraction of a cent), cutting the provider bill by roughly 40 percent to about $6,300. The savings are real, but note the asymmetry: you save the full provider cost on hits and pay a small fixed overhead on every miss. The break-even hit rate for semantic caching — below which the embedding and vector-search overhead is not worth it — is workload-dependent, which is exactly why the brief insists on measuring rather than assuming. These figures are illustrative; plug in your own token counts and current provider pricing before committing to a number.

Rate limits, budgets, and key management

Beyond routing and caching, the gateway is the only sane place to enforce limits and track spend. Per-key and per-team rate limits protect both your providers’ quotas and your wallet from a runaway agent loop. Budgets and chargeback turn raw token usage into per-project cost, which is what makes inference cost optimization actually enforceable rather than aspirational — you cannot optimize what you cannot attribute. Key management is the quiet win: provider keys live in the gateway’s secret store and never in application code or client bundles. Applications hold a virtual gateway key, scoped and revocable, and the real provider credentials stay in one rotatable place.

Rate limiting itself has more nuance than a single requests-per-minute number. Token-based limits matter more than request counts, because a single 100,000-token request costs more and stresses a provider more than a hundred tiny ones. Cloudflare AI Gateway, for instance, supports sliding-window and fixed-window techniques, and the sliding window avoids the burst-at-the-boundary artifact where a fixed window lets a client send a full window’s worth of traffic in the last second of one window and the first second of the next. For agent workloads, a per-key concurrency cap is often more useful than a rate-per-minute, because a misbehaving agent’s failure mode is fan-out — spawning dozens of parallel calls — not steady high throughput.

Guardrails, PII redaction, and observability

The last two responsibilities are guardrails and observability, and they are what separate a toy proxy from a control plane. Guardrails inspect inputs and outputs against policy: PII redaction, jailbreak and prompt-injection detection, profanity or topic filters, and schema validation on structured outputs. Portkey ships 40-plus pre-built guardrails and built-in PII detection and redaction, and a guardrail failure can itself trigger a fallback rather than a hard error. The architectural point is that guardrails belong at the chokepoint, not scattered across apps, so a new policy applies everywhere at once.

Observability is the responsibility most teams underrate until an incident. Because every call flows through the gateway, it is the natural place to emit a structured record per request: which app and key made it, which model answered, input and output token counts, latency split into queue and provider time, cache hit or miss, and computed cost. Feed that into a trace backend and you get end-to-end visibility no per-app SDK can match — you can answer “why did p95 latency spike at 3pm” with the model, region, and cache-hit-rate breakdown in one query. Without it, you are flying a multi-provider AI platform blind.

Comparing the leading gateways

There is no single best gateway; the right choice depends on whether you want open-source control, a managed control plane, or an extension of edge infrastructure you already run. The matrix below summarizes the trade-offs across the five most common options as of mid-2026. Treat feature columns as directional — every project ships fast, so verify specifics against current docs before you commit.

Gateway Model Routing & fallback Semantic caching Built-in guardrails Best fit
LiteLLM Open source, self-host Weighted, fallbacks, retries per key/team Yes (Redis or Qdrant) Via integrations Teams wanting full control and a fast start
Portkey Managed + open source Latency/cost routing, canary, circuit breakers Yes (simple + semantic) 40-plus built-in, PII redaction Enterprises wanting governance out of the box
Cloudflare AI Gateway Managed (edge) Fallback, retries Exact today, semantic on roadmap Limited Apps already on Cloudflare wanting edge caching
Kong AI Gateway Self-host/managed (plugins) Plugin-based routing Via plugins Generic API rules, fewer LLM-native Shops already standardized on Kong
Envoy / Gloo AI Gateway Self-host (Envoy) Envoy routing Layered on, not native Limited LLM-native Platform teams running Envoy/service mesh

The pattern in the matrix is consistent: the purpose-built LLM gateways (LiteLLM, Portkey) lead on LLM-native features like semantic caching, token analytics, and guardrails, while the extended API gateways (Kong, Envoy/Gloo) win when you already operate that infrastructure and want one control plane over both ordinary services and model traffic. Cloudflare occupies a middle ground: excellent edge caching and analytics with a lighter LLM-native feature set, ideal if your traffic already lives on its network. None of these is wrong; the deciding factor is your existing platform and how much governance you need at the proxy versus in the app.

Trade-offs, Gotchas, and What Goes Wrong

The gateway is a proxy, and every proxy adds latency. The honest number for a well-run gateway is single-digit to low-tens of milliseconds of overhead per request when the cache misses — auth, a cache lookup, routing logic, and the provider hop. Against an LLM call that takes hundreds of milliseconds to many seconds, that overhead is usually noise. But semantic caching adds an embedding call plus a vector search on every miss, which can add tens of milliseconds of its own; if your hit rate is low, you are paying that tax for little return. Measure the miss-path overhead, not just the hit-path savings.

The single biggest gotcha is that the gateway is a new single point of failure. Every AI feature now depends on it. The mitigations are non-negotiable: run the data plane as multiple stateless replicas behind a load balancer, keep the control-plane store (keys, config) highly available, and make sure a control-plane outage degrades gracefully — the data plane should keep serving with its last-known config rather than failing closed. A gateway that hard-fails when its config database blinks is worse than no gateway.

Multi-region LLM gateway deployment with regional caches and global config

Figure 4: Multi-region deployment. Each region runs stateless gateway pods with a local Redis for cache and keys; an anycast load balancer routes users to the nearest region; a global config store pushes routes, budgets, and keys; a central sink aggregates spend and traces.

Figure 4 raises the multi-region trade-off. Regional caches cut latency but fragment the cache — a hit in one region is a miss in another — and replicating cache state across regions costs bandwidth for data that is cheap to recompute. Most teams keep caches regional and accept the lower cross-region hit rate. Budget enforcement is harder still: if each region counts spend locally, a global budget can be overshot during the reconciliation window. The fix is a central spend sink with near-real-time aggregation, and accepting that hard budget caps are eventually-consistent, not instantaneous.

Data residency complicates the picture further. If EU traffic must stay in the EU for compliance, the gateway has to route EU requests only to providers and regions that satisfy that constraint — and a global fallback policy that quietly fails an EU request over to a US provider is now a compliance violation, not just a latency event. This is another argument for putting routing policy in the control plane where it can encode jurisdiction rules, rather than in scattered application code that no auditor can review. The same applies to logging: if prompts can contain regulated data, the observability sink itself becomes a residency and retention question, and redaction has to happen before the log is written, not after.

There is also a more philosophical gotcha worth naming. A gateway that does its job invisibly can lull teams into forgetting which model actually answered a given request. When routing, fallback, and caching all operate silently, the same application prompt might be served by a frontier model, a cheaper fallback, or a months-old cache entry on three consecutive calls — and if a user reports a bad answer, reproducing it requires the trace that records exactly what happened. This is why the observability discipline above is not optional polish: in a system designed to abstract away which model answers, the trace is the only ground truth you have.

A quieter failure mode is the gateway becoming an organizational bottleneck. Once one team owns the proxy that every other team’s AI traffic flows through, every “please add this provider” or “raise my rate limit” becomes a ticket. If the gateway’s config changes require a code review and a deploy by the platform team, you have recreated the central-IT chokepoint that microservices were supposed to kill. The fix is self-service: give teams scoped control over their own keys, budgets, and routes through the control plane, with guardrails on what they can change, so the gateway accelerates teams instead of gating them.

Versioning is another trap. Provider APIs and model names change — a model gets deprecated, a new snapshot ships, pricing shifts. If your apps hard-code model names that the gateway passes straight through, a provider deprecation breaks every app at once. Better to expose stable logical model aliases (your fast-model, your reasoning-model) that the gateway maps to concrete provider models, so a migration is a one-line config change in the control plane rather than a coordinated redeploy across every consumer.

Other recurring failures: silent provider feature loss through the lowest-common-denominator API; retrying non-retryable 4xx errors and multiplying cost; semantic-cache thresholds tuned on a demo dataset that drift in production; and treating guardrails as a checkbox rather than a measured filter. Guardrails and PII redaction are also a latency and false-positive trade — an over-eager PII filter that redacts a product SKU it mistook for a credit-card number breaks the feature as surely as a leak does. And because the gateway sees every prompt, it is itself a high-value attack surface; the same prompt-injection and agentic-AI security concerns that apply to your apps apply doubly to the chokepoint they all flow through.

Practical Recommendations

Start with the smallest gateway that removes the worst pain. If your problem is provider sprawl and no cost visibility, a self-hosted LiteLLM proxy gets you a unified API, virtual keys, spend tracking, and fallbacks in an afternoon. If you need managed guardrails, governance, and a polished dashboard without running the infrastructure, a managed gateway like Portkey or Cloudflare AI Gateway trades some control for operational simplicity. If you already run Kong or Envoy at the edge, extending it into AI routing keeps one control plane — at the cost of weaker LLM-native features like token analytics and built-in guardrails.

Treat caching as opt-in, not default. Enable exact caching freely for deterministic prompts; gate semantic caching behind a measured per-route threshold and never apply it to personalized or time-sensitive requests. Make retry policy explicit about which status codes are retryable, and put a circuit breaker in front of every provider. Above all, instrument first: a gateway whose value you cannot measure in saved dollars and avoided incidents is hard to justify and harder to tune.

Checklist before you call a gateway production-ready:

  • [ ] Data plane runs as 2-plus stateless replicas behind a load balancer.
  • [ ] Control-plane outage degrades gracefully (serves last-known config).
  • [ ] Provider keys live only in the gateway secret store; apps hold virtual keys.
  • [ ] Retry policy lists retryable status codes explicitly; 4xx is not retried.
  • [ ] Circuit breaker configured per provider with a cooldown window.
  • [ ] Semantic-cache threshold measured per route; disabled for personalized prompts.
  • [ ] Per-key and per-team rate limits and budgets enforced.
  • [ ] Every request emits a trace with provider, latency, tokens, and cost.
  • [ ] Guardrails and PII redaction measured for false-positive rate, not just enabled.

Frequently Asked Questions

What is an LLM gateway and how is it different from an API gateway?

An LLM gateway is a proxy specialized for model traffic: one OpenAI-compatible API in front of many providers, with routing, failover, token-aware caching, per-token cost tracking, and guardrails. A general API gateway handles auth, rate limiting, and routing for ordinary HTTP services but lacks LLM-native features — semantic caching, model-level observability, and token or cost analytics. You can extend an API gateway like Kong or Envoy into the LLM role, but you bolt those capabilities on rather than getting them by default.

Does an LLM gateway add too much latency?

For most workloads, no. A well-run gateway adds single-digit to low-tens of milliseconds of overhead on a cache miss — auth, a cache check, routing, and the provider hop — which is negligible against an LLM call measured in hundreds of milliseconds or seconds. The exception is semantic caching, which adds an embedding call and a vector lookup on every miss. If your hit rate is low, measure that miss-path cost; it can erase the savings. Cache hits, by contrast, return in milliseconds.

Is semantic caching safe for production?

It is safe when scoped carefully and dangerous when applied blindly. The risk is the similarity threshold returning a stored answer to a subtly different question, which is a correctness incident rather than a cache hit. Use semantic caching only for stateless, deterministic, read-only prompts, measure the threshold per route on real traffic, and never apply it to prompts whose answer depends on user identity, account state, or time. For everything else, prefer exact caching or no caching.

Should I build an LLM gateway or use an existing tool?

Almost always use an existing tool. LiteLLM (open source), Portkey (managed and open source), and Cloudflare AI Gateway cover routing, caching, budgets, and observability out of the box, and the LLM gateway architecture is well-trodden enough that building your own rarely pays off. Build only if you have requirements no tool meets — an unusual compliance boundary, a proprietary routing policy, or extreme latency constraints — and even then, fork an open-source gateway rather than starting from a blank file.

How does a gateway handle multiple model providers?

Through provider adapters behind one unified schema. Applications send an OpenAI-format request; the gateway translates it to each provider’s native API — Anthropic Messages, Gemini, Bedrock, or a self-hosted vLLM endpoint — and translates the response back. Adding a provider is a config change, and swapping models is a one-line edit. The limitation is that provider-specific features outside the common schema are either normalized imperfectly or exposed as pass-through extensions, so check which features your gateway preserves before relying on them.

What happens when the gateway itself goes down?

It takes every AI feature with it, which is why deployment matters more than features. Run the data plane as multiple stateless replicas behind a load balancer so no single instance is critical, keep the control-plane store highly available, and design for graceful degradation: a config-store outage should let the data plane keep serving with its last-known configuration rather than failing closed. A single-instance gateway is a liability; a properly replicated one is more reliable than the providers behind it.

Further Reading

By Riju — about

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *