Context Engineering for Production LLM Agents (2026)

Context Engineering for Production LLM Agents (2026)

Context Engineering for LLM Agents: Production Patterns for 2026

Context engineering for LLM agents is now the load-bearing skill in production AI work, and it has quietly displaced prompt engineering as the thing that separates a demo from a system that survives contact with real workloads. The shift is structural, not fashionable. A single clever prompt no longer decides outcomes; what decides them is what tokens occupy the context window at each step of a multi-turn, multi-tool agent loop, how those tokens got there, and how aggressively the stale ones get evicted. The context window is a scarce, shared, and increasingly contested resource. Every retrieved document, every tool result, every line of conversation history competes for the same budget, and every token you spend has both a latency cost and an accuracy cost. Treating that budget as something to be engineered, measured, and defended is what this discipline is about.

This post is a working catalog, not a manifesto. What this covers: why context engineering replaced prompt engineering as the core skill, a deep pattern catalog you can lift into a codebase, the failure modes that bite in production, how to measure context quality with evals and budgets, and a concrete checklist plus FAQ.

What is context engineering?

Context engineering is the practice of deciding which information enters an LLM’s context window at each step of an agent’s run, in what order, at what fidelity, and for how long, so the model has exactly what it needs and little else. It treats the context window as a managed resource with budgets, eviction policies, and assembly logic rather than a free-text box.

That definition sounds modest until you watch a long-running agent fall apart. Prompt engineering optimized a mostly static instruction string. Context engineering optimizes a moving target: a window whose contents change every turn as tools fire, documents arrive, and history accumulates. The job is no longer “write good wording.” It is “build a deterministic system that decides, turn after turn, what the model sees.” Wording still matters, but it is now one layer inside a pipeline.

Three forces drove the shift. First, agents loop. A coding agent or research agent can run dozens or hundreds of steps, and each step appends tool output that, left unmanaged, balloons the window. Second, longer context windows did not eliminate the problem; they relocated it. Bigger windows let you stuff more in, but models do not attend to all positions equally, and quality degrades as windows fill. Third, cost and latency scale with tokens. A bloated window is slower and more expensive on every call, and an agent makes many calls. So the optimization target moved from “the prompt” to “the entire flow of context across a session.”

A useful mental model is to treat the context window like a CPU cache or a working set in an operating system, not like a document you compose once. The window is small and fast; the external stores are large and slow. The art is keeping the working set of tokens, the ones the model actually needs for the current decision, resident in the window while everything else lives outside and is paged in only when required. Seen this way, most of the patterns below are just cache management: eviction policies, prefetching, write-back to durable storage, and keeping the hot path lean. Engineers who already think in those terms tend to pick up context engineering fast, because the failure modes rhyme with cache thrashing, working-set blowups, and stale reads.

Context assembly pipeline showing sources flowing through relevance gating and pruning into a deterministic assembler and token-budget check before the model
Figure 1: Context assembly pipeline. Multiple sources are gated, pruned, and assembled deterministically before a token-budget check decides whether compaction runs.

The pattern catalog

This is the meat of the post. Each pattern is a reusable technique you can compose. None is exotic; the skill is in combining them with discipline and in knowing which one a given symptom calls for.

System and instruction layering

Structure the context as explicit layers, ordered from most stable to most volatile. A typical layering, top to bottom: immutable system identity and safety rules, then durable task instructions and tool definitions, then retrieved knowledge, then conversation and tool history, then the current user turn. The ordering is not cosmetic. It serves two goals at once: it puts the highest-priority, least-changing material in a fixed position the model learns to rely on, and it keeps the prefix stable so prompt caching and KV cache reuse stay valid.

The cache point matters enough to repeat. Caching mechanisms reward a stable prefix: if the first N tokens of your context are byte-identical to the previous call, the provider can often reuse computed state and you pay less, both in money and latency. The moment you mutate something early in the context, everything after the mutation point must be recomputed. Layering with stable-first ordering is therefore not just for the model’s benefit; it is a cost-control pattern. Put volatile content (the latest tool result, the new user message) at the end, never woven into the middle of otherwise-stable text.

Retrieval into context with relevance gating

Retrieval-augmented generation is a context-engineering pattern, not a separate system. The naive version dumps top-k chunks into the window and hopes. The production version gates. After vector or hybrid search returns candidates, rerank them, then apply a relevance threshold and drop anything below it, even if that means injecting three chunks instead of ten. Irrelevant-but-plausible chunks are not free filler; they actively degrade reasoning (see distractor poisoning below). The gate is the difference between RAG that helps and RAG that quietly poisons.

A robust retrieval pattern looks like: rewrite the query for recall, run hybrid search (dense plus lexical), rerank the candidate set with a cross-encoder or model-based scorer, gate on the rerank score, deduplicate near-identical passages, order the survivors, and fit them to a retrieval-specific token budget so retrieval can never crowd out instructions or recent history. For the deeper version of this, see our agentic RAG architecture patterns.

Retrieval pattern showing query rewrite, hybrid search, reranking, a relevance gate, deduplication, and budget fitting before injection
Figure 4: Retrieval with relevance gating. Candidates are reranked and gated on score before deduplication and budget-fitting, so low-value chunks never reach the model.

Tool-result pruning and summarization

Tool outputs are the single biggest source of context bloat in agentic systems. A file read, an API response, a search result, a SQL dump: each can be thousands of tokens, and most of those tokens are irrelevant to the next decision. Prune at the boundary. When a tool returns, immediately decide what to keep. Three tactics, in increasing aggression: (1) truncate with a marker so the model knows content was elided; (2) extract only the fields the agent asked about and discard the rest; (3) summarize the result into a few lines and stash the full payload in an external store keyed by an ID the agent can re-fetch.

The principle is that the raw tool result should rarely persist in context past the turn that consumed it. Once the agent has acted on a file’s contents, the next twenty turns do not need the full file; they need a one-line note that the file was read and what was found. Keep the note, evict the payload.

A subtlety worth flagging: prune after the agent has used the result, not before, and never prune so aggressively that you destroy information the agent has not yet had a chance to act on. A common bug is summarizing a tool result on the same turn it returns, before the model has read it, which strips out exactly the detail the next reasoning step needed. The safe sequence is return the full result, let the agent reason over it, then evict on the following turn boundary once the useful conclusions have been written to the scratchpad or memory. Treat tool exhaust the way a build system treats intermediate artifacts: necessary while building, deletable once the output exists.

Context compaction and compression at thresholds

Compaction is the scheduled garbage collection of the context window. You set a token threshold below the model’s hard limit, and when a turn pushes the window past it, a compaction step runs before the next inference. Compaction preserves pinned material (system layer, task goal, key decisions), summarizes the compactible spans (old tool results, resolved sub-threads), persists the detail to an external store, and replaces each compacted span with a short summary plus a pointer to retrieve the full version if needed.

The trigger and the policy are separate concerns. The trigger is mechanical: count tokens, compare to threshold. The policy is the judgment: what is safe to summarize, what must stay verbatim, how terse the summary can be. Get the policy wrong and you summarize away the one constraint the user stated forty turns ago. A safe default pins anything the user explicitly flagged as a requirement and anything the agent recorded as a decision, and compacts only observations and tool exhaust.

# PSEUDOCODE: compaction trigger and policy
THRESHOLD = 0.75 * model_context_limit

def maybe_compact(ctx):
    if count_tokens(ctx) < THRESHOLD:
        return ctx                      # nothing to do

    pinned   = ctx.segments_where(pinned=True)        # system, goals, decisions
    volatile = ctx.segments_where(pinned=False)       # old tool results, sub-threads

    for seg in volatile.oldest_first():
        if count_tokens(ctx) < THRESHOLD:
            break
        store_id = external_store.put(seg.full_text)  # persist detail
        summary  = summarize(seg, max_tokens=120)     # terse summary
        ctx.replace(seg, summary + ref(store_id))     # summary + pointer

    return ctx.rebuild_stable_prefix()                # keep cache valid

Compaction trigger flow that counts tokens, checks a threshold, preserves pinned content, summarizes old tool results, persists detail, and rebuilds the stable prefix
Figure 3: Compaction flow. When tokens cross the threshold, pinned content is preserved, old tool results are summarized and persisted, and the stable prefix is rebuilt to keep the cache valid.

Memory tiers

Production agents need more than the context window can hold and more durability than a single session provides, so organize memory into tiers. Working memory is the current turn’s active state, fully in-context. Short-term memory is recent turns, also in-context but the first thing compacted. Episodic memory is an external store of past sessions and events, retrieved when relevant. Long-term memory is consolidated facts, user profiles, and stable preferences in a durable store.

Information flows down the tiers as it ages (working overflows to short-term, short-term compacts to episodic, episodic consolidates to long-term) and back up on demand (retrieval pulls episodic and long-term entries into working memory when the current goal needs them). The discipline is to keep only the active tier in context and to treat the rest as retrievable, not resident. Memory is its own deep topic; we cover the storage and consolidation mechanics in LLM agent memory architecture for production.

Memory tiers showing working and short-term memory in the context window and episodic and long-term memory in external stores with flow between them
Figure 2: Memory tiers. Working and short-term memory live in the context window; episodic and long-term memory live in external stores, flowing down as they age and up on retrieval.

Structured scratchpads and note-taking

Give the agent an explicit, structured place to write things down, and the context window stops being the only memory. A scratchpad is a small, persistent region (a notes file, a structured state object, a to-do list) that the agent reads and updates each turn. It externalizes intermediate reasoning, running plans, and partial results so they survive compaction without bloating the main history.

The structured form matters. Free-form notes drift; a schema (goal, sub-tasks with status, key findings, open questions, decisions) keeps the agent’s working state legible and lets you compact aggressively around it because the scratchpad already holds the distilled state. Many durable coding and research agents owe their stamina to a disciplined scratchpad more than to any single model capability.

Sub-agent context isolation

When a task has separable parts, spawn sub-agents with their own clean context windows instead of cramming everything into one. A sub-agent gets a narrow brief, does its work in an isolated window, and returns only a compact result to the parent. The parent never sees the sub-agent’s tool exhaust or intermediate steps, just the answer.

Isolation buys two things. It keeps each window focused, which improves quality and cost on both sides, and it contains failures: a sub-agent that goes off the rails pollutes only its own window, not the parent’s. The cost is coordination overhead and the risk of losing context the sub-agent needed but did not receive, so the parent’s brief to the sub-agent is itself a context-engineering artifact worth designing carefully. This composes naturally with modern tool-use loops, which we detail in Claude agent tool-use patterns.

Just-in-time versus pre-loaded context

There are two philosophies for getting information into context, and mature systems blend them. Pre-loaded context is assembled up front: load the relevant docs, history, and state before the agent starts reasoning. It is simple and predictable but risks loading things the agent never needs. Just-in-time context is fetched on demand: give the agent retrieval and file tools and let it pull information exactly when a step requires it, keeping the baseline window lean.

Just-in-time wins on token efficiency and scales to large corpora the agent could never pre-load, at the cost of extra round-trips and the chance the agent fails to fetch something it should. Pre-loading wins on latency and determinism for well-scoped tasks. A common production blend pre-loads the small, always-relevant core (system, task, user profile, a handful of pinned facts) and leaves everything else to just-in-time retrieval.

Deterministic context assembly

Tie the whole catalog together with one rule: context assembly must be deterministic and testable. The function that builds the context for a given turn should be a pure-ish function of explicit inputs (state, retrieved set, history, budget), producing the same window for the same inputs. That determinism is what lets you write evals, reproduce failures, and reason about cache stability.

# PSEUDOCODE: deterministic context assembly for one turn
def assemble_context(state, user_turn, budget):
    ctx = Context()

    # 1. Stable prefix first (cache-friendly, highest priority)
    ctx.add(SYSTEM_LAYER,        pinned=True)
    ctx.add(TASK_INSTRUCTIONS,   pinned=True)
    ctx.add(TOOL_DEFINITIONS,    pinned=True)

    # 2. Retrieved knowledge, gated and budgeted
    chunks = retrieve(user_turn, k=20)
    chunks = rerank_and_gate(chunks, min_score=0.55)
    ctx.add(fit_to_budget(chunks, budget.retrieval))

    # 3. Memory and scratchpad (distilled, not raw)
    ctx.add(state.scratchpad)
    ctx.add(fit_to_budget(state.recent_history, budget.history))

    # 4. Volatile content LAST (never in the middle)
    ctx.add(user_turn)

    assert count_tokens(ctx) <= budget.total
    return ctx

Note where the volatile content lands: at the end, after everything stable, so the prefix stays cacheable and the model’s most recent and most relevant material sits at a strong attention position.

Failure modes

Knowing the patterns is half the job; the other half is recognizing the symptoms when context goes wrong. These are the failure modes that show up in production agents, with the pattern that addresses each.

Context rot / long-context degradation. As a window fills, model quality tends to drift down even when the relevant facts are technically present. The model gets distracted, repeats itself, or loses the thread. This is the empirical reality behind “bigger context windows did not solve everything.” The fix is to keep the effective context small through aggressive pruning and compaction, regardless of how large the available window is. A 200k-token window is a ceiling, not a target.

Lost in the middle. Models attend more reliably to information at the beginning and end of a long context than to material buried in the middle; this is a documented long-context phenomenon, where retrieval-style accuracy sags for facts placed mid-window. The mitigation is positional: place the most critical instructions and facts at the edges, keep the truly load-bearing material out of the deep middle, and shorten the window so there is less middle to get lost in.

Distractor poisoning. Plausible-but-irrelevant content (a near-miss retrieved chunk, a stale tool result, an off-topic memory) does not just waste tokens; it pulls the model toward wrong answers. The relevance gate and reranking exist precisely to keep distractors out. When debugging a confidently wrong agent, audit what irrelevant material was in its window.

Stale memory. External memory that is never invalidated will eventually feed the agent outdated facts (a changed requirement, a resolved bug, a superseded decision). Attach time-to-live or versioning to memory entries, refresh on access, and prefer recency in retrieval ordering when facts conflict.

Token budget blowups. Without per-source budgets, one greedy source (a huge tool result, an over-eager retriever) consumes the window and starves everything else. Enforce explicit budgets per layer (retrieval gets X, history gets Y, scratchpad gets Z) and assert the total before every call. The assertion in the assembly function above is not decoration.

Cache invalidation. Mutating early context, reordering stable segments, or injecting volatile content into the prefix silently invalidates prompt and KV caching, spiking cost and latency. Keep the prefix byte-stable and append volatile content only at the end. Treat the stable prefix as an interface contract.

Failure modes map linking context rot, lost in the middle, distractor poisoning, stale memory, token budget blowups, and cache invalidation to their fixes
Figure 5: Failure modes and their fixes. Each common context failure maps to a specific pattern from the catalog.

Measuring it: evals, budgets, and cost

You cannot engineer what you do not measure, and context quality is measurable if you instrument for it.

Context-quality evals. Build a held-out set of agent tasks with known-good outcomes and measure end-to-end success as you vary context policy. More diagnostic are targeted evals: a needle-in-a-haystack-style probe that inserts a known fact at varying positions and depths to expose lost-in-the-middle behavior in your assembled context; a distractor test that injects plausible-but-wrong chunks and checks whether the gate held; a compaction-fidelity test that runs a task to completion, forces compaction, and verifies no load-bearing fact was summarized away.

Token budgets as a first-class metric. Track tokens per call, broken down by layer (system, retrieval, history, tools, scratchpad). The breakdown tells you where bloat lives. A retrieval layer creeping from 2k to 12k tokens over a session is a leak worth a ticket. Set budget alarms the way you set latency alarms.

Cost and latency. Tokens map almost linearly to cost and strongly to latency, so context efficiency is a direct lever on both. Measure cost per completed task, not cost per call; a leaner context that needs one more retrieval round-trip can still win on total task cost. Track cache hit rate explicitly: a falling hit rate usually means something is mutating the prefix.

Trace everything. Log the assembled context for every call (or a hash plus the segment manifest if full logging is too heavy). When an agent fails, the assembled window at the failing step is the primary evidence. Without it you are debugging blind.

A practical way to operationalize these metrics is a per-session context report: total tokens over time, the layer breakdown at the peak, the number of compaction events and what each one summarized away, the cache hit rate trend, and the count of retrieval round-trips. Run that report on a sample of real sessions weekly. The patterns it surfaces (a retriever that quietly doubled its budget after a config change, a compaction policy that fires too late and lets the window hit the rot zone, a prefix mutation that tanked cache hits) are the kind of slow regressions that never trip a single-call test but steadily raise cost and lower quality across a fleet of agents. Context engineering, like performance engineering, is mostly won in this kind of continuous measurement rather than in any one clever trick.

Practical recommendations and checklist

Pull the catalog into a short operating discipline:

  • Order stable-to-volatile. Fixed system and task layers first, latest content last. Never inject volatile data into the prefix.
  • Gate retrieval, do not dump it. Rerank, threshold, deduplicate, and budget. Three relevant chunks beat ten noisy ones.
  • Evict tool exhaust. Summarize or extract tool results at the turn boundary; keep a note, drop the payload.
  • Compact on a threshold, with a pinning policy. Set the trigger mechanically; design the policy deliberately. Pin requirements and decisions.
  • Tier your memory. Keep only the active tier in context; treat episodic and long-term memory as retrievable, not resident. Add TTL or versioning.
  • Use a structured scratchpad. Give the agent a schema’d place to externalize plans and findings so you can compact around it.
  • Isolate sub-agents. Separable work gets its own clean window; return compact results, not exhaust.
  • Blend just-in-time and pre-loaded. Pre-load the small always-relevant core; fetch the rest on demand.
  • Make assembly deterministic and budgeted. One pure function, per-layer budgets, a hard total assertion before every call.
  • Measure context quality, tokens, cache hit rate, and cost-per-task. Trace the assembled window so failures are debuggable.

Start with deterministic assembly and per-layer budgets; they make every other pattern observable. Add compaction and tool-result eviction next, since they prevent the most common form of slow degradation. Layer in tiered memory and sub-agent isolation as the agent’s scope grows.

FAQ

What is the difference between context engineering and prompt engineering?
Prompt engineering optimizes a largely static instruction string for a single call. Context engineering manages the entire, changing set of tokens in the window across a multi-turn, multi-tool agent run: what gets retrieved, pruned, compacted, remembered, and in what order. Prompt wording is now one layer inside a larger context pipeline.

Do larger context windows make context engineering unnecessary?
No. Larger windows raise the ceiling but do not remove the problems. Quality still degrades as windows fill (context rot), important facts still get lost in the middle, and every extra token still costs money and latency on every call. A big window is a budget to defend, not a reason to stop defending it.

When should an agent compact its context?
When token count crosses a threshold set below the model’s hard limit (a common default is around three-quarters of the limit), compact before the next inference. Pin the system layer, task goal, and recorded decisions; summarize old tool results and resolved sub-threads, persisting the detail to an external store with a retrievable pointer.

What causes the “lost in the middle” problem?
Models attend more reliably to the start and end of a long context than to the middle, so facts placed deep in a long window are recalled less accurately. It is a documented long-context behavior. Mitigate by placing critical material at the edges and by keeping the window short enough that there is little fragile middle.

How do I stop irrelevant retrieved chunks from hurting answers?
Gate retrieval. After search, rerank candidates and apply a relevance-score threshold, dropping anything below it even if that shrinks the injected set. Deduplicate near-identical passages and enforce a retrieval token budget. Plausible-but-irrelevant chunks (distractors) actively degrade reasoning, so fewer high-relevance chunks beat a larger noisy set.

How does context layering affect cost?
Prompt and KV caching reward a stable prefix: identical leading tokens let the provider reuse computed state, lowering cost and latency. Layering stable content first and appending volatile content last keeps the prefix cacheable. Mutating early context invalidates the cache and forces recomputation, so layering is a cost-control pattern as much as a quality one.

Further reading

  • Facebook
  • Twitter
  • LinkedIn
  • More Networks
Copy link
Powered by Social Snap