AI Agent Trajectory Evaluation: 2026 Patterns

Your agent passed every task-success check and still shipped a regression nobody caught. It reached the right final state, but it called a deletion tool three times, retried a paginated API in a loop, and burned four extra model turns getting there. AI agent trajectory evaluation exists because the final answer is only half the signal: in 2026, the path an agent takes is where cost, safety, and latency regressions actually hide. As agents move from demos into governed production systems, teams are discovering that a green outcome dashboard can sit on top of a quietly degrading decision process. This post gives you a reusable pattern to measure both.

What this covers: the outcome-versus-trajectory split, reference-based and reference-free step scoring, exact-match versus LLM-as-judge trade-offs, the well-documented judge biases, how to capture trajectories with OpenTelemetry, building golden sets, CI regression gates, and a harness you can copy.

Context and Background

Single-shot LLM evaluation is a solved enough problem. You have a prompt, a response, and a scoring function — exact match, BLEU-style overlap, or an LLM grader. Multi-step agents broke that model. An agent plans, calls tools, reads results, replans, and recovers from errors across many turns. The unit of work is no longer a string; it is a trajectory: an ordered sequence of thoughts, tool calls, arguments, observations, and the final state they produce.

Most teams started where the tooling was easiest — checking only the final outcome. Frameworks like LangGraph, OpenAI’s Agents SDK, and Google’s ADK all expose run results you can assert against, and the temptation is to stop there. But outcome-only evaluation is blind to how the agent got the answer. Two runs can both succeed while one is twice as expensive and one step away from a destructive mistake. The deeper analysis of this gap appears in our companion piece on the LLM-as-judge evaluation pipeline, which this post extends from single responses to multi-step paths.

The industry response in 2025–2026 has two threads. First, observability standardized: the OpenTelemetry GenAI semantic conventions defined spans and attributes for model calls and agent steps, so trajectories became capturable as structured traces rather than ad-hoc logs. Second, the research community formalized step-level grading. Work such as Google’s Trajectory Evaluation research and the survey-style paper Agent-as-a-Judge: Evaluating Agents with Agents argued that judging intermediate reasoning, not just outputs, is necessary for agentic systems. Those two threads — standard capture plus principled scoring — are the foundation of the pattern below.

The framing this post adopts is that AI agent trajectory evaluation is not a new tool you buy but a discipline you assemble from observability you already run and scoring functions you already understand. The top SERP results treat it as either a pure LLM-as-judge problem or a pure metrics problem. It is both, layered, and the layering is the whole craft: deterministic checks where the world is deterministic, judged checks where it is not, and a hard gate that forces the two to agree before code ships.

The Reference Harness: Outcome Plus Trajectory, Gated in CI

Here is the original thesis the top SERP results miss: outcome and trajectory are different measurements with different failure modes, and a serious agent eval harness must run both through one pipeline that ends in a hard CI gate. Outcome evaluation answers “did it work?” Trajectory evaluation answers “would I trust how it worked, at scale, on a bad day?” You need both numbers, scored independently, then aggregated against thresholds that block a merge.

AI agent trajectory evaluation is the practice of scoring the full ordered sequence of an agent’s steps — its tool selections, arguments, intermediate observations, and error recovery — rather than only its final answer. It measures whether the path was correct, efficient, and safe, capturing cost and risk regressions that outcome-only checks miss entirely.

Figure 1: The reference harness — instrument the agent, capture spans, score outcome and trajectory against a golden set, aggregate to metrics, and gate the pipeline.

The harness has five stages. The agent under test is instrumented to emit OpenTelemetry GenAI spans for every model call and tool invocation. Those spans land in a trajectory store. Two scorers read the same trace: an outcome scorer that checks the final state, and a trajectory scorer that grades the path. Both reference a golden trajectory set for the task. An aggregator rolls per-run scores into metrics, compares them to thresholds, and emits a pass/fail verdict plus a regression report. The gate is the point: evaluation that does not block anything is decoration.

Outcome scoring is assertions, not vibes

Outcome scoring should be as deterministic as you can make it. For a database agent, assert the final row count and the specific records. For a coding agent, run the test suite the change was supposed to fix. For a research agent, check that required facts appear in the output. The output of outcome scoring is ideally a binary task-success signal per run, aggregated into a task success rate across the suite. Resist the urge to use an LLM judge here when a real assertion exists — a SELECT query is cheaper and never drifts.

The trap is “soft” outcome checks that quietly become judge calls. If you find yourself asking an LLM “did the agent answer correctly?” for a task where you could instead assert a concrete fact, you have traded a deterministic signal for a noisy one to save twenty minutes of test authoring. Pay the twenty minutes. Reserve judged outcomes for genuinely open-ended deliverables — a written summary, a plan, a generated document — where no single assertion captures correctness. Even there, anchor the judge with a rubric and a reference answer rather than asking for a bare thumbs-up, because the same biases that plague trajectory judging plague outcome judging.

Trajectory scoring grades the path

Trajectory scoring is where the interesting work lives. You are evaluating four largely separable properties of the path: tool selection (did it pick the right tool), argument correctness (were the arguments well-formed and right), redundancy (did it repeat or loop unnecessarily), and recovery (when a tool failed, did it adapt instead of flailing). Each maps to a sub-score, and together they form a path-quality score that is independent of whether the final answer happened to be correct.

Keep those four dimensions separate in storage, not just in scoring. When a regression fires, you want to know which dimension moved. A drop in tool-selection accuracy points at a prompt or tool-description change; a drop in argument correctness usually points at a schema or formatting issue; a redundancy spike means the agent is looping; a recovery decline means it stopped adapting to errors. A single blended path-quality number hides all of that, and the whole reason to invest in LLM agent trajectory eval is to get diagnostic signal a final-answer check cannot give you. Treat the four sub-scores as first-class metrics with their own thresholds.

Figure 2: Outcome eval reduces the final state to a success bit; trajectory eval decomposes the step sequence into tool selection, argument correctness, redundancy, and recovery, then combines both into one verdict.

The two halves can disagree, and the disagreements are the most valuable output of the whole system. A run that succeeds with a poor trajectory is a latent regression or a future incident. A run that fails with a good trajectory is often an environment or tool problem, not an agent problem — and you debug it completely differently. Collapsing both into a single number throws away exactly the signal you built the harness to capture.

Consider a concrete disagreement. A support agent is asked to refund an order. It succeeds — the refund posts and the outcome scorer returns 1 — but the trajectory shows it called lookup_customer four times with slightly different arguments before finding the record, then issued the refund without the policy check the golden path requires. Outcome is green; trajectory is alarming. Two weeks later that missing policy check becomes an incident, and the only reason you can see it coming is that multi-step agent testing captured the path, not just the result. This is the entire argument for trajectory eval in one example: the failure was always there, and outcome-only scoring was structurally incapable of surfacing it.

Reference-based versus reference-free scoring

There are two families of trajectory scorers. Reference-based scoring compares the observed path to a golden trajectory — the sequence of tool calls a competent operator would make. This gives you crisp, cheap metrics like tool-call F1 (precision and recall over the set of expected tool calls) and step accuracy. It is the gold standard when you can author golden paths, but it is brittle when many valid paths exist. Reference-free scoring uses a rubric to grade the path on its own merits, no golden reference required. It tolerates path diversity but is more expensive and noisier because it leans on an LLM judge. Mature harnesses use reference-based scoring where golden paths are stable and reference-free rubric scoring everywhere else.

The choice is not binary per harness; it is per task. A “look up an order and issue a refund” task has essentially one correct tool sequence, so reference-based scoring is exact and cheap there. A “research this vendor and summarize the risk” task has dozens of valid paths, so forcing it onto a golden reference produces false regressions every time the agent picks a different-but-fine route. The practical rule: author golden paths for tasks with low path entropy and grade the high-entropy tasks with rubrics. Tag each task in the suite with its scoring family so the aggregator knows which metrics are even meaningful for it. Mixing tool-call F1 across both families and averaging gives you a number that means nothing.

There is also a hybrid worth knowing. You can use reference-based scoring on the subset of tool calls that must always appear — a mandatory authorization check, a required logging call — while grading the surrounding path with a rubric. This catches the safety-critical steps deterministically without forcing the entire trajectory onto a rigid template. Many production agent evaluation patterns settle on exactly this split: hard assertions on the steps that matter for compliance, soft rubric scoring on the rest.

Capturing, Scoring, and Judging Steps

Capture comes before scoring. You cannot run AI agent trajectory evaluation on a trajectory you did not record faithfully, and 2026’s answer is to record it as OpenTelemetry spans. Each model call becomes a span carrying gen_ai.operation.name, the request model, token counts, and latency; each tool call becomes a child span with the tool name and arguments. The result is a trace you can replay deterministically into the scorers. This is the same telemetry backbone described in our guide to long-running governed AI agents, so trajectory eval reuses production observability rather than bolting on a parallel logging path.

Capturing trajectories as standard spans buys you three things beyond evaluation. First, the same traces drive production debugging, so the eval harness and the on-call dashboard read the same data. Second, you can run AI agent trajectory evaluation offline by replaying recorded production traces against new agent versions — a powerful regression check that needs no live tool calls. Third, the semantic conventions are vendor-neutral, so you are not locked to one tracing backend. The one discipline that matters: record arguments and observations as span attributes or events, not just names. A span that says “called search_orders” without the arguments is useless for argument-correctness scoring, and argument correctness is where a surprising share of agent failures actually live.

Once you have steps, decide per step how to score them. The pipeline below routes each step to the cheapest defensible scorer.

Figure 3: The step-level judge pipeline. Deterministically checkable steps go to exact match or schema validation; the rest go to a rubric-grounded LLM judge with position and verbosity controls, then get calibrated against human labels before aggregation.

Exact match where you can, judge where you must

For a tool call, you often can check correctness deterministically. Did the agent call search_orders when the golden path expected search_orders? Did the arguments validate against the tool schema and match expected values within tolerance? That is exact-match or schema-check territory: fast, free, and stable. Reach for an LLM judge only when the step’s quality is genuinely subjective — was this clarifying question reasonable, was this summarization faithful, was the replan sensible given the error. The discipline is to push as many steps as possible onto deterministic checks and reserve judging for the irreducibly fuzzy ones.

Argument matching deserves nuance. Strict equality is too brittle — a timestamp, a generated ID, or a paraphrased query will differ run to run without being wrong. Define per-argument comparators: exact for enums and IDs that must match, set-membership for categories, numeric tolerance for thresholds, and semantic similarity (itself a small judged check) only for free-text arguments. This keeps argument correctness deterministic where the world is deterministic and judged only where text genuinely varies. The same comparator config doubles as documentation of what “correct arguments” even means for each tool, which is useful when a new teammate asks why a run failed.

Here is a compact trajectory scorer that encodes the routing and the outcome/trajectory split:

def score_trajectory(trace, golden, judge, rubric):
    steps = extract_steps(trace)            # from OTel spans
    expected = golden.tool_calls            # list of (tool, args)
    observed = [(s.tool, s.args) for s in steps if s.is_tool_call]

    # Reference-based, deterministic metrics
    tool_f1 = f1(set(t for t, _ in observed),
                 set(t for t, _ in expected))
    arg_acc = mean(args_match(o, e)         # schema + value check
                   for o, e in align(observed, expected))
    redundancy = count_redundant(observed) / max(len(observed), 1)

    # Reference-free, judged steps (only the fuzzy ones)
    judged = [judge.score(step=s, rubric=rubric,
                          anchors=rubric.few_shot,
                          controls=["shuffle_positions",
                                    "length_normalize"])
              for s in steps if not s.deterministic]
    recovery = mean(j.recovery for j in judged) if judged else 1.0

    path_quality = weighted_mean({
        "tool_f1": tool_f1, "arg_acc": arg_acc,
        "redundancy": 1 - redundancy, "recovery": recovery,
    }, weights=rubric.weights)

    outcome = assert_final_state(trace, golden.assertions)  # 0 or 1
    return {"outcome": outcome, "path_quality": path_quality,
            "tool_f1": tool_f1, "arg_acc": arg_acc,
            "turns": len(steps), "cost": trace.total_cost}

The shape matters more than the syntax: deterministic metrics computed directly, fuzzy steps routed to a grounded judge, and outcome kept as a separate field so the aggregator can detect outcome/trajectory disagreement.

A note on where this lives. Many teams reach for a managed eval product, and several good ones exist; the point of describing the pattern rather than naming a vendor is that an agent eval harness 2026 worth trusting is mostly a set of contracts — capture format, scorer interface, gate semantics — not a specific SaaS. Whether you assemble it from open tracing plus a few hundred lines of scoring code or buy a platform, the harness must let you swap judges, version golden sets, and inspect any failing trajectory down to the span. If a tool hides the raw trace, you cannot debug multi-step agent testing failures, and you will end up reverse-engineering scores from a dashboard. Own the trajectory data even if you rent the scoring UI.

Aggregation is its own design decision. Per-run, this scorer emits a dictionary; across runs and across the suite you must decide how to combine. Do not average outcome and path quality into one headline number — keep task success rate and median path quality as separate gates, and add explicit gates on cost and turn count. A clean AI agent trajectory evaluation report shows, per task family: success rate, the four trajectory sub-scores, p50 and p95 turn count, and cost per task, each with a delta against the previous baseline. The gate fails if any guarded metric regresses beyond its threshold. This is what turns scoring into an enforceable contract rather than a chart.

A metric reference for the harness

Before grounding the judge, it helps to fix the metric vocabulary the aggregator will gate on. The table below maps each metric to what it measures, how it is computed, and the failure it catches.

Metric	Measures	How computed	Catches
Task success rate	Outcome correctness	Deterministic assertion, share passing	Wrong final result
Tool-call F1	Right tools used	Precision and recall vs golden tool set	Missing or extra tool calls
Step accuracy	Per-step correctness	Matched steps over total steps	Subtle wrong-step drift
Argument correctness	Well-formed inputs	Per-argument comparators	Malformed or wrong arguments
Redundancy	Wasted work	Redundant calls over total calls	Loops and repeated calls
Recovery score	Error adaptation	Judged on failed-call handling	Flailing after tool errors
Turn count	Path length	Count of agent steps	Inefficient long paths
Cost per task	Spend	Summed token and tool cost	Quietly rising spend

Each row is a candidate gate. You will not gate on all of them for every task family — but naming them all forces an explicit decision about which ones matter where, and that decision is the substance of a real evaluation strategy rather than a default that happened to ship.

Grounding the judge with rubrics and anchors

An ungrounded LLM judge is a random number generator with good manners. Grounding means three things: an explicit rubric with score definitions, few-shot anchors showing a 1, a 3, and a 5 example, and a forced rationale before the score. A workable rubric for the recovery dimension:

RECOVERY (1-5): When a tool call failed or returned an error,
did the agent adapt sensibly?
5 = Diagnosed the error, chose a correct alternative, made progress.
3 = Retried the same call once, then adapted.
1 = Looped on the failing call or ignored the error and proceeded.
Output: {"rationale": "...", "score": <int>}
Anchors: [error->reauth->retry = 5] [retry x3 same args = 1]

The rationale-first ordering is not cosmetic — it forces the judge to commit to reasoning before a number, which measurably reduces arbitrary scores.

Few-shot anchors do heavy lifting here. A rubric in prose tells the judge what 5 and 1 mean in words; an anchor shows it with a real trajectory snippet and its assigned score. Without anchors, two judge models — or the same model two provider versions apart — will interpret “made sensible progress” differently and your scores will not be comparable over time. With three or four anchors spanning the score range, you pin the scale. Authoring good anchors is the most underra

AI Agent Trajectory Evaluation: 2026 Patterns

AI Agent Trajectory Evaluation: 2026 Patterns

Context and Background

The Reference Harness: Outcome Plus Trajectory, Gated in CI

Outcome scoring is assertions, not vibes

Trajectory scoring grades the path

Reference-based versus reference-free scoring

Capturing, Scoring, and Judging Steps

Exact match where you can, judge where you must

A metric reference for the harness

Grounding the judge with rubrics and anchors

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories