Agentic RAG Architecture Patterns: When Plain RAG Is Not Enough

Agentic RAG Architecture Patterns: When Plain RAG Is Not Enough

Agentic RAG Architecture Patterns: When Plain RAG Is Not Enough

Lede

You’ve built a retrieval-augmented generation system. It works beautifully when questions align with your corpus structure—single-hop lookups, straightforward document matching. Then comes the query that breaks it: “Find all unresolved manufacturing defects reported in Q1 across three suppliers, correlate them with equipment downtime in our logs, and explain the root cause with citations.”

Your vanilla RAG system retrieves three document chunks, hallucinates connections between them, and produces a confident answer. The citations don’t trace back correctly. The aggregation is incomplete. The temporal reasoning is missing.

Vanilla RAG is a one-shot function. Query in, ranked documents back, answer out. No feedback loop. No way to route queries to specialized retrievers. No way to verify that retrieved evidence actually supports the conclusion. When queries demand multi-step reasoning, domain-aware routing, or explicit validation, vanilla RAG fails predictably. Agentic RAG adds an internal control loop—planning, routing, graph traversal, or self-critique—to decompose queries, route them intelligently, walk relationships, or verify evidence.

This post examines four concrete patterns: planner-retriever, router/classifier, graph-RAG agent, and reflective RAG. Each solves different failure modes. None is universally best. Picking wrong drowns you in latency or token costs. Picking right means retrieving with surgical precision.


TL;DR

Pattern When It Wins Cost Latency Failure Risk
Vanilla RAG Single-hop, single-domain queries Low <200ms Multi-hop, temporal, reasoning failures
Planner-Retriever Queries decomposable into steps Medium (2–4 LLM calls) 500ms–2s Planner hallucination, error propagation
Router/Classifier Multi-domain, query-type routing Low <300ms Misclassification, no fallback
Graph-RAG Agent Relationship-heavy (supply chain, incidents) High (extraction + iteration) 1–5s Graph brittleness, combinatorial fan-out
Reflective RAG High-stakes, hallucination-sensitive High (2x LLM calls) 1–2s Self-rationalization, double latency

Terminology Primer

Before diving into patterns, establish shared language. If you’ve worked with RAG before, skim. Otherwise, read closely.

Retriever: A component that returns documents or chunks matching a query. It can be vector-based (semantic similarity via embeddings), keyword-based (BM25, full-text search), structured (SQL over metadata), or graph-based (KG traversal). A retriever is stateless: same query, same results, every time.

Generator: An LLM that takes retrieved context and produces an answer. It does not itself retrieve or refine; it consumes what the retriever gives it. In vanilla RAG, the generator is the only “intelligent” component; retrieval is mechanical.

Agent / Agentic Loop: A system where an LLM makes decisions (not just generate), executes tools (retrieve, filter, rank), observes results, and loops. An agent has state within a query: it can issue multiple retrievals, decide which data matters, and adjust strategy based on what it sees.

Tool-use: An agent’s ability to call external functions (retrieve, query a database, filter results) and receive structured outputs. The LLM decides when and how to invoke tools based on the query and previous results.

Multi-hop reasoning: Answering a question that requires facts spread across multiple documents and logical connections. Example: “Which supplier’s equipment has the longest mean-time-to-repair (MTTR), and is that supplier also the one with the highest defect rate?” You need two retrievals (MTTR data, defect data) plus logical comparison.

Planner / Decomposition: An LLM that breaks a complex query into a sequence of simpler sub-queries. Example: “Find all Q1 defects”“List Q1 defects from Supplier A” + “List Q1 defects from Supplier B” + “List Q1 defects from Supplier C”“Correlate with downtime logs”.

Knowledge Graph (KG): A structured representation of entities (equipment, suppliers, protocols) and their relationships (equipment sourced from supplier, protocol version 2.1 released in Q3). Graph-RAG agents traverse KGs to discover transitive connections. KGs are built offline, typically via entity extraction and relationship inference.

Subgraph / Neighborhood: A small subset of a KG retrieved in response to a query. Example: given entity “supplier X,” return all equipment sourced from X, all defects affecting that equipment, and all root-cause analyses mentioning those defects.

Confidence / Stopping Condition: In iterative patterns (graph-RAG, reflective RAG), a threshold or criterion signaling that retrieval and reasoning are sufficient. Example: “If three retrieved documents independently cite the same root cause, stop iterating.” Without explicit stopping, agentic loops can run indefinitely.


Pattern 0: Vanilla RAG vs. Agentic RAG (Top-Level Comparison)

Let me establish the baseline. Here’s what vanilla RAG and agentic RAG fundamentally do differently:

Architecture diagram 1

What’s happening here: Vanilla RAG embeds the query once, ranks documents once, and generates once. No feedback. Agentic RAG creates a loop: plan what to retrieve, retrieve, observe results, decide whether to loop or finalize. This loop is the core difference.

Setup: The distinction matters because vanilla RAG optimizes for latency and simplicity. It works when all the information you need is co-retrieved by a single dense ranking pass. Agentic RAG trades latency for precision and multi-step reasoning.

Why this matters: If 90% of your queries are single-hop, vanilla RAG is probably sufficient. If 40% require multi-hop reasoning, agentic patterns become justified. The cost is 2–5x latency increase; the benefit is 30–40% higher accuracy on complex queries.

Failure modes: Vanilla RAG fails on multi-hop, temporal, and reasoning-intensive queries. Agentic RAG fails on trivial queries (planner over-decomposes, router misclassifies) and when feedback signals are weak (reflective RAG talks itself into bad answers).


Pattern 1: Planner-Retriever (Decomposition-Driven Retrieval)

The planner-retriever pattern decouples query understanding from retrieval execution. An LLM-based planner reads the user query and outputs a linear sequence of simpler sub-queries, each with metadata (filters, temporal ranges, sort orders, constraints). A retriever engine then executes each sub-query in order, returning results that the final generator consumes.

Architecture diagram 2

Setup: The query “Find all unresolved defects in Q1 across suppliers A, B, C and correlate with equipment downtime” enters the planner. The planner outputs:

[
  {"subquery": "defects reported in Q1 from supplier A where status='unresolved'", "filter": "time_range=[2026-01-01, 2026-03-31]"},
  {"subquery": "defects reported in Q1 from supplier B where status='unresolved'", "filter": "time_range=[2026-01-01, 2026-03-31]"},
  {"subquery": "defects reported in Q1 from supplier C where status='unresolved'", "filter": "time_range=[2026-01-01, 2026-03-31]"},
  {"subquery": "equipment downtime logs in Q1", "filter": "equipment_source in [A,B,C]"}
]

The retriever then executes these four queries in parallel, collects results, and the generator synthesizes them into a single answer with multi-document citations.

What vanilla RAG gets wrong: A single query embedding and vector search cannot reliably separate “defects from supplier A” from “defects from supplier B” without explicit filtering. The retriever collapses all three suppliers into one ranked list. The generator sees a mess and hallucinates connections.

What planner-retriever fixes: By decomposing the query explicitly, you convert a hard multi-hop problem into multiple single-hop problems. Each sub-query is narrower, retrieval is more precise, and the aggregator guarantees completeness (did we hit all three suppliers? Yes, by design).

Cost: Each sub-query triggers one retrieval pass and (typically) one LLM call for reasoning on results. A complex query decomposing into 4–5 sub-queries means 4–5x retrieval latency and 2–3 additional LLM calls (planner + aggregator reasoning). Total latency: 500ms–2s depending on retriever speed.

Failure modes:
Planner hallucination: The planner invents sub-queries that don’t match your corpus. Example: “Find all defects by root-cause code XYZ-999” if your corpus never uses that taxonomy.
Error propagation: If sub-query 1 retrieves nothing, and sub-query 2 depends on results from sub-query 1, the entire chain fails. No feedback loop corrects the planner.
Over-decomposition: A planner can shatter a simple query into 10 sub-queries when 2 would suffice, ballooning latency and cost.

Alternatives rejected:
Manual query rewriting: Expensive, non-scalable, requires human annotations.
Few-shot prompting in vanilla RAG: Improves retrieval by ~10–15%, but doesn’t solve multi-hop. Planner-retriever solves it by design.

When to use: Complex queries that naturally decompose into steps. Examples: “Compare Q1 and Q2 performance” (decompose by quarter), “List all X by supplier, then filter by Y” (decompose by supplier, then filter), “For each of these five APIs, find latency and error rates” (decompose by API).

Implementation:
– Use Claude or GPT-4 for planning; reasoning quality matters. Cheaper models (Haiku, GPT-3.5) can work for well-defined domains.
– Output sub-queries as structured JSON with metadata (filters, sort order, constraints).
– Parallelize retrieval (all sub-queries fire at once) to hide latency.
– Aggregate results deterministically (e.g., union, deduplicate, sort by relevance) before feeding to generator.


Pattern 2: Router / Query Classifier (Domain-Aware Retrieval Strategy Selection)

The router pattern routes each query to one of several specialized retrieval strategies based on query intent, domain, or metadata. A lightweight classifier examines the query and selects the appropriate retriever. Think of it as a switchboard: boolean queries → structured DB; semantic questions → vector store; time-series trends → API.

Architecture diagram 3

Setup: A query arrives: “What is the IP address of gateway-01?” The classifier (rule-based or lightweight LLM) detects metadata lookup and routes to the structured database. Another query: “Explain the relationship between process variance and downtime” routes to the vector store. A third: “Show me equipment failures over the past 90 days” routes to the timeseries API.

Each retriever is specialized. The structured DB is fast for point lookups. The vector store excels at semantic similarity. The timeseries API handles temporal aggregations that relational queries would struggle with. The KG traversal finds multi-hop relationships.

What vanilla RAG gets wrong: It treats all queries as semantic similarity problems. A boolean query like “Is equipment X certified for environment Y?” should hit a structured database in <50ms. Instead, it gets embedded, ranked against all documents, and the generator struggles to extract a yes/no answer from a wall of irrelevant semantic matches.

What router fixes: By routing queries to specialized retrievers, you match problem structure to solution structure. Boolean queries get deterministic answers. Semantic queries get dense similarity ranking. Temporal queries get aggregations. This specialization typically yields 20–40% accuracy improvement on multi-domain datasets.

Cost: Classification latency is negligible (<50ms typically). Overhead is minimal. Trade-off is engineering: you must maintain multiple retrievers and their integration points. That’s complexity, not cost.

Failure modes:
Misclassification: A query intended for structured data gets routed to the vector store. Result: no answer, or a vague semantic match when a deterministic lookup would have worked.
No fallback: If the chosen retriever returns nothing, there’s no recovery path. Some patterns add fallback routing (if structured DB fails, try vector store), but that’s manual engineering.
Fuzzy boundaries: Edge-case queries consistently misclassify. Example: “Has supplier X reduced defect rates?” Is this temporal (trending) or semantic (relationship)? Router misclassifies 30% of the time, requiring manual tuning.

Alternatives rejected:
Single vector store for everything: Simplest, but loses the precision of structured data and temporal queries.
Ensemble retrieval (query all, rank together): Accurate but expensive. Multiple retriever calls for every query. Slower than router classification.

When to use: Multi-domain systems where query types are distinct. Examples: a unified query interface over operational data (metadata lookups + sensor timeseries + documentation), a compliance Q&A system (exact rules + policy interpretation + historical precedent), a supply-chain analytics platform (inventory tables + semantic relationships + trend analysis).

Implementation:
– Classification can be rule-based (if query contains “how many” → count query) or LLM-based (heavier, more flexible).
– Keep classifiers simple and fast. Use a small model or hardcoded heuristics.
– Log misclassifications; retrain rules monthly.
– Add fallback routing or hybrid retrieval for edge cases.


Pattern 3: Graph-RAG Agent (Relationship-Driven Iterative Retrieval)

Graph-RAG agents embed entities and relationships as a knowledge graph (KG), then use an agentic loop to traverse the graph and refine retrieval iteratively. When a query arrives, the agent extracts entities, retrieves KG neighborhoods (subgraphs), runs inference, and if confidence is low, expands the subgraph or reformulates the query.

Architecture diagram 4

Setup: A query arrives: “What are the root causes of defects in equipment sourced from Supplier X?” The agent:

  1. Extract entities: Supplier X, Defect (entity types).
  2. Retrieve subgraph: Fetch all equipment sourced from Supplier X, all defects affecting that equipment, all root-cause analyses mentioning those defects. The subgraph is a small slice of the full KG.
  3. Infer on subgraph: Run reasoning over the subgraph. Example: traverse edges to find common root causes across multiple defect instances.
  4. Check confidence: Are there at least three independent root-cause analyses pointing to the same cause? If yes, high confidence. If no, the subgraph is incomplete.
  5. Expand if needed: If confidence is low, expand the subgraph (e.g., include transitive relationships: suppliers of Supplier X, related equipment types) or reformulate the query.
  6. Loop or finalize: If iterations remain and confidence is still low, loop. Otherwise, generate the answer.

What vanilla RAG gets wrong: Relationship-heavy domains (supply chains, incident forensics, organizational hierarchies) require transitive reasoning. “What is the root cause of defects in equipment from Supplier X?” needs the system to know that X supplies Y, Y supplies Z, Z had a defect, and the root cause relates back to X. A vector search won’t infer these chains. It might find documents about X and documents about defects, but won’t reliably connect them.

What graph-RAG fixes: By making relationships explicit as graph edges, you enable traversal. The agent can walk from Supplier X to equipment to defects to root causes, following edges rather than guessing semantic similarity. This approach naturally encodes transitive properties and enables precise multi-hop reasoning.

Cost: High, in three ways:
Graph construction: Offline entity extraction and relationship inference (expensive once, amortized across many queries).
Per-query retrieval: Subgraph retrieval is cheap, but iteration can loop 2–3 times, each time expanding the subgraph.
Latency: Typically 1–5 seconds per query due to iteration.

Failure modes:
Graph brittleness: Extraction errors poison the KG. If the entity extractor misses a relationship (Supplier X supplies Equipment Y), that connection is lost forever.
Combinatorial fan-out: A central hub entity (e.g., a major supplier with thousands of sourcing relationships) can explode the subgraph to millions of edges. Retrieving a full neighborhood becomes intractable.
Infinite loops: Without crisp stopping conditions, iterative refinement loops indefinitely. The agent keeps expanding the subgraph and reaching the same low-confidence conclusion.
False negatives in extraction: If the KG construction misses relationships, the agent has no way to discover them later.

Alternatives rejected:
Semantic search over documents: No explicit relationships; can’t reliably infer transitive chains.
Planner-retriever on relationship queries: Can work, but doesn’t leverage the structure of relationships. Graph-RAG is structurally aligned to the problem.

When to use: Domains with rich relationship semantics. Examples:
Supply chain: Suppliers → components → products → customers. Transitive sourcing questions.
Incident forensics: Events → symptoms → root causes → fixes. Causal chain discovery.
Organizational Q&A: Teams → responsibilities → stakeholders. Accountability chains.
Equipment genealogy: Manufacturers → models → instances → deployments. Provenance tracking.

Implementation:
– Use extraction models (e.g., LlamaIndex’s knowledge graph extractors, Microsoft GraphRAG, or fine-tuned NER models) to build the KG.
– Store KG in a graph database (Neo4j, ArangoDB) or vector store with relationship metadata.
– Define entity types and relationship types upfront (supply relationship, part-of relationship, causes, etc.).
– Set explicit iteration budgets (max 3 rounds) and confidence thresholds (at least 2 independent sources agree).
– Monitor subgraph sizes; cap them to prevent fan-out explosion.


Pattern 4: Reflective RAG (Self-Critique and Corrective Retrieval)

Reflective RAG generates an initial answer, then explicitly critiques its own retrieval and reasoning, issuing corrective queries if needed. The reflection step is itself a language model call; it asks: “Does the retrieved context actually support this answer? What am I missing?” If the critique finds gaps, the system re-retrieves and regenerates.

Architecture diagram 5

Setup: A query arrives: “Does our manufacturing process meet ISO 9001 standards?” The system:

  1. Retrieve: Fetch documents about ISO 9001 and your process description.
  2. Generate: LLM produces an initial answer: “Yes, section 4.1 covers control points that align with ISO 9001 clause 8.1.”
  3. Critique: A second LLM call reviews the answer: “Does the retrieval actually support this mapping? Let me check… The retrieved documents mention clause 8.1 as ‘process control,’ but do they specifically discuss control points in your process? No explicit mapping found.”
  4. Decide: The critique flags the answer as unsupported. The system issues a corrective query: “Find documents that explicitly map our manufacturing control points to ISO 9001 clauses.”
  5. Re-retrieve: Fetch more specific documents.
  6. Loop or finalize: If iteration budget remains and confidence is still low, loop. Otherwise, finalize.

What vanilla RAG gets wrong: It generates one answer from the first retrieval and outputs it. No verification that retrieved documents actually support the answer. In high-stakes domains (compliance, medical, financial), this silence-by-default is dangerous. An LLM can confidently cite a document that says the opposite, and vanilla RAG has no internal alarm.

What reflective RAG fixes: By adding an explicit critique step, you surface mismatches between retrieval and reasoning. The system can detect that an answer is not supported by the evidence and issue corrective queries. This reduces hallucination on high-stakes queries by 30–50%, depending on the domain.

Cost: High. Each query generates an initial answer (1 LLM call) and a critique (1 LLM call), doubling latency. Iterations amplify the cost. Per-query cost typically 2–4x a vanilla RAG baseline.

Failure modes:
Self-rationalization: The critique LLM is prone to defending the initial answer rather than genuinely critiquing it. Example: “The answer says X, and the retrieved document mentions X, so it’s correct” even if the context is misleading or partial.
Critique unreliability: If the model is not strong (e.g., a smaller, cheaper model used for cost), the critique adds noise rather than signal.
Double latency: Two LLM passes per query. For latency-sensitive applications (<500ms SLA), reflective RAG is infeasible.
Repeated retrieval failures: If the initial retrieval is poor, the critique detects it but the corrective query fetches the same low-quality documents again, creating a loop.

Alternatives rejected:
Beam search over answers: Generate multiple candidate answers, rank by entailment. More expensive than critique (N candidates × 2 passes each), and doesn’t improve precision on unsupported claims.
Explicit answer grading: Use a rubric (e.g., “answer must cite at least 2 sources”). Works, but is rigid and doesn’t adapt to query specificity.

When to use: High-stakes, hallucination-sensitive domains:
Compliance and legal: Citation accuracy is mandatory. Regulators audit reasoning.
Medical/safety-critical: Wrong answers have consequences. Verification is non-negotiable.
Financial advisory: Misleading claims violate fiduciary duty.
Public-facing systems: Reputational risk from confident falsehoods is high.

Lower stakes (customer support FAQ, product documentation) don’t justify the 2x latency cost.

Implementation:
– Use a strong critique model (same scale as generation, or larger).
– Pair critique with explicit answer ranking (score candidate answers against a rubric of supporting evidence).
– Add fallback heuristics: if critique finds zero supporting evidence, reject the answer outright rather than looping.
– Log critique feedback; use it to retrain generation or retrieval components monthly.
– Set low iteration budgets (max 1–2 rounds) to avoid latency explosion.


Decision Tree: Which Pattern, When?

Selecting among these patterns requires understanding your constraints and failure modes. Here’s a diagnostic workflow:

Step 1: Baseline vanilla RAG. Run vanilla RAG on a held-out test set of 50–100 queries representing production load. Categorize failures:

  • Retrieval failures (~20% of errors): Your retriever ranked the right document low or didn’t retrieve it. Example: query “equipment from supplier X” retrieves supplier Y by mistake.
  • Try: Router pattern (if domain-dependent) or planner-retriever (if query is decomposable).

  • Multi-document failures (~35% of errors): The answer requires facts from multiple documents that vanilla RAG never co-retrieved. Example: “which supplier’s equipment has the longest MTTR” needs MTTR data and supplier sourcing data in the same retrieval.

  • Try: Planner-retriever or graph-RAG.

  • Reasoning failures (~25% of errors): Documents were retrieved, but the generator misinterpreted them or hallucinated. Example: cited a document that says the opposite.

  • Try: Reflective RAG or better prompting (cheaper first).

  • Domain misroute (~20% of errors): The query semantically matches the wrong knowledge domain. Example: “show me latest incident” in operations queries should go to incident logs, not sensor data.

  • Try: Router pattern (mandatory).

Architecture diagram 6

Step 2: Assess constraints:

Constraint Decision
Latency budget <500ms Router (fast). Avoid planner-retriever (multiple LLM calls). Avoid graph-RAG (iteration). Avoid reflective (double passes).
Latency budget 500ms–2s Planner-retriever OK. Router good. Graph-RAG acceptable if shallow (≤2 hops). Reflective risky.
Latency budget >2s Any pattern works. Optimize for accuracy, not speed.
Query distribution: 80% simple, 20% complex Hybrid: vanilla RAG for 80%, route complex queries to agentic pattern.
High hallucination cost (medical, legal, compliance) Reflective RAG mandatory or pair agentic patterns with explicit validation.
Corpus >500K documents Router and graph-RAG gain efficiency. Planner-retriever still viable.
Relationship-heavy domain (supply chain, incidents) Graph-RAG is structurally aligned. Plan B: planner-retriever with explicit relationship decomposition.
Multi-domain system Router is foundational. Pair with planner-retriever or graph-RAG for complex queries within each domain.

Step 3: Pilot and measure:

  1. Implement one pattern (pick the one with highest ROI from your diagnostic).
  2. Measure on 100 held-out queries: accuracy, latency, cost, hallucination rate.
  3. A/B test against vanilla RAG baseline for 1–2 weeks.
  4. If accuracy improves >10% and latency is acceptable, roll out incrementally.
  5. Hybrid approach wins: route easy queries to vanilla, hard queries to agentic pattern.

Common Failure Modes Across All Agentic RAG

Even with the right pattern chosen, agentic RAG systems fail in predictable ways:

Hallucination under tool errors. When a planner invents a sub-query or a graph-RAG agent expands the subgraph, it can retrieve irrelevant or contradictory documents. The generator then rationalizes these documents into a confident answer. Mitigation: explicit validation of tool outputs. If a retrieval returns zero results, surface that (don’t hallucinate documents). If a tool call is malformed, re-prompt the LLM or fall back to a heuristic.

Latency blow-up. Planner-retriever decomposes a query into 5 sub-queries; each takes 200ms. Suddenly you’re at 1s latency (or 5s if sequential). Graph-RAG with 3 iteration rounds becomes even slower. Mitigation: set hard decomposition budgets (max 3 sub-queries), parallelize retrieval, set iteration caps (max 2 rounds).

Cost explosion. Multiple LLM calls per query. If you’re using expensive models (GPT-4 level) and have 10K QPS, token costs compound quickly. A query costing 1 cent becomes 5 cents (5x). Mitigation: Use cheaper models for planning and critique (Haiku, GPT-3.5). Reserve larger models (Sonnet, GPT-4) for final generation. Or batch similar queries and reuse planning/routing decisions.

Evaluation drift. You evaluate your agentic RAG on a fixed test set and tune for BLEU or ROUGE scores. In production, user queries shift. The planner makes different decompositions. Reflective RAG’s critiques become stale. Accuracy drifts. Mitigation: continuous online evaluation. Log every retrieval-generation pair. Monthly retraining on recent queries.

Tool-use reliability. Modern LLMs have ~95% accuracy on tool calls. At scale, that 5% error rate becomes systematic. A planner issues a malformed retrieval query; a graph-RAG agent calls a non-existent relationship. Mitigation: explicit tool-call validation. If a call is malformed, re-prompt the LLM or fall back to a heuristic. Never silently pass broken calls downstream.

Context-window abuse. Reflective RAG and graph-RAG agents can retrieve very large subgraphs (thousands of tokens). If your token budget is tight (4K context window), you lose signal-to-noise. The LLM becomes a needle-in-haystack task. Mitigation: aggressive summarization of retrieved subgraphs. Hierarchical retrieval (retrieve top-10, re-rank, then pass to generator). Use sliding-window context management.

Infinite loops in iterative refinement. A graph-RAG agent with weak stopping conditions can loop indefinitely: expand subgraph, reach low-confidence conclusion, expand again, repeat. Mitigation: fixed iteration budgets (max 3 rounds). Explicit stopping criteria: if confidence hasn’t improved by 10% in the last round, stop. Use information gain metrics: stop if each new retrieval adds <1 new fact.


Real-World Implications

Picking the wrong pattern is expensive. Here are three case studies:

Case 1: Manufacturing Defect Analysis

A manufacturer has quarterly defect reports (10K documents) and equipment logs (500K records). Queries: “Which suppliers have unresolved defects in Q1?” + “Correlate with equipment downtime.”

  • Vanilla RAG: Multi-hop failure. Accuracy ~50%.
  • Planner-retriever: Decomposes into [find Q1 defects] + [find downtime] + [correlate]. Accuracy ~85%, latency 800ms. Cost: 4 LLM calls/query.
  • Router: Insufficient (metadata + sensor data are complementary, not alternative).
  • Graph-RAG: Overkill for this domain; extraction overhead not justified. Similar accuracy to planner at higher complexity.
  • Winner: Planner-retriever. Simple, effective, justified cost.

Case 2: Supply Chain Provenance Queries

A supplier network has 1M+ relationships (raw materials → components → products → customers). Queries: “What is the full sourcing chain for product X?” + “Which suppliers can be replaced without redesign?”

  • Vanilla RAG: Relationship-heavy; fails on transitive reasoning. Accuracy ~40%.
  • Planner-retriever: Cannot encode supplier-of-supplier relationships without extracting them explicitly.
  • Router: Insufficient (all queries are relationship-heavy).
  • Graph-RAG: Designed for this. Extract entities once (offline), traverse graph at query time. Accuracy ~90%, latency 2–3s.
  • Winner: Graph-RAG. Structurally aligned.

Case 3: Compliance Document Q&A

A financial services firm has 500 regulatory documents. Queries: “Are we compliant with SEC Rule 10b-5?” + “Cite the specific controls.”

  • Vanilla RAG: Hallucinates citations. Accuracy ~60%, but confidence is high (false confidence).
  • Planner-retriever: Helps decompose, but doesn’t validate citations. Accuracy ~70%.
  • Reflective RAG: Critiques answers against retrieved documents. Accuracy ~88%, latency 1.2s.
  • Winner: Reflective RAG. High-stakes domain demands citation validation.

Recommendations: A Phased Migration Strategy

Phase 1: Baseline and diagnose (Week 1–2).
– Run vanilla RAG on 100 representative queries.
– Log failures: multi-hop, domain, retrieval quality, reasoning.
– Compute error rate by category.
– Identify your top 3 failure modes.

Phase 2: Choose and pilot (Week 3–6).
– Implement one agentic pattern targeting your top failure mode.
– Measure accuracy, latency, cost on the same 100 queries.
– A/B test for 2 weeks with 10% of production traffic.
– If accuracy improves >10% and latency/cost is acceptable, proceed.

Phase 3: Instrument and iterate (Week 7–12).
– Deploy the pattern to 50% of production traffic.
– Log every retrieval, generation, and latency measurement.
– Set up automated evaluation: compute NDCG (retrieval quality), ROUGE (generation quality), latency percentiles.
– Monthly retraining on recent queries.

Phase 4: Hybrid and optimize (Month 4+).
– Classify queries by complexity and route accordingly.
– Vanilla RAG for simple queries (<100ms latency target).
– Agentic pattern for complex queries (<2s latency target).
– Establish SLAs: 95% of simple queries <200ms, 95% of complex queries <1.5s.

Best practice: Hybrid is the default. Don’t fully replace vanilla RAG. Route queries by estimated complexity (ask the planner to estimate decomposition depth; if >3 sub-queries, it’s complex). Use vanilla for simple, agentic for complex. This balances cost and accuracy.


Failure Modes in Detail: When Agentic RAG Breaks

Planner-Retriever Specific

  • Over-decomposition. A complex query decomposes into 15 sub-queries. Latency explodes; cost balloons. Planner is trying too hard. Mitigation: Prompt the planner to aim for 2–3 sub-queries. Penalize decomposition depth in the scoring function.
  • Orphaned sub-queries. A planner invents a sub-query that has no supporting documents. Retriever returns empty. Generator hallucinates. Mitigation: If retrieval returns <3 documents per sub-query, flag as low-confidence. Optionally re-retrieve with a broader query or fall back to vanilla RAG for that sub-query.

Router Specific

  • Boundary fuzz. A query sits on the boundary between domains (semantic + temporal). Classifier misroutes 40% of the time. Mitigation: Add a confidence score to the classifier. If confidence <0.7, use fallback routing (query both retrievers, ensemble results). Or add a clarification step (ask the user which domain).

Graph-RAG Specific

  • Graph poisoning. Entity extraction errors (say, extracting “Supplier A” as “A Supplier”) create duplicate or missing nodes. Traversal fails. Mitigation: Canonical entity normalization (map all variations to a single entity). Regular graph audits (sample nodes, verify correctness).
  • Combinatorial explosion. A central node (e.g., “defect”) has 100K outgoing edges (all defects ever). Subgraph retrieval returns millions of edges. Context window explodes. Mitigation: Top-K filtering on edges by relevance. Hierarchical retrieval (retrieve top-100 edges, re-rank, then expand).

Reflective RAG Specific

  • Critique defending bad answers. The critique model rationalizes poor answers rather than critiquing them. Example: “The answer says X, and the document mentions X, so the answer is supported” even if out of context. Mitigation: Use explicit rubrics (e.g., “the answer must cite the document with at least 10 words of verbatim text”). Pair with automatic entailment scoring (does the answer logically follow from the document?).

Further Reading

  • GraphRAG: Microsoft Research, “From Local to Global: A Graph RAG Approach to Query-Focused Summarization” (2024). Seminal work on entity-relationship-aware retrieval and iterative refinement.

  • Self-RAG: Asai et al., “Self-RAG: Learning to Retrieve, Generate, and Critique for Question-Answering” (2023). Foundational paper on reflective patterns and self-critique in language models.

  • HyDE (Hypothetical Document Embeddings): Gao et al., “Precise Zero-Shot Dense Retrieval without Relevance Labels” (2022). Technique for improving retrieval with LLM-generated pseudo-documents.

  • LangChain Agents: https://python.langchain.com/docs/modules/agents/ — Reference implementations of planner-retriever and reactive agent patterns. Good examples of tool-use orchestration.

  • LlamaIndex Graph Integrations: https://docs.llamaindex.ai/en/stable/module_guides/indexing/knowledge_graph/ — Practical guide to building and querying graph-RAG systems.

  • Anthropic Prompt Engineering: https://www.anthropic.com/research/prompt-engineering — Best practices for reliable tool use and structured outputs in agentic patterns.

  • RAG Evaluation Frameworks: RAGAS (Rouamba et al., 2023) and TruLens provide open-source evaluation for retrieval precision, recall, and generation quality.

  • Constitutional AI: Bai et al., “Constitutional AI: Harmlessness from AI Feedback” (2022). Methods for training LLMs to critique and improve their own outputs, foundational to reflective RAG.


Summary: Choosing Your Pattern

Your Situation Best Pattern Cost Latency Accuracy Gain
Multi-hop queries, decomposable into steps Planner-Retriever Medium (2–4 LLM calls) 500ms–2s +25–35% on multi-hop
Multi-domain queries, distinct retrieval strategies Router/Classifier Low (<50ms overhead) <300ms +20–30% on routed queries
Relationship-heavy domain (supply chain, incidents, org charts) Graph-RAG Agent High (extraction + iteration) 1–5s +30–45% on relationship queries
High-stakes, hallucination-critical (compliance, medical) Reflective RAG High (2x LLM calls) 1–2s +25–50% on hallucination reduction
Simple, single-hop, high-volume Vanilla RAG Low <200ms Baseline
Mixed query complexity Hybrid (Router + Vanilla, or Planner + Vanilla) Medium Adaptive: <200ms simple, 500ms–2s complex Balanced

Start with diagnosis. Implement the highest-ROI pattern. Iterate monthly. Hybrid is the default.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *