RAG Evaluation Architecture: Faithfulness, Context Precision, and RAGAS-Style Metrics (2026)
A retrieval-augmented generation system almost never fails loudly. It returns a fluent, confident, well-formatted paragraph that happens to be wrong — and no exception is thrown, no HTTP 500 fires, no log line turns red. This is why rag evaluation metrics matter more than the retrieval stack you spent three weeks tuning: without them, you are shipping a system whose failure mode is plausible fabrication, and you have no instrument that can see it. A demo that answers ten curated questions correctly tells you almost nothing about the thousandth real question. The only defense is a measurement layer that decomposes quality into retrieval and generation, scores each with metrics that survive contact with production, and gates deploys on regressions. This post gives you that layer: the metric taxonomy, the exact formulas, how RAGAS-style LLM-graded scores are computed, where LLM judges lie to you, and how to wire it all into a pipeline that runs offline in CI and online against live traffic.
What this covers: the retrieval-vs-generation decomposition, retrieval metrics (context precision, recall, hit rate, MRR, nDCG), generation metrics (faithfulness, answer relevancy, answer correctness), how RAGAS computes claim-level faithfulness via NLI, LLM-as-judge biases, golden sets, and CI regression gating.
Context and Background
RAG became the default pattern for grounding large language models in private or fresh data, and its adoption outran its instrumentation. Teams shipped retrievers and prompt chains, watched a handful of demo queries succeed, and declared victory — then discovered in production that the system confidently invented policy numbers, cited the wrong contract clause, or answered a question the retrieved documents never addressed. RAG breaks silently because both of its halves can fail independently and the surface output looks identical in every case: fluent prose. A retriever can pull irrelevant chunks while the generator smooths over the gap by hallucinating; or the retriever can nail the right passage while the generator ignores it and pattern-matches from pretraining. You cannot tell which happened by reading the answer.
The instinct is to reach for classic NLP metrics — BLEU, ROUGE, exact match — because they are cheap and deterministic. They fail here for a structural reason: they measure n-gram overlap against a reference string, but a RAG answer can be entirely correct while sharing almost no surface tokens with your reference, and entirely wrong while sharing most of them. ROUGE rewards a fluent paraphrase of the wrong document and punishes a correct answer phrased differently. Neither metric can tell whether a claim is supported by the retrieved context, which is the only property that actually defines a good RAG answer. Embedding-based similarity (BERTScore and friends) is a partial upgrade — it catches paraphrase — but it still cannot distinguish “semantically close to the reference” from “factually entailed by the retrieved evidence,” and factual entailment is the property that matters. Evaluation had to move from surface overlap to semantic, claim-level, grounding-aware measurement — which is what the RAGAS, TruLens, and DeepEval generation of tools deliver.
If you are still choosing a retrieval strategy, evaluation should run alongside it from day one; the corrective-RAG and self-RAG architecture patterns all assume you can measure retrieval quality to trigger their correction loops. For the broader operational picture, the RAGAS documentation is the canonical reference for the metric definitions this post formalizes.
A RAG Evaluation Architecture
A RAG evaluation architecture separates every quality signal into two axes — retrieval quality and generation quality — scores each axis with metrics computed from the same execution trace, aggregates those scores per metric, and feeds the result to a dashboard for humans and a threshold gate for CI. The decomposition is the whole point: a single end-to-end “correctness” number tells you the system is broken but never why, and you cannot fix what you cannot localize.

Figure 1: The evaluation pipeline. A fixed query set is run through the RAG system; each execution emits a trace containing the query, the retrieved context, and the generated answer. Retrieval scorers and generation scorers read the same traces, an aggregator computes per-metric means, and the results fan out to a dashboard and a CI regression gate.
The trace is the unit of evaluation. For every query you capture four fields: the question, the list of retrieved chunks (with their IDs and scores), the final answer, and — when you have one — the ground-truth reference. Retrieval scorers need only the question and retrieved chunks; generation scorers need the answer and context; correctness scorers need the reference. Because all scorers read from one trace, you can attribute a low end-to-end score to a specific quadrant: bad context with a good answer means the model guessed; good context with a bad answer means the generator failed. That attribution is what turns a red dashboard into an actionable ticket.
There is a deeper reason to persist the trace rather than compute scores inline and discard the intermediates: metrics evolve. You will change your faithfulness prompt, swap judge models, or add a new metric six months from now, and if you kept only the final numbers you cannot recompute history to compare. A stored trace corpus lets you re-score the entire back-catalog against a new metric definition and produce an apples-to-apples trend. Treat traces as the source of truth and scores as derived, versioned artifacts — every score row should record which metric version and which judge model produced it, or your longitudinal charts silently mix incompatible measurements.
Retrieval metrics: precision, recall, MRR, nDCG
Retrieval is a ranking problem, so its metrics come from information retrieval. Context precision asks: of the chunks we retrieved, how many are actually relevant? If you retrieve k chunks and r of them are relevant, precision@k = r / k. Retrieving four chunks of which two matter gives precision 0.5. Low precision means noise in the context window — wasted tokens and a higher chance the generator latches onto an irrelevant passage.
Context recall asks the complementary question: of all the chunks that should have been retrieved to answer this question, how many did we get? If the ground-truth answer requires information spread across three passages and you retrieved two of them, recall = 2/3 = 0.67. Low recall means the generator is working with incomplete evidence and will either omit facts or fabricate the missing ones. Precision and recall trade off: retrieving more chunks raises recall but usually lowers precision. Chunking granularity moves both at once — tiny chunks raise precision but fragment multi-sentence facts and hurt recall; large chunks raise recall but dilute precision and burn context tokens — which is exactly why you evaluate the retriever as a unit rather than tuning chunk size on intuition. Watch the pair together: rising recall with collapsing precision usually means you simply widened k, not that retrieval got smarter.
Rank-sensitive metrics reward putting the right chunk near the top, which matters because LLMs attend unevenly across a long context — the well-documented “lost in the middle” effect means a relevant chunk buried at position seven of ten is worth far less to the generator than the same chunk at position one. Hit rate@k is binary per query — 1 if any relevant chunk appears in the top k, else 0 — and you average it across the query set. It answers “did we retrieve anything useful at all” and is the coarsest retrieval health check. Mean Reciprocal Rank (MRR) rewards the position of the first relevant result: for a query where the first relevant chunk sits at rank p, the reciprocal rank is 1/p, and MRR is the mean of those reciprocals over all queries. First-position hits score 1.0; a relevant chunk at rank 4 contributes only 0.25.
Normalized Discounted Cumulative Gain (nDCG) generalizes this to graded relevance: DCG sums each result’s relevance discounted by log of its rank, DCG = Σ relᵢ / log₂(i+1), and nDCG divides that by the ideal ordering’s DCG so the score lands in [0, 1]. Use nDCG when relevance is not binary — when some chunks are highly on-topic and others only tangentially useful. Worked example: suppose you retrieve five chunks with graded relevance [3, 2, 0, 1, 0] where 3 is highly relevant and 0 is noise. The DCG is 3/log₂(2) + 2/log₂(3) + 0/log₂(4) + 1/log₂(5) + 0/log₂(6) = 3.0 + 1.26 + 0 + 0.43 + 0 = 4.69. The ideal ordering [3, 2, 1, 0, 0] gives IDCG = 3.0 + 1.26 + 0.5 + 0 + 0 = 4.76, so nDCG = 4.69 / 4.76 = 0.985 — near-perfect ranking despite the noise chunks, because the strong results are already at the top. Move the relevance-3 chunk to position four and DCG collapses to about 3.29, dropping nDCG to 0.69: the metric sees the ranking degradation that hit rate and even precision would completely miss.
Generation metrics: faithfulness, relevancy, correctness
Generation metrics score the answer given the context. Faithfulness (also called groundedness) is the single most important RAG metric because it measures hallucination directly. Its definition is a ratio: faithfulness = (number of answer claims supported by the retrieved context) / (total number of claims in the answer). An answer that makes five factual claims of which four are entailed by the context scores 0.8; the unsupported fifth claim is a hallucination the metric caught. Faithfulness deliberately ignores whether the answer is correct in absolute terms — it only asks whether the answer is grounded in what was retrieved, which is exactly the property a RAG system is contractually obligated to satisfy.
Answer relevancy measures whether the answer actually addresses the question, independent of grounding. The RAGAS technique is elegant: prompt an LLM to generate several candidate questions that the given answer would be a good response to, embed those synthetic questions and the original question, and take the mean cosine similarity between them. relevancy = mean(cos(qᵢ, q_original)). An answer that rambles or dodges produces synthetic questions that drift away from the original, lowering similarity. A tight, on-topic answer produces synthetic questions that cluster around the original, scoring near 1.0. The technique also penalizes non-committal answers: if you generate three reverse-questions from the answer and their cosine similarities to the original are 0.91, 0.88, and 0.34 — that third one drifting because the answer hedged — the mean drops to 0.71, flagging a partially evasive response that a human would also find unsatisfying.
Answer correctness compares the answer to a ground-truth reference and blends factual overlap (which claims match) with semantic similarity, so it needs a labeled golden set. In RAGAS it is computed as a weighted combination of an F1 over claim-level true-positives, false-positives, and false-negatives (claims present in both, only in the answer, only in the reference) and an embedding similarity to the reference. Faithfulness plus relevancy can be computed without references; correctness cannot. This split governs everything downstream: the reference-free pair runs cheaply and continuously on live traffic, while correctness runs only against your golden set in CI, because you only have ground truth for the queries you deliberately labeled.
Wiring the scoring pipeline
The scorers themselves are the easy part; the wiring around them is what makes the pipeline usable. A robust scoring pipeline has four stages. First, a runner that executes the query set against a pinned version of the RAG system and writes traces to durable storage — never score against a moving target, always against a snapshot you can reproduce. Second, a scorer fan-out that reads traces and dispatches each to the relevant metric functions; because judge calls are network-bound and rate-limited, this stage batches, retries with backoff, and caches by trace hash so a re-run does not re-pay for unchanged traces. Third, an aggregator that computes per-metric means and per-slice breakdowns, attaching confidence intervals and the metric/judge versions. Fourth, a sink that publishes to a dashboard and returns a pass/fail verdict to CI.
The two properties that separate a toy harness from a production one are idempotency and provenance. Idempotency — caching scores by (trace hash, metric version, judge version) — is what makes evaluating a large set affordable, because most traces are unchanged between runs and re-scoring them is pure waste. Provenance — stamping every score with the exact prompts, model versions, and metric definitions that produced it — is what lets you trust a trend line months later. Skip either and the pipeline works in the demo and betrays you at scale: the un-cached version times out on the real query set, and the un-stamped version produces charts nobody can interpret after the first judge upgrade.
Deeper Analysis: Metrics and the Judge
The metrics above are definitions; RAGAS-style tools make them computable by using an LLM as the measuring instrument. Understanding how they compute — and how the instrument itself can be biased — is the difference between trusting your dashboard and being fooled by it.
Faithfulness is the clearest worked example. RAGAS does not ask a model “is this answer faithful, yes or no” — that would collapse a structured judgment into one opaque token. Instead it runs a two-stage pipeline. Stage one is claim decomposition: an LLM breaks the answer into atomic, self-contained statements. “The plan costs $40 per month and includes 10 seats” becomes two claims: “the plan costs $40 per month” and “the plan includes 10 seats.” Stage two is natural language inference (NLI): for each claim, a model judges whether the retrieved context entails it — the same entailment/contradiction/neutral judgment from textual-entailment research. Faithfulness is then the fraction of claims labeled entailed. Decomposition matters because a paragraph-level judgment hides partial hallucination; a single fabricated number inside four true sentences would slip past a holistic score but gets isolated and caught at the claim level.

Figure 2: Faithfulness computation. The answer is decomposed into atomic claims; each claim is checked against the retrieved context by an NLI judge for entailment; the score is the ratio of supported claims to total claims.
Context precision and recall get the same treatment. RAGAS-style context recall decomposes the ground-truth answer into claims and asks, for each, whether it can be attributed to the retrieved context — recall is the fraction that can. Context precision checks whether relevant chunks are ranked above irrelevant ones by asking, per chunk, whether it was useful for arriving at the ground truth, then computing a rank-weighted precision. In every case the LLM is doing a bounded, local classification (entail vs not, useful vs not) rather than an open-ended quality vibe — bounded judgments are far more reliable and reproducible than holistic scores.
Walk one query end to end to see how the numbers combine. Question: “What is the refund window and does it cover digital goods?” The retriever pulls four chunks; two describe the 30-day refund window, one covers shipping, one is unrelated FAQ boilerplate — so context precision@4 = 2/4 = 0.5. The ground-truth answer has two claims (“30-day window”, “digital goods excluded”); the retrieved chunks support the window but never mention digital goods, so context recall = 1/2 = 0.5 — a retrieval gap. The generator answers: “You have 30 days to request a refund, and digital goods are fully covered.” Claim decomposition yields two claims; NLI finds the first entailed by context and the second not entailed (the context is silent on digital goods), so faithfulness = 1/2 = 0.5 — the model fabricated the digital-goods coverage to fill the recall gap. Answer relevancy is high (~0.95, the answer is on-topic) but answer correctness is low, because the fabricated claim contradicts the reference. This is the canonical silent failure: fluent, relevant, half-hallucinated, and only the decomposed metrics expose it. The fix the quadrant points to is retrieval, not the prompt — raise recall so the generator has the digital-goods passage instead of guessing.
The major frameworks implement these same primitives with different ergonomics. RAGAS is the reference library for the metric math itself. DeepEval wraps the metrics in a pytest-style assertion API so evals read like unit tests and slot into CI naturally. TruLens centers on “feedback functions” and instrumentation for tracing chains as they run. Arize Phoenix and LangSmith lean into tracing, dataset management, and dashboards for the online/offline lifecycle. They overlap heavily on the metrics and differ mostly in where they sit — library, test runner, or observability platform — so the choice is about your workflow, not the definitions.
LLM-as-judge and its biases
The moment an LLM scores outputs, its own failure modes become your measurement error. Three biases dominate. Position bias: when a judge compares two answers A and B, it systematically favors whichever appears first (or, for some models, second) regardless of quality. Verbosity bias: judges reward longer answers, mistaking length for thoroughness, so a padded answer beats a crisp correct one. Self-preference bias: a judge model rates outputs from its own family higher — GPT-graded evals flatter GPT outputs, Claude-graded evals flatter Claude — which quietly corrupts any leaderboard where the judge and a contestant share lineage.

Figure 3: The four-quadrant diagnostic. Cross retrieval quality with generation quality to localize failure: good context and bad answer points at the prompt or model; bad context and good answer means the model guessed and the retriever needs work; both bad calls for rebuilding the index and reranking.
Two more biases deserve a mention because they quietly inflate scores. Format bias: judges reward answers that look like the format they expect — bulleted, markdown-headed, confident — over equally correct prose, so a formatting change can move your scores without any change in substance. Anchoring / leniency drift: a judge shown a rubric with generous examples grades leniently thereafter, and single-answer scoring (grade this 1–5) tends to cluster around the middle and drift upward over a long batch. The defense against both is a tight, example-anchored rubric with explicit criteria and, for pairwise setups, forcing a decision rather than allowing ties.
Mitigations are concrete, not aspirational. For position bias, evaluate both orderings and average, or randomize order across the eval set. For verbosity, either normalize for length or use a rubric that scores specific dimensions rather than overall preference. For self-preference, use a judge from a different model family than the system under test, and periodically validate the judge against human labels on a held-out slice. The most important discipline is calibration: measure your judge’s agreement with human raters (Cohen’s kappa or simple agreement rate) on a few hundred examples before you trust it on a hundred thousand. A judge that agrees with humans 70% of the time is a noisy but usable instrument; one that agrees 55% of the time is a random number generator with good grammar. Recompute this agreement whenever you change the judge model or its prompt — a “harmless” prompt tweak can shift the judge’s operating point and invalidate every score that follows it. Where the stakes justify it, use a panel of judges from different families and take the majority or mean; disagreement among judges is itself a useful signal that a query is genuinely ambiguous and may need a human.
Sample size, aggregation, and reading the numbers
A metric mean is only as trustworthy as the sample behind it, and this is where many teams quietly fool themselves. If you evaluate 30 queries and faithfulness comes back 0.83, the 95% confidence interval on that mean is roughly ±0.13 for a proportion-like score — so a “regression” from 0.83 to 0.78 on the next run is statistically indistinguishable from noise. As a rough rule, the standard error of a mean shrinks with the square root of the sample size, so quadrupling your query set only halves your uncertainty. For a gate you actually trust, aim for a few hundred queries per evaluated slice, and report the confidence interval next to every mean rather than the bare number. A dashboard that shows 0.83 with no interval invites people to over-interpret three-decimal-place jitter.
How you aggregate matters as much as how many. A single macro-average across all queries hides catastrophic failure on a small but critical slice — legal questions, say, or a specific product line — behind a healthy overall mean. Segment your metrics by query type, document source, and difficulty, and watch the worst slice, not the average. A system at 0.90 overall faithfulness that sits at 0.55 on compliance questions is not a 0.90 system for the people who ask compliance questions. Per-slice reporting is also what makes the four-quadrant diagnostic actionable: you localize failures not just to retrieval-vs-generation but to which kind of query falls into each quadrant, which is the difference between “retrieval is weak” and “retrieval is weak specifically on multi-hop policy questions, fix the reranker there.”
Golden sets and synthetic test generation
Reference-based metrics need a golden set: query, ideal answer, and the chunks that should be retrieved. Hand-curating this is expensive but irreplaceable — a few hundred carefully labeled examples, drawn from real user questions and spanning your actual document distribution, anchor every reference-based number you report. Deliberately include the hard cases: multi-hop questions that require stitching two passages together, ambiguous questions with more than one defensible answer, and — most importantly — out-of-scope questions whose only correct answer is “the documents do not cover this.” That last class is where hallucination lives, and a golden set without it will report healthy faithfulness while the system quietly invents answers to questions the corpus cannot support.
Constructing the ideal answer is subtler than it looks. Write the reference as the answer a careful human expert would give using only your corpus, not the answer the open web would give — otherwise correctness will penalize a perfectly grounded response for omitting facts that were never retrievable. Record, alongside each reference, the set of chunk IDs that a perfect retriever should have surfaced; those IDs are what context recall scores against, and without them recall degenerates into another LLM guess. Capture provenance too: who wrote the reference, from which document version, on what date. A golden set is a living dataset with its own schema and version history, not a one-off spreadsheet someone exported and forgot.
To scale beyond hand-labeling, RAGAS and DeepEval can generate synthetic test sets: sample passages from your corpus, prompt an LLM to write the questions those passages answer plus the reference answers, and vary the difficulty (single-hop, multi-hop, reasoning, ambiguous). RAGAS in particular builds an “evolution” of questions — starting from a simple query and progressively rewriting it into multi-context or reasoning variants — to stress the retriever the way real users do rather than lobbing softballs. Synthetic sets buy breadth cheaply, but every batch must be spot-checked by a human: an LLM-generated question that the same model can trivially answer from parametric memory measures nothing about your retriever. The strongest golden sets are hybrid — a synthetic backbone for coverage, a human-curated core for the queries that actually matter to the business, and a continuously growing tail of real production failures curated back in. Arize Phoenix and TruLens both provide dataset and tracing tooling to manage exactly this lifecycle, while DeepEval and LangSmith add the CI integration and dataset versioning that keep the golden set under source control alongside your code.
One discipline separates a golden set that ages well from one that rots: treat it like production data. Give it a schema, put it in version control, gate changes through review, and tag each example with the metrics it is meant to exercise. When you split it, hold out a slice that is never used for prompt tuning so you retain an honest measure of generalization — a golden set you have optimized against silently becomes a training set, and its numbers stop predicting production behavior.
Trade-offs, Gotchas, and What Goes Wrong
The dominant cost is the judge itself. Every RAGAS-style metric is several LLM calls per query — claim decomposition, then one NLI call per claim — so evaluating a thousand-query set can mean tens of thousands of model calls and a bill that scales with your ambition. This is what forces sampling in production: you cannot afford to LLM-grade every live request, so you score a representative sample and accept statistical uncertainty on the tail, prioritizing the low-confidence and high-stakes traffic where a hallucination is most expensive.
Nondeterminism is the second trap. LLM judges are stochastic; the same answer can score 0.8 on one run and 0.9 on the next. Set temperature to 0 where the API allows it, run multiple samples and average for high-stakes gates, and never treat a single-run score as ground truth. A regression gate that trips on a 0.02 delta will page you at 3 a.m. for noise. The disciplined move is to characterize your own noise floor first: run the same golden set through the same unchanged system five times, measure the standard deviation of each metric’s mean, and set the gate tolerance at two or three sigma above that. Now a trip means a real regression rather than the judge’s coin-flips.
A quieter failure is golden-set staleness. The corpus changes — documents are added, policies updated, products renamed — but the golden set is frozen, so its references slowly diverge from ground truth and correctness scores decay for reasons that have nothing to do with your retriever or model. Version the golden set alongside the corpus, review it on a cadence, and expire references that no longer reflect the live documents. An eval suite that measures a world that no longer exists is worse than none, because it manufactures false confidence.
Metric gaming is subtler and more dangerous. Optimize hard enough for a single metric and you get a system that maximizes the metric while degrading real quality. Push faithfulness alone and the model learns to answer “the context does not specify” to everything — perfectly grounded, perfectly useless. Faithfulness must always be read alongside answer relevancy and correctness; a Goodhart’d metric is worse than no metric because it looks green while the product rots. Threshold-setting compounds this: thresholds are business decisions, not statistical constants. A 0.9 faithfulness floor is right for medical answers and absurdly strict for brainstorming. Set them from your own labeled data and the cost of each failure class, and revisit them as the corpus drifts. Treat the judge as a dependency with its own drift — pin its version, log which judge produced each score, and cross-check critical gates against a second model family, the same way you would monitor any other production model in your LLM observability and LLMOps stack.
Practical Recommendations
Start with the decomposition, not the tooling. Before you install anything, commit to always reporting retrieval and generation quality separately — one blended number is a trap. Then stand up faithfulness and answer relevancy first, because they need no references and catch the two failures that hurt most: hallucination and evasion. Add a small human-curated golden set (200–500 examples) for correctness and context recall once the reference-free metrics are stable.
Wire evaluation into CI as a blocking gate. On every change to the retriever, chunking, prompt, or model, run the golden set offline and fail the build if any core metric drops beyond a tolerance you set from observed run-to-run variance — not zero, because the judge is noisy. In production, sample a slice of live traffic, run the reference-free metrics (faithfulness, relevancy) online, and alert on drift. Curate the failures that sampling surfaces back into the golden set so the offline suite keeps pace with reality.

Figure 4: The two-loop evaluation lifecycle. The golden set gates deploys offline in CI; production sampling runs a judge online and raises drift and regression alerts; curated production failures flow back into the golden set, closing the loop so the offline suite keeps pace with live traffic.
Checklist:
- [ ] Report retrieval and generation metrics separately, never one blended score.
- [ ] Stand up faithfulness + answer relevancy first (no references needed).
- [ ] Build a golden set of 200–500 human-curated examples, including out-of-scope questions.
- [ ] Store the ideal chunk IDs with every reference so context recall has ground truth.
- [ ] Use a judge from a different model family than the system under test; pin its version.
- [ ] Calibrate the judge against human labels before trusting it at scale.
- [ ] Set the CI gate tolerance from run-to-run variance, not zero.
- [ ] Run online sampling with drift alerts; curate failures back into the golden set.
- [ ] Always read faithfulness alongside relevancy to prevent metric gaming.
Frequently Asked Questions
What is the difference between faithfulness and answer correctness?
Faithfulness measures whether the answer is grounded in the retrieved context — the fraction of the answer’s claims the context supports — and needs no reference answer. Answer correctness measures whether the answer matches an external ground truth, blending factual overlap with semantic similarity, and requires a labeled golden set. An answer can be perfectly faithful yet incorrect if the retrieved chunks themselves were wrong or incomplete. You need both: faithfulness catches hallucination, correctness catches retrieval and knowledge gaps that faithfulness alone cannot see.
How does RAGAS compute faithfulness without a reference answer?
RAGAS uses a two-stage LLM pipeline. First it decomposes the generated answer into atomic claims — self-contained factual statements. Then, for each claim, an NLI-style judge checks whether the retrieved context entails it. Faithfulness is the fraction of claims that are entailed. Because it only compares the answer against the context it was given, no ground-truth reference is required, which makes faithfulness cheap enough to run continuously on live production traffic where you rarely have labels.
Why not just use BLEU or ROUGE to evaluate RAG?
BLEU and ROUGE measure n-gram overlap against a reference string. A correct RAG answer can be phrased entirely differently from your reference and score near zero, while a fluent paraphrase of the wrong document scores high. Neither metric can tell whether a claim is supported by the retrieved context — the one property that defines RAG quality. They were built for machine-translation and summarization overlap, not for grounding-aware, claim-level factuality, so they systematically mislead on RAG.
Which LLM-as-judge biases should I worry about most?
Three dominate: position bias (favoring whichever answer appears first in a pairwise comparison), verbosity bias (rewarding longer answers as if length equals quality), and self-preference bias (a judge rating its own model family higher). Mitigate by averaging over both orderings, scoring against an example-anchored rubric instead of overall preference, and choosing a judge from a different family than the system under test. Always calibrate the judge’s agreement against human labels before trusting its scores at scale.
How large should my golden set be?
Start with 200–500 human-curated examples that span your real document distribution and question types (single-hop, multi-hop, ambiguous, out-of-scope). That is enough to produce stable per-metric means and catch meaningful regressions. Scale coverage with synthetically generated tests, but keep a human-curated core for the queries that matter and continuously fold in real production failures. Quality and distribution coverage matter far more than raw size — a thousand near-duplicate questions measure less than three hundred diverse ones.
Can I run RAG evaluation entirely online in production?
Partly. Reference-free metrics — faithfulness and answer relevancy — run online because they need only the trace, so you can sample live traffic and score it continuously. Reference-based metrics (answer correctness, context recall) need ground truth, so they run offline against your golden set in CI. The mature pattern is both loops: offline gates deploys, online sampling catches drift the golden set never anticipated, and online failures are curated back into the offline set. Online scoring costs money per request, so sample rather than grade everything.
How do I set a regression threshold that doesn’t page me for noise?
Characterize your noise floor first. Run the identical golden set through the unchanged system several times and measure the standard deviation of each metric’s mean across runs — that variance is pure judge and sampling noise. Set the gate tolerance at two to three standard deviations below the baseline mean, so only a move larger than the system’s own jitter trips it. Report a confidence interval next to every mean, and widen it when your query set is small. A gate tuned to real variance fires on regressions, not coin-flips.
Further Reading
- Corrective RAG and Self-RAG architecture patterns — retrieval-correction loops that depend on the measurable retrieval quality this post defines.
- LLM observability and LLMOps architecture — where evaluation fits inside the broader tracing and monitoring stack.
- GraphRAG knowledge-graph retrieval-augmented generation — a retrieval architecture whose multi-hop nature makes context recall the metric to watch.
- RAGAS documentation — canonical definitions and reference implementations of the metrics formalized here.
- Evaluating Retrieval-Augmented Generation (arXiv) — a survey of RAG evaluation frameworks, metrics, and benchmarks, including RAGAS and ARES.
By Riju — about
