Lang
API

RAG-specific evaluation

A RAG demo that “looks smart” on three PDFs is not evidence. Production RAG fails in two independent places: retrieval (wrong chunk in context) and generation (model ignores or hallucinates beyond context). End-to-end accuracy alone hides which layer regressed when you change chunk size, hybrid weights, or the answer prompt. This guide is a production-first methodology for measuring all three layers—retrieval-only, generation-only, and full pipeline— with classical IR metrics, faithfulness scorers, framework comparisons, and dashboard patterns teams use before widening rollout.

After reading, you should be able to: split RAG eval into retrieval, generation, and end-to-end dimensions; compute Precision@K, Recall@K, NDCG, and MRR on labelled chunk IDs; score faithfulness, grounding, and citation accuracy; choose between RAGAS, TruLens, DeepEval, and UpTrain for your stack; build QCA test sets from manual and synthetic sources; and wire a RAG eval dashboard into CI and nightly regression.

developer platform architect Track 5 RAGAS DeepEval Recall@K Faithfulness LangChain 0.3+ Spring AI 1.0+

Three evaluation dimensions

RAG quality is not one number. Treat retrieval, generation, and end-to-end as separate experiments with separate gold labels, separate baselines, and separate regression gates. When end-to-end accuracy drops, you need to know whether the index, the reranker, or the answer prompt moved—not guess from a single “accuracy” column.

The mental model mirrors traditional search plus summarization: first ask did we fetch the right evidence?, then ask did the model stay inside that evidence?, finally ask would a user accept this answer? Skipping straight to the third question wastes LLM judge tokens on cases where retrieval already failed.

Dimension map

DimensionWhat you measureGold labels neededTypical failure → fix
Retrieval-only Ranked chunk list vs labelled relevant IDs gold_chunk_ids per question Chunking, hybrid weights, filters, reranker
Generation-only Answer faithfulness given fixed context Question + frozen context + reference answer or claims Prompt, temperature, citation format, model swap
End-to-end Full pipeline: question → retrieve → answer Question + expected facts / acceptable refusals Both layers + routing + UX (citations shown?)

Evaluation flow (offline)

  1. Build or import a QCA test set — questions with chunk IDs and optional reference answers (see Test sets).
  2. Run retrieval eval — log ranked chunk IDs; compute Recall@K, MRR, NDCG without calling the answer LLM.
  3. Run generation eval — inject gold context (or top-K from step 2) into a frozen prompt; score faithfulness.
  4. Run end-to-end eval — full pipeline; compare fact-match rate and user-rubric scores to baselines.
  5. Attribute regressions — if retrieval metrics stable but E2E drops, blame generation; if retrieval drops, fix index before tuning prompts.
Three-layer eval orchestrator
from dataclasses import dataclass
from typing import Callable

@dataclass
class RagEvalCase:
    id: str
    question: str
    gold_chunk_ids: list[str]
    gold_answer_contains: list[str] | None = None
    frozen_context: str | None = None  # for generation-only

@dataclass
class LayerScores:
    retrieval: dict
    generation: dict | None
    end_to_end: dict

def run_three_layer_eval(
    cases: list[RagEvalCase],
    retrieve_fn: Callable,
    answer_fn: Callable,
    faithfulness_fn: Callable,
    k: int = 5,
) -> dict:
    ret_recalls, ret_mrrs = [], []
    gen_faith, e2e_ok = [], []

    for case in cases:
        hits = retrieve_fn(case.question, k=k)
        hit_ids = [h["id"] for h in hits]
        ret_recalls.append(recall_at_k(hit_ids, case.gold_chunk_ids, k))
        ret_mrrs.append(mrr(hit_ids, case.gold_chunk_ids))

        ctx = case.frozen_context or "\n\n".join(h["text"] for h in hits[:k])
        answer = answer_fn(case.question, ctx)
        gen_faith.append(faithfulness_fn(answer, ctx))

        must = case.gold_answer_contains or []
        e2e_ok.append(all(s.lower() in answer.lower() for s in must) if must else True)

    n = len(cases)
    return {
        "retrieval": {"recall@%d" % k: sum(ret_recalls) / n, "mrr": sum(ret_mrrs) / n},
        "generation": {"faithfulness_mean": sum(gen_faith) / n},
        "end_to_end": {"fact_match_rate": sum(e2e_ok) / n},
    }
public record RagEvalCase(
    String id, String question, List<String> goldChunkIds,
    List<String> goldAnswerContains, String frozenContext) {}

public record LayerScores(Map<String, Double> retrieval,
                          Map<String, Double> generation,
                          Map<String, Double> endToEnd) {}

public LayerScores runThreeLayerEval(
    List<RagEvalCase> cases,
    BiFunction<String, Integer, List<RetrievedChunk>> retrieveFn,
    BiFunction<String, String, String> answerFn,
    BiFunction<String, String, Double> faithfulnessFn,
    int k) {

    List<Double> recalls = new ArrayList<>();
    List<Double> mrrs = new ArrayList<>();
    List<Double> faith = new ArrayList<>();
    int e2ePass = 0;

    for (RagEvalCase c : cases) {
        List<RetrievedChunk> hits = retrieveFn.apply(c.question(), k);
        List<String> ids = hits.stream().map(RetrievedChunk::id).toList();
        recalls.add(Metrics.recallAtK(ids, c.goldChunkIds(), k));
        mrrs.add(Metrics.mrr(ids, c.goldChunkIds()));

        String ctx = c.frozenContext() != null ? c.frozenContext()
            : hits.stream().limit(k).map(RetrievedChunk::text)
                .collect(Collectors.joining("\n\n"));
        String answer = answerFn.apply(c.question(), ctx);
        faith.add(faithfulnessFn.apply(answer, ctx));

        if (c.goldAnswerContains() == null || c.goldAnswerContains().isEmpty()) {
            e2ePass++;
        } else {
            String lower = answer.toLowerCase();
            boolean ok = c.goldAnswerContains().stream()
                .allMatch(s -> lower.contains(s.toLowerCase()));
            if (ok) e2ePass++;
        }
    }
    int n = cases.size();
    return new LayerScores(
        Map.of("recall@" + k, recalls.stream().mapToDouble(d -> d).average().orElse(0),
               "mrr", mrrs.stream().mapToDouble(d -> d).average().orElse(0)),
        Map.of("faithfulness_mean", faith.stream().mapToDouble(d -> d).average().orElse(0)),
        Map.of("fact_match_rate", (double) e2ePass / n));
}

When to run which layer

Change under testRun firstOptional
Chunk size / overlapRetrieval-onlyE2E smoke (10 rows)
Embedding model migrationRetrieval-only + re-embed indexNDCG compare old vs new
Answer system promptGeneration-only (frozen context)E2E
Reranker swapRetrieval-only @ same KLatency benchmark
Full pipeline releaseAll threeOnline shadow traffic
⚠️ Pitfall

Reporting only end-to-end “accuracy” on 20 demo questions. When hybrid search regresses recall@5 by 8 points but the LLM compensates with parametric knowledge, E2E may look flat—until production users hit questions outside training distribution and hallucination spikes. Always log retrieval metrics alongside answer scores.

🔬 Under the Hood

Retrieval eval is cheap: embed + search, no generation tokens. Generation-only eval fixes context so you isolate the LLM. That separation is how teams A/B answer prompts without re-indexing, and re-index without burning judge budget on every chunk experiment.

⚖️ Trade-off

Three layers mean three CI jobs and three dashboards. For small teams, start with retrieval + E2E fact checks; add generation-only when prompt iteration outpaces index changes. The generation layer pays off when legal/compliance requires proving answers cite provided context—not training data.

Retrieval metrics — Precision@K, Recall@K, NDCG, MRR

Retrieval evaluation treats your vector + sparse + rerank pipeline as a ranked list problem. Given a question and a set of labelled relevant chunk IDs, you score how high those chunks appear in top-K results. These metrics predate LLMs; they remain the fastest signal for chunking and index experiments.

Label format: each test row has question and gold_chunk_ids—one or more chunk IDs that contain the answer. Run your retriever, record ordered IDs, compute metrics. No LLM generation required. For multi-hop questions, gold may include two chunks; Recall@K counts how many gold IDs appear in top-K (set overlap).

Metric definitions

MetricFormula intuitionRangeBest when
Precision@K (# relevant in top K) / K 0–1 Context window is tight; false positives hurt (long chunks)
Recall@K (# gold IDs found in top K) / (total gold IDs) 0–1 Multi-chunk answers; you must not miss a required passage
MRR (Mean Reciprocal Rank) Average of 1/rank of first relevant hit 0–1 Single best chunk matters; FAQ-style Q&A
NDCG@K Discounted cumulative gain vs ideal ranking 0–1 Graded relevance (highly vs partially relevant chunks)

Worked example

Gold: ["policy-refunds-2", "policy-refunds-5"]. Retrieved top 5: ["faq-shipping-1", "policy-refunds-5", "policy-refunds-2", "policy-refunds-7", "billing-3"].

  • Precision@5 = 3 relevant / 5 = 0.60 (refunds-5, refunds-2, refunds-7 if 7 is relevant—here assume only 2 and 5 are gold → 2/5 = 0.40)
  • Recall@5 = 2 gold found / 2 gold total = 1.0
  • MRR = 1/2 = 0.50 (first gold at rank 2)
  • NDCG@5 — requires graded gains; if both gold chunks are equally relevant, NDCG rewards putting both high
Retrieval metrics implementation
import math

def precision_at_k(retrieved: list[str], gold: set[str], k: int) -> float:
    top = retrieved[:k]
    if not top:
        return 0.0
    hits = sum(1 for cid in top if cid in gold)
    return hits / k

def recall_at_k(retrieved: list[str], gold: set[str], k: int) -> float:
    if not gold:
        return 1.0 if not retrieved[:k] else 0.0
    top = set(retrieved[:k])
    return len(top & gold) / len(gold)

def mrr(retrieved: list[str], gold: set[str]) -> float:
    for rank, cid in enumerate(retrieved, start=1):
        if cid in gold:
            return 1.0 / rank
    return 0.0

def dcg_at_k(relevances: list[float], k: int) -> float:
    return sum(rel / math.log2(i + 2) for i, rel in enumerate(relevances[:k]))

def ndcg_at_k(retrieved: list[str], graded: dict[str, float], k: int) -> float:
    gains = [graded.get(cid, 0.0) for cid in retrieved[:k]]
    ideal = sorted(graded.values(), reverse=True)[:k]
    idcg = dcg_at_k(ideal, k)
    if idcg == 0:
        return 0.0
    return dcg_at_k(gains, k) / idcg

def eval_retrieval_suite(rows, search_fn, k: int = 5) -> dict:
    p, r, m, n = [], [], [], []
    for row in rows:
        gold = set(row["gold_chunk_ids"])
        ids = [h["id"] for h in search_fn(row["question"], k=k)]
        p.append(precision_at_k(ids, gold, k))
        r.append(recall_at_k(ids, gold, k))
        m.append(mrr(ids, gold))
        graded = {cid: 1.0 for cid in gold}
        n.append(ndcg_at_k(ids, graded, k))
    cnt = len(rows)
    return {
        "precision@%d" % k: sum(p) / cnt,
        "recall@%d" % k: sum(r) / cnt,
        "mrr": sum(m) / cnt,
        "ndcg@%d" % k: sum(n) / cnt,
    }
public final class RetrievalMetrics {

    public static double precisionAtK(List<String> retrieved, Set<String> gold, int k) {
        List<String> top = retrieved.subList(0, Math.min(k, retrieved.size()));
        if (top.isEmpty()) return 0.0;
        long hits = top.stream().filter(gold::contains).count();
        return (double) hits / k;
    }

    public static double recallAtK(List<String> retrieved, Set<String> gold, int k) {
        if (gold.isEmpty()) return retrieved.isEmpty() ? 1.0 : 0.0;
        Set<String> top = new HashSet<>(retrieved.subList(0, Math.min(k, retrieved.size())));
        top.retainAll(gold);
        return (double) top.size() / gold.size();
    }

    public static double mrr(List<String> retrieved, Set<String> gold) {
        for (int i = 0; i < retrieved.size(); i++) {
            if (gold.contains(retrieved.get(i))) return 1.0 / (i + 1);
        }
        return 0.0;
    }

    public static double ndcgAtK(List<String> retrieved, Map<String, Double> graded, int k) {
        double dcg = 0, idcg = 0;
        List<Double> ideal = graded.values().stream()
            .sorted(Comparator.reverseOrder()).limit(k).toList();
        for (int i = 0; i < k && i < retrieved.size(); i++) {
            dcg += graded.getOrDefault(retrieved.get(i), 0.0) / (Math.log(i + 2) / Math.log(2));
        }
        for (int i = 0; i < ideal.size(); i++) {
            idcg += ideal.get(i) / (Math.log(i + 2) / Math.log(2));
        }
        return idcg == 0 ? 0.0 : dcg / idcg;
    }
}

Choosing K and gates

SettingTypical KGate example
Stuff-all-context (8k window)K = 10–20Recall@10 ≥ baseline − 0.02
Rerank 50 → keep 5Measure @ 50 and @ 5MRR@50 for coarse; Precision@5 post-rerank
Citation UI shows 3 sourcesK = 3Recall@3 ≥ 0.85 on gold set
Agentic multi-retrievePer-hop K + unionPer-hop recall + union recall
⚙️ Config

Store baseline metrics in git: eval/baselines/retrieval_v3.json with commit SHA, index version, embed model, and metric snapshot. CI compares PR branch to that file; fail if recall@5 drops more than your agreed delta (often 2–3 points on small sets, tighter on 500+ rows).

💰 Cost

Retrieval-only eval on 500 questions against a warm index costs embedding API fees only—roughly $0.01–0.05 per 1k queries with small embed models—versus $5–50 for the same count with LLM judges. Run retrieval eval on every index PR; reserve judges for generation layers.

📦 Real World

Notion and internal KB teams often maintain a “retrieval leaderboard” per collection: dense vs hybrid vs +rerank, updated nightly. Product picks the Pareto point (recall vs p95 latency) from that table—not from demo screenshots.

Generation metrics — faithfulness, grounding, citation accuracy

Once context is fixed (gold retrieval or logged production chunks), measure whether the answer stays inside that context. Faithfulness penalizes unsupported claims. Grounding checks entailment from context to answer. Citation accuracy verifies that cited chunk IDs actually support the attached sentences.

Generation metrics catch a different failure mode than retrieval: the right chunk is in the prompt, but the model summarizes incorrectly, merges two policies, or cites chunk B while stating facts from chunk A. Regulated domains (finance, healthcare, internal legal) often gate releases on faithfulness scores even when BLEU/ROUGE look fine—those n-gram metrics reward fluency, not truth.

Metric comparison

MetricQuestion it answersImplementation tiers
Faithfulness Is every claim in the answer supported by context? Claim decomposition + NLI; LLM judge; RAGAS faithfulness
Answer relevance Does the answer address the question (without extra fluff)? Embedding similarity Q↔A; LLM rubric
Grounding score Fraction of answer sentences entailed by context NLI model (e.g. deberta-v3-base-mnli); TruLens groundedness
Citation accuracy Do [1][2] pointers match supporting spans? Parse citations → retrieve span → NLI or string overlap
Hallucination rate % answers with ≥1 unsupported atomic claim Inverse of faithfulness on claim set

Cheap faithfulness before LLM judges

  1. Split answer into sentences (or bullet claims).
  2. For each claim, check if any context sentence contains overlapping entities + predicate (keyword heuristic).
  3. Escalate low-confidence claims to NLI or LLM judge.
  4. Score = supported claims / total claims.
Faithfulness + citation accuracy
import re

CITE_PATTERN = re.compile(r"\[(\d+)\]")

def split_claims(answer: str) -> list[str]:
    parts = re.split(r"(?<=[.!?])\s+", answer.strip())
    return [p for p in parts if len(p.split()) > 3]

def keyword_supported(claim: str, context: str) -> bool:
    tokens = {t.lower() for t in re.findall(r"\w{4,}", claim)}
    ctx = context.lower()
    return sum(1 for t in tokens if t in ctx) >= max(1, len(tokens) // 2)

def faithfulness_score(answer: str, context: str) -> float:
    claims = split_claims(answer)
    if not claims:
        return 1.0
    supported = sum(1 for c in claims if keyword_supported(c, context))
    return supported / len(claims)

def citation_accuracy(answer: str, chunks: list[dict]) -> float:
    """chunks: [{id, text}] indexed 1-based in citations [1], [2]."""
    cites = CITE_PATTERN.findall(answer)
    if not cites:
        return 1.0  # no citations required
    ok = 0
    for ref in cites:
        idx = int(ref) - 1
        if idx < 0 or idx >= len(chunks):
            continue
        # strip citation markers for overlap check
        sentence = CITE_PATTERN.sub("", answer).strip()
        if keyword_supported(sentence, chunks[idx]["text"]):
            ok += 1
    return ok / len(cites) if cites else 1.0
public final class GenerationMetrics {
    private static final Pattern CITE = Pattern.compile("\\[(\\d+)\\]");

    public static double faithfulnessScore(String answer, String context) {
        List<String> claims = splitClaims(answer);
        if (claims.isEmpty()) return 1.0;
        long supported = claims.stream()
            .filter(c -> keywordSupported(c, context)).count();
        return (double) supported / claims.size();
    }

    public static double citationAccuracy(String answer, List<Chunk> chunks) {
        Matcher m = CITE.matcher(answer);
        List<Integer> refs = new ArrayList<>();
        while (m.find()) refs.add(Integer.parseInt(m.group(1)));
        if (refs.isEmpty()) return 1.0;
        String sentence = CITE.matcher(answer).replaceAll("").trim();
        long ok = refs.stream().filter(ref -> {
            int idx = ref - 1;
            return idx >= 0 && idx < chunks.size()
                && keywordSupported(sentence, chunks.get(idx).text());
        }).count();
        return (double) ok / refs.size();
    }

    private static boolean keywordSupported(String claim, String context) {
        Set<String> tokens = tokenize(claim, 4);
        String ctx = context.toLowerCase();
        long hits = tokens.stream().filter(ctx::contains).count();
        return hits >= Math.max(1, tokens.size() / 2);
    }
}

RAGAS-style generation scores (conceptual)

RAGAS popularized LLM-assisted metrics: faithfulness (claims ⊆ context), answer_relevancy (question ↔ answer), context_precision (relevant chunks ranked high), context_recall (gold facts in retrieved context). You can implement the same ideas without the library—see Eval frameworks for RAGAS integration sketches.

🔒 Security

Faithfulness eval prompts include full retrieved context—which may contain PII or secrets from your corpus. Run offline eval in isolated environments; redact production logs before they enter judge prompts. A “faithful” answer can still leak context if the eval set includes restricted documents without ACL simulation.

🎯 Interview Tip

“How do you eval RAG?” — Open with three layers; give Recall@5 for retrieval; faithfulness for generation; mention citation accuracy for UI with footnotes; close with CI gates and human calibration quarterly—not “we eyeball it.”

💡 Pro Tip

Log context_hash + prompt_version + chunk_ids on every production answer. Generation-only replay becomes a one-liner: fetch logged context, re-run answer prompt, diff faithfulness scores.

Eval frameworks — RAGAS, TruLens, DeepEval, UpTrain

You can implement metrics by hand (previous sections). Frameworks bundle dataset schemas, LLM judges, tracing, and CI hooks. Pick based on language stack, observability needs, and whether you want batteries-included RAG metrics or a general eval harness.

Framework comparison

Framework Strengths RAG metrics Tracing / prod Java / JVM Typical fit
RAGAS RAG-native metrics; research-backed; LangChain ecosystem Faithfulness, answer relevancy, context precision/recall Offline batch; export to CSV Python-first; JVM via HTTP wrapper Teams standardizing RAG metric definitions
TruLens Instrument apps; feedback functions on traces Groundedness, context relevance, composite Strong — trace every app call Python SDK; Java via OpenTelemetry bridge Apps already using TruLens for LLM tracing
DeepEval Pytest-style tests; 14+ metrics; CI integration Hallucination, faithfulness, contextual recall/precision Confident AI cloud dashboard optional Python-first Engineering teams wanting deepeval test run in CI
UpTrain Open-source + managed; multimodal; guardrails Context relevance, factual accuracy, response completeness Online evals + alerting REST API from any language Platform teams needing hosted dashboards fast

RAGAS batch eval sketch

RAGAS evaluation run
# pip install ragas datasets langchain-openai
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

def build_ragas_dataset(rows, rag_pipeline) -> Dataset:
    records = []
    for row in rows:
        out = rag_pipeline(row["question"])
        records.append({
            "question": row["question"],
            "answer": out["answer"],
            "contexts": out["context_texts"],  # list[str]
            "ground_truth": row.get("reference_answer", ""),
        })
    return Dataset.from_list(records)

def run_ragas_eval(rows, rag_pipeline) -> dict:
    ds = build_ragas_dataset(rows, rag_pipeline)
    result = evaluate(
        ds,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
    return result.to_pandas().mean(numeric_only=True).to_dict()

# CI gate example
if __name__ == "__main__":
    import json
    from pathlib import Path
    rows = [json.loads(l) for l in Path("eval/qca.jsonl").read_text().splitlines() if l.strip()]
    scores = run_ragas_eval(rows, my_rag_pipeline)
    assert scores["faithfulness"] >= 0.85, scores
    print(scores)
// Spring service calling RAGAS via sidecar HTTP or internal Python worker
@Service
public class RagasEvalClient {
    private final RestClient rest;

    public RagasEvalClient(RestClient.Builder builder) {
        this.rest = builder.baseUrl("http://ragas-worker:8080").build();
    }

    public Map<String, Double> evaluateBatch(List<QcaRow> rows, RagPipeline pipeline) {
        List<RagasRecord> records = rows.stream().map(row -> {
            RagResult out = pipeline.answer(row.question());
            return new RagasRecord(
                row.question(), out.answer(), out.contextTexts(),
                row.referenceAnswer() != null ? row.referenceAnswer() : "");
        }).toList();

        return rest.post()
            .uri("/v1/evaluate")
            .body(new RagasEvalRequest(records, List.of(
                "faithfulness", "answer_relevancy",
                "context_precision", "context_recall")))
            .retrieve()
            .body(new ParameterizedTypeReference<Map<String, Double>>() {});
    }

    public void assertFaithfulnessGate(List<QcaRow> rows, RagPipeline pipeline, double min) {
        Map<String, Double> scores = evaluateBatch(rows, pipeline);
        if (scores.getOrDefault("faithfulness", 0.0) < min) {
            throw new EvalRegressionException("faithfulness below gate: " + scores);
        }
    }
}

DeepEval pytest-style sketch

DeepEval RAG test cases
# pip install deepeval
import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, ContextualRecallMetric
from deepeval.test_case import LLMTestCase

@pytest.mark.parametrize("row", load_qca_rows("eval/qca.jsonl"))
def test_rag_faithfulness(row, rag_pipeline):
    out = rag_pipeline(row["question"])
    test_case = LLMTestCase(
        input=row["question"],
        actual_output=out["answer"],
        retrieval_context=out["context_texts"],
        expected_output=row.get("reference_answer"),
    )
    faith = FaithfulnessMetric(threshold=0.8)
    recall = ContextualRecallMetric(threshold=0.75)
    assert_test(test_case, [faith, recall])

# Run: deepeval test run tests/test_rag_eval.py
// DeepEval is Python-native; JVM teams mirror the pattern with JUnit + HTTP judge
@Test
void ragFaithfulnessGate() throws Exception {
    List<QcaRow> smoke = qcaLoader.load("eval/qca_smoke.jsonl");
    List<EvalFailure> failures = new ArrayList<>();

    for (QcaRow row : smoke) {
        RagResult out = ragPipeline.answer(row.question());
        JudgeVerdict verdict = deepEvalClient.judge(new JudgeRequest(
            row.question(), out.answer(), out.contextTexts(),
            row.referenceAnswer(), List.of("faithfulness", "contextual_recall"),
            0.8, 0.75));
        if (!verdict.passed()) failures.add(new EvalFailure(row.id(), verdict));
    }
    assertTrue(failures.isEmpty(), "RAG eval failures: " + failures);
}

Selection guide

  • Already on LangChain/LlamaIndex → RAGAS integrates cleanly; start with faithfulness + context_recall.
  • Need production tracing + eval on same spans → TruLens feedback functions on retrieved contexts.
  • Want pytest in CI tomorrow → DeepEval with threshold metrics and deepeval test run.
  • Python-averse platform team → UpTrain REST or custom Java judge service; keep metric definitions in one OpenAPI spec.
☸️ OpenShift / K8s

Run RAGAS/DeepEval as a nightly CronJob with frozen index snapshot mounted from S3. PR pipelines call a slim retrieval-only job; full RAGAS suite needs GPU-less LLM judge quota—schedule off-peak. Store results as Prometheus metrics or push to your eval DB.

🏗️ IaC

Version eval dependencies in lockfiles (ragas==0.2.x, judge model ID in Terraform/Helm values). Changing judge model without re-baseline invalidates historical dashboards—tag every eval run with judge_model and embed_model.

⚖️ Trade-off

Framework metrics correlate with human judgment but are not ground truth. Teams that skip human calibration over-trust RAGAS faithfulness and ship regressions when the judge model drifts. Budget 50–100 human-labelled rows per quarter to recalibrate thresholds.

Test sets — QCA triples, manual vs synthetic

Metrics are only as good as labels. RAG eval sets center on QCA triples: Question, Context (gold chunk IDs or passages), Answer (reference or required fact spans). Build from SME-authored rows first; augment with synthetic generation once the schema is stable.

A minimal JSONL row for retrieval + E2E:

eval/qca.jsonl schema
{"id":"qca-001","question":"Can I refund a monthly plan after 14 days?","gold_chunk_ids":["refund-policy.txt-0"],"reference_answer":"Partial months are not refunded after 14 days.","gold_answer_contains":["partial months","not refund"],"tags":["billing","policy"],"source":"manual"}
{"id":"qca-002","question":"Lookup order ORD-1042 status","gold_chunk_ids":["orders.txt-12"],"gold_answer_contains":["ORD-1042"],"tags":["keyword","lookup"],"source":"manual"}
{"id":"qca-003","question":"What is the Mars colony return policy?","gold_chunk_ids":[],"gold_answer_contains":["don't know","support@","not find"],"tags":["negative","ood"],"source":"synthetic"}

Manual vs synthetic

SourceProsConsWhen to use
Manual (SME) Ground truth trusted; captures domain nuance Slow; expensive; coverage gaps Initial 50–150 rows; compliance-critical paths
Production sample Real user phrasing; thumbs-down relabel PII scrubbing; biased toward frequent queries Weekly harvest with redaction pipeline
Synthetic (LLM) Scales coverage; paraphrase variants May not match real failures; label noise Augment after manual seed; stratified by tag
Synthetic (template) Deterministic ID lookup rows Shallow linguistic variety SKU/order/account keyword tests

Synthetic QCA generation sketch

Generate questions from corpus chunks
SYNTH_PROMPT = """Given this document chunk, write ONE factual question answerable ONLY from the chunk.
Return JSON: {{"question": "...", "reference_answer": "..."}}

Chunk:
{chunk}
"""

def synth_qca_from_chunks(chunks: list[dict], client, max_per_chunk: int = 2) -> list[dict]:
    rows = []
    for chunk in chunks:
        for _ in range(max_per_chunk):
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0.3,
                messages=[{"role": "user", "content": SYNTH_PROMPT.format(chunk=chunk["text"][:2000])}],
            )
            data = json.loads(resp.choices[0].message.content)
            rows.append({
                "id": f"synth-{chunk['id']}-{len(rows)}",
                "question": data["question"],
                "gold_chunk_ids": [chunk["id"]],
                "reference_answer": data["reference_answer"],
                "source": "synthetic",
                "tags": ["synthetic", chunk.get("collection", "default")],
            })
    return rows
public List<QcaRow> synthQcaFromChunks(List<Chunk> chunks, ChatClient client, int maxPerChunk) {
    List<QcaRow> rows = new ArrayList<>();
    for (Chunk chunk : chunks) {
        for (int i = 0; i < maxPerChunk; i++) {
            String json = client.prompt()
                .user(SYNTH_PROMPT.formatted(chunk.text().substring(0, Math.min(2000, chunk.text().length()))))
                .options(ChatOptionsBuilder.builder().withModel("gpt-4o-mini").withTemperature(0.3).build())
                .call().getResult().getOutput().getContent();
            SynthPayload data = objectMapper.readValue(json, SynthPayload.class);
            rows.add(new QcaRow(
                "synth-" + chunk.id() + "-" + rows.size(),
                data.question(), List.of(chunk.id()), data.referenceAnswer(),
                List.of(), List.of("synthetic", chunk.collection()), "synthetic"));
        }
    }
    return rows;
}

Stratification and coverage

  • Tags — billing, shipping, negative (unanswerable), keyword, multi-hop, ACL-bound.
  • Smoke vs full — 15–25 stratified rows for PR; 200–2000 for nightly.
  • Negative cases — questions with empty gold_chunk_ids; expect refusal or escalation, not fabrication.
  • Versioningqca_manifest.json with schema version, corpus snapshot ID, labeler sign-off.
⚠️ Pitfall

Synthetic-only eval sets inflate scores: the same LLM family generated questions and judges answers. Always anchor with manual SME rows and production failures; use synthetic for coverage expansion, not as the sole gate.

📦 Real World

Support bots at scale maintain a “failure inbox”: every thumbs-down with logged chunk IDs becomes a candidate QCA row after SME review. That loop beats one-time synthetic generation for catching retrieval drift on new product launches.

Production — RAG eval dashboard & operating loop

Offline QCA scores prove a build candidate. Production needs a RAG eval dashboard that ties retrieval logs, faithfulness samples, user feedback, and CI baselines into one view—so on-call can answer “did last night’s index deploy hurt refunds?” without re-running ad hoc notebooks.

Dashboard panels

PanelMetricsSourceAlert if
Retrieval health Recall@5, MRR (nightly on QCA); empty-result rate live Offline job + retrieval logs Recall drops > 2 pts vs 7d baseline
Generation quality Faithfulness p50/p95; hallucination rate Sampled online judge + offline RAGAS Faithfulness p50 < gate
Citations Citation accuracy; broken chunk ID rate Answer post-processor logs Accuracy < 90% on sampled rows
User signals Thumbs down, “wrong source”, escalation Product analytics Spike vs 7d moving average
Cost / latency Retrieve p95, rerank p95, tokens per answer APM traces p95 retrieve > SLO

CI + nightly pipeline

  1. PR (smoke) — 20 stratified QCA rows: retrieval recall@5 + fact_match E2E; no live judge.
  2. Merge to main — rebuild index on staging snapshot; full retrieval suite (~500 rows).
  3. Nightly — RAGAS/DeepEval on full QCA + 5% production sample (redacted).
  4. Pre-prod promote — compare dashboard baselines; human sign-off on diff report.
  5. Post-deploy — 24h canary: online faithfulness sample + thumbs-down watch.
Emit eval metrics to observability backend
import json
from datetime import datetime, timezone

def publish_eval_report(scores: dict, meta: dict, out_dir: str = "reports/rag") -> str:
    payload = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "embed_model": meta["embed_model"],
        "index_version": meta["index_version"],
        "judge_model": meta.get("judge_model"),
        "scores": scores,
        "gates": {
            "recall@5": scores.get("recall@5", 0) >= meta["min_recall@5"],
            "faithfulness": scores.get("faithfulness", 0) >= meta["min_faithfulness"],
        },
    }
    path = f"{out_dir}/latest.json"
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    Path(path).write_text(json.dumps(payload, indent=2))
    # Optional: push to Datadog/Grafana via HTTP API
    # datadog_api.Metric.send(type="gauge", metric="rag.recall_at_5", points=[(time.time(), scores["recall@5"])])
    return path
@Service
public class RagEvalReporter {
    private final MeterRegistry metrics;
    private final ObjectMapper mapper;

    public Path publishEvalReport(LayerScores scores, EvalMeta meta) throws IOException {
        Map<String, Object> payload = Map.of(
            "timestamp", Instant.now().toString(),
            "embedModel", meta.embedModel(),
            "indexVersion", meta.indexVersion(),
            "scores", scores,
            "gates", Map.of(
                "recall@5", scores.retrieval().get("recall@5") >= meta.minRecallAt5(),
                "faithfulness", scores.generation().get("faithfulness_mean") >= meta.minFaithfulness()
            ));
        Path path = Path.of("reports/rag/latest.json");
        Files.createDirectories(path.getParent());
        mapper.writerWithDefaultPrettyPrinter().writeValue(path.toFile(), payload);

        metrics.gauge("rag.recall_at_5", scores.retrieval().get("recall@5"));
        metrics.gauge("rag.faithfulness_mean", scores.generation().get("faithfulness_mean"));
        return path;
    }
}

Operating loop (weekly)

  • Review dashboard regressions and top thumbs-down cluster themes.
  • Promote 5–10 production failures into QCA after SME label.
  • Re-calibrate judge vs human on 30-row sample if judge model changed.
  • Archive eval report JSON alongside index version for audit trail.

Production readiness checklist

  • QCA JSONL with manual seed + synthetic augment + negative cases
  • Three-layer eval scripts (retrieval, generation, E2E) in repo
  • Baseline metrics file per index + embed + judge version
  • PR smoke gate (< 5 min) and nightly full suite
  • Dashboard: recall, faithfulness, citation accuracy, user feedback
  • Production logs: question hash, chunk IDs, context hash, prompt version
  • Runbook: “recall OK but faithfulness down” vs “both down”
🎯 Interview Tip

“Design RAG eval for production.” — QCA sets, separate retrieval/generation metrics, RAGAS or custom judges, CI smoke + nightly full, dashboard with user feedback loop, canary after index deploy. Mention you never promote on demo accuracy alone.

💡 Pro Tip

Single “eval scorecard” PDF/HTML auto-generated from reports/rag/latest.json for PM/legal sign-off—same numbers as CI, zero manual spreadsheet drift.

💰 Cost

Online faithfulness on 2% of traffic with gpt-4o-mini judge ≈ $0.002–0.008 per judged answer (context-dependent). Cap daily judge spend; fall back to keyword faithfulness for the long tail. Full RAGAS nightly on 1k rows ≈ $10–40 depending on context size.

Related: Retrieval strategies — improve recall before you tune faithfulness gates; Hands-on JSONL walkthrough in ~/hello-rag (see the eval harness section above).