RAG-specific evaluation

Three evaluation dimensions

RAG quality is not one number. Treat retrieval, generation, and end-to-end as separate experiments with separate gold labels, separate baselines, and separate regression gates. When end-to-end accuracy drops, you need to know whether the index, the reranker, or the answer prompt moved—not guess from a single “accuracy” column.

The mental model mirrors traditional search plus summarization: first ask did we fetch the right evidence?, then ask did the model stay inside that evidence?, finally ask would a user accept this answer? Skipping straight to the third question wastes LLM judge tokens on cases where retrieval already failed.

Dimension map

Dimension	What you measure	Gold labels needed	Typical failure → fix
Retrieval-only	Ranked chunk list vs labelled relevant IDs	`gold_chunk_ids` per question	Chunking, hybrid weights, filters, reranker
Generation-only	Answer faithfulness given fixed context	Question + frozen context + reference answer or claims	Prompt, temperature, citation format, model swap
End-to-end	Full pipeline: question → retrieve → answer	Question + expected facts / acceptable refusals	Both layers + routing + UX (citations shown?)

Evaluation flow (offline)

Build or import a QCA test set — questions with chunk IDs and optional reference answers (see Test sets).
Run retrieval eval — log ranked chunk IDs; compute Recall@K, MRR, NDCG without calling the answer LLM.
Run generation eval — inject gold context (or top-K from step 2) into a frozen prompt; score faithfulness.
Run end-to-end eval — full pipeline; compare fact-match rate and user-rubric scores to baselines.
Attribute regressions — if retrieval metrics stable but E2E drops, blame generation; if retrieval drops, fix index before tuning prompts.

from dataclasses import dataclass
from typing import Callable

@dataclass
class RagEvalCase:
    id: str
    question: str
    gold_chunk_ids: list[str]
    gold_answer_contains: list[str] | None = None
    frozen_context: str | None = None  # for generation-only

@dataclass
class LayerScores:
    retrieval: dict
    generation: dict | None
    end_to_end: dict

def run_three_layer_eval(
    cases: list[RagEvalCase],
    retrieve_fn: Callable,
    answer_fn: Callable,
    faithfulness_fn: Callable,
    k: int = 5,
) -> dict:
    ret_recalls, ret_mrrs = [], []
    gen_faith, e2e_ok = [], []

    for case in cases:
        hits = retrieve_fn(case.question, k=k)
        hit_ids = [h["id"] for h in hits]
        ret_recalls.append(recall_at_k(hit_ids, case.gold_chunk_ids, k))
        ret_mrrs.append(mrr(hit_ids, case.gold_chunk_ids))

        ctx = case.frozen_context or "\n\n".join(h["text"] for h in hits[:k])
        answer = answer_fn(case.question, ctx)
        gen_faith.append(faithfulness_fn(answer, ctx))

        must = case.gold_answer_contains or []
        e2e_ok.append(all(s.lower() in answer.lower() for s in must) if must else True)

    n = len(cases)
    return {
        "retrieval": {"recall@%d" % k: sum(ret_recalls) / n, "mrr": sum(ret_mrrs) / n},
        "generation": {"faithfulness_mean": sum(gen_faith) / n},
        "end_to_end": {"fact_match_rate": sum(e2e_ok) / n},
    }

public record RagEvalCase(
    String id, String question, List<String> goldChunkIds,
    List<String> goldAnswerContains, String frozenContext) {}

public record LayerScores(Map<String, Double> retrieval,
                          Map<String, Double> generation,
                          Map<String, Double> endToEnd) {}

public LayerScores runThreeLayerEval(
    List<RagEvalCase> cases,
    BiFunction<String, Integer, List<RetrievedChunk>> retrieveFn,
    BiFunction<String, String, String> answerFn,
    BiFunction<String, String, Double> faithfulnessFn,
    int k) {

    List<Double> recalls = new ArrayList<>();
    List<Double> mrrs = new ArrayList<>();
    List<Double> faith = new ArrayList<>();
    int e2ePass = 0;

    for (RagEvalCase c : cases) {
        List<RetrievedChunk> hits = retrieveFn.apply(c.question(), k);
        List<String> ids = hits.stream().map(RetrievedChunk::id).toList();
        recalls.add(Metrics.recallAtK(ids, c.goldChunkIds(), k));
        mrrs.add(Metrics.mrr(ids, c.goldChunkIds()));

        String ctx = c.frozenContext() != null ? c.frozenContext()
            : hits.stream().limit(k).map(RetrievedChunk::text)
                .collect(Collectors.joining("\n\n"));
        String answer = answerFn.apply(c.question(), ctx);
        faith.add(faithfulnessFn.apply(answer, ctx));

        if (c.goldAnswerContains() == null || c.goldAnswerContains().isEmpty()) {
            e2ePass++;
        } else {
            String lower = answer.toLowerCase();
            boolean ok = c.goldAnswerContains().stream()
                .allMatch(s -> lower.contains(s.toLowerCase()));
            if (ok) e2ePass++;
        }
    }
    int n = cases.size();
    return new LayerScores(
        Map.of("recall@" + k, recalls.stream().mapToDouble(d -> d).average().orElse(0),
               "mrr", mrrs.stream().mapToDouble(d -> d).average().orElse(0)),
        Map.of("faithfulness_mean", faith.stream().mapToDouble(d -> d).average().orElse(0)),
        Map.of("fact_match_rate", (double) e2ePass / n));
}

When to run which layer

Change under test	Run first	Optional
Chunk size / overlap	Retrieval-only	E2E smoke (10 rows)
Embedding model migration	Retrieval-only + re-embed index	NDCG compare old vs new
Answer system prompt	Generation-only (frozen context)	E2E
Reranker swap	Retrieval-only @ same K	Latency benchmark
Full pipeline release	All three	Online shadow traffic

⚠️ Pitfall

Reporting only end-to-end “accuracy” on 20 demo questions. When hybrid search regresses recall@5 by 8 points but the LLM compensates with parametric knowledge, E2E may look flat—until production users hit questions outside training distribution and hallucination spikes. Always log retrieval metrics alongside answer scores.

🔬 Under the Hood

Retrieval eval is cheap: embed + search, no generation tokens. Generation-only eval fixes context so you isolate the LLM. That separation is how teams A/B answer prompts without re-indexing, and re-index without burning judge budget on every chunk experiment.

⚖️ Trade-off

Three layers mean three CI jobs and three dashboards. For small teams, start with retrieval + E2E fact checks; add generation-only when prompt iteration outpaces index changes. The generation layer pays off when legal/compliance requires proving answers cite provided context—not training data.

Retrieval metrics — Precision@K, Recall@K, NDCG, MRR

Retrieval evaluation treats your vector + sparse + rerank pipeline as a ranked list problem. Given a question and a set of labelled relevant chunk IDs, you score how high those chunks appear in top-K results. These metrics predate LLMs; they remain the fastest signal for chunking and index experiments.

Label format: each test row has question and gold_chunk_ids—one or more chunk IDs that contain the answer. Run your retriever, record ordered IDs, compute metrics. No LLM generation required. For multi-hop questions, gold may include two chunks; Recall@K counts how many gold IDs appear in top-K (set overlap).

Metric definitions

Metric	Formula intuition	Range	Best when
Precision@K	(# relevant in top K) / K	0–1	Context window is tight; false positives hurt (long chunks)
Recall@K	(# gold IDs found in top K) / (total gold IDs)	0–1	Multi-chunk answers; you must not miss a required passage
MRR (Mean Reciprocal Rank)	Average of 1/rank of first relevant hit	0–1	Single best chunk matters; FAQ-style Q&A
NDCG@K	Discounted cumulative gain vs ideal ranking	0–1	Graded relevance (highly vs partially relevant chunks)

Worked example

Gold: ["policy-refunds-2", "policy-refunds-5"]. Retrieved top 5: ["faq-shipping-1", "policy-refunds-5", "policy-refunds-2", "policy-refunds-7", "billing-3"].

Precision@5 = 3 relevant / 5 = 0.60 (refunds-5, refunds-2, refunds-7 if 7 is relevant—here assume only 2 and 5 are gold → 2/5 = 0.40)
Recall@5 = 2 gold found / 2 gold total = 1.0
MRR = 1/2 = 0.50 (first gold at rank 2)
NDCG@5 — requires graded gains; if both gold chunks are equally relevant, NDCG rewards putting both high

import math

def precision_at_k(retrieved: list[str], gold: set[str], k: int) -> float:
    top = retrieved[:k]
    if not top:
        return 0.0
    hits = sum(1 for cid in top if cid in gold)
    return hits / k

def recall_at_k(retrieved: list[str], gold: set[str], k: int) -> float:
    if not gold:
        return 1.0 if not retrieved[:k] else 0.0
    top = set(retrieved[:k])
    return len(top & gold) / len(gold)

def mrr(retrieved: list[str], gold: set[str]) -> float:
    for rank, cid in enumerate(retrieved, start=1):
        if cid in gold:
            return 1.0 / rank
    return 0.0

def dcg_at_k(relevances: list[float], k: int) -> float:
    return sum(rel / math.log2(i + 2) for i, rel in enumerate(relevances[:k]))

def ndcg_at_k(retrieved: list[str], graded: dict[str, float], k: int) -> float:
    gains = [graded.get(cid, 0.0) for cid in retrieved[:k]]
    ideal = sorted(graded.values(), reverse=True)[:k]
    idcg = dcg_at_k(ideal, k)
    if idcg == 0:
        return 0.0
    return dcg_at_k(gains, k) / idcg

def eval_retrieval_suite(rows, search_fn, k: int = 5) -> dict:
    p, r, m, n = [], [], [], []
    for row in rows:
        gold = set(row["gold_chunk_ids"])
        ids = [h["id"] for h in search_fn(row["question"], k=k)]
        p.append(precision_at_k(ids, gold, k))
        r.append(recall_at_k(ids, gold, k))
        m.append(mrr(ids, gold))
        graded = {cid: 1.0 for cid in gold}
        n.append(ndcg_at_k(ids, graded, k))
    cnt = len(rows)
    return {
        "precision@%d" % k: sum(p) / cnt,
        "recall@%d" % k: sum(r) / cnt,
        "mrr": sum(m) / cnt,
        "ndcg@%d" % k: sum(n) / cnt,
    }

public final class RetrievalMetrics {

    public static double precisionAtK(List<String> retrieved, Set<String> gold, int k) {
        List<String> top = retrieved.subList(0, Math.min(k, retrieved.size()));
        if (top.isEmpty()) return 0.0;
        long hits = top.stream().filter(gold::contains).count();
        return (double) hits / k;
    }

    public static double recallAtK(List<String> retrieved, Set<String> gold, int k) {
        if (gold.isEmpty()) return retrieved.isEmpty() ? 1.0 : 0.0;
        Set<String> top = new HashSet<>(retrieved.subList(0, Math.min(k, retrieved.size())));
        top.retainAll(gold);
        return (double) top.size() / gold.size();
    }

    public static double mrr(List<String> retrieved, Set<String> gold) {
        for (int i = 0; i < retrieved.size(); i++) {
            if (gold.contains(retrieved.get(i))) return 1.0 / (i + 1);
        }
        return 0.0;
    }

    public static double ndcgAtK(List<String> retrieved, Map<String, Double> graded, int k) {
        double dcg = 0, idcg = 0;
        List<Double> ideal = graded.values().stream()
            .sorted(Comparator.reverseOrder()).limit(k).toList();
        for (int i = 0; i < k && i < retrieved.size(); i++) {
            dcg += graded.getOrDefault(retrieved.get(i), 0.0) / (Math.log(i + 2) / Math.log(2));
        }
        for (int i = 0; i < ideal.size(); i++) {
            idcg += ideal.get(i) / (Math.log(i + 2) / Math.log(2));
        }
        return idcg == 0 ? 0.0 : dcg / idcg;
    }
}

Choosing K and gates

Setting	Typical K	Gate example
Stuff-all-context (8k window)	K = 10–20	Recall@10 ≥ baseline − 0.02
Rerank 50 → keep 5	Measure @ 50 and @ 5	MRR@50 for coarse; Precision@5 post-rerank
Citation UI shows 3 sources	K = 3	Recall@3 ≥ 0.85 on gold set
Agentic multi-retrieve	Per-hop K + union	Per-hop recall + union recall

⚙️ Config

Store baseline metrics in git: eval/baselines/retrieval_v3.json with commit SHA, index version, embed model, and metric snapshot. CI compares PR branch to that file; fail if recall@5 drops more than your agreed delta (often 2–3 points on small sets, tighter on 500+ rows).

💰 Cost

Retrieval-only eval on 500 questions against a warm index costs embedding API fees only—roughly $0.01–0.05 per 1k queries with small embed models—versus $5–50 for the same count with LLM judges. Run retrieval eval on every index PR; reserve judges for generation layers.

📦 Real World

Notion and internal KB teams often maintain a “retrieval leaderboard” per collection: dense vs hybrid vs +rerank, updated nightly. Product picks the Pareto point (recall vs p95 latency) from that table—not from demo screenshots.

Generation metrics — faithfulness, grounding, citation accuracy

Once context is fixed (gold retrieval or logged production chunks), measure whether the answer stays inside that context. Faithfulness penalizes unsupported claims. Grounding checks entailment from context to answer. Citation accuracy verifies that cited chunk IDs actually support the attached sentences.

Generation metrics catch a different failure mode than retrieval: the right chunk is in the prompt, but the model summarizes incorrectly, merges two policies, or cites chunk B while stating facts from chunk A. Regulated domains (finance, healthcare, internal legal) often gate releases on faithfulness scores even when BLEU/ROUGE look fine—those n-gram metrics reward fluency, not truth.

Metric comparison

Metric	Question it answers	Implementation tiers
Faithfulness	Is every claim in the answer supported by context?	Claim decomposition + NLI; LLM judge; RAGAS faithfulness
Answer relevance	Does the answer address the question (without extra fluff)?	Embedding similarity Q↔A; LLM rubric
Grounding score	Fraction of answer sentences entailed by context	NLI model (e.g. deberta-v3-base-mnli); TruLens groundedness
Citation accuracy	Do [1][2] pointers match supporting spans?	Parse citations → retrieve span → NLI or string overlap
Hallucination rate	% answers with ≥1 unsupported atomic claim	Inverse of faithfulness on claim set

Cheap faithfulness before LLM judges

Split answer into sentences (or bullet claims).
For each claim, check if any context sentence contains overlapping entities + predicate (keyword heuristic).
Escalate low-confidence claims to NLI or LLM judge.
Score = supported claims / total claims.

import re

CITE_PATTERN = re.compile(r"\[(\d+)\]")

def split_claims(answer: str) -> list[str]:
    parts = re.split(r"(?<=[.!?])\s+", answer.strip())
    return [p for p in parts if len(p.split()) > 3]

def keyword_supported(claim: str, context: str) -> bool:
    tokens = {t.lower() for t in re.findall(r"\w{4,}", claim)}
    ctx = context.lower()
    return sum(1 for t in tokens if t in ctx) >= max(1, len(tokens) // 2)

def faithfulness_score(answer: str, context: str) -> float:
    claims = split_claims(answer)
    if not claims:
        return 1.0
    supported = sum(1 for c in claims if keyword_supported(c, context))
    return supported / len(claims)

def citation_accuracy(answer: str, chunks: list[dict]) -> float:
    """chunks: [{id, text}] indexed 1-based in citations [1], [2]."""
    cites = CITE_PATTERN.findall(answer)
    if not cites:
        return 1.0  # no citations required
    ok = 0
    for ref in cites:
        idx = int(ref) - 1
        if idx < 0 or idx >= len(chunks):
            continue
        # strip citation markers for overlap check
        sentence = CITE_PATTERN.sub("", answer).strip()
        if keyword_supported(sentence, chunks[idx]["text"]):
            ok += 1
    return ok / len(cites) if cites else 1.0

public final class GenerationMetrics {
    private static final Pattern CITE = Pattern.compile("\\[(\\d+)\\]");

    public static double faithfulnessScore(String answer, String context) {
        List<String> claims = splitClaims(answer);
        if (claims.isEmpty()) return 1.0;
        long supported = claims.stream()
            .filter(c -> keywordSupported(c, context)).count();
        return (double) supported / claims.size();
    }

    public static double citationAccuracy(String answer, List<Chunk> chunks) {
        Matcher m = CITE.matcher(answer);
        List<Integer> refs = new ArrayList<>();
        while (m.find()) refs.add(Integer.parseInt(m.group(1)));
        if (refs.isEmpty()) return 1.0;
        String sentence = CITE.matcher(answer).replaceAll("").trim();
        long ok = refs.stream().filter(ref -> {
            int idx = ref - 1;
            return idx >= 0 && idx < chunks.size()
                && keywordSupported(sentence, chunks.get(idx).text());
        }).count();
        return (double) ok / refs.size();
    }

    private static boolean keywordSupported(String claim, String context) {
        Set<String> tokens = tokenize(claim, 4);
        String ctx = context.toLowerCase();
        long hits = tokens.stream().filter(ctx::contains).count();
        return hits >= Math.max(1, tokens.size() / 2);
    }
}

RAGAS-style generation scores (conceptual)

RAGAS popularized LLM-assisted metrics: faithfulness (claims ⊆ context), answer_relevancy (question ↔ answer), context_precision (relevant chunks ranked high), context_recall (gold facts in retrieved context). You can implement the same ideas without the library—see Eval frameworks for RAGAS integration sketches.

🔒 Security

Faithfulness eval prompts include full retrieved context—which may contain PII or secrets from your corpus. Run offline eval in isolated environments; redact production logs before they enter judge prompts. A “faithful” answer can still leak context if the eval set includes restricted documents without ACL simulation.

🎯 Interview Tip

“How do you eval RAG?” — Open with three layers; give Recall@5 for retrieval; faithfulness for generation; mention citation accuracy for UI with footnotes; close with CI gates and human calibration quarterly—not “we eyeball it.”

💡 Pro Tip

Log context_hash + prompt_version + chunk_ids on every production answer. Generation-only replay becomes a one-liner: fetch logged context, re-run answer prompt, diff faithfulness scores.

Eval frameworks — RAGAS, TruLens, DeepEval, UpTrain

You can implement metrics by hand (previous sections). Frameworks bundle dataset schemas, LLM judges, tracing, and CI hooks. Pick based on language stack, observability needs, and whether you want batteries-included RAG metrics or a general eval harness.

Framework comparison

Framework	Strengths	RAG metrics	Tracing / prod	Java / JVM	Typical fit
RAGAS	RAG-native metrics; research-backed; LangChain ecosystem	Faithfulness, answer relevancy, context precision/recall	Offline batch; export to CSV	Python-first; JVM via HTTP wrapper	Teams standardizing RAG metric definitions
TruLens	Instrument apps; feedback functions on traces	Groundedness, context relevance, composite	Strong — trace every app call	Python SDK; Java via OpenTelemetry bridge	Apps already using TruLens for LLM tracing
DeepEval	Pytest-style tests; 14+ metrics; CI integration	Hallucination, faithfulness, contextual recall/precision	Confident AI cloud dashboard optional	Python-first	Engineering teams wanting `deepeval test run` in CI
UpTrain	Open-source + managed; multimodal; guardrails	Context relevance, factual accuracy, response completeness	Online evals + alerting	REST API from any language	Platform teams needing hosted dashboards fast

RAGAS batch eval sketch

# pip install ragas datasets langchain-openai
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

def build_ragas_dataset(rows, rag_pipeline) -> Dataset:
    records = []
    for row in rows:
        out = rag_pipeline(row["question"])
        records.append({
            "question": row["question"],
            "answer": out["answer"],
            "contexts": out["context_texts"],  # list[str]
            "ground_truth": row.get("reference_answer", ""),
        })
    return Dataset.from_list(records)

def run_ragas_eval(rows, rag_pipeline) -> dict:
    ds = build_ragas_dataset(rows, rag_pipeline)
    result = evaluate(
        ds,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
    )
    return result.to_pandas().mean(numeric_only=True).to_dict()

# CI gate example
if __name__ == "__main__":
    import json
    from pathlib import Path
    rows = [json.loads(l) for l in Path("eval/qca.jsonl").read_text().splitlines() if l.strip()]
    scores = run_ragas_eval(rows, my_rag_pipeline)
    assert scores["faithfulness"] >= 0.85, scores
    print(scores)

// Spring service calling RAGAS via sidecar HTTP or internal Python worker
@Service
public class RagasEvalClient {
    private final RestClient rest;

    public RagasEvalClient(RestClient.Builder builder) {
        this.rest = builder.baseUrl("http://ragas-worker:8080").build();
    }

    public Map<String, Double> evaluateBatch(List<QcaRow> rows, RagPipeline pipeline) {
        List<RagasRecord> records = rows.stream().map(row -> {
            RagResult out = pipeline.answer(row.question());
            return new RagasRecord(
                row.question(), out.answer(), out.contextTexts(),
                row.referenceAnswer() != null ? row.referenceAnswer() : "");
        }).toList();

        return rest.post()
            .uri("/v1/evaluate")
            .body(new RagasEvalRequest(records, List.of(
                "faithfulness", "answer_relevancy",
                "context_precision", "context_recall")))
            .retrieve()
            .body(new ParameterizedTypeReference<Map<String, Double>>() {});
    }

    public void assertFaithfulnessGate(List<QcaRow> rows, RagPipeline pipeline, double min) {
        Map<String, Double> scores = evaluateBatch(rows, pipeline);
        if (scores.getOrDefault("faithfulness", 0.0) < min) {
            throw new EvalRegressionException("faithfulness below gate: " + scores);
        }
    }
}

DeepEval pytest-style sketch

# pip install deepeval
import pytest
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, ContextualRecallMetric
from deepeval.test_case import LLMTestCase

@pytest.mark.parametrize("row", load_qca_rows("eval/qca.jsonl"))
def test_rag_faithfulness(row, rag_pipeline):
    out = rag_pipeline(row["question"])
    test_case = LLMTestCase(
        input=row["question"],
        actual_output=out["answer"],
        retrieval_context=out["context_texts"],
        expected_output=row.get("reference_answer"),
    )
    faith = FaithfulnessMetric(threshold=0.8)
    recall = ContextualRecallMetric(threshold=0.75)
    assert_test(test_case, [faith, recall])

# Run: deepeval test run tests/test_rag_eval.py

// DeepEval is Python-native; JVM teams mirror the pattern with JUnit + HTTP judge
@Test
void ragFaithfulnessGate() throws Exception {
    List<QcaRow> smoke = qcaLoader.load("eval/qca_smoke.jsonl");
    List<EvalFailure> failures = new ArrayList<>();

    for (QcaRow row : smoke) {
        RagResult out = ragPipeline.answer(row.question());
        JudgeVerdict verdict = deepEvalClient.judge(new JudgeRequest(
            row.question(), out.answer(), out.contextTexts(),
            row.referenceAnswer(), List.of("faithfulness", "contextual_recall"),
            0.8, 0.75));
        if (!verdict.passed()) failures.add(new EvalFailure(row.id(), verdict));
    }
    assertTrue(failures.isEmpty(), "RAG eval failures: " + failures);
}

Selection guide

Already on LangChain/LlamaIndex → RAGAS integrates cleanly; start with faithfulness + context_recall.
Need production tracing + eval on same spans → TruLens feedback functions on retrieved contexts.
Want pytest in CI tomorrow → DeepEval with threshold metrics and deepeval test run.
Python-averse platform team → UpTrain REST or custom Java judge service; keep metric definitions in one OpenAPI spec.

☸️ OpenShift / K8s

Run RAGAS/DeepEval as a nightly CronJob with frozen index snapshot mounted from S3. PR pipelines call a slim retrieval-only job; full RAGAS suite needs GPU-less LLM judge quota—schedule off-peak. Store results as Prometheus metrics or push to your eval DB.

🏗️ IaC

Version eval dependencies in lockfiles (ragas==0.2.x, judge model ID in Terraform/Helm values). Changing judge model without re-baseline invalidates historical dashboards—tag every eval run with judge_model and embed_model.

⚖️ Trade-off

Framework metrics correlate with human judgment but are not ground truth. Teams that skip human calibration over-trust RAGAS faithfulness and ship regressions when the judge model drifts. Budget 50–100 human-labelled rows per quarter to recalibrate thresholds.

Test sets — QCA triples, manual vs synthetic

Metrics are only as good as labels. RAG eval sets center on QCA triples: Question, Context (gold chunk IDs or passages), Answer (reference or required fact spans). Build from SME-authored rows first; augment with synthetic generation once the schema is stable.

A minimal JSONL row for retrieval + E2E:

{"id":"qca-001","question":"Can I refund a monthly plan after 14 days?","gold_chunk_ids":["refund-policy.txt-0"],"reference_answer":"Partial months are not refunded after 14 days.","gold_answer_contains":["partial months","not refund"],"tags":["billing","policy"],"source":"manual"}
{"id":"qca-002","question":"Lookup order ORD-1042 status","gold_chunk_ids":["orders.txt-12"],"gold_answer_contains":["ORD-1042"],"tags":["keyword","lookup"],"source":"manual"}
{"id":"qca-003","question":"What is the Mars colony return policy?","gold_chunk_ids":[],"gold_answer_contains":["don't know","support@","not find"],"tags":["negative","ood"],"source":"synthetic"}

Manual vs synthetic

Source	Pros	Cons	When to use
Manual (SME)	Ground truth trusted; captures domain nuance	Slow; expensive; coverage gaps	Initial 50–150 rows; compliance-critical paths
Production sample	Real user phrasing; thumbs-down relabel	PII scrubbing; biased toward frequent queries	Weekly harvest with redaction pipeline
Synthetic (LLM)	Scales coverage; paraphrase variants	May not match real failures; label noise	Augment after manual seed; stratified by tag
Synthetic (template)	Deterministic ID lookup rows	Shallow linguistic variety	SKU/order/account keyword tests

Synthetic QCA generation sketch

SYNTH_PROMPT = """Given this document chunk, write ONE factual question answerable ONLY from the chunk.
Return JSON: {{"question": "...", "reference_answer": "..."}}

Chunk:
{chunk}
"""

def synth_qca_from_chunks(chunks: list[dict], client, max_per_chunk: int = 2) -> list[dict]:
    rows = []
    for chunk in chunks:
        for _ in range(max_per_chunk):
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                temperature=0.3,
                messages=[{"role": "user", "content": SYNTH_PROMPT.format(chunk=chunk["text"][:2000])}],
            )
            data = json.loads(resp.choices[0].message.content)
            rows.append({
                "id": f"synth-{chunk['id']}-{len(rows)}",
                "question": data["question"],
                "gold_chunk_ids": [chunk["id"]],
                "reference_answer": data["reference_answer"],
                "source": "synthetic",
                "tags": ["synthetic", chunk.get("collection", "default")],
            })
    return rows

public List<QcaRow> synthQcaFromChunks(List<Chunk> chunks, ChatClient client, int maxPerChunk) {
    List<QcaRow> rows = new ArrayList<>();
    for (Chunk chunk : chunks) {
        for (int i = 0; i < maxPerChunk; i++) {
            String json = client.prompt()
                .user(SYNTH_PROMPT.formatted(chunk.text().substring(0, Math.min(2000, chunk.text().length()))))
                .options(ChatOptionsBuilder.builder().withModel("gpt-4o-mini").withTemperature(0.3).build())
                .call().getResult().getOutput().getContent();
            SynthPayload data = objectMapper.readValue(json, SynthPayload.class);
            rows.add(new QcaRow(
                "synth-" + chunk.id() + "-" + rows.size(),
                data.question(), List.of(chunk.id()), data.referenceAnswer(),
                List.of(), List.of("synthetic", chunk.collection()), "synthetic"));
        }
    }
    return rows;
}

Stratification and coverage

Tags — billing, shipping, negative (unanswerable), keyword, multi-hop, ACL-bound.
Smoke vs full — 15–25 stratified rows for PR; 200–2000 for nightly.
Negative cases — questions with empty gold_chunk_ids; expect refusal or escalation, not fabrication.
Versioning — qca_manifest.json with schema version, corpus snapshot ID, labeler sign-off.

⚠️ Pitfall

Synthetic-only eval sets inflate scores: the same LLM family generated questions and judges answers. Always anchor with manual SME rows and production failures; use synthetic for coverage expansion, not as the sole gate.

📦 Real World

Support bots at scale maintain a “failure inbox”: every thumbs-down with logged chunk IDs becomes a candidate QCA row after SME review. That loop beats one-time synthetic generation for catching retrieval drift on new product launches.

Production — RAG eval dashboard & operating loop

Offline QCA scores prove a build candidate. Production needs a RAG eval dashboard that ties retrieval logs, faithfulness samples, user feedback, and CI baselines into one view—so on-call can answer “did last night’s index deploy hurt refunds?” without re-running ad hoc notebooks.

Dashboard panels

Panel	Metrics	Source	Alert if
Retrieval health	Recall@5, MRR (nightly on QCA); empty-result rate live	Offline job + retrieval logs	Recall drops > 2 pts vs 7d baseline
Generation quality	Faithfulness p50/p95; hallucination rate	Sampled online judge + offline RAGAS	Faithfulness p50 < gate
Citations	Citation accuracy; broken chunk ID rate	Answer post-processor logs	Accuracy < 90% on sampled rows
User signals	Thumbs down, “wrong source”, escalation	Product analytics	Spike vs 7d moving average
Cost / latency	Retrieve p95, rerank p95, tokens per answer	APM traces	p95 retrieve > SLO

CI + nightly pipeline

PR (smoke) — 20 stratified QCA rows: retrieval recall@5 + fact_match E2E; no live judge.
Merge to main — rebuild index on staging snapshot; full retrieval suite (~500 rows).
Nightly — RAGAS/DeepEval on full QCA + 5% production sample (redacted).
Pre-prod promote — compare dashboard baselines; human sign-off on diff report.
Post-deploy — 24h canary: online faithfulness sample + thumbs-down watch.

import json
from datetime import datetime, timezone

def publish_eval_report(scores: dict, meta: dict, out_dir: str = "reports/rag") -> str:
    payload = {
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "embed_model": meta["embed_model"],
        "index_version": meta["index_version"],
        "judge_model": meta.get("judge_model"),
        "scores": scores,
        "gates": {
            "recall@5": scores.get("recall@5", 0) >= meta["min_recall@5"],
            "faithfulness": scores.get("faithfulness", 0) >= meta["min_faithfulness"],
        },
    }
    path = f"{out_dir}/latest.json"
    Path(out_dir).mkdir(parents=True, exist_ok=True)
    Path(path).write_text(json.dumps(payload, indent=2))
    # Optional: push to Datadog/Grafana via HTTP API
    # datadog_api.Metric.send(type="gauge", metric="rag.recall_at_5", points=[(time.time(), scores["recall@5"])])
    return path

@Service
public class RagEvalReporter {
    private final MeterRegistry metrics;
    private final ObjectMapper mapper;

    public Path publishEvalReport(LayerScores scores, EvalMeta meta) throws IOException {
        Map<String, Object> payload = Map.of(
            "timestamp", Instant.now().toString(),
            "embedModel", meta.embedModel(),
            "indexVersion", meta.indexVersion(),
            "scores", scores,
            "gates", Map.of(
                "recall@5", scores.retrieval().get("recall@5") >= meta.minRecallAt5(),
                "faithfulness", scores.generation().get("faithfulness_mean") >= meta.minFaithfulness()
            ));
        Path path = Path.of("reports/rag/latest.json");
        Files.createDirectories(path.getParent());
        mapper.writerWithDefaultPrettyPrinter().writeValue(path.toFile(), payload);

        metrics.gauge("rag.recall_at_5", scores.retrieval().get("recall@5"));
        metrics.gauge("rag.faithfulness_mean", scores.generation().get("faithfulness_mean"));
        return path;
    }
}

Operating loop (weekly)

Review dashboard regressions and top thumbs-down cluster themes.
Promote 5–10 production failures into QCA after SME label.
Re-calibrate judge vs human on 30-row sample if judge model changed.
Archive eval report JSON alongside index version for audit trail.

Production readiness checklist

QCA JSONL with manual seed + synthetic augment + negative cases
Three-layer eval scripts (retrieval, generation, E2E) in repo
Baseline metrics file per index + embed + judge version
PR smoke gate (< 5 min) and nightly full suite
Dashboard: recall, faithfulness, citation accuracy, user feedback
Production logs: question hash, chunk IDs, context hash, prompt version
Runbook: “recall OK but faithfulness down” vs “both down”

🎯 Interview Tip

“Design RAG eval for production.” — QCA sets, separate retrieval/generation metrics, RAGAS or custom judges, CI smoke + nightly full, dashboard with user feedback loop, canary after index deploy. Mention you never promote on demo accuracy alone.

💡 Pro Tip

Single “eval scorecard” PDF/HTML auto-generated from reports/rag/latest.json for PM/legal sign-off—same numbers as CI, zero manual spreadsheet drift.

💰 Cost

Online faithfulness on 2% of traffic with gpt-4o-mini judge ≈ $0.002–0.008 per judged answer (context-dependent). Cap daily judge spend; fall back to keyword faithfulness for the long tail. Full RAGAS nightly on 1k rows ≈ $10–40 depending on context size.

Related: Retrieval strategies — improve recall before you tune faithfulness gates; Hands-on JSONL walkthrough in ~/hello-rag (see the eval harness section above).