Evaluation explained
Evaluation is automated proof that your LLM pipeline still behaves on a frozen set of tickets—not a vibe check after changing one prompt line. Unlike traditional software, outputs are stochastic; evals use scorers that tolerate paraphrase while catching regressions. Golden cases live in git, reports land in CI artifacts, and non-zero exit codes block merges when quality drops.
After reading, you should be able to: explain why eval is often the hardest part of LLM engineering; name the vibe-check trap; map unit, integration, and human eval layers; choose metrics over subjective iteration; run baseline→change→compare→ship; distinguish reference-based, LLM-as-judge, human, and task-based eval types; and apply a production eval readiness checklist.
Why evaluation — the hardest part of LLM engineering
Calling an API is easy. Writing a prompt that works on three demo questions is easy. Knowing whether your system still works after the tenth prompt change, a model upgrade, or a new document corpus—that is hard. Evaluation is the discipline that turns LLM features from demos into software you can ship and maintain.
The demo vs production gap
Every LLM product passes a hallway test: three beautiful answers, everyone applauds. Production bots fail on the fourth ticket—monthly refund edge cases, jailbreaks, missing KB topics, or wrong RAG chunks. Without evals, you learn from angry users, not CI.
Traditional software: same input → same bytes. LLMs sample from a distribution—paraphrase is normal. Evals use scorers—rules, embeddings, judges—that tolerate wording while catching policy violations.
The vibe-check trap
The vibe check is the most common failure mode: after every prompt edit, someone opens a chat UI, asks five questions, squints, declares “looks good.” That workflow is non-reproducible, non-comparable, biased toward recent edits, and invisible in CI.
Escape with a golden dataset: 30–200 real questions as JSONL in git with must_contain, forbid_phrases, refusal flags, citation requirements. Run on every PR. When a case fails, fix the pipeline or update gold deliberately.
| Signal | Vibe check | Structured eval |
|---|---|---|
| Repeatability | Ad hoc questions | Same JSONL every run |
| Regression detection | Often missed | Pass rate delta blocks merge |
| Attribution | “Feels off” | Case ID + scorer + prompt version |
| Onboarding | Tribal knowledge | make eval in README |
flowchart LR
subgraph trap["Vibe-check trap"]
E1[Edit prompt] --> M1[Manual chat test]
M1 --> S1[Ship on feeling]
S1 --> E1
end
subgraph eval["Eval discipline"]
E2[Edit prompt] --> R2[Run gold JSONL]
R2 --> C2{Pass rate OK?}
C2 -->|yes| S2[Merge]
C2 -->|no| F2[Fix or update gold]
F2 --> E2
end
Why eval is harder than building the feature
- Ground truth is fuzzy — many valid answers; scorers encode policy.
- Multi-component failures — bad answers may be retrieval, prompt, model, or tools.
- Eval drift — gold sets stale when policy changes; judges need calibration.
- Cost vs coverage — live CI calls are expensive; mocks miss real failures.
- Agent trajectories — multi-step paths need trajectory assertions, not answer-only checks.
Shipping after playground testing. Playground sessions skip your RAG index, production prompt version, tool allowlist, and model route. Eval the deployed pipeline, not an ad-hoc chat tab.
Support teams at scale maintain hundreds of golden scenarios. Prompt merges require pass-rate thresholds; production failures become new gold rows within days.
“What is LLM evaluation?” — Frozen cases + automated scorers + CI regression gates, separating offline suites from online monitoring. Not manual ChatGPT testing.
import json
from dataclasses import dataclass
@dataclass
class EvalCase:
id: str
question: str
must_contain: list[str]
forbid_phrases: list[str]
expect_refusal: bool = False
expect_citation: bool = False
tags: list[str] | None = None
def load_cases(path: str) -> list[EvalCase]:
cases = []
with open(path) as f:
for line in f:
row = json.loads(line)
cases.append(EvalCase(**{k: row[k] for k in row if k != "prompt_version"}))
return cases
public record EvalCase(
String id,
String question,
List mustContain,
List forbidPhrases,
boolean expectRefusal,
boolean expectCitation,
List tags
) {
public static EvalCase fromJson(JsonNode n) {
return new EvalCase(
n.get("id").asText(),
n.get("question").asText(),
jsonStrings(n.get("must_contain")),
jsonStrings(n.get("forbid_phrases")),
n.path("expect_refusal").asBoolean(false),
n.path("expect_citation").asBoolean(false),
jsonStrings(n.path("tags"))
);
}
}
Harnesses treat the pipeline as answer = pipeline(question, config). Scorers inspect outputs only—architecture can evolve from completion to RAG to agents without rewriting every test.
Start with 20 failures from real tickets—not hypothetical questions. The best gold rows come from production bugs you already paid for.
Offline regression vs online monitoring
Offline eval runs against frozen gold before deploy—your safety net in CI. Online eval samples live traffic after deploy—catches distribution shift, new user phrasing, and doc updates not yet re-indexed. You need both; offline alone misses production-only failures, online alone ships broken prompts to the first thousand users.
| Mode | Input | When | Purpose |
|---|---|---|---|
| Offline regression | JSONL gold in git | PR, nightly, pre-release | Block regressions before merge |
| Online sampling | Production traces (1–5%) | Continuous post-deploy | Detect drift, new failure modes |
| Shadow eval | Live queries, new pipeline in parallel | Before flag flip | Compare v1 vs v2 on real traffic safely |
| Red-team batch | Adversarial prompts | Weekly / release gate | Safety regressions (Guide 4) |
Track 3 introduced prompt golden inputs; Track 2 introduced recall@k. Track 5 unifies them into a product-level eval strategy—PR smoke, nightly live runs, and human review of production samples. See Reliable output formatting for schema-level unit tests that complement answer-level evals.
Large gold sets increase confidence but require maintenance when product copy changes. Review gold quarterly; archive obsolete cases instead of letting pass rates lie.
The eval pyramid — unit, integration, human evals
Borrow the test pyramid from traditional QA: many fast cheap checks at the bottom, fewer expensive checks at the top. LLM evals stack the same way—unit scorers on strings, integration runs through the full pipeline, human review for nuance and safety.
Three layers
| Layer | What you test | When | Cost | Example |
|---|---|---|---|---|
| Unit | Scorers, parsers, schema validators | Every commit | Free | must_contain scorer on fixture text |
| Integration | Full pipeline: retrieve → prompt → model | PR + nightly | API tokens (live) or mock | 50-case JSONL smoke on PR; 500 nightly |
| Human | Rubric quality, tone, edge judgment | Weekly / on sample | People time | Review 1% production traces |
flowchart TB H[Human eval sampling] I[Integration pipeline evals] U[Unit scorers and schema tests] H --> I --> U
Unit layer — scorers and fixtures
Unit tests never call the LLM. They assert scorers behave: given answer text and gold expectations, return pass/fail. Also test JSONL loaders, report writers, and exit-code logic—same as testing pytest helpers.
Integration layer — mock vs live runners
Mock runners return deterministic fake answers—free CI on every PR, catches scorer and wiring regressions.Live runners call real RAG/agents—nightly or pre-release—catch model and retrieval drift mocks hide.
Human layer — when automation stops
Judges and rules miss empathy, subtle policy, and novel failure modes. Sample production traces into a review queue; label pass/fail with reasons; promote failures to gold JSONL. Human eval is not “skip automation”—it feeds the automation.
Smoke vs full suites
| Suite | Cases | Runner | Gate |
|---|---|---|---|
| support_smoke.jsonl | 8–15 | Mock (PR) + live (optional) | Blocks merge |
| support_full.jsonl | 100–500 | Live API | Nightly alert |
| safety_redteam.jsonl | 50+ adversarial | Live | Release gate |
| rag_recall.jsonl | Question + expected chunk IDs | Retrieval only | Index change gate |
Agent evals add a fourth suite type: agent_trajectory.jsonl with expected tool names and order—see Agents explained for why final-answer-only checks miss wrong tool paths.
def score_must_contain(answer: str, phrases: list[str]) -> tuple[bool, str]:
lower = answer.lower()
missing = [p for p in phrases if p.lower() not in lower]
if missing:
return False, f"missing phrases: {missing}"
return True, "ok"
def score_forbid(answer: str, forbidden: list[str]) -> tuple[bool, str]:
hits = [p for p in forbidden if p.lower() in answer.lower()]
if hits:
return False, f"forbidden phrases: {hits}"
return True, "ok"
public final class RuleScorers {
public static ScoreResult mustContain(String answer, List phrases) {
String lower = answer.toLowerCase();
List missing = phrases.stream()
.filter(p -> !lower.contains(p.toLowerCase()))
.toList();
return missing.isEmpty()
? ScoreResult.pass("ok")
: ScoreResult.fail("missing: " + missing);
}
}
Run mock integration on every PR (~seconds). Reserve live LLM integration for nightly jobs with budget caps—surprise $3k/month eval bills get CI disabled.
More gold cases improve coverage but slow CI. Tag cases: smoke (8 rows, PR gate) vs full (500 rows, nightly).
Pin model, prompt_version, and git_sha in report metadata so score deltas are attributable.
Gold files may contain realistic PII-shaped fixtures—keep in private repos; redact in public docs; never commit production customer transcripts verbatim.
Metrics vs vibes — quantitative iteration
Subjective iteration—“this answer feels better”—does not scale across engineers, models, or time zones. Metrics turn quality into numbers you can plot, diff in PR comments, and gate in CI: pass rate, per-tag breakdown, latency p95, cost per case, judge score distributions.
Core metrics for LLM apps
| Metric | Formula / meaning | Use when |
|---|---|---|
| Pass rate | passed_cases / total_cases | Primary CI gate |
| Delta vs baseline | pass_rate_branch − pass_rate_main | Block merge if < −2% |
| Per-tag pass rate | Pass rate grouped by billing, safety, RAG | Find regressions hidden in aggregate |
| Refusal rate | Cases where model refused / expect_refusal cases | Safety and policy evals |
| Latency p95 | 95th percentile pipeline ms | Performance regressions |
| Cost per eval run | Sum input+output tokens × price | Budget CI and nightly jobs |
From vibes to numbers — workflow
- Establish baseline pass rate on main with pinned config.
- Change one variable (prompt, chunk size, model)—not three at once.
- Re-run suite; compute delta and per-tag breakdown.
- Inspect failed case IDs—not only the aggregate score.
- Accept, revert, or update gold with documented rationale.
Plot pass rate over time in Grafana or your observability stack. A slow drift across prompt edits is invisible in vibes but obvious on a chart.
Leading vs lagging indicators
- Leading — offline pass rate delta, parse success rate, retrieval recall@5 on gold queries (predicts user pain before deploy).
- Lagging — support escalation rate, thumbs-down on bot replies, average handle time (confirms eval gaps after deploy).
When leading and lagging diverge—eval pass rate high but escalations rising—your gold set no longer represents real traffic. Refresh gold from production samples (Guide 5) before tuning prompts again.
Store thresholds in repo config: max_regression: 0.02, min_safety_pass: 1.0, min_billing_pass: 0.95. Avoid magic numbers scattered in shell scripts.
from dataclasses import dataclass, asdict
import json
@dataclass
class EvalReport:
suite: str
pass_rate: float
passed: int
failed: int
baseline_pass_rate: float
delta: float
failed_ids: list[str]
model: str
prompt_version: str
git_sha: str
def write_report(path: str, report: EvalReport) -> None:
with open(path, "w") as f:
json.dump(asdict(report), f, indent=2)
def should_fail_ci(report: EvalReport, max_regression: float = 0.02) -> bool:
return report.delta < -max_regression
public record EvalReport(
String suite,
double passRate,
int passed,
int failed,
double baselinePassRate,
double delta,
List failedIds,
String model,
String promptVersion,
String gitSha
) {
public boolean shouldFailCi(double maxRegression) {
return delta < -maxRegression;
}
}
Optimizing pass rate alone. A prompt that refuses everything passes safety cases and fails helpfulness—track multiple tags and weight business-critical rows higher.
Teams attach eval summary comments to PRs: “pass 94.2% (−0.3% vs main), failed: refund-monthly, jailbreak-07”. Reviewers scan deltas, not full transcripts.
Keep a baseline.json on main updated nightly. PR jobs compare against that artifact, not a stale local file.
“How do you know a prompt change helped?” — Golden set, before/after pass rate with pinned model, inspect failed IDs, per-tag breakdown—not subjective comparison.
Eval-driven loop — baseline → change → compare → ship
Treat eval like TDD for nondeterministic systems: capture expected behavior first (or after first bug), change one knob, measure, then ship only when metrics hold. The loop is baseline → change → compare → ship—same rhythm as performance benchmarks in backend engineering.
flowchart LR
B[Baseline run on main]
C[Change prompt RAG model]
R[Re-run gold suite]
D{Delta within threshold?}
S[Ship merge deploy]
F[Fix or revert]
B --> C --> R --> D
D -->|yes| S
D -->|no| F --> C
Step 1 — Baseline
Run full suite on main with production config. Store reports/baseline.json: pass rate, failed IDs, latency, token cost. This is your contract for “currently acceptable.”
Step 2 — Change
One change per experiment: new system prompt version, chunk size 512→768, swap gpt-4o-mini for Claude, add a tool. Log the hypothesis in the PR description.
Step 3 — Compare
Re-run the same suite. Compute delta. Open failed cases side-by-side: old answer vs new. Distinguish improvements (new pass), regressions (new fail), and gold drift (policy changed—update JSONL).
Step 4 — Ship
Merge when delta ≥ threshold (e.g. no worse than −1% overall, no safety tag below 100%). Tag prompt registry version; deploy with feature flag; monitor online metrics (Guide 5).
Promote production failures into gold
When support escalates a bad bot answer, capture question + expected behavior + actual failure mode. Add JSONL row with same id pattern. That closes the loop—eval suites grow from real pain.
Example PR workflow
- Engineer bumps support_system_v3 → v4 in prompt registry.
- CI runs support_smoke.jsonl with mock runner (free) — wiring OK.
- Optional: CI runs 8-case live smoke if EVAL_LIVE=1 on protected branches.
- Nightly job runs full suite; posts pass rate to Slack if delta < −1% vs baseline.
- On merge, deploy behind flag; online sampler watches thumbs-down for 24h before full rollout.
RAG pipeline changes follow the same loop but add retrieval metrics first—if recall@5 drops, fix chunking or hybrid search before rewriting the generation prompt. See RAG explained eval basics and Guide 3 in this track for RAGAS-style decomposition.
import sys
from eval.lib.load_cases import load_cases
from eval.scorers import score_case
from runners.live_support import answer as pipeline
def run_suite(gold_path: str, baseline_rate: float) -> int:
cases = load_cases(gold_path)
passed, failed_ids = 0, []
for case in cases:
ans = pipeline(case.question)
ok, _ = score_case(ans, case)
if ok:
passed += 1
else:
failed_ids.append(case.id)
rate = passed / len(cases)
delta = rate - baseline_rate
print(f"pass_rate={rate:.3f} delta={delta:+.3f} failed={failed_ids}")
return 1 if delta < -0.02 else 0
if __name__ == "__main__":
sys.exit(run_suite("eval/gold/support_smoke.jsonl", baseline_rate=0.94))
@Service
public class EvalSuiteRunner {
public int runAndExitCode(Path goldPath, double baselineRate) {
List cases = caseLoader.load(goldPath);
List failed = new ArrayList<>();
int passed = 0;
for (EvalCase c : cases) {
String ans = pipeline.answer(c.question());
if (scorer.score(ans, c).passed()) passed++;
else failed.add(c.id());
}
double rate = (double) passed / cases.size();
double delta = rate - baselineRate;
log.info("pass_rate={} delta={} failed={}", rate, delta, failed);
return delta < -0.02 ? 1 : 0;
}
}
Strict CI gates slow iteration early; loose gates ship regressions. Start with smoke-only blocking; tighten thresholds as gold coverage matures.
Eval-driven development does not require perfect gold on day one. Start with 10 known-bad production tickets; expand weekly—coverage is a product, not a one-time project.
Batch baseline + compare in one nightly job when API cost matters; PRs run mock runner only, comparing mock pass rates for wiring regressions.
Prompt registries link each version to eval scores at promote time—v4 shipped at 96.1%, v5 rejected at 93.8% on billing tag before any user saw v5.
Eval types — reference-based, LLM-as-judge, human, task-based
No single scorer fits every question. Production stacks combine four families—each with strengths, costs, and failure modes. Pick scorers by what you need to prove: exact policy phrases, semantic equivalence, human judgment, or end-to-end task success.
| Type | Mechanism | Best for | Weakness |
|---|---|---|---|
| Reference-based | Rules, regex, must_contain, exact JSON schema | Policy phrases, format, refusals | Misses valid paraphrase |
| Embedding similarity | Cosine vs reference answer or chunk | Soft equivalence when wording varies | False pass on wrong facts, similar prose |
| LLM-as-judge | Rubric prompt scores 1–5 or pass/fail | Helpfulness, tone, multi-criteria | Judge bias, cost, inconsistency |
| Human eval | Reviewers with rubric | Gold standard, calibrating judges | Slow, expensive |
| Task-based | Did the agent complete the task? | Tool use, ticket resolved, SQL row updated | Needs sandbox or mock APIs |
Reference-based evals
Fast, deterministic, CI-friendly. Encode legal/compliance language literally: “partial months,” “30-day window.” Combine with forbid_phrases for hallucinated guarantees. Track 3 formatting guides pair with schema validators for JSON outputs.
LLM-as-judge
When rules cannot enumerate good answers, a separate model scores against a rubric: “Answer cites policy, refuses harmful requests, does not invent refund amounts.” Calibrate judges against human labels; use chain-of-thought rubrics with structured JSON output. Deep dive in LLM-as-judge.
Human eval
Sample 50–200 cases for blind A/B comparison between prompt versions. Human scores become labels to tune judges and detect judge drift. Never skip human review for high-stakes launches—automate the long tail only.
Task-based evals
For agents: assert tool sequence contains lookup_order before search_policy; mock CRM returns expected state; ticket marked resolved in sandbox. Final answer quality is insufficient when wrong tools ran.
Mapping eval types to Tracks 1–4
| Product layer | Primary eval types | Track reference |
|---|---|---|
| Chat / completion | Reference rules, LLM judge | LLMs explained |
| RAG copilot | Recall@k, faithfulness, reference rules | RAG explained |
| Prompted workflows | Schema validation, golden inputs | Core prompting |
| Agents | Task-based trajectories, tool selection | Tool calling |
Composite pipelines stack scorers: retrieval recall (unit/integration on index), generation faithfulness (judge), policy phrases (rules), safety (red-team reference + judge). Failures in composite stacks need failure attribution—log which stage failed so you do not “fix” prompts when retrieval broke.
flowchart TD Q[What must you prove?] Q -->|Exact policy text| R[Reference rules] Q -->|Paraphrase OK| E[Embedding similarity] Q -->|Subjective quality| J[LLM-as-judge] Q -->|High stakes launch| H[Human rubric] Q -->|Agent completed action| T[Task-based sandbox]
def score_case(answer: str, case: EvalCase, use_judge: bool = False) -> tuple[bool, str]:
ok, msg = score_must_contain(answer, case.must_contain)
if not ok:
return ok, msg
ok, msg = score_forbid(answer, case.forbid_phrases)
if not ok:
return ok, msg
if case.expect_refusal and not looks_like_refusal(answer):
return False, "expected refusal"
if use_judge and case.tags and "subjective" in case.tags:
return llm_judge(answer, case.question, rubric=JUDGE_RUBRIC)
return True, "ok"
public ScoreResult scoreCase(String answer, EvalCase c, boolean useJudge) {
ScoreResult r = RuleScorers.mustContain(answer, c.mustContain());
if (!r.passed()) return r;
r = RuleScorers.forbid(answer, c.forbidPhrases());
if (!r.passed()) return r;
if (c.expectRefusal() && !RefusalDetector.isRefusal(answer))
return ScoreResult.fail("expected refusal");
if (useJudge && c.tags() != null && c.tags().contains("subjective"))
return llmJudge.score(answer, c.question(), JUDGE_RUBRIC);
return ScoreResult.pass("ok");
}
LLM-as-judge for everything on every PR—cost explodes and judges correlate with the model under test. Rules first; judges for nightly subjective tags.
Task-based evals in sandboxes still need auth—do not point agents at production CRM with eval credentials that bypass audit.
“Reference vs LLM judge?” — Rules for compliance and format; judges for helpfulness when paraphrase varies; human labels calibrate judges; task-based for agents.
Tag gold rows with required scorer: "scorers": ["rules"] vs ["rules", "judge"] so PR jobs skip expensive judges.
What this looks like in production
A notebook that scores five answers is a spike. Production eval is a platform: versioned gold sets, scheduled runners, report artifacts, CI gates on prompt/RAG/agent repos, online sampling, and a process to promote failures into regression cases—before and after launch.
Architecture sketch
flowchart LR GIT[Gold JSONL in git] CI[CI smoke runner] NIGHT[Nightly live runner] RPT[Report artifact] OBS[Online trace sampling] GOLD[Promote to gold] GIT --> CI --> RPT GIT --> NIGHT --> RPT OBS --> GOLD --> GIT
Eval readiness checklist
- Gold JSONL in git with unique id per case; schema validated in unit tests
- Smoke suite (≤15 cases) runs on every PR—mock or live with budget cap
- Full suite nightly with pinned model and prompt_version
- CI fails when pass rate drops more than agreed threshold vs baseline
- Report artifact uploaded: pass rate, delta, failed IDs, latency, token cost
- Per-tag breakdown for billing, safety, RAG, agent trajectories
- Rule scorers for policy phrases; judges scheduled—not on every PR by default
- Human review queue for 1–5% production samples
- Process to add production failures as new gold rows within SLA
- Prompt/registry version linked to eval score at deploy time
- Feature flag rollback when online metrics diverge from offline eval
- RAG evals separate retrieval recall from generation faithfulness (Guide 3)
Minimum viable eval platform (week one)
You do not need LangSmith or Braintrust on day one. Minimum stack:
- eval/gold/*.jsonl — 20 cases from real tickets
- eval/scorers.py — rule scorers + unit tests
- runners/mock_*.py — deterministic pipeline stub
- eval/run_suite.py — exit code 0/1
- .github/workflows/eval-smoke.yml — runs on PR
- reports/latest.json — CI artifact
Upgrade to live runners, judges, and observability exporters as coverage and traffic grow—Guide 6 walks CI wiring end-to-end.
Logging every eval run
| Field | Why |
|---|---|
| run_id, git_sha | Reproduce any report |
| suite, runner (mock/live) | Explain mock vs live deltas |
| pass_rate, delta | CI gate input |
| failed_case_ids[] | Jump to debugging |
| model, prompt_version | Attribute behavior changes |
| total_tokens, cost_usd | FinOps for eval spend |
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel
app = FastAPI()
ADMIN_TOKEN = "rotate-me"
class EvalTrigger(BaseModel):
suite: str = "support_smoke"
runner: str = "mock"
@app.post("/internal/eval/run")
def trigger_eval(body: EvalTrigger, x_admin_token: str = Header(...)):
if x_admin_token != ADMIN_TOKEN:
raise HTTPException(403)
report = run_suite(body.suite, runner=body.runner)
return {"run_id": report.run_id, "pass_rate": report.pass_rate}
@RestController
@RequestMapping("/internal/eval")
public class EvalController {
@PostMapping("/run")
public EvalRunResponse trigger(@RequestBody EvalTrigger body,
@RequestHeader("X-Admin-Token") String token) {
if (!adminToken.equals(token)) throw new ResponseStatusException(HttpStatus.FORBIDDEN);
EvalReport report = suiteRunner.run(body.suite(), body.runner());
return new EvalRunResponse(report.runId(), report.passRate());
}
}
Nightly eval as CronJob; mount gold ConfigMap or clone repo; upload reports to S3; Slack webhook on regression vs baseline.
Terraform CI budget alarms on eval token spend; separate AWS/GCP project for eval API keys with hard quotas.
Cap nightly live eval at N cases × M tokens; overflow queue to weekend batch—unbounded eval jobs have bankrupted side projects.
Mature teams treat eval regressions like P1 bugs—billing tag drop from 100% to 97% blocks release even if aggregate pass rate still looks fine.
Mature teams treat eval regressions like P1 bugs—billing tag drop from 100% to 97% blocks release even if aggregate pass rate still looks fine.
“How do you ship LLM features safely?” — Golden sets, CI gates, offline + online eval, promote production failures to gold, separate RAG retrieval from generation metrics, feature-flag rollback tied to eval drift.
Publish eval pass-rate badges in internal docs next to each prompt version—makes “which template is production-safe?” a one-glance question for PMs and support leads.