Evaluation explained

Why evaluation — the hardest part of LLM engineering

Calling an API is easy. Writing a prompt that works on three demo questions is easy. Knowing whether your system still works after the tenth prompt change, a model upgrade, or a new document corpus—that is hard. Evaluation is the discipline that turns LLM features from demos into software you can ship and maintain.

The demo vs production gap

Every LLM product passes a hallway test: three beautiful answers, everyone applauds. Production bots fail on the fourth ticket—monthly refund edge cases, jailbreaks, missing KB topics, or wrong RAG chunks. Without evals, you learn from angry users, not CI.

Traditional software: same input → same bytes. LLMs sample from a distribution—paraphrase is normal. Evals use scorers—rules, embeddings, judges—that tolerate wording while catching policy violations.

The vibe-check trap

The vibe check is the most common failure mode: after every prompt edit, someone opens a chat UI, asks five questions, squints, declares “looks good.” That workflow is non-reproducible, non-comparable, biased toward recent edits, and invisible in CI.

Escape with a golden dataset: 30–200 real questions as JSONL in git with must_contain, forbid_phrases, refusal flags, citation requirements. Run on every PR. When a case fails, fix the pipeline or update gold deliberately.

Signal	Vibe check	Structured eval
Repeatability	Ad hoc questions	Same JSONL every run
Regression detection	Often missed	Pass rate delta blocks merge
Attribution	“Feels off”	Case ID + scorer + prompt version
Onboarding	Tribal knowledge	make eval in README

flowchart LR
  subgraph trap["Vibe-check trap"]
    E1[Edit prompt] --> M1[Manual chat test]
    M1 --> S1[Ship on feeling]
    S1 --> E1
  end
  subgraph eval["Eval discipline"]
    E2[Edit prompt] --> R2[Run gold JSONL]
    R2 --> C2{Pass rate OK?}
    C2 -->|yes| S2[Merge]
    C2 -->|no| F2[Fix or update gold]
    F2 --> E2
  end

Why eval is harder than building the feature

Ground truth is fuzzy — many valid answers; scorers encode policy.
Multi-component failures — bad answers may be retrieval, prompt, model, or tools.
Eval drift — gold sets stale when policy changes; judges need calibration.
Cost vs coverage — live CI calls are expensive; mocks miss real failures.
Agent trajectories — multi-step paths need trajectory assertions, not answer-only checks.

⚠️ Pitfall

Shipping after playground testing. Playground sessions skip your RAG index, production prompt version, tool allowlist, and model route. Eval the deployed pipeline, not an ad-hoc chat tab.

📦 Real World

Support teams at scale maintain hundreds of golden scenarios. Prompt merges require pass-rate thresholds; production failures become new gold rows within days.

🎯 Interview Tip

“What is LLM evaluation?” — Frozen cases + automated scorers + CI regression gates, separating offline suites from online monitoring. Not manual ChatGPT testing.

import json
from dataclasses import dataclass

@dataclass
class EvalCase:
    id: str
    question: str
    must_contain: list[str]
    forbid_phrases: list[str]
    expect_refusal: bool = False
    expect_citation: bool = False
    tags: list[str] | None = None

def load_cases(path: str) -> list[EvalCase]:
    cases = []
    with open(path) as f:
        for line in f:
            row = json.loads(line)
            cases.append(EvalCase(**{k: row[k] for k in row if k != "prompt_version"}))
    return cases

public record EvalCase(
    String id,
    String question,
    List mustContain,
    List forbidPhrases,
    boolean expectRefusal,
    boolean expectCitation,
    List tags
) {
  public static EvalCase fromJson(JsonNode n) {
    return new EvalCase(
        n.get("id").asText(),
        n.get("question").asText(),
        jsonStrings(n.get("must_contain")),
        jsonStrings(n.get("forbid_phrases")),
        n.path("expect_refusal").asBoolean(false),
        n.path("expect_citation").asBoolean(false),
        jsonStrings(n.path("tags"))
    );
  }
}

🔬 Under the Hood

Harnesses treat the pipeline as answer = pipeline(question, config). Scorers inspect outputs only—architecture can evolve from completion to RAG to agents without rewriting every test.

💡 Pro Tip

Start with 20 failures from real tickets—not hypothetical questions. The best gold rows come from production bugs you already paid for.

Offline regression vs online monitoring

Offline eval runs against frozen gold before deploy—your safety net in CI. Online eval samples live traffic after deploy—catches distribution shift, new user phrasing, and doc updates not yet re-indexed. You need both; offline alone misses production-only failures, online alone ships broken prompts to the first thousand users.

Mode	Input	When	Purpose
Offline regression	JSONL gold in git	PR, nightly, pre-release	Block regressions before merge
Online sampling	Production traces (1–5%)	Continuous post-deploy	Detect drift, new failure modes
Shadow eval	Live queries, new pipeline in parallel	Before flag flip	Compare v1 vs v2 on real traffic safely
Red-team batch	Adversarial prompts	Weekly / release gate	Safety regressions (Guide 4)

Track 3 introduced prompt golden inputs; Track 2 introduced recall@k. Track 5 unifies them into a product-level eval strategy—PR smoke, nightly live runs, and human review of production samples. See Reliable output formatting for schema-level unit tests that complement answer-level evals.

⚖️ Trade-off

Large gold sets increase confidence but require maintenance when product copy changes. Review gold quarterly; archive obsolete cases instead of letting pass rates lie.

The eval pyramid — unit, integration, human evals

Borrow the test pyramid from traditional QA: many fast cheap checks at the bottom, fewer expensive checks at the top. LLM evals stack the same way—unit scorers on strings, integration runs through the full pipeline, human review for nuance and safety.

Three layers

Layer	What you test	When	Cost	Example
Unit	Scorers, parsers, schema validators	Every commit	Free	must_contain scorer on fixture text
Integration	Full pipeline: retrieve → prompt → model	PR + nightly	API tokens (live) or mock	50-case JSONL smoke on PR; 500 nightly
Human	Rubric quality, tone, edge judgment	Weekly / on sample	People time	Review 1% production traces

flowchart TB
  H[Human eval sampling]
  I[Integration pipeline evals]
  U[Unit scorers and schema tests]
  H --> I --> U

Unit layer — scorers and fixtures

Unit tests never call the LLM. They assert scorers behave: given answer text and gold expectations, return pass/fail. Also test JSONL loaders, report writers, and exit-code logic—same as testing pytest helpers.

Integration layer — mock vs live runners

Mock runners return deterministic fake answers—free CI on every PR, catches scorer and wiring regressions.Live runners call real RAG/agents—nightly or pre-release—catch model and retrieval drift mocks hide.

Human layer — when automation stops

Judges and rules miss empathy, subtle policy, and novel failure modes. Sample production traces into a review queue; label pass/fail with reasons; promote failures to gold JSONL. Human eval is not “skip automation”—it feeds the automation.

Smoke vs full suites

Suite	Cases	Runner	Gate
support_smoke.jsonl	8–15	Mock (PR) + live (optional)	Blocks merge
support_full.jsonl	100–500	Live API	Nightly alert
safety_redteam.jsonl	50+ adversarial	Live	Release gate
rag_recall.jsonl	Question + expected chunk IDs	Retrieval only	Index change gate

Agent evals add a fourth suite type: agent_trajectory.jsonl with expected tool names and order—see Agents explained for why final-answer-only checks miss wrong tool paths.

def score_must_contain(answer: str, phrases: list[str]) -> tuple[bool, str]:
    lower = answer.lower()
    missing = [p for p in phrases if p.lower() not in lower]
    if missing:
        return False, f"missing phrases: {missing}"
    return True, "ok"

def score_forbid(answer: str, forbidden: list[str]) -> tuple[bool, str]:
    hits = [p for p in forbidden if p.lower() in answer.lower()]
    if hits:
        return False, f"forbidden phrases: {hits}"
    return True, "ok"

public final class RuleScorers {
  public static ScoreResult mustContain(String answer, List phrases) {
    String lower = answer.toLowerCase();
    List missing = phrases.stream()
        .filter(p -> !lower.contains(p.toLowerCase()))
        .toList();
    return missing.isEmpty()
        ? ScoreResult.pass("ok")
        : ScoreResult.fail("missing: " + missing);
  }
}

💰 Cost

Run mock integration on every PR (~seconds). Reserve live LLM integration for nightly jobs with budget caps—surprise $3k/month eval bills get CI disabled.

⚖️ Trade-off

More gold cases improve coverage but slow CI. Tag cases: smoke (8 rows, PR gate) vs full (500 rows, nightly).

⚙️ Config

Pin model, prompt_version, and git_sha in report metadata so score deltas are attributable.

🔒 Security

Gold files may contain realistic PII-shaped fixtures—keep in private repos; redact in public docs; never commit production customer transcripts verbatim.

Metrics vs vibes — quantitative iteration

Subjective iteration—“this answer feels better”—does not scale across engineers, models, or time zones. Metrics turn quality into numbers you can plot, diff in PR comments, and gate in CI: pass rate, per-tag breakdown, latency p95, cost per case, judge score distributions.

Core metrics for LLM apps

Metric	Formula / meaning	Use when
Pass rate	passed_cases / total_cases	Primary CI gate
Delta vs baseline	pass_rate_branch − pass_rate_main	Block merge if < −2%
Per-tag pass rate	Pass rate grouped by billing, safety, RAG	Find regressions hidden in aggregate
Refusal rate	Cases where model refused / expect_refusal cases	Safety and policy evals
Latency p95	95th percentile pipeline ms	Performance regressions
Cost per eval run	Sum input+output tokens × price	Budget CI and nightly jobs

From vibes to numbers — workflow

Establish baseline pass rate on main with pinned config.
Change one variable (prompt, chunk size, model)—not three at once.
Re-run suite; compute delta and per-tag breakdown.
Inspect failed case IDs—not only the aggregate score.
Accept, revert, or update gold with documented rationale.

Plot pass rate over time in Grafana or your observability stack. A slow drift across prompt edits is invisible in vibes but obvious on a chart.

Leading vs lagging indicators

Leading — offline pass rate delta, parse success rate, retrieval recall@5 on gold queries (predicts user pain before deploy).
Lagging — support escalation rate, thumbs-down on bot replies, average handle time (confirms eval gaps after deploy).

When leading and lagging diverge—eval pass rate high but escalations rising—your gold set no longer represents real traffic. Refresh gold from production samples (Guide 5) before tuning prompts again.

⚙️ Config

Store thresholds in repo config: max_regression: 0.02, min_safety_pass: 1.0, min_billing_pass: 0.95. Avoid magic numbers scattered in shell scripts.

from dataclasses import dataclass, asdict
import json

@dataclass
class EvalReport:
    suite: str
    pass_rate: float
    passed: int
    failed: int
    baseline_pass_rate: float
    delta: float
    failed_ids: list[str]
    model: str
    prompt_version: str
    git_sha: str

def write_report(path: str, report: EvalReport) -> None:
    with open(path, "w") as f:
        json.dump(asdict(report), f, indent=2)

def should_fail_ci(report: EvalReport, max_regression: float = 0.02) -> bool:
    return report.delta < -max_regression

public record EvalReport(
    String suite,
    double passRate,
    int passed,
    int failed,
    double baselinePassRate,
    double delta,
    List failedIds,
    String model,
    String promptVersion,
    String gitSha
) {
  public boolean shouldFailCi(double maxRegression) {
    return delta < -maxRegression;
  }
}

⚠️ Pitfall

Optimizing pass rate alone. A prompt that refuses everything passes safety cases and fails helpfulness—track multiple tags and weight business-critical rows higher.

📦 Real World

Teams attach eval summary comments to PRs: “pass 94.2% (−0.3% vs main), failed: refund-monthly, jailbreak-07”. Reviewers scan deltas, not full transcripts.

💡 Pro Tip

Keep a baseline.json on main updated nightly. PR jobs compare against that artifact, not a stale local file.

🎯 Interview Tip

“How do you know a prompt change helped?” — Golden set, before/after pass rate with pinned model, inspect failed IDs, per-tag breakdown—not subjective comparison.

Eval-driven loop — baseline → change → compare → ship

Treat eval like TDD for nondeterministic systems: capture expected behavior first (or after first bug), change one knob, measure, then ship only when metrics hold. The loop is baseline → change → compare → ship—same rhythm as performance benchmarks in backend engineering.

flowchart LR
  B[Baseline run on main]
  C[Change prompt RAG model]
  R[Re-run gold suite]
  D{Delta within threshold?}
  S[Ship merge deploy]
  F[Fix or revert]
  B --> C --> R --> D
  D -->|yes| S
  D -->|no| F --> C

Step 1 — Baseline

Run full suite on main with production config. Store reports/baseline.json: pass rate, failed IDs, latency, token cost. This is your contract for “currently acceptable.”

Step 2 — Change

One change per experiment: new system prompt version, chunk size 512→768, swap gpt-4o-mini for Claude, add a tool. Log the hypothesis in the PR description.

Step 3 — Compare

Re-run the same suite. Compute delta. Open failed cases side-by-side: old answer vs new. Distinguish improvements (new pass), regressions (new fail), and gold drift (policy changed—update JSONL).

Step 4 — Ship

Merge when delta ≥ threshold (e.g. no worse than −1% overall, no safety tag below 100%). Tag prompt registry version; deploy with feature flag; monitor online metrics (Guide 5).

Promote production failures into gold

When support escalates a bad bot answer, capture question + expected behavior + actual failure mode. Add JSONL row with same id pattern. That closes the loop—eval suites grow from real pain.

Example PR workflow

Engineer bumps support_system_v3 → v4 in prompt registry.
CI runs support_smoke.jsonl with mock runner (free) — wiring OK.
Optional: CI runs 8-case live smoke if EVAL_LIVE=1 on protected branches.
Nightly job runs full suite; posts pass rate to Slack if delta < −1% vs baseline.
On merge, deploy behind flag; online sampler watches thumbs-down for 24h before full rollout.

RAG pipeline changes follow the same loop but add retrieval metrics first—if recall@5 drops, fix chunking or hybrid search before rewriting the generation prompt. See RAG explained eval basics and Guide 3 in this track for RAGAS-style decomposition.

import sys
from eval.lib.load_cases import load_cases
from eval.scorers import score_case
from runners.live_support import answer as pipeline

def run_suite(gold_path: str, baseline_rate: float) -> int:
    cases = load_cases(gold_path)
    passed, failed_ids = 0, []
    for case in cases:
        ans = pipeline(case.question)
        ok, _ = score_case(ans, case)
        if ok:
            passed += 1
        else:
            failed_ids.append(case.id)
    rate = passed / len(cases)
    delta = rate - baseline_rate
    print(f"pass_rate={rate:.3f} delta={delta:+.3f} failed={failed_ids}")
    return 1 if delta < -0.02 else 0

if __name__ == "__main__":
    sys.exit(run_suite("eval/gold/support_smoke.jsonl", baseline_rate=0.94))

@Service
public class EvalSuiteRunner {
  public int runAndExitCode(Path goldPath, double baselineRate) {
    List cases = caseLoader.load(goldPath);
    List failed = new ArrayList<>();
    int passed = 0;
    for (EvalCase c : cases) {
      String ans = pipeline.answer(c.question());
      if (scorer.score(ans, c).passed()) passed++;
      else failed.add(c.id());
    }
    double rate = (double) passed / cases.size();
    double delta = rate - baselineRate;
    log.info("pass_rate={} delta={} failed={}", rate, delta, failed);
    return delta < -0.02 ? 1 : 0;
  }
}

⚖️ Trade-off

Strict CI gates slow iteration early; loose gates ship regressions. Start with smoke-only blocking; tighten thresholds as gold coverage matures.

🔬 Under the Hood

Eval-driven development does not require perfect gold on day one. Start with 10 known-bad production tickets; expand weekly—coverage is a product, not a one-time project.

💰 Cost

Batch baseline + compare in one nightly job when API cost matters; PRs run mock runner only, comparing mock pass rates for wiring regressions.

📦 Real World

Prompt registries link each version to eval scores at promote time—v4 shipped at 96.1%, v5 rejected at 93.8% on billing tag before any user saw v5.

Eval types — reference-based, LLM-as-judge, human, task-based

No single scorer fits every question. Production stacks combine four families—each with strengths, costs, and failure modes. Pick scorers by what you need to prove: exact policy phrases, semantic equivalence, human judgment, or end-to-end task success.

Type	Mechanism	Best for	Weakness
Reference-based	Rules, regex, must_contain, exact JSON schema	Policy phrases, format, refusals	Misses valid paraphrase
Embedding similarity	Cosine vs reference answer or chunk	Soft equivalence when wording varies	False pass on wrong facts, similar prose
LLM-as-judge	Rubric prompt scores 1–5 or pass/fail	Helpfulness, tone, multi-criteria	Judge bias, cost, inconsistency
Human eval	Reviewers with rubric	Gold standard, calibrating judges	Slow, expensive
Task-based	Did the agent complete the task?	Tool use, ticket resolved, SQL row updated	Needs sandbox or mock APIs

Reference-based evals

Fast, deterministic, CI-friendly. Encode legal/compliance language literally: “partial months,” “30-day window.” Combine with forbid_phrases for hallucinated guarantees. Track 3 formatting guides pair with schema validators for JSON outputs.

LLM-as-judge

When rules cannot enumerate good answers, a separate model scores against a rubric: “Answer cites policy, refuses harmful requests, does not invent refund amounts.” Calibrate judges against human labels; use chain-of-thought rubrics with structured JSON output. Deep dive in LLM-as-judge.

Human eval

Sample 50–200 cases for blind A/B comparison between prompt versions. Human scores become labels to tune judges and detect judge drift. Never skip human review for high-stakes launches—automate the long tail only.

Task-based evals

For agents: assert tool sequence contains lookup_order before search_policy; mock CRM returns expected state; ticket marked resolved in sandbox. Final answer quality is insufficient when wrong tools ran.

Mapping eval types to Tracks 1–4

Product layer	Primary eval types	Track reference
Chat / completion	Reference rules, LLM judge	LLMs explained
RAG copilot	Recall@k, faithfulness, reference rules	RAG explained
Prompted workflows	Schema validation, golden inputs	Core prompting
Agents	Task-based trajectories, tool selection	Tool calling

Composite pipelines stack scorers: retrieval recall (unit/integration on index), generation faithfulness (judge), policy phrases (rules), safety (red-team reference + judge). Failures in composite stacks need failure attribution—log which stage failed so you do not “fix” prompts when retrieval broke.

flowchart TD
  Q[What must you prove?]
  Q -->|Exact policy text| R[Reference rules]
  Q -->|Paraphrase OK| E[Embedding similarity]
  Q -->|Subjective quality| J[LLM-as-judge]
  Q -->|High stakes launch| H[Human rubric]
  Q -->|Agent completed action| T[Task-based sandbox]

def score_case(answer: str, case: EvalCase, use_judge: bool = False) -> tuple[bool, str]:
    ok, msg = score_must_contain(answer, case.must_contain)
    if not ok:
        return ok, msg
    ok, msg = score_forbid(answer, case.forbid_phrases)
    if not ok:
        return ok, msg
    if case.expect_refusal and not looks_like_refusal(answer):
        return False, "expected refusal"
    if use_judge and case.tags and "subjective" in case.tags:
        return llm_judge(answer, case.question, rubric=JUDGE_RUBRIC)
    return True, "ok"

public ScoreResult scoreCase(String answer, EvalCase c, boolean useJudge) {
  ScoreResult r = RuleScorers.mustContain(answer, c.mustContain());
  if (!r.passed()) return r;
  r = RuleScorers.forbid(answer, c.forbidPhrases());
  if (!r.passed()) return r;
  if (c.expectRefusal() && !RefusalDetector.isRefusal(answer))
    return ScoreResult.fail("expected refusal");
  if (useJudge && c.tags() != null && c.tags().contains("subjective"))
    return llmJudge.score(answer, c.question(), JUDGE_RUBRIC);
  return ScoreResult.pass("ok");
}

⚠️ Pitfall

LLM-as-judge for everything on every PR—cost explodes and judges correlate with the model under test. Rules first; judges for nightly subjective tags.

🔒 Security

Task-based evals in sandboxes still need auth—do not point agents at production CRM with eval credentials that bypass audit.

🎯 Interview Tip

“Reference vs LLM judge?” — Rules for compliance and format; judges for helpfulness when paraphrase varies; human labels calibrate judges; task-based for agents.

💡 Pro Tip

Tag gold rows with required scorer: "scorers": ["rules"] vs ["rules", "judge"] so PR jobs skip expensive judges.

What this looks like in production

A notebook that scores five answers is a spike. Production eval is a platform: versioned gold sets, scheduled runners, report artifacts, CI gates on prompt/RAG/agent repos, online sampling, and a process to promote failures into regression cases—before and after launch.

Architecture sketch

flowchart LR
  GIT[Gold JSONL in git]
  CI[CI smoke runner]
  NIGHT[Nightly live runner]
  RPT[Report artifact]
  OBS[Online trace sampling]
  GOLD[Promote to gold]
  GIT --> CI --> RPT
  GIT --> NIGHT --> RPT
  OBS --> GOLD --> GIT

Eval readiness checklist

Gold JSONL in git with unique id per case; schema validated in unit tests
Smoke suite (≤15 cases) runs on every PR—mock or live with budget cap
Full suite nightly with pinned model and prompt_version
CI fails when pass rate drops more than agreed threshold vs baseline
Report artifact uploaded: pass rate, delta, failed IDs, latency, token cost
Per-tag breakdown for billing, safety, RAG, agent trajectories
Rule scorers for policy phrases; judges scheduled—not on every PR by default
Human review queue for 1–5% production samples
Process to add production failures as new gold rows within SLA
Prompt/registry version linked to eval score at deploy time
Feature flag rollback when online metrics diverge from offline eval
RAG evals separate retrieval recall from generation faithfulness (Guide 3)

Minimum viable eval platform (week one)

You do not need LangSmith or Braintrust on day one. Minimum stack:

eval/gold/*.jsonl — 20 cases from real tickets
eval/scorers.py — rule scorers + unit tests
runners/mock_*.py — deterministic pipeline stub
eval/run_suite.py — exit code 0/1
.github/workflows/eval-smoke.yml — runs on PR
reports/latest.json — CI artifact

Upgrade to live runners, judges, and observability exporters as coverage and traffic grow—Guide 6 walks CI wiring end-to-end.

Logging every eval run

Field	Why
run_id, git_sha	Reproduce any report
suite, runner (mock/live)	Explain mock vs live deltas
pass_rate, delta	CI gate input
failed_case_ids[]	Jump to debugging
model, prompt_version	Attribute behavior changes
total_tokens, cost_usd	FinOps for eval spend

from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel

app = FastAPI()
ADMIN_TOKEN = "rotate-me"

class EvalTrigger(BaseModel):
    suite: str = "support_smoke"
    runner: str = "mock"

@app.post("/internal/eval/run")
def trigger_eval(body: EvalTrigger, x_admin_token: str = Header(...)):
    if x_admin_token != ADMIN_TOKEN:
        raise HTTPException(403)
    report = run_suite(body.suite, runner=body.runner)
    return {"run_id": report.run_id, "pass_rate": report.pass_rate}

@RestController
@RequestMapping("/internal/eval")
public class EvalController {
  @PostMapping("/run")
  public EvalRunResponse trigger(@RequestBody EvalTrigger body,
      @RequestHeader("X-Admin-Token") String token) {
    if (!adminToken.equals(token)) throw new ResponseStatusException(HttpStatus.FORBIDDEN);
    EvalReport report = suiteRunner.run(body.suite(), body.runner());
    return new EvalRunResponse(report.runId(), report.passRate());
  }
}

☸️ OpenShift / K8s

Nightly eval as CronJob; mount gold ConfigMap or clone repo; upload reports to S3; Slack webhook on regression vs baseline.

🏗️ IaC

Terraform CI budget alarms on eval token spend; separate AWS/GCP project for eval API keys with hard quotas.

💰 Cost

Cap nightly live eval at N cases × M tokens; overflow queue to weekend batch—unbounded eval jobs have bankrupted side projects.

📦 Real World

Mature teams treat eval regressions like P1 bugs—billing tag drop from 100% to 97% blocks release even if aggregate pass rate still looks fine.

📦 Real World

Mature teams treat eval regressions like P1 bugs—billing tag drop from 100% to 97% blocks release even if aggregate pass rate still looks fine.

🎯 Interview Tip

“How do you ship LLM features safely?” — Golden sets, CI gates, offline + online eval, promote production failures to gold, separate RAG retrieval from generation metrics, feature-flag rollback tied to eval drift.

💡 Pro Tip

Publish eval pass-rate badges in internal docs next to each prompt version—makes “which template is production-safe?” a one-glance question for PMs and support leads.