Lang
API

Evaluation explained

Evaluation is automated proof that your LLM pipeline still behaves on a frozen set of tickets—not a vibe check after changing one prompt line. Unlike traditional software, outputs are stochastic; evals use scorers that tolerate paraphrase while catching regressions. Golden cases live in git, reports land in CI artifacts, and non-zero exit codes block merges when quality drops.

After reading, you should be able to: explain why eval is often the hardest part of LLM engineering; name the vibe-check trap; map unit, integration, and human eval layers; choose metrics over subjective iteration; run baseline→change→compare→ship; distinguish reference-based, LLM-as-judge, human, and task-based eval types; and apply a production eval readiness checklist.

developer platform architect Track 5 JSONL gold sets PromptFoo RAGAS

Why evaluation — the hardest part of LLM engineering

Calling an API is easy. Writing a prompt that works on three demo questions is easy. Knowing whether your system still works after the tenth prompt change, a model upgrade, or a new document corpus—that is hard. Evaluation is the discipline that turns LLM features from demos into software you can ship and maintain.

The demo vs production gap

Every LLM product passes a hallway test: three beautiful answers, everyone applauds. Production bots fail on the fourth ticket—monthly refund edge cases, jailbreaks, missing KB topics, or wrong RAG chunks. Without evals, you learn from angry users, not CI.

Traditional software: same input → same bytes. LLMs sample from a distribution—paraphrase is normal. Evals use scorers—rules, embeddings, judges—that tolerate wording while catching policy violations.

The vibe-check trap

The vibe check is the most common failure mode: after every prompt edit, someone opens a chat UI, asks five questions, squints, declares “looks good.” That workflow is non-reproducible, non-comparable, biased toward recent edits, and invisible in CI.

Escape with a golden dataset: 30–200 real questions as JSONL in git with must_contain, forbid_phrases, refusal flags, citation requirements. Run on every PR. When a case fails, fix the pipeline or update gold deliberately.

SignalVibe checkStructured eval
RepeatabilityAd hoc questionsSame JSONL every run
Regression detectionOften missedPass rate delta blocks merge
Attribution“Feels off”Case ID + scorer + prompt version
OnboardingTribal knowledgemake eval in README
flowchart LR
  subgraph trap["Vibe-check trap"]
    E1[Edit prompt] --> M1[Manual chat test]
    M1 --> S1[Ship on feeling]
    S1 --> E1
  end
  subgraph eval["Eval discipline"]
    E2[Edit prompt] --> R2[Run gold JSONL]
    R2 --> C2{Pass rate OK?}
    C2 -->|yes| S2[Merge]
    C2 -->|no| F2[Fix or update gold]
    F2 --> E2
  end

Why eval is harder than building the feature

  • Ground truth is fuzzy — many valid answers; scorers encode policy.
  • Multi-component failures — bad answers may be retrieval, prompt, model, or tools.
  • Eval drift — gold sets stale when policy changes; judges need calibration.
  • Cost vs coverage — live CI calls are expensive; mocks miss real failures.
  • Agent trajectories — multi-step paths need trajectory assertions, not answer-only checks.
⚠️ Pitfall

Shipping after playground testing. Playground sessions skip your RAG index, production prompt version, tool allowlist, and model route. Eval the deployed pipeline, not an ad-hoc chat tab.

📦 Real World

Support teams at scale maintain hundreds of golden scenarios. Prompt merges require pass-rate thresholds; production failures become new gold rows within days.

🎯 Interview Tip

“What is LLM evaluation?” — Frozen cases + automated scorers + CI regression gates, separating offline suites from online monitoring. Not manual ChatGPT testing.

Golden case — one JSONL row in git
import json
from dataclasses import dataclass

@dataclass
class EvalCase:
    id: str
    question: str
    must_contain: list[str]
    forbid_phrases: list[str]
    expect_refusal: bool = False
    expect_citation: bool = False
    tags: list[str] | None = None

def load_cases(path: str) -> list[EvalCase]:
    cases = []
    with open(path) as f:
        for line in f:
            row = json.loads(line)
            cases.append(EvalCase(**{k: row[k] for k in row if k != "prompt_version"}))
    return cases
public record EvalCase(
    String id,
    String question,
    List mustContain,
    List forbidPhrases,
    boolean expectRefusal,
    boolean expectCitation,
    List tags
) {
  public static EvalCase fromJson(JsonNode n) {
    return new EvalCase(
        n.get("id").asText(),
        n.get("question").asText(),
        jsonStrings(n.get("must_contain")),
        jsonStrings(n.get("forbid_phrases")),
        n.path("expect_refusal").asBoolean(false),
        n.path("expect_citation").asBoolean(false),
        jsonStrings(n.path("tags"))
    );
  }
}
🔬 Under the Hood

Harnesses treat the pipeline as answer = pipeline(question, config). Scorers inspect outputs only—architecture can evolve from completion to RAG to agents without rewriting every test.

💡 Pro Tip

Start with 20 failures from real tickets—not hypothetical questions. The best gold rows come from production bugs you already paid for.

Offline regression vs online monitoring

Offline eval runs against frozen gold before deploy—your safety net in CI. Online eval samples live traffic after deploy—catches distribution shift, new user phrasing, and doc updates not yet re-indexed. You need both; offline alone misses production-only failures, online alone ships broken prompts to the first thousand users.

ModeInputWhenPurpose
Offline regressionJSONL gold in gitPR, nightly, pre-releaseBlock regressions before merge
Online samplingProduction traces (1–5%)Continuous post-deployDetect drift, new failure modes
Shadow evalLive queries, new pipeline in parallelBefore flag flipCompare v1 vs v2 on real traffic safely
Red-team batchAdversarial promptsWeekly / release gateSafety regressions (Guide 4)

Track 3 introduced prompt golden inputs; Track 2 introduced recall@k. Track 5 unifies them into a product-level eval strategy—PR smoke, nightly live runs, and human review of production samples. See Reliable output formatting for schema-level unit tests that complement answer-level evals.

⚖️ Trade-off

Large gold sets increase confidence but require maintenance when product copy changes. Review gold quarterly; archive obsolete cases instead of letting pass rates lie.

The eval pyramid — unit, integration, human evals

Borrow the test pyramid from traditional QA: many fast cheap checks at the bottom, fewer expensive checks at the top. LLM evals stack the same way—unit scorers on strings, integration runs through the full pipeline, human review for nuance and safety.

Three layers

LayerWhat you testWhenCostExample
UnitScorers, parsers, schema validatorsEvery commitFreemust_contain scorer on fixture text
IntegrationFull pipeline: retrieve → prompt → modelPR + nightlyAPI tokens (live) or mock50-case JSONL smoke on PR; 500 nightly
HumanRubric quality, tone, edge judgmentWeekly / on samplePeople timeReview 1% production traces
flowchart TB
  H[Human eval sampling]
  I[Integration pipeline evals]
  U[Unit scorers and schema tests]
  H --> I --> U

Unit layer — scorers and fixtures

Unit tests never call the LLM. They assert scorers behave: given answer text and gold expectations, return pass/fail. Also test JSONL loaders, report writers, and exit-code logic—same as testing pytest helpers.

Integration layer — mock vs live runners

Mock runners return deterministic fake answers—free CI on every PR, catches scorer and wiring regressions.Live runners call real RAG/agents—nightly or pre-release—catch model and retrieval drift mocks hide.

Human layer — when automation stops

Judges and rules miss empathy, subtle policy, and novel failure modes. Sample production traces into a review queue; label pass/fail with reasons; promote failures to gold JSONL. Human eval is not “skip automation”—it feeds the automation.

Smoke vs full suites

SuiteCasesRunnerGate
support_smoke.jsonl8–15Mock (PR) + live (optional)Blocks merge
support_full.jsonl100–500Live APINightly alert
safety_redteam.jsonl50+ adversarialLiveRelease gate
rag_recall.jsonlQuestion + expected chunk IDsRetrieval onlyIndex change gate

Agent evals add a fourth suite type: agent_trajectory.jsonl with expected tool names and order—see Agents explained for why final-answer-only checks miss wrong tool paths.

Rule scorer — unit-testable without LLM
def score_must_contain(answer: str, phrases: list[str]) -> tuple[bool, str]:
    lower = answer.lower()
    missing = [p for p in phrases if p.lower() not in lower]
    if missing:
        return False, f"missing phrases: {missing}"
    return True, "ok"

def score_forbid(answer: str, forbidden: list[str]) -> tuple[bool, str]:
    hits = [p for p in forbidden if p.lower() in answer.lower()]
    if hits:
        return False, f"forbidden phrases: {hits}"
    return True, "ok"
public final class RuleScorers {
  public static ScoreResult mustContain(String answer, List phrases) {
    String lower = answer.toLowerCase();
    List missing = phrases.stream()
        .filter(p -> !lower.contains(p.toLowerCase()))
        .toList();
    return missing.isEmpty()
        ? ScoreResult.pass("ok")
        : ScoreResult.fail("missing: " + missing);
  }
}
💰 Cost

Run mock integration on every PR (~seconds). Reserve live LLM integration for nightly jobs with budget caps—surprise $3k/month eval bills get CI disabled.

⚖️ Trade-off

More gold cases improve coverage but slow CI. Tag cases: smoke (8 rows, PR gate) vs full (500 rows, nightly).

⚙️ Config

Pin model, prompt_version, and git_sha in report metadata so score deltas are attributable.

🔒 Security

Gold files may contain realistic PII-shaped fixtures—keep in private repos; redact in public docs; never commit production customer transcripts verbatim.

Metrics vs vibes — quantitative iteration

Subjective iteration—“this answer feels better”—does not scale across engineers, models, or time zones. Metrics turn quality into numbers you can plot, diff in PR comments, and gate in CI: pass rate, per-tag breakdown, latency p95, cost per case, judge score distributions.

Core metrics for LLM apps

MetricFormula / meaningUse when
Pass ratepassed_cases / total_casesPrimary CI gate
Delta vs baselinepass_rate_branch − pass_rate_mainBlock merge if < −2%
Per-tag pass ratePass rate grouped by billing, safety, RAGFind regressions hidden in aggregate
Refusal rateCases where model refused / expect_refusal casesSafety and policy evals
Latency p9595th percentile pipeline msPerformance regressions
Cost per eval runSum input+output tokens × priceBudget CI and nightly jobs

From vibes to numbers — workflow

  1. Establish baseline pass rate on main with pinned config.
  2. Change one variable (prompt, chunk size, model)—not three at once.
  3. Re-run suite; compute delta and per-tag breakdown.
  4. Inspect failed case IDs—not only the aggregate score.
  5. Accept, revert, or update gold with documented rationale.

Plot pass rate over time in Grafana or your observability stack. A slow drift across prompt edits is invisible in vibes but obvious on a chart.

Leading vs lagging indicators

  • Leading — offline pass rate delta, parse success rate, retrieval recall@5 on gold queries (predicts user pain before deploy).
  • Lagging — support escalation rate, thumbs-down on bot replies, average handle time (confirms eval gaps after deploy).

When leading and lagging diverge—eval pass rate high but escalations rising—your gold set no longer represents real traffic. Refresh gold from production samples (Guide 5) before tuning prompts again.

⚙️ Config

Store thresholds in repo config: max_regression: 0.02, min_safety_pass: 1.0, min_billing_pass: 0.95. Avoid magic numbers scattered in shell scripts.

Report JSON — store in CI artifacts
from dataclasses import dataclass, asdict
import json

@dataclass
class EvalReport:
    suite: str
    pass_rate: float
    passed: int
    failed: int
    baseline_pass_rate: float
    delta: float
    failed_ids: list[str]
    model: str
    prompt_version: str
    git_sha: str

def write_report(path: str, report: EvalReport) -> None:
    with open(path, "w") as f:
        json.dump(asdict(report), f, indent=2)

def should_fail_ci(report: EvalReport, max_regression: float = 0.02) -> bool:
    return report.delta < -max_regression
public record EvalReport(
    String suite,
    double passRate,
    int passed,
    int failed,
    double baselinePassRate,
    double delta,
    List failedIds,
    String model,
    String promptVersion,
    String gitSha
) {
  public boolean shouldFailCi(double maxRegression) {
    return delta < -maxRegression;
  }
}
⚠️ Pitfall

Optimizing pass rate alone. A prompt that refuses everything passes safety cases and fails helpfulness—track multiple tags and weight business-critical rows higher.

📦 Real World

Teams attach eval summary comments to PRs: “pass 94.2% (−0.3% vs main), failed: refund-monthly, jailbreak-07”. Reviewers scan deltas, not full transcripts.

💡 Pro Tip

Keep a baseline.json on main updated nightly. PR jobs compare against that artifact, not a stale local file.

🎯 Interview Tip

“How do you know a prompt change helped?” — Golden set, before/after pass rate with pinned model, inspect failed IDs, per-tag breakdown—not subjective comparison.

Eval-driven loop — baseline → change → compare → ship

Treat eval like TDD for nondeterministic systems: capture expected behavior first (or after first bug), change one knob, measure, then ship only when metrics hold. The loop is baseline → change → compare → ship—same rhythm as performance benchmarks in backend engineering.

flowchart LR
  B[Baseline run on main]
  C[Change prompt RAG model]
  R[Re-run gold suite]
  D{Delta within threshold?}
  S[Ship merge deploy]
  F[Fix or revert]
  B --> C --> R --> D
  D -->|yes| S
  D -->|no| F --> C

Step 1 — Baseline

Run full suite on main with production config. Store reports/baseline.json: pass rate, failed IDs, latency, token cost. This is your contract for “currently acceptable.”

Step 2 — Change

One change per experiment: new system prompt version, chunk size 512→768, swap gpt-4o-mini for Claude, add a tool. Log the hypothesis in the PR description.

Step 3 — Compare

Re-run the same suite. Compute delta. Open failed cases side-by-side: old answer vs new. Distinguish improvements (new pass), regressions (new fail), and gold drift (policy changed—update JSONL).

Step 4 — Ship

Merge when delta ≥ threshold (e.g. no worse than −1% overall, no safety tag below 100%). Tag prompt registry version; deploy with feature flag; monitor online metrics (Guide 5).

Promote production failures into gold

When support escalates a bad bot answer, capture question + expected behavior + actual failure mode. Add JSONL row with same id pattern. That closes the loop—eval suites grow from real pain.

Example PR workflow

  1. Engineer bumps support_system_v3 → v4 in prompt registry.
  2. CI runs support_smoke.jsonl with mock runner (free) — wiring OK.
  3. Optional: CI runs 8-case live smoke if EVAL_LIVE=1 on protected branches.
  4. Nightly job runs full suite; posts pass rate to Slack if delta < −1% vs baseline.
  5. On merge, deploy behind flag; online sampler watches thumbs-down for 24h before full rollout.

RAG pipeline changes follow the same loop but add retrieval metrics first—if recall@5 drops, fix chunking or hybrid search before rewriting the generation prompt. See RAG explained eval basics and Guide 3 in this track for RAGAS-style decomposition.

run_suite.py — compare branch to baseline
import sys
from eval.lib.load_cases import load_cases
from eval.scorers import score_case
from runners.live_support import answer as pipeline

def run_suite(gold_path: str, baseline_rate: float) -> int:
    cases = load_cases(gold_path)
    passed, failed_ids = 0, []
    for case in cases:
        ans = pipeline(case.question)
        ok, _ = score_case(ans, case)
        if ok:
            passed += 1
        else:
            failed_ids.append(case.id)
    rate = passed / len(cases)
    delta = rate - baseline_rate
    print(f"pass_rate={rate:.3f} delta={delta:+.3f} failed={failed_ids}")
    return 1 if delta < -0.02 else 0

if __name__ == "__main__":
    sys.exit(run_suite("eval/gold/support_smoke.jsonl", baseline_rate=0.94))
@Service
public class EvalSuiteRunner {
  public int runAndExitCode(Path goldPath, double baselineRate) {
    List cases = caseLoader.load(goldPath);
    List failed = new ArrayList<>();
    int passed = 0;
    for (EvalCase c : cases) {
      String ans = pipeline.answer(c.question());
      if (scorer.score(ans, c).passed()) passed++;
      else failed.add(c.id());
    }
    double rate = (double) passed / cases.size();
    double delta = rate - baselineRate;
    log.info("pass_rate={} delta={} failed={}", rate, delta, failed);
    return delta < -0.02 ? 1 : 0;
  }
}
⚖️ Trade-off

Strict CI gates slow iteration early; loose gates ship regressions. Start with smoke-only blocking; tighten thresholds as gold coverage matures.

🔬 Under the Hood

Eval-driven development does not require perfect gold on day one. Start with 10 known-bad production tickets; expand weekly—coverage is a product, not a one-time project.

💰 Cost

Batch baseline + compare in one nightly job when API cost matters; PRs run mock runner only, comparing mock pass rates for wiring regressions.

📦 Real World

Prompt registries link each version to eval scores at promote time—v4 shipped at 96.1%, v5 rejected at 93.8% on billing tag before any user saw v5.

Eval types — reference-based, LLM-as-judge, human, task-based

No single scorer fits every question. Production stacks combine four families—each with strengths, costs, and failure modes. Pick scorers by what you need to prove: exact policy phrases, semantic equivalence, human judgment, or end-to-end task success.

TypeMechanismBest forWeakness
Reference-basedRules, regex, must_contain, exact JSON schemaPolicy phrases, format, refusalsMisses valid paraphrase
Embedding similarityCosine vs reference answer or chunkSoft equivalence when wording variesFalse pass on wrong facts, similar prose
LLM-as-judgeRubric prompt scores 1–5 or pass/failHelpfulness, tone, multi-criteriaJudge bias, cost, inconsistency
Human evalReviewers with rubricGold standard, calibrating judgesSlow, expensive
Task-basedDid the agent complete the task?Tool use, ticket resolved, SQL row updatedNeeds sandbox or mock APIs

Reference-based evals

Fast, deterministic, CI-friendly. Encode legal/compliance language literally: “partial months,” “30-day window.” Combine with forbid_phrases for hallucinated guarantees. Track 3 formatting guides pair with schema validators for JSON outputs.

LLM-as-judge

When rules cannot enumerate good answers, a separate model scores against a rubric: “Answer cites policy, refuses harmful requests, does not invent refund amounts.” Calibrate judges against human labels; use chain-of-thought rubrics with structured JSON output. Deep dive in LLM-as-judge.

Human eval

Sample 50–200 cases for blind A/B comparison between prompt versions. Human scores become labels to tune judges and detect judge drift. Never skip human review for high-stakes launches—automate the long tail only.

Task-based evals

For agents: assert tool sequence contains lookup_order before search_policy; mock CRM returns expected state; ticket marked resolved in sandbox. Final answer quality is insufficient when wrong tools ran.

Mapping eval types to Tracks 1–4

Product layerPrimary eval typesTrack reference
Chat / completionReference rules, LLM judgeLLMs explained
RAG copilotRecall@k, faithfulness, reference rulesRAG explained
Prompted workflowsSchema validation, golden inputsCore prompting
AgentsTask-based trajectories, tool selectionTool calling

Composite pipelines stack scorers: retrieval recall (unit/integration on index), generation faithfulness (judge), policy phrases (rules), safety (red-team reference + judge). Failures in composite stacks need failure attribution—log which stage failed so you do not “fix” prompts when retrieval broke.

flowchart TD
  Q[What must you prove?]
  Q -->|Exact policy text| R[Reference rules]
  Q -->|Paraphrase OK| E[Embedding similarity]
  Q -->|Subjective quality| J[LLM-as-judge]
  Q -->|High stakes launch| H[Human rubric]
  Q -->|Agent completed action| T[Task-based sandbox]
Composite scorer — chain reference then judge
def score_case(answer: str, case: EvalCase, use_judge: bool = False) -> tuple[bool, str]:
    ok, msg = score_must_contain(answer, case.must_contain)
    if not ok:
        return ok, msg
    ok, msg = score_forbid(answer, case.forbid_phrases)
    if not ok:
        return ok, msg
    if case.expect_refusal and not looks_like_refusal(answer):
        return False, "expected refusal"
    if use_judge and case.tags and "subjective" in case.tags:
        return llm_judge(answer, case.question, rubric=JUDGE_RUBRIC)
    return True, "ok"
public ScoreResult scoreCase(String answer, EvalCase c, boolean useJudge) {
  ScoreResult r = RuleScorers.mustContain(answer, c.mustContain());
  if (!r.passed()) return r;
  r = RuleScorers.forbid(answer, c.forbidPhrases());
  if (!r.passed()) return r;
  if (c.expectRefusal() && !RefusalDetector.isRefusal(answer))
    return ScoreResult.fail("expected refusal");
  if (useJudge && c.tags() != null && c.tags().contains("subjective"))
    return llmJudge.score(answer, c.question(), JUDGE_RUBRIC);
  return ScoreResult.pass("ok");
}
⚠️ Pitfall

LLM-as-judge for everything on every PR—cost explodes and judges correlate with the model under test. Rules first; judges for nightly subjective tags.

🔒 Security

Task-based evals in sandboxes still need auth—do not point agents at production CRM with eval credentials that bypass audit.

🎯 Interview Tip

“Reference vs LLM judge?” — Rules for compliance and format; judges for helpfulness when paraphrase varies; human labels calibrate judges; task-based for agents.

💡 Pro Tip

Tag gold rows with required scorer: "scorers": ["rules"] vs ["rules", "judge"] so PR jobs skip expensive judges.

What this looks like in production

A notebook that scores five answers is a spike. Production eval is a platform: versioned gold sets, scheduled runners, report artifacts, CI gates on prompt/RAG/agent repos, online sampling, and a process to promote failures into regression cases—before and after launch.

Architecture sketch

flowchart LR
  GIT[Gold JSONL in git]
  CI[CI smoke runner]
  NIGHT[Nightly live runner]
  RPT[Report artifact]
  OBS[Online trace sampling]
  GOLD[Promote to gold]
  GIT --> CI --> RPT
  GIT --> NIGHT --> RPT
  OBS --> GOLD --> GIT

Eval readiness checklist

  • Gold JSONL in git with unique id per case; schema validated in unit tests
  • Smoke suite (≤15 cases) runs on every PR—mock or live with budget cap
  • Full suite nightly with pinned model and prompt_version
  • CI fails when pass rate drops more than agreed threshold vs baseline
  • Report artifact uploaded: pass rate, delta, failed IDs, latency, token cost
  • Per-tag breakdown for billing, safety, RAG, agent trajectories
  • Rule scorers for policy phrases; judges scheduled—not on every PR by default
  • Human review queue for 1–5% production samples
  • Process to add production failures as new gold rows within SLA
  • Prompt/registry version linked to eval score at deploy time
  • Feature flag rollback when online metrics diverge from offline eval
  • RAG evals separate retrieval recall from generation faithfulness (Guide 3)

Minimum viable eval platform (week one)

You do not need LangSmith or Braintrust on day one. Minimum stack:

  • eval/gold/*.jsonl — 20 cases from real tickets
  • eval/scorers.py — rule scorers + unit tests
  • runners/mock_*.py — deterministic pipeline stub
  • eval/run_suite.py — exit code 0/1
  • .github/workflows/eval-smoke.yml — runs on PR
  • reports/latest.json — CI artifact

Upgrade to live runners, judges, and observability exporters as coverage and traffic grow—Guide 6 walks CI wiring end-to-end.

Logging every eval run

FieldWhy
run_id, git_shaReproduce any report
suite, runner (mock/live)Explain mock vs live deltas
pass_rate, deltaCI gate input
failed_case_ids[]Jump to debugging
model, prompt_versionAttribute behavior changes
total_tokens, cost_usdFinOps for eval spend
FastAPI eval endpoint — internal only
from fastapi import FastAPI, Header, HTTPException
from pydantic import BaseModel

app = FastAPI()
ADMIN_TOKEN = "rotate-me"

class EvalTrigger(BaseModel):
    suite: str = "support_smoke"
    runner: str = "mock"

@app.post("/internal/eval/run")
def trigger_eval(body: EvalTrigger, x_admin_token: str = Header(...)):
    if x_admin_token != ADMIN_TOKEN:
        raise HTTPException(403)
    report = run_suite(body.suite, runner=body.runner)
    return {"run_id": report.run_id, "pass_rate": report.pass_rate}
@RestController
@RequestMapping("/internal/eval")
public class EvalController {
  @PostMapping("/run")
  public EvalRunResponse trigger(@RequestBody EvalTrigger body,
      @RequestHeader("X-Admin-Token") String token) {
    if (!adminToken.equals(token)) throw new ResponseStatusException(HttpStatus.FORBIDDEN);
    EvalReport report = suiteRunner.run(body.suite(), body.runner());
    return new EvalRunResponse(report.runId(), report.passRate());
  }
}
☸️ OpenShift / K8s

Nightly eval as CronJob; mount gold ConfigMap or clone repo; upload reports to S3; Slack webhook on regression vs baseline.

🏗️ IaC

Terraform CI budget alarms on eval token spend; separate AWS/GCP project for eval API keys with hard quotas.

💰 Cost

Cap nightly live eval at N cases × M tokens; overflow queue to weekend batch—unbounded eval jobs have bankrupted side projects.

📦 Real World

Mature teams treat eval regressions like P1 bugs—billing tag drop from 100% to 97% blocks release even if aggregate pass rate still looks fine.

📦 Real World

Mature teams treat eval regressions like P1 bugs—billing tag drop from 100% to 97% blocks release even if aggregate pass rate still looks fine.

🎯 Interview Tip

“How do you ship LLM features safely?” — Golden sets, CI gates, offline + online eval, promote production failures to gold, separate RAG retrieval from generation metrics, feature-flag rollback tied to eval drift.

💡 Pro Tip

Publish eval pass-rate badges in internal docs next to each prompt version—makes “which template is production-safe?” a one-glance question for PMs and support leads.