Continuous evaluation in CI/CD

Eval-as-code: golden sets, scorers, and suite config in git

Eval-as-code stores datasets, scorers, thresholds, and runner config in the repo—versioned, reviewed, and executed in CI like any other migration or test suite.

Eval-as-code means golden cases, scorers, thresholds, and runner config live in git beside application code—reviewed in PRs, executed in CI, versioned with prompts. Ad-hoc spreadsheets and one engineer’s laptop notebook do not block merges. Treat eval suites like database migrations: immutable history, explicit rollback, and ownership in CODEOWNERS.

Repository layout

Path	Contents	Change frequency
evals/datasets/*.jsonl	Golden rows (append-only)	Weekly from incidents
evals/scorers/	Python/Java scorer modules	Monthly
evals/suites/*.yaml	Suite composition + thresholds	Per feature launch
prompts/registry.yaml	Prompt semver → template path	Every prompt PR
evals/ci.config.yaml	Smoke vs full matrix	Rare

Golden row schema

Each row: id, input, optional context, expected or rubric, tags[] (intent, locale, risk), and source (incident_id, user_report, synthetic).

Scorers as pure functions

Scorers return Score(name, value, passed, metadata)—no network in unit scorers; LLM judges isolated behind interfaces for mocking. Compose with AND for blocking gates, OR for diagnostic-only signals.

# evals/suites/refund_smoke.yaml
name: refund_smoke
dataset: evals/datasets/refund_golden.jsonl
sample: 50  # PR smoke: first 50 rows
scorers:
  - name: exact_policy_id
    fn: scorers.policy_id_match
    threshold: 1.0
  - name: faithfulness
    fn: scorers.nli_faithfulness
    threshold: 0.85
blocking: true
max_cost_usd: 2.00

# evals/suites/refund_smoke.yaml — same schema consumed by Java runner
name: refund_smoke
dataset: evals/datasets/refund_golden.jsonl
sample: 50
scorers:
  - name: exact_policy_id
    fn: com.acme.evals.PolicyIdMatch
    threshold: 1.0

from dataclasses import dataclass
from typing import Any, Protocol

@dataclass
class Score:
    name: str
    value: float
    passed: bool
    metadata: dict[str, Any]

class Scorer(Protocol):
    def __call__(self, row: dict, output: str) -> Score: ...

def policy_id_match(row: dict, output: str) -> Score:
    expected = row["expected"]["policy_id"]
    found = extract_policy_id(output)
    value = 1.0 if found == expected else 0.0
    return Score("policy_id", value, value == 1.0, {"expected": expected, "found": found})

public record Score(String name, double value, boolean passed, Map metadata) {}

public interface Scorer {
  Score score(EvalRow row, String output);
}

public class PolicyIdMatch implements Scorer {
  public Score score(EvalRow row, String output) {
    String expected = row.expected().get("policy_id");
    String found = PolicyIdExtractor.extract(output);
    double value = expected.equals(found) ? 1.0 : 0.0;
    return new Score("policy_id", value, value == 1.0, Map.of("expected", expected));
  }
}

import yaml
from runner import run_suite

def main():
    cfg = yaml.safe_load(open("evals/suites/refund_smoke.yaml"))
    report = run_suite(cfg, prompt_version=os.environ["PROMPT_VERSION"])
    report.to_json("eval-report.json")
    if report.blocking_failed:
        raise SystemExit(1)

if __name__ == "__main__":
    main()

public static void main(String[] args) {
  SuiteConfig cfg = SuiteConfig.load("evals/suites/refund_smoke.yaml");
  EvalReport report = Runner.run(cfg, System.getenv("PROMPT_VERSION"));
  report.writeJson("eval-report.json");
  if (report.blockingFailed()) System.exit(1);
}

💡 Tip

Pin PROMPT_VERSION and MODEL env vars in CI from the PR diff—never “latest” in merge gates.

⚠️ Pitfall

Checking golden files into git without row IDs—merge conflicts become unmergeable JSONL blobs. Use stable id per line and append-only policy.

🔬 Under the Hood

LLM eval CI is embarrassingly parallel—shard JSONL by hash(row.id) across workers; aggregate with weighted pass rates, not naive average of shard rates.

FAQ: How big should smoke be?

50–100 rows covering top intents and one adversarial tag each—finish in <3 min and <$3 LLM spend per PR.

CI pipeline: PR smoke, merge full eval, PR comments

Design CI in three layers: PR fast eval (smoke subset, minutes, blocks merge), merge full eval (complete golden + adversarial, blocks release), and PR comment artifacts with pass-rate delta and failure IDs.

CI pipelines for LLM apps usually split into PR fast eval (smoke subset, cheap model, parallel shards), merge full eval (nightly or post-merge, full golden + adversarial), and PR comment artifacts (pass rate delta, cost, slowest rows, trace links). Blocking merges on smoke; blocking releases on full.

PR fast eval

Trigger: pull_request on paths prompts/**, evals/**, app code affecting LLM path
Runtime budget: <5 minutes wall, <$5 API spend
Dataset: smoke sample stratified by tag
Model: same as prod or cheaper surrogate with calibration doc

Merge / nightly full eval

Trigger: push to main, cron 02:00 UTC
Full golden + adversarial + safety suites from Guide 4
Publish report artifact retained 90 days
Block release tag if regression vs baseline > ε

PR comments

Post structured summary: overall pass rate, Δ vs base branch, top 3 failing row IDs with diff snippets, estimated eval cost, and link to HTML report. Engineers fix failures without digging CI logs.

Stage	When	Blocks merge?	Typical duration
Lint / schema	PR	Yes	30s
Smoke eval	PR	Yes	2–4 min
Unit scorers	PR	Yes	20s
Full eval	Post-merge	Release	20–60 min
Shadow prod sample	Daily	No (alert)	Continuous

name: llm-eval-smoke
on:
  pull_request:
    paths: ['prompts/**', 'evals/**', 'src/**']
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r evals/requirements.txt
      - env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPT_VERSION: ${{ github.sha }}
        run: python -m evals.runner --suite refund_smoke --shard 4
      - uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval-report.json

// GitHub Actions YAML identical; Java step:
// run: ./gradlew :evals:runSmoke -PpromptVersion=${{ github.sha }}

import json, subprocess

def post_pr_comment(report_path: str):
    report = json.load(open(report_path))
    body = f"""## LLM eval smoke
| Metric | Value |
|--------|-------|
| Pass rate | {report['pass_rate']:.1%} |
| Δ vs base | {report['delta_vs_base']:+.1%} |
| Cost | ${report['cost_usd']:.2f} |
| Failures | {', '.join(report['failed_ids'][:5])} |
"""
    subprocess.run(["gh", "pr", "comment", "--body", body], check=True)

void postPrComment(Path reportPath) throws IOException {
  EvalReport report = EvalReport.readJson(reportPath);
  String body = String.format("Pass rate: %.1f%%", report.passRate() * 100);
  new ProcessBuilder("gh", "pr", "comment", "--body", body).inheritIO().start();
}

def shard_rows(rows: list, shard_index: int, shard_count: int) -> list:
    return [r for i, r in enumerate(rows) if i % shard_count == shard_index]

def merge_reports(reports: list[dict]) -> dict:
    total = sum(r["total"] for r in reports)
    passed = sum(r["passed"] for r in reports)
    return {"pass_rate": passed / total, "cost_usd": sum(r["cost_usd"] for r in reports)}

List shardRows(List rows, int shardIndex, int shardCount) {
  List out = new ArrayList<>();
  for (int i = 0; i < rows.size(); i++) if (i % shardCount == shardIndex) out.add(rows.get(i));
  return out;
}

📦 Real World

Teams gate merge on smoke but require on-call ack for full-eval failure within 24h—balances velocity vs safety when nightly budget is 45 minutes.

💰 Cost

Cap PR smoke with max_cost_usd in suite YAML—runner aborts early instead of burning $200 on a bad infinite loop prompt.

⚖️ Trade-off

Cheap surrogate model in PR saves money but misses GPT-4-only failures—recalibrate monthly with 20-row bridge set run on both models.

FAQ: Re-run full eval on PR manually?

Expose workflow_dispatch with label full-eval for risky prompt rewrites—still non-blocking for merge if smoke passed.

Regression analysis: ε thresholds, t-tests, bucketed slices

Noisy evals need regression thresholds (ε), Welch t-tests on continuous judge scores, and bucketed analysis by tag—aggregate pass rate alone hides localized disasters.

LLM evals are noisy—a 92% pass rate Monday and 89% Wednesday may be normal variance or a real regression. Regression analysis combines pass-rate thresholds (ε), t-tests on continuous scores, and bucketed analysis by tag so you do not miss a 40pt drop in “refund_denial” while aggregate stays green.

Threshold ε (pass rate)

Define baseline pass rate p₀ from last green main build. Block if p < p₀ - ε. Typical ε: 2pp for smoke (n≈50), 1pp for full (n>500). Zero tolerance for safety tags (PII, injection) from Guide 4.

Suite	n	ε	Notes
Smoke	50	0.04	Wide band; high variance
Full golden	800	0.01	Tight band
Safety adversarial	200	0.00	Any fail blocks
RAG faithfulness	300	0.02 mean score	Continuous metric

T-test on continuous scores

For judge scores and NLI faithfulness, compare distributions with Welch’s t-test between base and PR. Require p < 0.01 and mean drop > 0.03 to block—avoid blocking on statistically significant but tiny 0.005 mean shifts.

Bucketed analysis

Always slice by tags: intent, locale, risk, prompt_variant. Aggregate pass rate can hide catastrophic failure on low-volume “escalation” bucket. CI fails if any bucket with n≥10 drops >2× baseline failure rate.

def pass_rate_gate(current: float, baseline: float, epsilon: float, n: int) -> bool:
    if current < baseline - epsilon:
        return False  # block
    # optional: Wilson interval lower bound > baseline - epsilon
    return True

# Example: baseline 0.94, current 0.905, epsilon 0.02 → block

boolean passRateGate(double current, double baseline, double epsilon) {
  return current >= baseline - epsilon;
}

from scipy import stats

def regression_ttest(base_scores: list[float], pr_scores: list[float]) -> dict:
    t, p = stats.ttest_ind(base_scores, pr_scores, equal_var=False)
    delta = stats.mean(pr_scores) - stats.mean(base_scores)
    block = p < 0.01 and delta < -0.03
    return {"t": t, "p": p, "delta_mean": delta, "block": block}

// Apache Commons Math TTest
double[] base = ...;
double[] pr = ...;
TTest tTest = new TTest();
double p = tTest.tTest(base, pr);
double delta = StatUtils.mean(pr) - StatUtils.mean(base);
boolean block = p < 0.01 && delta < -0.03;

from collections import defaultdict

def bucket_pass_rates(rows: list, scores: list) -> dict:
    buckets = defaultdict(lambda: {"pass": 0, "total": 0})
    for row, score in zip(rows, scores):
        for tag in row.get("tags", ["default"]):
            buckets[tag]["total"] += 1
            buckets[tag]["pass"] += int(score.passed)
    return {k: v["pass"] / v["total"] for k, v in buckets.items() if v["total"] >= 10}

def any_bucket_regression(current: dict, baseline: dict, min_drop: float = 0.05) -> list[str]:
    failed = []
    for tag, base_rate in baseline.items():
        if current.get(tag, 1.0) < base_rate - min_drop:
            failed.append(tag)
    return failed

Map bucketPassRates(List rows, List scores) {
  Map counts = new HashMap<>();
  // aggregate pass/total per tag; return rate where total >= 10
  return rates;
}

⚠️ Pitfall

Using fixed ε=0.01 on n=30 smoke—statistical noise triggers false blocks. Scale ε with 1/sqrt(n) or use Wilson intervals.

🎯 Interview Tip

"How do you detect LLM regressions in CI?" — Baseline pass rate + ε, t-test on judge scores, bucket by intent/locale, zero tolerance safety slice, human review queue for borderline p-values.

🔒 Security

Never waive safety bucket failures for velocity—injection/PII rows must be 100% pass with audit trail if override ever required (exec + security sign-off).

FAQ: Flaky LLM judge?

Run judge with temperature 0; cache judge prompts; require 2-of-3 agreement on borderline rows; exclude known-flaky rows into quarantine tag until fixed.

Dataset management: versioning, append-only, provenance

Golden datasets need append-only rows, explicit version pins, manifest checksums, and provenance from incidents or traces—not editable spreadsheets.

Golden datasets rot without governance. Adopt append-only JSONL (never edit rows in place), version pins in suite YAML (dataset_version: 2026-04-01), and provenance linking rows to incident IDs or user reports. Retire rows to archive/ instead of delete—reproducibility beats tidiness.

Versioning strategies

Strategy	Pros	Cons
Git SHA only	Simple	Hard to reference in dashboards
Date semver 2026.04.01	Human readable	Needs changelog
DVC / LFS blob	Large datasets	Extra infra
Langfuse/Braintrust export	Trace → row loop	Vendor lock-in

Row lifecycle

Intake — incident or eval failure → draft row in PR
Review — ML + domain owner approve tags and expected
Active — included in full eval
Quarantine — flaky judge; excluded from blocking until fixed
Archive — obsolete policy; kept for historical replay

Synthetic vs production-sourced

Balance: 60% synthetic/hand-crafted, 40% de-identified production failures (from Guide 5 trace exemplars). Synthetic covers edge cases; production rows catch real phrasing and retrieval gaps.

import json
import sys

def lint_append_only(old_path: str, new_path: str) -> None:
    old_ids = {json.loads(l)["id"] for l in open(old_path)}
    for line in open(new_path):
        row = json.loads(line)
        rid = row["id"]
        if rid in old_ids and row.get("_modified"):
            print(f"ERROR: in-place edit forbidden for id={rid}")
            sys.exit(1)
    print("OK: append-only respected")

void lintAppendOnly(Path oldPath, Path newPath) throws IOException {
  Set oldIds = readIds(oldPath);
  for (String line : Files.readAllLines(newPath)) {
    EvalRow row = EvalRow.parse(line);
    if (oldIds.contains(row.id()) && row.modified()) throw new IllegalStateException("in-place edit: " + row.id());
  }
}

def trace_to_golden(trace: dict) -> dict:
    return {
        "id": f"inc_{trace['incident_id']}",
        "input": trace["user_query_redacted"],
        "context": trace.get("chunk_ids", []),
        "expected": {"rubric": "Must cite policy SEC-12 and refuse unauthorized refund"},
        "tags": ["refund", "from_trace", trace.get("locale", "en")],
        "source": {"incident_id": trace["incident_id"], "trace_id": trace["trace_id"]},
    }

EvalRow traceToGolden(Trace trace) {
  return EvalRow.builder()
      .id("inc_" + trace.incidentId())
      .input(trace.userQueryRedacted())
      .tag("from_trace")
      .source(Map.of("trace_id", trace.traceId()))
      .build();
}

# evals/datasets/manifest.yaml
datasets:
  refund_golden:
    version: "2026.04.01"
    path: refund_golden.jsonl
    sha256: a1b2c3...
    row_count: 842

// manifest.yaml consumed by Java runner for integrity check
datasets:
  refund_golden:
    version: "2026.04.01"
    sha256: a1b2c3

💡 Tip

Require source.incident_id or source.author on every new row—anonymous golden rows become unowned debt.

📦 Real World

Support org added 200 rows from peak season without tags—spring eval pass rate looked fine while “tax_form” bucket silently failed until April traffic hit.

🔬 Under the Hood

JSONL line hashes in manifest detect accidental git reformatting—CI fails if sha256 mismatch even when row content ‘looks’ identical.

FAQ: Edit wrong expected answer?

Add superseding row with same id suffix -v2; deprecate old id in manifest retired_ids.

Prompt evaluation gates: registry, A/B, and shadow

Every prompt PR should pass registry lint, smoke eval against baseline, token budget checks, and—before prod—shadow evaluation on redacted prod queries or synthetic equivalents.

Prompt changes are the highest-frequency cause of LLM regressions. Prompt evaluation gates run on every prompt registry PR: diff highlights system message changes, smoke eval against pinned model, optional A/B against production prompt, and shadow eval on sampled prod queries in staging before flip.

Prompt registry pattern

Store templates as files; registry maps refund_v3.2.1 → path + checksum. CI resolves prompt from PR SHA, not floating “main”.

Gate	Trigger	Blocks?
Template lint (Jinja vars)	PR	Yes
Smoke eval vs baseline prompt	PR	Yes
Token count budget	PR	Yes if >+15%
Shadow replay 1k prod queries	Pre-prod deploy	Warn

A/B and shadow evaluation

A/B: same golden rows, two prompt versions, compare pass rate and cost. Shadow: staging receives copy of prod traffic (redacted), runs candidate prompt, no user impact. Promote when candidate wins on quality with ≤5% cost increase.

Human review queue

Rows where judge score ∈ [0.4, 0.6] or diff >200 chars go to human review before merge on high-risk prompts (refunds, legal, medical disclaimers).

prompts:
  refund_system:
    version: 3.2.1
    path: prompts/refund_system.jinja2
    sha256: deadbeef...
    max_tokens_budget: 1200
  refund_user:
    version: 1.0.0
    path: prompts/refund_user.jinja2

prompts:
  refund_system:
    version: 3.2.1
    path: prompts/refund_system.jinja2

def ab_eval(rows: list, prompt_a: str, prompt_b: str, model: str) -> dict:
    scores_a = [run_row(row, prompt_a, model) for row in rows]
    scores_b = [run_row(row, prompt_b, model) for row in rows]
    return {
        "pass_a": mean(s.passed for s in scores_a),
        "pass_b": mean(s.passed for s in scores_b),
        "cost_a": sum(s.cost for s in scores_a),
        "cost_b": sum(s.cost for s in scores_b),
        "winner": "B" if mean(s.value for s in scores_b) > mean(s.value for s in scores_a) else "A",
    }

AbResult abEval(List rows, String promptA, String promptB, String model) {
  var scoresA = rows.stream().map(r -> runRow(r, promptA, model)).toList();
  var scoresB = rows.stream().map(r -> runRow(r, promptB, model)).toList();
  double passA = scoresA.stream().filter(Score::passed).count() / (double) rows.size();
  double passB = scoresB.stream().filter(Score::passed).count() / (double) rows.size();
  return new AbResult(passA, passB, passA >= passB ? "A" : "B");
}

async def shadow_eval_loop():
    async for event in prod_query_stream(sample_rate=0.05):
        if event.env != "staging_shadow":
            continue
        candidate = render_prompt(registry.get("refund_system", version=CANDIDATE_VER), event)
        prod = render_prompt(registry.get("refund_system", version=PROD_VER), event)
        score_c, score_p = await asyncio.gather(judge(candidate), judge(prod))
        metrics.gauge("shadow.delta", score_c - score_p)

void shadowEvalLoop() {
  prodQueryStream(0.05).forEach(event -> {
    String candidate = renderPrompt(CANDIDATE_VER, event);
    String prod = renderPrompt(PROD_VER, event);
    double delta = judge(candidate) - judge(prod);
    metrics.gauge("shadow.delta", delta);
  });
}

⚖️ Trade-off

Shadow eval catches real phrasing but needs strict PII scrub—prefer synthetic shadow if legal blocks prod query copy.

💰 Cost

A/B doubles eval API spend—run on stratified 100-row subset, not full 800-row golden, unless release-tier promotion.

🎯 Interview Tip

"Prompt change process?" — Registry semver, PR smoke vs baseline, token budget gate, shadow on staging, rollback via prompt_version pin.

FAQ: Few-shot examples in prompt?

Version examples as separate YAML; CI fails if example policy IDs drift from current policy database snapshot.

CI tools: DeepEval, PromptFoo, Braintrust, LangSmith

Compare DeepEval, PromptFoo, Braintrust, and LangSmith for CI integration, self-host options, PR comments, and regression UX.

CI eval tooling spans dedicated frameworks (DeepEval, PromptFoo) and platform suites (Braintrust, LangSmith). Pick based on self-host needs, language support, PR comment UX, and whether you already pay for trace storage on the same vendor.

CI tools comparison

Tool	CI focus	Self-host	Strength	Gap
DeepEval	Pytest-native metrics	OSS core	Fast unit-style LLM tests in CI	Less PR UX polish
PromptFoo	CLI matrix + red team	OSS	Prompt A/B grids, YAML configs, CI comments	Java secondary
Braintrust	Cloud eval + diff	Cloud	Human review, regression diffs, datasets	Cost at scale
LangSmith	LangChain eval runs	Cloud	Trace→dataset, eval on runs	Best inside LC ecosystem

DeepEval

Wraps metrics (faithfulness, toxicity, G-Eval) as pytest cases—familiar to Python teams. Good for scorer unit tests and smoke in GitHub Actions. Pair with custom JSONL loader for full golden suites.

PromptFoo

Declarative YAML defines prompt matrix, providers, assertions (contains, javascript, llm-rubric). Built-in red-team plugins for injection probes. Excellent PR comment output via GitHub Action.

Braintrust CI

Push eval results to cloud; UI diffs failures across commits. Strong when PMs and ML share review duty. CI action fails build on regression policy defined in project settings.

LangSmith CI

Run evaluate() in CI against LangSmith datasets; ties to traces from staging. Lowest friction if already on LangChain for app and observability (Guide 5).

Selection matrix

If you…	Prefer
Want pytest in CI tomorrow	DeepEval
Need prompt A/B grids + red team	PromptFoo
Need human review on failures	Braintrust
All-in on LangChain	LangSmith

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric

@pytest.mark.parametrize("row", load_smoke_rows())
def test_refund_faithfulness(row):
    output = run_pipeline(row["input"])
    test_case = LLMTestCase(input=row["input"], actual_output=output, retrieval_context=row["context"])
    assert_test(test_case, [FaithfulnessMetric(threshold=0.85)])

@ParameterizedTest
@MethodSource("smokeRows")
void testRefundFaithfulness(EvalRow row) {
  String output = pipeline.run(row.input());
  assert faithfulnessMetric.score(output, row.context()) >= 0.85;
}

# promptfoo.yaml
prompts: [file://prompts/refund_system.jinja2]
providers: [openai:gpt-4o]
tests: file://evals/promptfoo/refund_cases.yaml
defaultTest:
  assert:
    - type: llm-rubric
      value: Must cite official policy ID
    - type: not-contains
      value: SSN

# promptfoo.yaml — same format for JavaScript/Java via CLI wrapper

import braintrust

def ci_eval():
    experiment = braintrust.init("refund-copilot", experiment=os.environ["GITHUB_SHA"])
    for row in load_golden():
        out = run(row)
        experiment.log(input=row["input"], output=out, scores={"faithfulness": score(out, row)})
    summary = experiment.summarize()
    if summary.regressions:
        raise SystemExit(1)

void ciEval() {
  Experiment exp = Braintrust.init("refund-copilot", System.getenv("GITHUB_SHA"));
  for (EvalRow row : loadGolden()) {
    String out = run(row);
    exp.log(row.input(), out, Map.of("faithfulness", score(out, row)));
  }
  if (exp.summarize().hasRegressions()) System.exit(1);
}

📦 Real World

Python app team: DeepEval in unit CI + PromptFoo weekly red-team; LangSmith only for trace→dataset loop—not duplicate scorers in both.

⚖️ Trade-off

PromptFoo YAML assertions are fast but brittle on paraphrase—pair contains with llm-rubric for high-stakes rows.

💰 Cost

Braintrust/LangSmith CI billed per eval row—cache model responses in CI when prompt unchanged (key = hash(input+prompt_version)).

FAQ: Multiple tools?

One blocking runner in merge path; others advisory (red team weekly). Duplicate scorers diverge.

Track 5 capstone checklist

Complete the Track 5 capstone checklist—eval-as-code CI, observability handoff, safety gates, and incident→golden loop—before starting Track 6 Shipping.

Track 5 capstone: offline eval gates + online observability + safety rails form a closed loop. This checklist confirms CI eval, dataset governance, prompt gates, and observability handoffs before Track 6 — Shipping.

Track 5 capstone checklist

Capability	Guide	Done when
Eval foundations + judges	1–2	Scorers documented
RAG eval gold	3	RAGAS nightly green
Safety adversarial CI	4	Zero waiver PII/injection
Production traces + alerts	5	TTFT/cost dashboards live
Eval-as-code in CI	6	Smoke blocks PR
Full eval blocks release	6	Nightly artifact retained
Incident → golden loop	5+6	Weekly promotion PRs

Release gate summary

Gate	Threshold	Owner
PR smoke pass rate	≥ baseline − ε	ML eng
Safety suite	100%	Security
Full eval faithfulness mean	≥ baseline − 0.02	ML eng
Observability schema tests	Pass	Platform
Shadow eval 24h	No quality red flag	Product

GATES = {
    "smoke_ci": github.check_required("llm-eval-smoke"),
    "full_eval_last_green": artifact.pass_rate("full-eval") >= 0.93,
    "observability_checklist": open("docs/obs_checklist.done").read().strip() == "yes",
    "safety_zero_fail": artifact.pass_rate("safety-adversarial") == 1.0,
}

def track5_ready() -> bool:
    return all(GATES.values())

Map> gates = Map.of(
    "smoke_ci", () -> github.checkRequired("llm-eval-smoke"),
    "safety_zero_fail", () -> artifacts.passRate("safety-adversarial") == 1.0
);
boolean track5Ready() { return gates.values().stream().allMatch(Supplier::get); }

def weekly_golden_promotion():
    traces = fetch_failed_traces(min_faithfulness=0.5, days=7)
    rows = [trace_to_golden(t) for t in traces[:20]]
    if not rows:
        return
    open_pr("evals: promote 20 incident rows", write_jsonl(rows))
    # After merge, nightly full eval automatically includes new ids

void weeklyGoldenPromotion() {
  var traces = fetchFailedTraces(0.5, 7);
  var rows = traces.stream().limit(20).map(Evals::traceToGolden).toList();
  if (!rows.isEmpty()) openPr("evals: promote incident rows", writeJsonl(rows));
}

📦 End of Track 5

You built eval foundations, RAG and safety gates, observability, and CI regression. Track 6 — Shipping covers launch checklists, feature flags, rollback, and operating LLM products in production.

🔒 Security

Capstone gate: safety adversarial suite at 100% with no override in CI config—waivers only via signed release exception file in repo.

📦 Real World

Teams shipping Track 6 without capstone green typically revert within 72h—missing CI eval means prompt hotfixes ship without smoke.

🎯 Interview Tip

"End-to-end LLM quality?" — Golden eval-as-code, CI regression + buckets, safety zero-tolerance, prod traces with sampled judges, incident→dataset loop.

💰 Cost

Budget full nightly eval separately from prod inference—typical $20–80/night for 1k-row golden at GPT-4o; cache when prompts unchanged.

🔬 Under the Hood

Regression baselines stored as CI artifacts from last green main—PR compares to downloaded baseline, not live re-run of main (saves API cost).

⚖️ Trade-off

Blocking every PR on full eval adds 45 min latency—smoke on PR + full on main is the industry compromise.

⚠️ Pitfall

Celebrating Track 5 complete with CI green once—datasets drift, models update, indexes rot. Capstone is a living program with weekly review.

Track 5 → Track 6 handoff

Feature flag prompt_version rollout doc
Rollback runbook tied to eval baseline SHA
On-call rotation includes ML eval owner
Launch dashboard: TTFT, cost, faithfulness, safety blocks

FAQ: Skip full eval for hotfix?

Only with exec + ML sign-off and automatic post-merge full eval within 4h—document in incident ticket.