Continuous evaluation in CI/CD
A green unit test suite does not stop a prompt edit from breaking refunds. Eval-as-code treats golden cases, scorers, and regression thresholds like migrations: versioned in git, executed in CI, blocking merge on slip. This is Guide 6—the Track 5 capstone—covering PR fast eval, nightly full suites, statistical regression analysis, dataset management, prompt change gates, and CI tool selection before Track 6 shipping.
After reading, you should be able to: store eval suites in-repo and run them in CI; design PR smoke vs merge full-eval pipelines with PR comment artifacts; apply regression thresholds, t-tests, and bucketed analysis; version datasets append-only and tie rows to prompt/model versions; enforce prompt-eval gates with A/B and shadow evaluation; compare DeepEval, PromptFoo, Braintrust, and LangSmith CI integrations; and complete the Track 5 capstone checklist.
Eval-as-code: golden sets, scorers, and suite config in git
Eval-as-code stores datasets, scorers, thresholds, and runner config in the repo—versioned, reviewed, and executed in CI like any other migration or test suite.
Eval-as-code means golden cases, scorers, thresholds, and runner config live in git beside application code—reviewed in PRs, executed in CI, versioned with prompts. Ad-hoc spreadsheets and one engineer’s laptop notebook do not block merges. Treat eval suites like database migrations: immutable history, explicit rollback, and ownership in CODEOWNERS.
Repository layout
| Path | Contents | Change frequency |
|---|---|---|
| evals/datasets/*.jsonl | Golden rows (append-only) | Weekly from incidents |
| evals/scorers/ | Python/Java scorer modules | Monthly |
| evals/suites/*.yaml | Suite composition + thresholds | Per feature launch |
| prompts/registry.yaml | Prompt semver → template path | Every prompt PR |
| evals/ci.config.yaml | Smoke vs full matrix | Rare |
Golden row schema
Each row: id, input, optional context, expected or rubric, tags[] (intent, locale, risk), and source (incident_id, user_report, synthetic).
Scorers as pure functions
Scorers return Score(name, value, passed, metadata)—no network in unit scorers; LLM judges isolated behind interfaces for mocking. Compose with AND for blocking gates, OR for diagnostic-only signals.
# evals/suites/refund_smoke.yaml
name: refund_smoke
dataset: evals/datasets/refund_golden.jsonl
sample: 50 # PR smoke: first 50 rows
scorers:
- name: exact_policy_id
fn: scorers.policy_id_match
threshold: 1.0
- name: faithfulness
fn: scorers.nli_faithfulness
threshold: 0.85
blocking: true
max_cost_usd: 2.00
# evals/suites/refund_smoke.yaml — same schema consumed by Java runner
name: refund_smoke
dataset: evals/datasets/refund_golden.jsonl
sample: 50
scorers:
- name: exact_policy_id
fn: com.acme.evals.PolicyIdMatch
threshold: 1.0
from dataclasses import dataclass
from typing import Any, Protocol
@dataclass
class Score:
name: str
value: float
passed: bool
metadata: dict[str, Any]
class Scorer(Protocol):
def __call__(self, row: dict, output: str) -> Score: ...
def policy_id_match(row: dict, output: str) -> Score:
expected = row["expected"]["policy_id"]
found = extract_policy_id(output)
value = 1.0 if found == expected else 0.0
return Score("policy_id", value, value == 1.0, {"expected": expected, "found": found})
public record Score(String name, double value, boolean passed, Map metadata) {}
public interface Scorer {
Score score(EvalRow row, String output);
}
public class PolicyIdMatch implements Scorer {
public Score score(EvalRow row, String output) {
String expected = row.expected().get("policy_id");
String found = PolicyIdExtractor.extract(output);
double value = expected.equals(found) ? 1.0 : 0.0;
return new Score("policy_id", value, value == 1.0, Map.of("expected", expected));
}
}
import yaml
from runner import run_suite
def main():
cfg = yaml.safe_load(open("evals/suites/refund_smoke.yaml"))
report = run_suite(cfg, prompt_version=os.environ["PROMPT_VERSION"])
report.to_json("eval-report.json")
if report.blocking_failed:
raise SystemExit(1)
if __name__ == "__main__":
main()
public static void main(String[] args) {
SuiteConfig cfg = SuiteConfig.load("evals/suites/refund_smoke.yaml");
EvalReport report = Runner.run(cfg, System.getenv("PROMPT_VERSION"));
report.writeJson("eval-report.json");
if (report.blockingFailed()) System.exit(1);
}
Pin PROMPT_VERSION and MODEL env vars in CI from the PR diff—never “latest” in merge gates.
Checking golden files into git without row IDs—merge conflicts become unmergeable JSONL blobs. Use stable id per line and append-only policy.
LLM eval CI is embarrassingly parallel—shard JSONL by hash(row.id) across workers; aggregate with weighted pass rates, not naive average of shard rates.
FAQ: How big should smoke be?
50–100 rows covering top intents and one adversarial tag each—finish in <3 min and <$3 LLM spend per PR.
CI pipeline: PR smoke, merge full eval, PR comments
Design CI in three layers: PR fast eval (smoke subset, minutes, blocks merge), merge full eval (complete golden + adversarial, blocks release), and PR comment artifacts with pass-rate delta and failure IDs.
CI pipelines for LLM apps usually split into PR fast eval (smoke subset, cheap model, parallel shards), merge full eval (nightly or post-merge, full golden + adversarial), and PR comment artifacts (pass rate delta, cost, slowest rows, trace links). Blocking merges on smoke; blocking releases on full.
PR fast eval
- Trigger: pull_request on paths prompts/**, evals/**, app code affecting LLM path
- Runtime budget: <5 minutes wall, <$5 API spend
- Dataset: smoke sample stratified by tag
- Model: same as prod or cheaper surrogate with calibration doc
Merge / nightly full eval
- Trigger: push to main, cron 02:00 UTC
- Full golden + adversarial + safety suites from Guide 4
- Publish report artifact retained 90 days
- Block release tag if regression vs baseline > ε
PR comments
Post structured summary: overall pass rate, Δ vs base branch, top 3 failing row IDs with diff snippets, estimated eval cost, and link to HTML report. Engineers fix failures without digging CI logs.
| Stage | When | Blocks merge? | Typical duration |
|---|---|---|---|
| Lint / schema | PR | Yes | 30s |
| Smoke eval | PR | Yes | 2–4 min |
| Unit scorers | PR | Yes | 20s |
| Full eval | Post-merge | Release | 20–60 min |
| Shadow prod sample | Daily | No (alert) | Continuous |
name: llm-eval-smoke
on:
pull_request:
paths: ['prompts/**', 'evals/**', 'src/**']
jobs:
smoke:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -r evals/requirements.txt
- env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
PROMPT_VERSION: ${{ github.sha }}
run: python -m evals.runner --suite refund_smoke --shard 4
- uses: actions/upload-artifact@v4
with:
name: eval-report
path: eval-report.json
// GitHub Actions YAML identical; Java step:
// run: ./gradlew :evals:runSmoke -PpromptVersion=${{ github.sha }}
import json, subprocess
def post_pr_comment(report_path: str):
report = json.load(open(report_path))
body = f"""## LLM eval smoke
| Metric | Value |
|--------|-------|
| Pass rate | {report['pass_rate']:.1%} |
| Δ vs base | {report['delta_vs_base']:+.1%} |
| Cost | ${report['cost_usd']:.2f} |
| Failures | {', '.join(report['failed_ids'][:5])} |
"""
subprocess.run(["gh", "pr", "comment", "--body", body], check=True)
void postPrComment(Path reportPath) throws IOException {
EvalReport report = EvalReport.readJson(reportPath);
String body = String.format("Pass rate: %.1f%%", report.passRate() * 100);
new ProcessBuilder("gh", "pr", "comment", "--body", body).inheritIO().start();
}
def shard_rows(rows: list, shard_index: int, shard_count: int) -> list:
return [r for i, r in enumerate(rows) if i % shard_count == shard_index]
def merge_reports(reports: list[dict]) -> dict:
total = sum(r["total"] for r in reports)
passed = sum(r["passed"] for r in reports)
return {"pass_rate": passed / total, "cost_usd": sum(r["cost_usd"] for r in reports)}
List shardRows(List rows, int shardIndex, int shardCount) {
List out = new ArrayList<>();
for (int i = 0; i < rows.size(); i++) if (i % shardCount == shardIndex) out.add(rows.get(i));
return out;
}
Teams gate merge on smoke but require on-call ack for full-eval failure within 24h—balances velocity vs safety when nightly budget is 45 minutes.
Cap PR smoke with max_cost_usd in suite YAML—runner aborts early instead of burning $200 on a bad infinite loop prompt.
Cheap surrogate model in PR saves money but misses GPT-4-only failures—recalibrate monthly with 20-row bridge set run on both models.
FAQ: Re-run full eval on PR manually?
Expose workflow_dispatch with label full-eval for risky prompt rewrites—still non-blocking for merge if smoke passed.
Regression analysis: ε thresholds, t-tests, bucketed slices
Noisy evals need regression thresholds (ε), Welch t-tests on continuous judge scores, and bucketed analysis by tag—aggregate pass rate alone hides localized disasters.
LLM evals are noisy—a 92% pass rate Monday and 89% Wednesday may be normal variance or a real regression. Regression analysis combines pass-rate thresholds (ε), t-tests on continuous scores, and bucketed analysis by tag so you do not miss a 40pt drop in “refund_denial” while aggregate stays green.
Threshold ε (pass rate)
Define baseline pass rate p₀ from last green main build. Block if p < p₀ - ε. Typical ε: 2pp for smoke (n≈50), 1pp for full (n>500). Zero tolerance for safety tags (PII, injection) from Guide 4.
| Suite | n | ε | Notes |
|---|---|---|---|
| Smoke | 50 | 0.04 | Wide band; high variance |
| Full golden | 800 | 0.01 | Tight band |
| Safety adversarial | 200 | 0.00 | Any fail blocks |
| RAG faithfulness | 300 | 0.02 mean score | Continuous metric |
T-test on continuous scores
For judge scores and NLI faithfulness, compare distributions with Welch’s t-test between base and PR. Require p < 0.01 and mean drop > 0.03 to block—avoid blocking on statistically significant but tiny 0.005 mean shifts.
Bucketed analysis
Always slice by tags: intent, locale, risk, prompt_variant. Aggregate pass rate can hide catastrophic failure on low-volume “escalation” bucket. CI fails if any bucket with n≥10 drops >2× baseline failure rate.
def pass_rate_gate(current: float, baseline: float, epsilon: float, n: int) -> bool:
if current < baseline - epsilon:
return False # block
# optional: Wilson interval lower bound > baseline - epsilon
return True
# Example: baseline 0.94, current 0.905, epsilon 0.02 → block
boolean passRateGate(double current, double baseline, double epsilon) {
return current >= baseline - epsilon;
}
from scipy import stats
def regression_ttest(base_scores: list[float], pr_scores: list[float]) -> dict:
t, p = stats.ttest_ind(base_scores, pr_scores, equal_var=False)
delta = stats.mean(pr_scores) - stats.mean(base_scores)
block = p < 0.01 and delta < -0.03
return {"t": t, "p": p, "delta_mean": delta, "block": block}
// Apache Commons Math TTest
double[] base = ...;
double[] pr = ...;
TTest tTest = new TTest();
double p = tTest.tTest(base, pr);
double delta = StatUtils.mean(pr) - StatUtils.mean(base);
boolean block = p < 0.01 && delta < -0.03;
from collections import defaultdict
def bucket_pass_rates(rows: list, scores: list) -> dict:
buckets = defaultdict(lambda: {"pass": 0, "total": 0})
for row, score in zip(rows, scores):
for tag in row.get("tags", ["default"]):
buckets[tag]["total"] += 1
buckets[tag]["pass"] += int(score.passed)
return {k: v["pass"] / v["total"] for k, v in buckets.items() if v["total"] >= 10}
def any_bucket_regression(current: dict, baseline: dict, min_drop: float = 0.05) -> list[str]:
failed = []
for tag, base_rate in baseline.items():
if current.get(tag, 1.0) < base_rate - min_drop:
failed.append(tag)
return failed
Map bucketPassRates(List rows, List scores) {
Map counts = new HashMap<>();
// aggregate pass/total per tag; return rate where total >= 10
return rates;
}
Using fixed ε=0.01 on n=30 smoke—statistical noise triggers false blocks. Scale ε with 1/sqrt(n) or use Wilson intervals.
"How do you detect LLM regressions in CI?" — Baseline pass rate + ε, t-test on judge scores, bucket by intent/locale, zero tolerance safety slice, human review queue for borderline p-values.
Never waive safety bucket failures for velocity—injection/PII rows must be 100% pass with audit trail if override ever required (exec + security sign-off).
FAQ: Flaky LLM judge?
Run judge with temperature 0; cache judge prompts; require 2-of-3 agreement on borderline rows; exclude known-flaky rows into quarantine tag until fixed.
Dataset management: versioning, append-only, provenance
Golden datasets need append-only rows, explicit version pins, manifest checksums, and provenance from incidents or traces—not editable spreadsheets.
Golden datasets rot without governance. Adopt append-only JSONL (never edit rows in place), version pins in suite YAML (dataset_version: 2026-04-01), and provenance linking rows to incident IDs or user reports. Retire rows to archive/ instead of delete—reproducibility beats tidiness.
Versioning strategies
| Strategy | Pros | Cons |
|---|---|---|
| Git SHA only | Simple | Hard to reference in dashboards |
| Date semver 2026.04.01 | Human readable | Needs changelog |
| DVC / LFS blob | Large datasets | Extra infra |
| Langfuse/Braintrust export | Trace → row loop | Vendor lock-in |
Row lifecycle
- Intake — incident or eval failure → draft row in PR
- Review — ML + domain owner approve tags and expected
- Active — included in full eval
- Quarantine — flaky judge; excluded from blocking until fixed
- Archive — obsolete policy; kept for historical replay
Synthetic vs production-sourced
Balance: 60% synthetic/hand-crafted, 40% de-identified production failures (from Guide 5 trace exemplars). Synthetic covers edge cases; production rows catch real phrasing and retrieval gaps.
import json
import sys
def lint_append_only(old_path: str, new_path: str) -> None:
old_ids = {json.loads(l)["id"] for l in open(old_path)}
for line in open(new_path):
row = json.loads(line)
rid = row["id"]
if rid in old_ids and row.get("_modified"):
print(f"ERROR: in-place edit forbidden for id={rid}")
sys.exit(1)
print("OK: append-only respected")
void lintAppendOnly(Path oldPath, Path newPath) throws IOException {
Set oldIds = readIds(oldPath);
for (String line : Files.readAllLines(newPath)) {
EvalRow row = EvalRow.parse(line);
if (oldIds.contains(row.id()) && row.modified()) throw new IllegalStateException("in-place edit: " + row.id());
}
}
def trace_to_golden(trace: dict) -> dict:
return {
"id": f"inc_{trace['incident_id']}",
"input": trace["user_query_redacted"],
"context": trace.get("chunk_ids", []),
"expected": {"rubric": "Must cite policy SEC-12 and refuse unauthorized refund"},
"tags": ["refund", "from_trace", trace.get("locale", "en")],
"source": {"incident_id": trace["incident_id"], "trace_id": trace["trace_id"]},
}
EvalRow traceToGolden(Trace trace) {
return EvalRow.builder()
.id("inc_" + trace.incidentId())
.input(trace.userQueryRedacted())
.tag("from_trace")
.source(Map.of("trace_id", trace.traceId()))
.build();
}
# evals/datasets/manifest.yaml
datasets:
refund_golden:
version: "2026.04.01"
path: refund_golden.jsonl
sha256: a1b2c3...
row_count: 842
// manifest.yaml consumed by Java runner for integrity check
datasets:
refund_golden:
version: "2026.04.01"
sha256: a1b2c3
Require source.incident_id or source.author on every new row—anonymous golden rows become unowned debt.
Support org added 200 rows from peak season without tags—spring eval pass rate looked fine while “tax_form” bucket silently failed until April traffic hit.
JSONL line hashes in manifest detect accidental git reformatting—CI fails if sha256 mismatch even when row content ‘looks’ identical.
FAQ: Edit wrong expected answer?
Add superseding row with same id suffix -v2; deprecate old id in manifest retired_ids.
Prompt evaluation gates: registry, A/B, and shadow
Every prompt PR should pass registry lint, smoke eval against baseline, token budget checks, and—before prod—shadow evaluation on redacted prod queries or synthetic equivalents.
Prompt changes are the highest-frequency cause of LLM regressions. Prompt evaluation gates run on every prompt registry PR: diff highlights system message changes, smoke eval against pinned model, optional A/B against production prompt, and shadow eval on sampled prod queries in staging before flip.
Prompt registry pattern
Store templates as files; registry maps refund_v3.2.1 → path + checksum. CI resolves prompt from PR SHA, not floating “main”.
| Gate | Trigger | Blocks? |
|---|---|---|
| Template lint (Jinja vars) | PR | Yes |
| Smoke eval vs baseline prompt | PR | Yes |
| Token count budget | PR | Yes if >+15% |
| Shadow replay 1k prod queries | Pre-prod deploy | Warn |
A/B and shadow evaluation
A/B: same golden rows, two prompt versions, compare pass rate and cost. Shadow: staging receives copy of prod traffic (redacted), runs candidate prompt, no user impact. Promote when candidate wins on quality with ≤5% cost increase.
Human review queue
Rows where judge score ∈ [0.4, 0.6] or diff >200 chars go to human review before merge on high-risk prompts (refunds, legal, medical disclaimers).
prompts:
refund_system:
version: 3.2.1
path: prompts/refund_system.jinja2
sha256: deadbeef...
max_tokens_budget: 1200
refund_user:
version: 1.0.0
path: prompts/refund_user.jinja2
prompts:
refund_system:
version: 3.2.1
path: prompts/refund_system.jinja2
def ab_eval(rows: list, prompt_a: str, prompt_b: str, model: str) -> dict:
scores_a = [run_row(row, prompt_a, model) for row in rows]
scores_b = [run_row(row, prompt_b, model) for row in rows]
return {
"pass_a": mean(s.passed for s in scores_a),
"pass_b": mean(s.passed for s in scores_b),
"cost_a": sum(s.cost for s in scores_a),
"cost_b": sum(s.cost for s in scores_b),
"winner": "B" if mean(s.value for s in scores_b) > mean(s.value for s in scores_a) else "A",
}
AbResult abEval(List rows, String promptA, String promptB, String model) {
var scoresA = rows.stream().map(r -> runRow(r, promptA, model)).toList();
var scoresB = rows.stream().map(r -> runRow(r, promptB, model)).toList();
double passA = scoresA.stream().filter(Score::passed).count() / (double) rows.size();
double passB = scoresB.stream().filter(Score::passed).count() / (double) rows.size();
return new AbResult(passA, passB, passA >= passB ? "A" : "B");
}
async def shadow_eval_loop():
async for event in prod_query_stream(sample_rate=0.05):
if event.env != "staging_shadow":
continue
candidate = render_prompt(registry.get("refund_system", version=CANDIDATE_VER), event)
prod = render_prompt(registry.get("refund_system", version=PROD_VER), event)
score_c, score_p = await asyncio.gather(judge(candidate), judge(prod))
metrics.gauge("shadow.delta", score_c - score_p)
void shadowEvalLoop() {
prodQueryStream(0.05).forEach(event -> {
String candidate = renderPrompt(CANDIDATE_VER, event);
String prod = renderPrompt(PROD_VER, event);
double delta = judge(candidate) - judge(prod);
metrics.gauge("shadow.delta", delta);
});
}
Shadow eval catches real phrasing but needs strict PII scrub—prefer synthetic shadow if legal blocks prod query copy.
A/B doubles eval API spend—run on stratified 100-row subset, not full 800-row golden, unless release-tier promotion.
"Prompt change process?" — Registry semver, PR smoke vs baseline, token budget gate, shadow on staging, rollback via prompt_version pin.
FAQ: Few-shot examples in prompt?
Version examples as separate YAML; CI fails if example policy IDs drift from current policy database snapshot.
CI tools: DeepEval, PromptFoo, Braintrust, LangSmith
Compare DeepEval, PromptFoo, Braintrust, and LangSmith for CI integration, self-host options, PR comments, and regression UX.
CI eval tooling spans dedicated frameworks (DeepEval, PromptFoo) and platform suites (Braintrust, LangSmith). Pick based on self-host needs, language support, PR comment UX, and whether you already pay for trace storage on the same vendor.
CI tools comparison
| Tool | CI focus | Self-host | Strength | Gap |
|---|---|---|---|---|
| DeepEval | Pytest-native metrics | OSS core | Fast unit-style LLM tests in CI | Less PR UX polish |
| PromptFoo | CLI matrix + red team | OSS | Prompt A/B grids, YAML configs, CI comments | Java secondary |
| Braintrust | Cloud eval + diff | Cloud | Human review, regression diffs, datasets | Cost at scale |
| LangSmith | LangChain eval runs | Cloud | Trace→dataset, eval on runs | Best inside LC ecosystem |
DeepEval
Wraps metrics (faithfulness, toxicity, G-Eval) as pytest cases—familiar to Python teams. Good for scorer unit tests and smoke in GitHub Actions. Pair with custom JSONL loader for full golden suites.
PromptFoo
Declarative YAML defines prompt matrix, providers, assertions (contains, javascript, llm-rubric). Built-in red-team plugins for injection probes. Excellent PR comment output via GitHub Action.
Braintrust CI
Push eval results to cloud; UI diffs failures across commits. Strong when PMs and ML share review duty. CI action fails build on regression policy defined in project settings.
LangSmith CI
Run evaluate() in CI against LangSmith datasets; ties to traces from staging. Lowest friction if already on LangChain for app and observability (Guide 5).
Selection matrix
| If you… | Prefer |
|---|---|
| Want pytest in CI tomorrow | DeepEval |
| Need prompt A/B grids + red team | PromptFoo |
| Need human review on failures | Braintrust |
| All-in on LangChain | LangSmith |
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric
@pytest.mark.parametrize("row", load_smoke_rows())
def test_refund_faithfulness(row):
output = run_pipeline(row["input"])
test_case = LLMTestCase(input=row["input"], actual_output=output, retrieval_context=row["context"])
assert_test(test_case, [FaithfulnessMetric(threshold=0.85)])
@ParameterizedTest
@MethodSource("smokeRows")
void testRefundFaithfulness(EvalRow row) {
String output = pipeline.run(row.input());
assert faithfulnessMetric.score(output, row.context()) >= 0.85;
}
# promptfoo.yaml
prompts: [file://prompts/refund_system.jinja2]
providers: [openai:gpt-4o]
tests: file://evals/promptfoo/refund_cases.yaml
defaultTest:
assert:
- type: llm-rubric
value: Must cite official policy ID
- type: not-contains
value: SSN
# promptfoo.yaml — same format for JavaScript/Java via CLI wrapper
import braintrust
def ci_eval():
experiment = braintrust.init("refund-copilot", experiment=os.environ["GITHUB_SHA"])
for row in load_golden():
out = run(row)
experiment.log(input=row["input"], output=out, scores={"faithfulness": score(out, row)})
summary = experiment.summarize()
if summary.regressions:
raise SystemExit(1)
void ciEval() {
Experiment exp = Braintrust.init("refund-copilot", System.getenv("GITHUB_SHA"));
for (EvalRow row : loadGolden()) {
String out = run(row);
exp.log(row.input(), out, Map.of("faithfulness", score(out, row)));
}
if (exp.summarize().hasRegressions()) System.exit(1);
}
Python app team: DeepEval in unit CI + PromptFoo weekly red-team; LangSmith only for trace→dataset loop—not duplicate scorers in both.
PromptFoo YAML assertions are fast but brittle on paraphrase—pair contains with llm-rubric for high-stakes rows.
Braintrust/LangSmith CI billed per eval row—cache model responses in CI when prompt unchanged (key = hash(input+prompt_version)).
FAQ: Multiple tools?
One blocking runner in merge path; others advisory (red team weekly). Duplicate scorers diverge.
Track 5 capstone checklist
Complete the Track 5 capstone checklist—eval-as-code CI, observability handoff, safety gates, and incident→golden loop—before starting Track 6 Shipping.
Track 5 capstone: offline eval gates + online observability + safety rails form a closed loop. This checklist confirms CI eval, dataset governance, prompt gates, and observability handoffs before Track 6 — Shipping.
Track 5 capstone checklist
| Capability | Guide | Done when |
|---|---|---|
| Eval foundations + judges | 1–2 | Scorers documented |
| RAG eval gold | 3 | RAGAS nightly green |
| Safety adversarial CI | 4 | Zero waiver PII/injection |
| Production traces + alerts | 5 | TTFT/cost dashboards live |
| Eval-as-code in CI | 6 | Smoke blocks PR |
| Full eval blocks release | 6 | Nightly artifact retained |
| Incident → golden loop | 5+6 | Weekly promotion PRs |
Release gate summary
| Gate | Threshold | Owner |
|---|---|---|
| PR smoke pass rate | ≥ baseline − ε | ML eng |
| Safety suite | 100% | Security |
| Full eval faithfulness mean | ≥ baseline − 0.02 | ML eng |
| Observability schema tests | Pass | Platform |
| Shadow eval 24h | No quality red flag | Product |
GATES = {
"smoke_ci": github.check_required("llm-eval-smoke"),
"full_eval_last_green": artifact.pass_rate("full-eval") >= 0.93,
"observability_checklist": open("docs/obs_checklist.done").read().strip() == "yes",
"safety_zero_fail": artifact.pass_rate("safety-adversarial") == 1.0,
}
def track5_ready() -> bool:
return all(GATES.values())
Map> gates = Map.of(
"smoke_ci", () -> github.checkRequired("llm-eval-smoke"),
"safety_zero_fail", () -> artifacts.passRate("safety-adversarial") == 1.0
);
boolean track5Ready() { return gates.values().stream().allMatch(Supplier::get); }
def weekly_golden_promotion():
traces = fetch_failed_traces(min_faithfulness=0.5, days=7)
rows = [trace_to_golden(t) for t in traces[:20]]
if not rows:
return
open_pr("evals: promote 20 incident rows", write_jsonl(rows))
# After merge, nightly full eval automatically includes new ids
void weeklyGoldenPromotion() {
var traces = fetchFailedTraces(0.5, 7);
var rows = traces.stream().limit(20).map(Evals::traceToGolden).toList();
if (!rows.isEmpty()) openPr("evals: promote incident rows", writeJsonl(rows));
}
You built eval foundations, RAG and safety gates, observability, and CI regression. Track 6 — Shipping covers launch checklists, feature flags, rollback, and operating LLM products in production.
Capstone gate: safety adversarial suite at 100% with no override in CI config—waivers only via signed release exception file in repo.
Teams shipping Track 6 without capstone green typically revert within 72h—missing CI eval means prompt hotfixes ship without smoke.
"End-to-end LLM quality?" — Golden eval-as-code, CI regression + buckets, safety zero-tolerance, prod traces with sampled judges, incident→dataset loop.
Budget full nightly eval separately from prod inference—typical $20–80/night for 1k-row golden at GPT-4o; cache when prompts unchanged.
Regression baselines stored as CI artifacts from last green main—PR compares to downloaded baseline, not live re-run of main (saves API cost).
Blocking every PR on full eval adds 45 min latency—smoke on PR + full on main is the industry compromise.
Celebrating Track 5 complete with CI green once—datasets drift, models update, indexes rot. Capstone is a living program with weekly review.
Track 5 → Track 6 handoff
- Feature flag prompt_version rollout doc
- Rollback runbook tied to eval baseline SHA
- On-call rotation includes ML eval owner
- Launch dashboard: TTFT, cost, faithfulness, safety blocks
FAQ: Skip full eval for hotfix?
Only with exec + ML sign-off and automatic post-merge full eval within 4h—document in incident ticket.