Lang
API

Continuous evaluation in CI/CD

A green unit test suite does not stop a prompt edit from breaking refunds. Eval-as-code treats golden cases, scorers, and regression thresholds like migrations: versioned in git, executed in CI, blocking merge on slip. This is Guide 6—the Track 5 capstone—covering PR fast eval, nightly full suites, statistical regression analysis, dataset management, prompt change gates, and CI tool selection before Track 6 shipping.

After reading, you should be able to: store eval suites in-repo and run them in CI; design PR smoke vs merge full-eval pipelines with PR comment artifacts; apply regression thresholds, t-tests, and bucketed analysis; version datasets append-only and tie rows to prompt/model versions; enforce prompt-eval gates with A/B and shadow evaluation; compare DeepEval, PromptFoo, Braintrust, and LangSmith CI integrations; and complete the Track 5 capstone checklist.

developer devops security Track 5 DeepEval PromptFoo GitHub Actions regression

Eval-as-code: golden sets, scorers, and suite config in git

Eval-as-code stores datasets, scorers, thresholds, and runner config in the repo—versioned, reviewed, and executed in CI like any other migration or test suite.

Eval-as-code means golden cases, scorers, thresholds, and runner config live in git beside application code—reviewed in PRs, executed in CI, versioned with prompts. Ad-hoc spreadsheets and one engineer’s laptop notebook do not block merges. Treat eval suites like database migrations: immutable history, explicit rollback, and ownership in CODEOWNERS.

Repository layout

PathContentsChange frequency
evals/datasets/*.jsonlGolden rows (append-only)Weekly from incidents
evals/scorers/Python/Java scorer modulesMonthly
evals/suites/*.yamlSuite composition + thresholdsPer feature launch
prompts/registry.yamlPrompt semver → template pathEvery prompt PR
evals/ci.config.yamlSmoke vs full matrixRare

Golden row schema

Each row: id, input, optional context, expected or rubric, tags[] (intent, locale, risk), and source (incident_id, user_report, synthetic).

Scorers as pure functions

Scorers return Score(name, value, passed, metadata)—no network in unit scorers; LLM judges isolated behind interfaces for mocking. Compose with AND for blocking gates, OR for diagnostic-only signals.

Golden row + suite YAML
# evals/suites/refund_smoke.yaml
name: refund_smoke
dataset: evals/datasets/refund_golden.jsonl
sample: 50  # PR smoke: first 50 rows
scorers:
  - name: exact_policy_id
    fn: scorers.policy_id_match
    threshold: 1.0
  - name: faithfulness
    fn: scorers.nli_faithfulness
    threshold: 0.85
blocking: true
max_cost_usd: 2.00
# evals/suites/refund_smoke.yaml — same schema consumed by Java runner
name: refund_smoke
dataset: evals/datasets/refund_golden.jsonl
sample: 50
scorers:
  - name: exact_policy_id
    fn: com.acme.evals.PolicyIdMatch
    threshold: 1.0
Scorer interface (Python)
from dataclasses import dataclass
from typing import Any, Protocol

@dataclass
class Score:
    name: str
    value: float
    passed: bool
    metadata: dict[str, Any]

class Scorer(Protocol):
    def __call__(self, row: dict, output: str) -> Score: ...

def policy_id_match(row: dict, output: str) -> Score:
    expected = row["expected"]["policy_id"]
    found = extract_policy_id(output)
    value = 1.0 if found == expected else 0.0
    return Score("policy_id", value, value == 1.0, {"expected": expected, "found": found})
public record Score(String name, double value, boolean passed, Map metadata) {}

public interface Scorer {
  Score score(EvalRow row, String output);
}

public class PolicyIdMatch implements Scorer {
  public Score score(EvalRow row, String output) {
    String expected = row.expected().get("policy_id");
    String found = PolicyIdExtractor.extract(output);
    double value = expected.equals(found) ? 1.0 : 0.0;
    return new Score("policy_id", value, value == 1.0, Map.of("expected", expected));
  }
}
Eval runner entrypoint
import yaml
from runner import run_suite

def main():
    cfg = yaml.safe_load(open("evals/suites/refund_smoke.yaml"))
    report = run_suite(cfg, prompt_version=os.environ["PROMPT_VERSION"])
    report.to_json("eval-report.json")
    if report.blocking_failed:
        raise SystemExit(1)

if __name__ == "__main__":
    main()
public static void main(String[] args) {
  SuiteConfig cfg = SuiteConfig.load("evals/suites/refund_smoke.yaml");
  EvalReport report = Runner.run(cfg, System.getenv("PROMPT_VERSION"));
  report.writeJson("eval-report.json");
  if (report.blockingFailed()) System.exit(1);
}
💡 Tip

Pin PROMPT_VERSION and MODEL env vars in CI from the PR diff—never “latest” in merge gates.

⚠️ Pitfall

Checking golden files into git without row IDs—merge conflicts become unmergeable JSONL blobs. Use stable id per line and append-only policy.

🔬 Under the Hood

LLM eval CI is embarrassingly parallel—shard JSONL by hash(row.id) across workers; aggregate with weighted pass rates, not naive average of shard rates.

FAQ: How big should smoke be?

50–100 rows covering top intents and one adversarial tag each—finish in <3 min and <$3 LLM spend per PR.

CI pipeline: PR smoke, merge full eval, PR comments

Design CI in three layers: PR fast eval (smoke subset, minutes, blocks merge), merge full eval (complete golden + adversarial, blocks release), and PR comment artifacts with pass-rate delta and failure IDs.

CI pipelines for LLM apps usually split into PR fast eval (smoke subset, cheap model, parallel shards), merge full eval (nightly or post-merge, full golden + adversarial), and PR comment artifacts (pass rate delta, cost, slowest rows, trace links). Blocking merges on smoke; blocking releases on full.

PR fast eval

  • Trigger: pull_request on paths prompts/**, evals/**, app code affecting LLM path
  • Runtime budget: <5 minutes wall, <$5 API spend
  • Dataset: smoke sample stratified by tag
  • Model: same as prod or cheaper surrogate with calibration doc

Merge / nightly full eval

  • Trigger: push to main, cron 02:00 UTC
  • Full golden + adversarial + safety suites from Guide 4
  • Publish report artifact retained 90 days
  • Block release tag if regression vs baseline > ε

PR comments

Post structured summary: overall pass rate, Δ vs base branch, top 3 failing row IDs with diff snippets, estimated eval cost, and link to HTML report. Engineers fix failures without digging CI logs.

StageWhenBlocks merge?Typical duration
Lint / schemaPRYes30s
Smoke evalPRYes2–4 min
Unit scorersPRYes20s
Full evalPost-mergeRelease20–60 min
Shadow prod sampleDailyNo (alert)Continuous
GitHub Actions PR smoke job
name: llm-eval-smoke
on:
  pull_request:
    paths: ['prompts/**', 'evals/**', 'src/**']
jobs:
  smoke:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -r evals/requirements.txt
      - env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          PROMPT_VERSION: ${{ github.sha }}
        run: python -m evals.runner --suite refund_smoke --shard 4
      - uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval-report.json
// GitHub Actions YAML identical; Java step:
// run: ./gradlew :evals:runSmoke -PpromptVersion=${{ github.sha }}
PR comment from eval-report.json
import json, subprocess

def post_pr_comment(report_path: str):
    report = json.load(open(report_path))
    body = f"""## LLM eval smoke
| Metric | Value |
|--------|-------|
| Pass rate | {report['pass_rate']:.1%} |
| Δ vs base | {report['delta_vs_base']:+.1%} |
| Cost | ${report['cost_usd']:.2f} |
| Failures | {', '.join(report['failed_ids'][:5])} |
"""
    subprocess.run(["gh", "pr", "comment", "--body", body], check=True)
void postPrComment(Path reportPath) throws IOException {
  EvalReport report = EvalReport.readJson(reportPath);
  String body = String.format("Pass rate: %.1f%%", report.passRate() * 100);
  new ProcessBuilder("gh", "pr", "comment", "--body", body).inheritIO().start();
}
Parallel sharding in runner
def shard_rows(rows: list, shard_index: int, shard_count: int) -> list:
    return [r for i, r in enumerate(rows) if i % shard_count == shard_index]

def merge_reports(reports: list[dict]) -> dict:
    total = sum(r["total"] for r in reports)
    passed = sum(r["passed"] for r in reports)
    return {"pass_rate": passed / total, "cost_usd": sum(r["cost_usd"] for r in reports)}
List shardRows(List rows, int shardIndex, int shardCount) {
  List out = new ArrayList<>();
  for (int i = 0; i < rows.size(); i++) if (i % shardCount == shardIndex) out.add(rows.get(i));
  return out;
}
📦 Real World

Teams gate merge on smoke but require on-call ack for full-eval failure within 24h—balances velocity vs safety when nightly budget is 45 minutes.

💰 Cost

Cap PR smoke with max_cost_usd in suite YAML—runner aborts early instead of burning $200 on a bad infinite loop prompt.

⚖️ Trade-off

Cheap surrogate model in PR saves money but misses GPT-4-only failures—recalibrate monthly with 20-row bridge set run on both models.

FAQ: Re-run full eval on PR manually?

Expose workflow_dispatch with label full-eval for risky prompt rewrites—still non-blocking for merge if smoke passed.

Regression analysis: ε thresholds, t-tests, bucketed slices

Noisy evals need regression thresholds (ε), Welch t-tests on continuous judge scores, and bucketed analysis by tag—aggregate pass rate alone hides localized disasters.

LLM evals are noisy—a 92% pass rate Monday and 89% Wednesday may be normal variance or a real regression. Regression analysis combines pass-rate thresholds (ε), t-tests on continuous scores, and bucketed analysis by tag so you do not miss a 40pt drop in “refund_denial” while aggregate stays green.

Threshold ε (pass rate)

Define baseline pass rate p₀ from last green main build. Block if p < p₀ - ε. Typical ε: 2pp for smoke (n≈50), 1pp for full (n>500). Zero tolerance for safety tags (PII, injection) from Guide 4.

SuitenεNotes
Smoke500.04Wide band; high variance
Full golden8000.01Tight band
Safety adversarial2000.00Any fail blocks
RAG faithfulness3000.02 mean scoreContinuous metric

T-test on continuous scores

For judge scores and NLI faithfulness, compare distributions with Welch’s t-test between base and PR. Require p < 0.01 and mean drop > 0.03 to block—avoid blocking on statistically significant but tiny 0.005 mean shifts.

Bucketed analysis

Always slice by tags: intent, locale, risk, prompt_variant. Aggregate pass rate can hide catastrophic failure on low-volume “escalation” bucket. CI fails if any bucket with n≥10 drops >2× baseline failure rate.

Pass-rate gate with ε
def pass_rate_gate(current: float, baseline: float, epsilon: float, n: int) -> bool:
    if current < baseline - epsilon:
        return False  # block
    # optional: Wilson interval lower bound > baseline - epsilon
    return True

# Example: baseline 0.94, current 0.905, epsilon 0.02 → block
boolean passRateGate(double current, double baseline, double epsilon) {
  return current >= baseline - epsilon;
}
Welch t-test on faithfulness scores
from scipy import stats

def regression_ttest(base_scores: list[float], pr_scores: list[float]) -> dict:
    t, p = stats.ttest_ind(base_scores, pr_scores, equal_var=False)
    delta = stats.mean(pr_scores) - stats.mean(base_scores)
    block = p < 0.01 and delta < -0.03
    return {"t": t, "p": p, "delta_mean": delta, "block": block}
// Apache Commons Math TTest
double[] base = ...;
double[] pr = ...;
TTest tTest = new TTest();
double p = tTest.tTest(base, pr);
double delta = StatUtils.mean(pr) - StatUtils.mean(base);
boolean block = p < 0.01 && delta < -0.03;
Bucketed regression report
from collections import defaultdict

def bucket_pass_rates(rows: list, scores: list) -> dict:
    buckets = defaultdict(lambda: {"pass": 0, "total": 0})
    for row, score in zip(rows, scores):
        for tag in row.get("tags", ["default"]):
            buckets[tag]["total"] += 1
            buckets[tag]["pass"] += int(score.passed)
    return {k: v["pass"] / v["total"] for k, v in buckets.items() if v["total"] >= 10}

def any_bucket_regression(current: dict, baseline: dict, min_drop: float = 0.05) -> list[str]:
    failed = []
    for tag, base_rate in baseline.items():
        if current.get(tag, 1.0) < base_rate - min_drop:
            failed.append(tag)
    return failed
Map bucketPassRates(List rows, List scores) {
  Map counts = new HashMap<>();
  // aggregate pass/total per tag; return rate where total >= 10
  return rates;
}
⚠️ Pitfall

Using fixed ε=0.01 on n=30 smoke—statistical noise triggers false blocks. Scale ε with 1/sqrt(n) or use Wilson intervals.

🎯 Interview Tip

"How do you detect LLM regressions in CI?" — Baseline pass rate + ε, t-test on judge scores, bucket by intent/locale, zero tolerance safety slice, human review queue for borderline p-values.

🔒 Security

Never waive safety bucket failures for velocity—injection/PII rows must be 100% pass with audit trail if override ever required (exec + security sign-off).

FAQ: Flaky LLM judge?

Run judge with temperature 0; cache judge prompts; require 2-of-3 agreement on borderline rows; exclude known-flaky rows into quarantine tag until fixed.

Dataset management: versioning, append-only, provenance

Golden datasets need append-only rows, explicit version pins, manifest checksums, and provenance from incidents or traces—not editable spreadsheets.

Golden datasets rot without governance. Adopt append-only JSONL (never edit rows in place), version pins in suite YAML (dataset_version: 2026-04-01), and provenance linking rows to incident IDs or user reports. Retire rows to archive/ instead of delete—reproducibility beats tidiness.

Versioning strategies

StrategyProsCons
Git SHA onlySimpleHard to reference in dashboards
Date semver 2026.04.01Human readableNeeds changelog
DVC / LFS blobLarge datasetsExtra infra
Langfuse/Braintrust exportTrace → row loopVendor lock-in

Row lifecycle

  1. Intake — incident or eval failure → draft row in PR
  2. Review — ML + domain owner approve tags and expected
  3. Active — included in full eval
  4. Quarantine — flaky judge; excluded from blocking until fixed
  5. Archive — obsolete policy; kept for historical replay

Synthetic vs production-sourced

Balance: 60% synthetic/hand-crafted, 40% de-identified production failures (from Guide 5 trace exemplars). Synthetic covers edge cases; production rows catch real phrasing and retrieval gaps.

Append-only dataset PR check
import json
import sys

def lint_append_only(old_path: str, new_path: str) -> None:
    old_ids = {json.loads(l)["id"] for l in open(old_path)}
    for line in open(new_path):
        row = json.loads(line)
        rid = row["id"]
        if rid in old_ids and row.get("_modified"):
            print(f"ERROR: in-place edit forbidden for id={rid}")
            sys.exit(1)
    print("OK: append-only respected")
void lintAppendOnly(Path oldPath, Path newPath) throws IOException {
  Set oldIds = readIds(oldPath);
  for (String line : Files.readAllLines(newPath)) {
    EvalRow row = EvalRow.parse(line);
    if (oldIds.contains(row.id()) && row.modified()) throw new IllegalStateException("in-place edit: " + row.id());
  }
}
Promote trace to golden row
def trace_to_golden(trace: dict) -> dict:
    return {
        "id": f"inc_{trace['incident_id']}",
        "input": trace["user_query_redacted"],
        "context": trace.get("chunk_ids", []),
        "expected": {"rubric": "Must cite policy SEC-12 and refuse unauthorized refund"},
        "tags": ["refund", "from_trace", trace.get("locale", "en")],
        "source": {"incident_id": trace["incident_id"], "trace_id": trace["trace_id"]},
    }
EvalRow traceToGolden(Trace trace) {
  return EvalRow.builder()
      .id("inc_" + trace.incidentId())
      .input(trace.userQueryRedacted())
      .tag("from_trace")
      .source(Map.of("trace_id", trace.traceId()))
      .build();
}
Dataset manifest for suite pin
# evals/datasets/manifest.yaml
datasets:
  refund_golden:
    version: "2026.04.01"
    path: refund_golden.jsonl
    sha256: a1b2c3...
    row_count: 842
// manifest.yaml consumed by Java runner for integrity check
datasets:
  refund_golden:
    version: "2026.04.01"
    sha256: a1b2c3
💡 Tip

Require source.incident_id or source.author on every new row—anonymous golden rows become unowned debt.

📦 Real World

Support org added 200 rows from peak season without tags—spring eval pass rate looked fine while “tax_form” bucket silently failed until April traffic hit.

🔬 Under the Hood

JSONL line hashes in manifest detect accidental git reformatting—CI fails if sha256 mismatch even when row content ‘looks’ identical.

FAQ: Edit wrong expected answer?

Add superseding row with same id suffix -v2; deprecate old id in manifest retired_ids.

Prompt evaluation gates: registry, A/B, and shadow

Every prompt PR should pass registry lint, smoke eval against baseline, token budget checks, and—before prod—shadow evaluation on redacted prod queries or synthetic equivalents.

Prompt changes are the highest-frequency cause of LLM regressions. Prompt evaluation gates run on every prompt registry PR: diff highlights system message changes, smoke eval against pinned model, optional A/B against production prompt, and shadow eval on sampled prod queries in staging before flip.

Prompt registry pattern

Store templates as files; registry maps refund_v3.2.1 → path + checksum. CI resolves prompt from PR SHA, not floating “main”.

GateTriggerBlocks?
Template lint (Jinja vars)PRYes
Smoke eval vs baseline promptPRYes
Token count budgetPRYes if >+15%
Shadow replay 1k prod queriesPre-prod deployWarn

A/B and shadow evaluation

A/B: same golden rows, two prompt versions, compare pass rate and cost. Shadow: staging receives copy of prod traffic (redacted), runs candidate prompt, no user impact. Promote when candidate wins on quality with ≤5% cost increase.

Human review queue

Rows where judge score ∈ [0.4, 0.6] or diff >200 chars go to human review before merge on high-risk prompts (refunds, legal, medical disclaimers).

Prompt registry YAML
prompts:
  refund_system:
    version: 3.2.1
    path: prompts/refund_system.jinja2
    sha256: deadbeef...
    max_tokens_budget: 1200
  refund_user:
    version: 1.0.0
    path: prompts/refund_user.jinja2
prompts:
  refund_system:
    version: 3.2.1
    path: prompts/refund_system.jinja2
A/B eval two prompt versions
def ab_eval(rows: list, prompt_a: str, prompt_b: str, model: str) -> dict:
    scores_a = [run_row(row, prompt_a, model) for row in rows]
    scores_b = [run_row(row, prompt_b, model) for row in rows]
    return {
        "pass_a": mean(s.passed for s in scores_a),
        "pass_b": mean(s.passed for s in scores_b),
        "cost_a": sum(s.cost for s in scores_a),
        "cost_b": sum(s.cost for s in scores_b),
        "winner": "B" if mean(s.value for s in scores_b) > mean(s.value for s in scores_a) else "A",
    }
AbResult abEval(List rows, String promptA, String promptB, String model) {
  var scoresA = rows.stream().map(r -> runRow(r, promptA, model)).toList();
  var scoresB = rows.stream().map(r -> runRow(r, promptB, model)).toList();
  double passA = scoresA.stream().filter(Score::passed).count() / (double) rows.size();
  double passB = scoresB.stream().filter(Score::passed).count() / (double) rows.size();
  return new AbResult(passA, passB, passA >= passB ? "A" : "B");
}
Shadow eval worker (staging)
async def shadow_eval_loop():
    async for event in prod_query_stream(sample_rate=0.05):
        if event.env != "staging_shadow":
            continue
        candidate = render_prompt(registry.get("refund_system", version=CANDIDATE_VER), event)
        prod = render_prompt(registry.get("refund_system", version=PROD_VER), event)
        score_c, score_p = await asyncio.gather(judge(candidate), judge(prod))
        metrics.gauge("shadow.delta", score_c - score_p)
void shadowEvalLoop() {
  prodQueryStream(0.05).forEach(event -> {
    String candidate = renderPrompt(CANDIDATE_VER, event);
    String prod = renderPrompt(PROD_VER, event);
    double delta = judge(candidate) - judge(prod);
    metrics.gauge("shadow.delta", delta);
  });
}
⚖️ Trade-off

Shadow eval catches real phrasing but needs strict PII scrub—prefer synthetic shadow if legal blocks prod query copy.

💰 Cost

A/B doubles eval API spend—run on stratified 100-row subset, not full 800-row golden, unless release-tier promotion.

🎯 Interview Tip

"Prompt change process?" — Registry semver, PR smoke vs baseline, token budget gate, shadow on staging, rollback via prompt_version pin.

FAQ: Few-shot examples in prompt?

Version examples as separate YAML; CI fails if example policy IDs drift from current policy database snapshot.

CI tools: DeepEval, PromptFoo, Braintrust, LangSmith

Compare DeepEval, PromptFoo, Braintrust, and LangSmith for CI integration, self-host options, PR comments, and regression UX.

CI eval tooling spans dedicated frameworks (DeepEval, PromptFoo) and platform suites (Braintrust, LangSmith). Pick based on self-host needs, language support, PR comment UX, and whether you already pay for trace storage on the same vendor.

CI tools comparison

ToolCI focusSelf-hostStrengthGap
DeepEvalPytest-native metricsOSS coreFast unit-style LLM tests in CILess PR UX polish
PromptFooCLI matrix + red teamOSSPrompt A/B grids, YAML configs, CI commentsJava secondary
BraintrustCloud eval + diffCloudHuman review, regression diffs, datasetsCost at scale
LangSmithLangChain eval runsCloudTrace→dataset, eval on runsBest inside LC ecosystem

DeepEval

Wraps metrics (faithfulness, toxicity, G-Eval) as pytest cases—familiar to Python teams. Good for scorer unit tests and smoke in GitHub Actions. Pair with custom JSONL loader for full golden suites.

PromptFoo

Declarative YAML defines prompt matrix, providers, assertions (contains, javascript, llm-rubric). Built-in red-team plugins for injection probes. Excellent PR comment output via GitHub Action.

Braintrust CI

Push eval results to cloud; UI diffs failures across commits. Strong when PMs and ML share review duty. CI action fails build on regression policy defined in project settings.

LangSmith CI

Run evaluate() in CI against LangSmith datasets; ties to traces from staging. Lowest friction if already on LangChain for app and observability (Guide 5).

Selection matrix

If you…Prefer
Want pytest in CI tomorrowDeepEval
Need prompt A/B grids + red teamPromptFoo
Need human review on failuresBraintrust
All-in on LangChainLangSmith
DeepEval pytest smoke
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric

@pytest.mark.parametrize("row", load_smoke_rows())
def test_refund_faithfulness(row):
    output = run_pipeline(row["input"])
    test_case = LLMTestCase(input=row["input"], actual_output=output, retrieval_context=row["context"])
    assert_test(test_case, [FaithfulnessMetric(threshold=0.85)])
@ParameterizedTest
@MethodSource("smokeRows")
void testRefundFaithfulness(EvalRow row) {
  String output = pipeline.run(row.input());
  assert faithfulnessMetric.score(output, row.context()) >= 0.85;
}
PromptFoo CI config excerpt
# promptfoo.yaml
prompts: [file://prompts/refund_system.jinja2]
providers: [openai:gpt-4o]
tests: file://evals/promptfoo/refund_cases.yaml
defaultTest:
  assert:
    - type: llm-rubric
      value: Must cite official policy ID
    - type: not-contains
      value: SSN
# promptfoo.yaml — same format for JavaScript/Java via CLI wrapper
Braintrust CI regression
import braintrust

def ci_eval():
    experiment = braintrust.init("refund-copilot", experiment=os.environ["GITHUB_SHA"])
    for row in load_golden():
        out = run(row)
        experiment.log(input=row["input"], output=out, scores={"faithfulness": score(out, row)})
    summary = experiment.summarize()
    if summary.regressions:
        raise SystemExit(1)
void ciEval() {
  Experiment exp = Braintrust.init("refund-copilot", System.getenv("GITHUB_SHA"));
  for (EvalRow row : loadGolden()) {
    String out = run(row);
    exp.log(row.input(), out, Map.of("faithfulness", score(out, row)));
  }
  if (exp.summarize().hasRegressions()) System.exit(1);
}
📦 Real World

Python app team: DeepEval in unit CI + PromptFoo weekly red-team; LangSmith only for trace→dataset loop—not duplicate scorers in both.

⚖️ Trade-off

PromptFoo YAML assertions are fast but brittle on paraphrase—pair contains with llm-rubric for high-stakes rows.

💰 Cost

Braintrust/LangSmith CI billed per eval row—cache model responses in CI when prompt unchanged (key = hash(input+prompt_version)).

FAQ: Multiple tools?

One blocking runner in merge path; others advisory (red team weekly). Duplicate scorers diverge.

Track 5 capstone checklist

Complete the Track 5 capstone checklist—eval-as-code CI, observability handoff, safety gates, and incident→golden loop—before starting Track 6 Shipping.

Track 5 capstone: offline eval gates + online observability + safety rails form a closed loop. This checklist confirms CI eval, dataset governance, prompt gates, and observability handoffs before Track 6 — Shipping.

Track 5 capstone checklist

CapabilityGuideDone when
Eval foundations + judges1–2Scorers documented
RAG eval gold3RAGAS nightly green
Safety adversarial CI4Zero waiver PII/injection
Production traces + alerts5TTFT/cost dashboards live
Eval-as-code in CI6Smoke blocks PR
Full eval blocks release6Nightly artifact retained
Incident → golden loop5+6Weekly promotion PRs

Release gate summary

GateThresholdOwner
PR smoke pass rate≥ baseline − εML eng
Safety suite100%Security
Full eval faithfulness mean≥ baseline − 0.02ML eng
Observability schema testsPassPlatform
Shadow eval 24hNo quality red flagProduct
Track 5 capstone verification script
GATES = {
    "smoke_ci": github.check_required("llm-eval-smoke"),
    "full_eval_last_green": artifact.pass_rate("full-eval") >= 0.93,
    "observability_checklist": open("docs/obs_checklist.done").read().strip() == "yes",
    "safety_zero_fail": artifact.pass_rate("safety-adversarial") == 1.0,
}

def track5_ready() -> bool:
    return all(GATES.values())
Map> gates = Map.of(
    "smoke_ci", () -> github.checkRequired("llm-eval-smoke"),
    "safety_zero_fail", () -> artifacts.passRate("safety-adversarial") == 1.0
);
boolean track5Ready() { return gates.values().stream().allMatch(Supplier::get); }
Incident → golden → CI closed loop
def weekly_golden_promotion():
    traces = fetch_failed_traces(min_faithfulness=0.5, days=7)
    rows = [trace_to_golden(t) for t in traces[:20]]
    if not rows:
        return
    open_pr("evals: promote 20 incident rows", write_jsonl(rows))
    # After merge, nightly full eval automatically includes new ids
void weeklyGoldenPromotion() {
  var traces = fetchFailedTraces(0.5, 7);
  var rows = traces.stream().limit(20).map(Evals::traceToGolden).toList();
  if (!rows.isEmpty()) openPr("evals: promote incident rows", writeJsonl(rows));
}
📦 End of Track 5

You built eval foundations, RAG and safety gates, observability, and CI regression. Track 6 — Shipping covers launch checklists, feature flags, rollback, and operating LLM products in production.

🔒 Security

Capstone gate: safety adversarial suite at 100% with no override in CI config—waivers only via signed release exception file in repo.

📦 Real World

Teams shipping Track 6 without capstone green typically revert within 72h—missing CI eval means prompt hotfixes ship without smoke.

🎯 Interview Tip

"End-to-end LLM quality?" — Golden eval-as-code, CI regression + buckets, safety zero-tolerance, prod traces with sampled judges, incident→dataset loop.

💰 Cost

Budget full nightly eval separately from prod inference—typical $20–80/night for 1k-row golden at GPT-4o; cache when prompts unchanged.

🔬 Under the Hood

Regression baselines stored as CI artifacts from last green main—PR compares to downloaded baseline, not live re-run of main (saves API cost).

⚖️ Trade-off

Blocking every PR on full eval adds 45 min latency—smoke on PR + full on main is the industry compromise.

⚠️ Pitfall

Celebrating Track 5 complete with CI green once—datasets drift, models update, indexes rot. Capstone is a living program with weekly review.

Track 5 → Track 6 handoff

  • Feature flag prompt_version rollout doc
  • Rollback runbook tied to eval baseline SHA
  • On-call rotation includes ML eval owner
  • Launch dashboard: TTFT, cost, faithfulness, safety blocks

FAQ: Skip full eval for hotfix?

Only with exec + ML sign-off and automatic post-merge full eval within 4h—document in incident ticket.