Track 5 · Evaluation · Coral
Evaluation, safety & quality
A demo that nails three questions is not evidence—it is a trailer. Production fails on the fourth ticket when someone asks for a monthly refund, a jailbreak, or a topic that never appeared in the docs. Evaluation is how you catch that before customers do: frozen golden cases in git, automatic scorers, JSON reports, and non-zero exit codes that block the merge—same discipline as unit tests. This track covers LLM-as-judge rubrics, RAG-specific metrics, safety guardrails, production observability, and CI regression gates. Assumes Tracks 1–4: you can call an LLM, run RAG, version prompts, and ship bounded agents.
Guides in this track
Six deep-dive chapters. All guides are live—read in order.
Reading order: Evaluation explained → LLM-as-judge → RAG evaluation → Safety guardrails → Observability & monitoring → CI evaluation
-
01
Evaluation explained
Why eval is the hardest part of LLM engineering, the vibe-check trap, eval pyramid, metrics vs vibes, eval-driven loops, and eval types.
-
02
LLM-as-judge
Rubric design, pairwise vs pointwise scoring, judge calibration, bias mitigation, and when automated judges beat rule scorers.
-
03
RAG evaluation
Recall@k, faithfulness, answer relevance, RAGAS-style metrics, chunk attribution, and separating retrieval from generation failures.
-
04
Safety guardrails
Prompt injection defenses, refusal policies, PII redaction, content filters, red-team suites, and eval-driven safety regression.
-
05
Observability & monitoring
Trace logging, online eval sampling, drift detection, cost/latency SLOs, and promoting production failures into golden cases.
-
06
CI evaluation
Smoke vs full suites, mock vs live runners, GitHub Actions gates, report artifacts, and blocking deploy on score regression.