Track 5 · Evaluation · Coral

Evaluation, safety & quality

A demo that nails three questions is not evidence—it is a trailer. Production fails on the fourth ticket when someone asks for a monthly refund, a jailbreak, or a topic that never appeared in the docs. Evaluation is how you catch that before customers do: frozen golden cases in git, automatic scorers, JSON reports, and non-zero exit codes that block the merge—same discipline as unit tests. This track covers LLM-as-judge rubrics, RAG-specific metrics, safety guardrails, production observability, and CI regression gates. Assumes Tracks 1–4: you can call an LLM, run RAG, version prompts, and ship bounded agents.

Start with evaluation explained All tracks

Guides in this track

Six deep-dive chapters. All guides are live—read in order.

Reading order: Evaluation explained → LLM-as-judge → RAG evaluation → Safety guardrails → Observability & monitoring → CI evaluation

01

Evaluation explained

Why eval is the hardest part of LLM engineering, the vibe-check trap, eval pyramid, metrics vs vibes, eval-driven loops, and eval types.

Foundation · Live
02

LLM-as-judge

Rubric design, pairwise vs pointwise scoring, judge calibration, bias mitigation, and when automated judges beat rule scorers.

Guide · Live
03

RAG evaluation

Recall@k, faithfulness, answer relevance, RAGAS-style metrics, chunk attribution, and separating retrieval from generation failures.

Guide · Live
04

Safety guardrails

Prompt injection defenses, refusal policies, PII redaction, content filters, red-team suites, and eval-driven safety regression.

Guide · Live
05

Observability & monitoring

Trace logging, online eval sampling, drift detection, cost/latency SLOs, and promoting production failures into golden cases.

Guide · Live
06

CI evaluation

Smoke vs full suites, mock vs live runners, GitHub Actions gates, report artifacts, and blocking deploy on score regression.

Guide · Live