Lang

Track 5 · Evaluation · Coral

Evaluation, safety & quality

A demo that nails three questions is not evidence—it is a trailer. Production fails on the fourth ticket when someone asks for a monthly refund, a jailbreak, or a topic that never appeared in the docs. Evaluation is how you catch that before customers do: frozen golden cases in git, automatic scorers, JSON reports, and non-zero exit codes that block the merge—same discipline as unit tests. This track covers LLM-as-judge rubrics, RAG-specific metrics, safety guardrails, production observability, and CI regression gates. Assumes Tracks 1–4: you can call an LLM, run RAG, version prompts, and ship bounded agents.

Guides in this track

Six deep-dive chapters. All guides are live—read in order.

Reading order: Evaluation explainedLLM-as-judgeRAG evaluationSafety guardrailsObservability & monitoringCI evaluation

  1. 01

    Evaluation explained

    Why eval is the hardest part of LLM engineering, the vibe-check trap, eval pyramid, metrics vs vibes, eval-driven loops, and eval types.

    Foundation · Live

  2. 02

    LLM-as-judge

    Rubric design, pairwise vs pointwise scoring, judge calibration, bias mitigation, and when automated judges beat rule scorers.

    Guide · Live

  3. 03

    RAG evaluation

    Recall@k, faithfulness, answer relevance, RAGAS-style metrics, chunk attribution, and separating retrieval from generation failures.

    Guide · Live

  4. 04

    Safety guardrails

    Prompt injection defenses, refusal policies, PII redaction, content filters, red-team suites, and eval-driven safety regression.

    Guide · Live

  5. 05

    Observability & monitoring

    Trace logging, online eval sampling, drift detection, cost/latency SLOs, and promoting production failures into golden cases.

    Guide · Live

  6. 06

    CI evaluation

    Smoke vs full suites, mock vs live runners, GitHub Actions gates, report artifacts, and blocking deploy on score regression.

    Guide · Live