Lang
API

Reliable output formatting

Prompts that say “return JSON” fail in production. Models wrap JSON in markdown fences, omit required fields, hallucinate enums, and truncate mid-object when they hit max_tokens. This guide is a production-first contract for output shape: provider-native structured output, prompt-level fallbacks, parsers that fail closed, and regression tests that catch format drift before deploy.

After reading, you should be able to: choose JSON mode vs strict schema vs Instructor/Pydantic vs Spring AI converters; specify Markdown and XML tag contracts models actually follow; control length with explicit sentence counts and token caps; ground answers with numbered citations tied to retrieved chunks; pipe multi-step outputs through strict parsers; and ship formatting regression tests in CI.

developer platform architect Track 3 OpenAI json_schema Instructor 1.x Pydantic v2 Spring AI 1.0+

JSON output — schemas, not vibes

Structured extraction is the backbone of LLM features that touch databases, billing, and compliance. Treat the model as a probabilistic serializer: your job is to constrain the output space with provider APIs first, validate with typed models second, and retry once with the validation error third.

Three layers of JSON reliability

LayerMechanismWhen to use
1. Prompt-only “Return valid JSON only, no markdown” Prototypes only—breaks under edge cases and model upgrades
2. JSON mode response_format: {"type":"json_object"} Valid JSON guaranteed; schema shape still your problem
3. Strict schema json_schema with strict: true Production extraction—required fields, enums, no extra keys

OpenAI’s strict JSON schema mode (see also APIs guide — structured output) constrains generation at decode time. Pair it with temperature=0 and explicit max_tokens sized for your largest valid object.

OpenAI strict schema

Invoice extraction — strict schema
from openai import OpenAI
from pydantic import BaseModel, ValidationError
import json

class InvoiceLine(BaseModel):
    vendor: str
    amount_usd: float
    category: str  # enum in schema below

client = OpenAI()
schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "amount_usd": {"type": "number"},
        "category": {"type": "string", "enum": ["saas", "travel", "other"]},
    },
    "required": ["vendor", "amount_usd", "category"],
    "additionalProperties": False,
}

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=256,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "invoice_line", "strict": True, "schema": schema},
    },
    messages=[{"role": "user", "content": raw_invoice_text}],
)
raw = resp.choices[0].message.content
try:
    line = InvoiceLine.model_validate_json(raw)
except ValidationError as e:
    raise ValueError(f"schema drift: {e}") from e
record InvoiceLine(String vendor, double amountUsd, String category) {}

BeanOutputConverter<InvoiceLine> converter =
    new BeanOutputConverter<>(InvoiceLine.class);

InvoiceLine line = chatClient.prompt()
    .options(ChatOptionsBuilder.builder()
        .withModel("gpt-4o-mini")
        .withTemperature(0.0)
        .withMaxTokens(256)
        .build())
    .user(u -> u.text(rawInvoiceText).param("format", converter.getFormat()))
    .call()
    .entity(InvoiceLine.class);
# Anthropic: tool_use with input_schema, or prompt + extract JSON block
import anthropic, json, re

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    temperature=0,
    system="Return ONLY a JSON object matching the schema. No markdown.",
    tools=[{
        "name": "extract_invoice",
        "description": "Structured invoice line",
        "input_schema": schema,  # same JSON Schema as OpenAI
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{"role": "user", "content": raw_invoice_text}],
)
for block in msg.content:
    if block.type == "tool_use":
        data = block.input  # already parsed dict
        line = InvoiceLine.model_validate(data)
// Spring AI AnthropicChatModel + ToolCallbacks or BeanOutputConverter
// Prefer tool_use with input_schema for Anthropic-native structured output
var tool = FunctionToolCallback.builder("extract_invoice", InvoiceLine.class)
    .description("Structured invoice line")
    .inputSchema(schemaJson)
    .build();
# Bedrock Converse — toolConfig with JSON schema (Claude on Bedrock)
import boto3, json

client = boto3.client("bedrock-runtime")
resp = client.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    toolConfig={
        "tools": [{
            "toolSpec": {
                "name": "extract_invoice",
                "inputSchema": {"json": schema},
            }
        }]
    },
    messages=[{"role": "user", "content": [{"text": raw_invoice_text}]}],
)
// BedrockConverseChatModel with tool definitions mirroring JSON Schema
ConverseResponse response = bedrockClient.converse(request -> request
    .modelId("anthropic.claude-3-5-sonnet-20241022-v2:0")
    .toolConfig(toolsWithSchema(schema))
    .messages(userMessage(rawInvoiceText)));

Instructor + Pydantic (repair loop built in)

Instructor wraps the OpenAI/Anthropic client, sends your Pydantic model as schema, and retries on validation failure with the error message injected— the pattern you would otherwise hand-roll.

Instructor — typed response with max_retries
import instructor
from pydantic import BaseModel, Field
from openai import OpenAI

class RefundDecision(BaseModel):
    eligible: bool
    reason: str = Field(max_length=200)
    policy_section: str

client = instructor.from_openai(OpenAI())
decision = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=RefundDecision,
    max_retries=2,
    messages=[
        {"role": "system", "content": "Extract refund eligibility from policy text."},
        {"role": "user", "content": f"Policy:\n{policy}\n\nTicket: {ticket}"},
    ],
)
# decision is a validated RefundDecision instance
// Java: Spring AI BeanOutputConverter + manual retry on JsonMappingException
for (int attempt = 0; attempt < 3; attempt++) {
    try {
        return chatClient.prompt()
            .user(buildExtractionPrompt(policy, ticket))
            .call()
            .entity(RefundDecision.class);
    } catch (NonTransientAiException e) {
        if (attempt == 2) throw e;
        // append validation error to next user message
    }
}

Spring AI structured output

Spring AI’s .entity(Class) and BeanOutputConverter embed format instructions in the prompt and parse the response. For OpenAI, prefer native response_format when available; use converter mode for providers without strict schema support or when migrating legacy prompts.

  • Map output.entity(Map.class) for dynamic keys (fragile).
  • Record/POJO — compile-time types; Jackson validates after generation.
  • StructuredOutputChatOptions — wire OpenAI json_schema through Spring AI 1.0 options bean.
⚠️ Pitfall

Calling json.loads on raw text without schema validation. Models return "amount_usd": "49.99" (string), omit nullable fields, or add confidence you never defined. Always validate with Pydantic/Jackson; never pass unvalidated dicts to SQL or payment APIs.

⚖️ Trade-off

Strict schema reduces parse errors but increases latency and can refuse fuzzy-but-correct extractions from messy OCR. For noisy inputs, a two-step pipeline (free-text extract → normalize with strict schema) often beats one-shot strict mode.

💡 Pro Tip

Mirror your Pydantic model to JSON Schema in CI (model_json_schema()) and diff against the schema you send to the API—drift between code and provider schema is a common silent regression.

Markdown output — human-readable contracts

Not every response should be JSON. User-facing assistants, internal runbooks, and email drafts benefit from Markdown the model already knows how to emit—but only if you specify structure explicitly. Vague “use markdown” produces inconsistent heading levels, broken lists, and code fences that break your renderer.

When Markdown is the right format

Use MarkdownUse JSON / schema instead
Chat answers shown in a rich-text UI Fields consumed by code (APIs, workflows, DB writes)
Docs, summaries, comparison tables for humans Tool arguments, agent state, eval scoring
Code snippets with syntax highlighting When you need guaranteed key presence

Prompt template for stable Markdown shape

Pin heading levels and section order in the system prompt—models follow templates better than adjectives.

System prompt excerpt — support reply
Respond in Markdown using EXACTLY this structure:

## Summary
One paragraph (2–4 sentences).

## Steps
Numbered list; each step one actionable sentence.

## Code (optional)
If code is required, one fenced block with language tag, e.g. ```python

Do NOT use H1 (#). Do NOT wrap the entire response in a single code fence.
Do NOT add sections not listed above.

Headings and hierarchy

  • Fix levels — If your UI renders ## as section titles, forbid # in the prompt.
  • Limit depth — “Use at most ## and ###” prevents deep nesting that breaks mobile layouts.
  • Named sections — Required headings (“Summary”, “Evidence”, “Next steps”) enable regex or AST-based post-checks.

Code blocks

Models often emit triple-backtick fences without language tags or nest fences incorrectly. Specify: “Exactly one fenced code block per snippet; always include language tag (python, bash, json).” Sanitize before render—never pass model Markdown to dangerouslySetInnerHTML without a trusted parser.

Post-validate required Markdown sections
import re

REQUIRED = ["## Summary", "## Steps"]

def validate_markdown_shape(text: str) -> list[str]:
    errors = []
    for heading in REQUIRED:
        if heading not in text:
            errors.append(f"missing section: {heading}")
    if text.strip().startswith("```") and text.strip().endswith("```"):
        errors.append("entire response wrapped in code fence")
    if re.search(r"^# [^#]", text, re.MULTILINE):
        errors.append("H1 heading used — forbidden")
    return errors
List<String> validateMarkdownShape(String text) {
    List<String> errors = new ArrayList<>();
    if (!text.contains("## Summary")) errors.add("missing Summary");
    if (!text.contains("## Steps")) errors.add("missing Steps");
    if (text.matches("(?s)^```.*```$"))
        errors.add("entire response in code fence");
    return errors;
}
📦 Real World

Notion and Slack-style assistants render Markdown client-side. Teams store raw model Markdown in the DB, run a lightweight validator before save, and re-prompt once if sections are missing—cheaper than letting broken layout reach users and support.

🔬 Under the Hood

Markdown is token-efficient for prose but ambiguous for machines—parsers differ on list indentation and HTML passthrough. For mixed human+machine consumers, use Markdown for display and a hidden JSON block in an HTML comment only in internal pipelines (not user-visible).

XML tags — delimiters Claude follows well

Anthropic’s documentation and Claude’s training heavily use XML-tagged sections. Tags like <thinking> and <answer> create clear boundaries for parsing, hiding chain-of-thought from users, and separating evidence from synthesis.

Common tag patterns

TagPurposeShow to user?
<thinking>Private reasoning, scratch workNo — strip before display
<answer>Final user-facing textYes — extract and render
<context>Retrieved chunks you injectNo — input only
<citation>Evidence spans with idsParse for footnotes

System prompt with thinking / answer split

Tagged output contract
SYSTEM = """You must respond using exactly these tags:

<thinking>...your private reasoning...</thinking>
<answer>...user-facing answer only...</answer>

Rules:
- <thinking> may reference context; <answer> must not reveal tag names.
- If evidence is insufficient, say so inside <answer> only."""

msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=SYSTEM,
    messages=[{"role": "user", "content": user_question}],
)
raw = "".join(b.text for b in msg.content if b.type == "text")
answer = extract_tag(raw, "answer")  # see parser below
String system = """
You must respond using exactly these tags:
<thinking>...</thinking>
<answer>...</answer>
""";

ChatResponse r = anthropicChat.prompt()
    .system(system)
    .user(userQuestion)
    .call();
String raw = r.getResult().getOutput().getContent();
String answer = XmlTagParser.extract(raw, "answer");
# OpenAI models also follow XML tags when instructed — same parser
resp = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": user_question},
    ],
)
raw = resp.choices[0].message.content
answer = extract_tag(raw, "answer")
chatClient.prompt().system(system).user(userQuestion).call();
// Same XmlTagParser — provider-agnostic extraction
# Bedrock Claude — identical tag contract in system field
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "system": SYSTEM,
    "messages": [{"role": "user", "content": user_question}],
}
bedrockChat.prompt().system(system).user(userQuestion).call();

Robust tag parser

Regex alone fails on nested tags and unclosed tags. Use a non-greedy match for well-formed output, fail closed when tags are missing, and log raw text for eval—not for users.

extract_tag — fail closed
import re

TAG_RE = re.compile(r"<(\\w+)>(.*?)</\\1>", re.DOTALL)

def extract_tag(text: str, name: str) -> str:
    for tag, body in TAG_RE.findall(text):
        if tag == name:
            return body.strip()
    raise FormatError(f"missing <{name}>...</{name}>")

def split_thinking_answer(text: str) -> tuple[str, str]:
    thinking = extract_tag(text, "thinking")  # optional: catch FormatError
    answer = extract_tag(text, "answer")
    return thinking, answer
public static String extract(String text, String name) {
    Pattern p = Pattern.compile(
        "<" + name + ">(.*?)</" + name + ">", Pattern.DOTALL);
    Matcher m = p.matcher(text);
    if (!m.find()) throw new FormatException("missing tag: " + name);
    return m.group(1).trim();
}
⚠️ Pitfall

Showing <thinking> to end users—leaks reasoning, policy hints, and retrieved PII. Always strip before SSE to the browser. Log thinking server-side with redaction if needed for debug.

🎯 Interview Tip

“How do you hide chain-of-thought?” — Mention tagged sections, server-side extraction, never relying on model self-censorship; optional distillation step that removes thinking before storage.

Length control — three levers that actually work

“Be concise” is not a length policy. Production systems combine explicit structure limits in the prompt, max_tokens as a hard ceiling, and post-generation validation—with different behavior when finish_reason is length.

Compare the levers

TechniqueEffectReliability
“Be concise” / “briefly” Vague shortening Low — variance across models and temperatures
“Answer in exactly 3 sentences” Structural cap on prose Medium — good for summaries; count in validator
max_tokens Hard token ceiling on completion High — can truncate mid-JSON; always check finish_reason
Schema maxLength / Pydantic Field(max_length=) Per-field bounds on structured output High for JSON — combine with max_tokens headroom

Explicit sentence limits in prompt

Prompt + validator pattern
SYSTEM = """Answer in EXACTLY 3 sentences. Number them 1. 2. 3.
Do not include a preamble or closing remark."""

def count_sentences(text: str) -> int:
    # naive split — replace with nltk/spacy for production locales
    parts = [p.strip() for p in re.split(r"[.!?]+", text) if p.strip()]
    return len(parts)

def validate_length(text: str, expected: int = 3) -> None:
    n = count_sentences(text)
    if n != expected:
        raise FormatError(f"expected {expected} sentences, got {n}")

max_tokens sizing

Size max_tokens from measured p95 completion length on golden evals, not guesses. For JSON, add ~20% headroom above largest valid object. When finish_reason == "length", do not parse partial JSON—retry with higher cap or return 422.

Handle truncation — all providers
choice = resp.choices[0]
if choice.finish_reason == "length":
    raise TruncatedOutputError(
        "completion hit max_tokens — output may be invalid JSON or cut mid-sentence"
    )
if ("length".equals(response.getResult().getMetadata().getFinishReason())) {
    throw new TruncatedOutputException("max_tokens exceeded");
}
if msg.stop_reason == "max_tokens":
    raise TruncatedOutputError("Anthropic max_tokens — increase cap or shorten prompt")
// Anthropic: stopReason == StopReason.MAX_TOKENS
# Converse: stopReason == "max_tokens"
if resp["stopReason"] == "max_tokens":
    raise TruncatedOutputError("Bedrock truncation")
if (response.stopReason() == StopReason.MAX_TOKENS) { ... }
💰 Cost

Setting max_tokens too high “just in case” bills full completion length on some providers. Tune per route: ticket summaries 150 tokens, JSON extraction 256, long reports 1024—with metrics proving p99 fit.

⚖️ Trade-off

Sentence-count prompts improve readability but fight models that add bullet lists or numbered steps. Prefer “3 sentences OR a 3-item numbered list” if your UI accepts both—or enforce one format in the validator.

Citations — [1][2] grounded to retrieved context

RAG answers without citations are un-auditable. Numbered references ([1], [2]) tie each claim to a specific chunk you injected—so support can open the source doc and compliance can trace the answer.

Citation contract in prompt

Pre-number chunks in the context block. Instruct the model to cite inline and forbid citations to ids not in the list. See Retrieval strategies for how chunks arrive; here we focus on format.

Context assembly with numbered sources
def format_chunks_for_prompt(chunks: list[dict]) -> str:
    lines = []
    for i, c in enumerate(chunks, start=1):
        lines.append(f"[{i}] (doc={c['doc_id']}, page={c.get('page')})\n{c['text']}")
    return "\n\n".join(lines)

CITATION_RULES = """
Use inline citations [1], [2], ... matching the numbered sources above.
Every factual claim must have at least one citation.
Do NOT cite numbers outside 1..N. If sources are insufficient, say "I don't have evidence" without inventing citations.
"""

Parse and validate citations

Extract [n] references and validate range
import re

CITE_RE = re.compile(r"\\[(\\d+)\\]")

def parse_citations(text: str, max_source: int) -> set[int]:
    ids = {int(m) for m in CITE_RE.findall(text)}
    bad = ids - set(range(1, max_source + 1))
    if bad:
        raise FormatError(f"invalid citation ids: {bad}")
    return ids

def map_citations_to_chunks(ids: set[int], chunks: list[dict]) -> list[dict]:
    return [chunks[i - 1] for i in sorted(ids)]
Pattern CITE = Pattern.compile("\\\\[(\\\\d+)\\\\]");

Set<Integer> parseCitations(String text, int maxSource) {
    Set<Integer> ids = new TreeSet<>();
    Matcher m = CITE.matcher(text);
    while (m.find()) {
        int id = Integer.parseInt(m.group(1));
        if (id < 1 || id > maxSource)
            throw new FormatException("invalid citation: " + id);
        ids.add(id);
    }
    return ids;
}

UI mapping

  • Render [1] as superscript links to chunk metadata (title, page, URL).
  • Store citation_ids alongside the answer in your DB for ticket audit.
  • Eval: golden answers include expected citation set—flag answers with correct prose but wrong sources.
⚠️ Pitfall

Models cite confidently without matching text—[1] on a claim no chunk supports. Automated checks: embedding similarity between cited span and chunk, or LLM-as-judge on citation faithfulness (Track 5).

💡 Pro Tip

Duplicate chunk text under different numbers confuses models. Deduplicate by chunk_id before numbering; re-use the same [n] if the same passage appears twice in retrieval results.

Chain output — pipe through strict parsers

Multi-step LLM pipelines (classify → retrieve → answer → format) fail silently when step 2 accepts garbage from step 1. Treat each step’s output as an untrusted string until validated; never pass raw model text to the next prompt without parsing into typed objects.

Anti-pattern vs production pattern

Anti-patternProduction pattern
step2_prompt = f"Intent: {step1_raw}\\n..." intent = Intent.model_validate_json(step1_raw)
Retry entire chain on any error Retry failed step only; cap total latency
Log full intermediate text to users on failure Fail closed with generic error; log details server-side

Three-step chain with strict gates

classify → extract → respond
from pydantic import BaseModel
from enum import Enum

class Intent(str, Enum):
    refund = "refund"
    billing = "billing"
    other = "other"

class IntentResult(BaseModel):
    intent: Intent
    confidence: float

class ExtractedFacts(BaseModel):
    order_id: str | None
    amount_usd: float | None

def run_chain(ticket: str) -> str:
    # Step 1 — strict JSON
    r1 = call_json(prompt_classify(ticket), IntentResult)
    if r1.confidence < 0.7:
        return "Escalating to human — low confidence."

    # Step 2 — only runs with validated intent
    r2 = call_json(prompt_extract(ticket, r1.intent), ExtractedFacts)

    # Step 3 — prose with citations; validate markdown shape
    raw = call_text(prompt_answer(ticket, r1, r2, chunks))
    errs = validate_markdown_shape(raw)
    if errs:
        raise FormatError(errs)
    return raw
public String runChain(String ticket) {
    IntentResult r1 = callJson(classifyPrompt(ticket), IntentResult.class);
    if (r1.confidence() < 0.7) return "Escalating to human — low confidence.";

    ExtractedFacts r2 = callJson(extractPrompt(ticket, r1), ExtractedFacts.class);

    String raw = callText(answerPrompt(ticket, r1, r2, chunks));
    List<String> errs = validateMarkdownShape(raw);
    if (!errs.isEmpty()) throw new FormatException(errs.toString());
    return raw;
}

Prevent error propagation

  1. Typed boundaries — Pydantic/Jackson at every handoff; no stringly-typed pipelines.
  2. Idempotent step keys — cache step 1 result by ticket hash so retries do not re-classify differently.
  3. Circuit breaker — after N format failures on a route, fall back to template response and alert.
  4. Version prompts per stepclassify_v3, answer_v7 in logs for blame.
📦 Real World

Support automation at scale uses 4–7 LLM steps. Teams that skipped parsers saw “refund” intents drift into billing prompts after a model upgrade—fixed by JSON schema on step 1 and blocking the chain when validation fails.

🔬 Under the Hood

LangGraph / Temporal workflows encode the same idea: each node output is a typed state object. LLM nodes append to state only after parse success—graph execution stops on FormatError instead of polluting downstream nodes.

Production — formatting regression tests

Output format is a contract—test it like an API schema. Golden fixtures assert parse success, required fields, citation ids, and Markdown section presence; CI fails when a prompt or model change drops parse rate below threshold.

What to measure

MetricDefinitionAlert threshold (example)
Parse rate% responses passing schema/tag/markdown validator< 98% on golden set
Truncation rate% with finish_reason length> 2% on any route
Citation validity% answers with in-range [n] only< 95% for RAG routes
Repair retry rate% needing Instructor/second attemptSpike > 2× baseline

Golden formatting regression harness

eval/format_regression.py — run in CI on prompts/ changes
"""Golden cases: input + expected schema / sections / citation ids."""
import json
from pathlib import Path

GOLDEN = Path("eval/golden/formatting_cases.jsonl")

def run_case(case: dict) -> dict:
    raw = invoke_route(case["route"], case["input"])
    result = {"id": case["id"], "ok": True, "errors": []}

    if case.get("expect_json"):
        try:
            obj = InvoiceLine.model_validate_json(raw)
            if case.get("required_fields"):
                for f in case["required_fields"]:
                    if getattr(obj, f) is None:
                        result["errors"].append(f"missing {f}")
        except Exception as e:
            result["ok"] = False
            result["errors"].append(str(e))

    if case.get("markdown_sections"):
        result["errors"].extend(validate_markdown_shape(raw))
        result["ok"] = result["ok"] and not result["errors"]

    if case.get("max_citation"):
        try:
            parse_citations(raw, case["max_citation"])
        except FormatError as e:
            result["ok"] = False
            result["errors"].append(str(e))

    return result

def main():
    cases = [json.loads(l) for l in GOLDEN.read_text().splitlines() if l.strip()]
    results = [run_case(c) for c in cases]
    parse_rate = sum(r["ok"] for r in results) / len(results)
    assert parse_rate >= 0.98, f"parse rate {parse_rate:.2%} below gate"
    print(f"format regression OK — {parse_rate:.2%} parse rate on {len(results)} cases")

if __name__ == "__main__":
    main()
// JUnit + golden JSONL — same gates as Python harness
@Test
void formattingRegressionGate() {
    List<GoldenCase> cases = loadJsonl("eval/golden/formatting_cases.jsonl");
    long ok = cases.stream().filter(this::runCase).count();
    double rate = (double) ok / cases.size();
    assertTrue(rate >= 0.98, "parse rate " + rate);
}

CI integration checklist

  • Trigger on changes under prompts/, eval/golden/, or model pin config.
  • Smoke subset (< 30s) on every PR; full golden nightly or pre-release.
  • Store raw failures as artifacts (redacted) for prompt engineers—compare before/after on regressions.
  • Block merge if parse rate drops > 1% vs main baseline (relative gate, not only absolute).
  • Tag releases with prompt_bundle_version and schema hash in observability.

Production readiness checklist

  • Every structured route uses provider strict schema or tool input_schema
  • Pydantic/Jackson validation before any side effect
  • finish_reason == length handled — no partial JSON parse
  • Markdown/XML validators on human-facing routes; thinking stripped from user stream
  • RAG routes validate citation ids against injected chunk count
  • Multi-step chains use typed handoffs — no raw string pipe
  • Formatting golden suite in CI with documented parse-rate gate
  • Dashboards: parse rate, truncation rate, repair retries by route and model
🎯 Interview Tip

“How do you prevent prompt changes from breaking production?” — Golden eval with parse-rate gates in CI, versioned prompt bundles, schema validation fail-closed, and canary routes with comparison to baseline metrics.

💰 Cost

Full golden runs on every PR get expensive. Cache embeddings for fixed inputs, use smaller models for format-only smoke tests, and reserve frontier models for weekly regression or changed routes only.