Reliable output formatting

JSON output — schemas, not vibes

Structured extraction is the backbone of LLM features that touch databases, billing, and compliance. Treat the model as a probabilistic serializer: your job is to constrain the output space with provider APIs first, validate with typed models second, and retry once with the validation error third.

Three layers of JSON reliability

Layer	Mechanism	When to use
1. Prompt-only	“Return valid JSON only, no markdown”	Prototypes only—breaks under edge cases and model upgrades
2. JSON mode	response_format: {"type":"json_object"}	Valid JSON guaranteed; schema shape still your problem
3. Strict schema	json_schema with strict: true	Production extraction—required fields, enums, no extra keys

OpenAI’s strict JSON schema mode (see also APIs guide — structured output) constrains generation at decode time. Pair it with temperature=0 and explicit max_tokens sized for your largest valid object.

OpenAI strict schema

from openai import OpenAI
from pydantic import BaseModel, ValidationError
import json

class InvoiceLine(BaseModel):
    vendor: str
    amount_usd: float
    category: str  # enum in schema below

client = OpenAI()
schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "amount_usd": {"type": "number"},
        "category": {"type": "string", "enum": ["saas", "travel", "other"]},
    },
    "required": ["vendor", "amount_usd", "category"],
    "additionalProperties": False,
}

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=256,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "invoice_line", "strict": True, "schema": schema},
    },
    messages=[{"role": "user", "content": raw_invoice_text}],
)
raw = resp.choices[0].message.content
try:
    line = InvoiceLine.model_validate_json(raw)
except ValidationError as e:
    raise ValueError(f"schema drift: {e}") from e

record InvoiceLine(String vendor, double amountUsd, String category) {}

BeanOutputConverter<InvoiceLine> converter =
    new BeanOutputConverter<>(InvoiceLine.class);

InvoiceLine line = chatClient.prompt()
    .options(ChatOptionsBuilder.builder()
        .withModel("gpt-4o-mini")
        .withTemperature(0.0)
        .withMaxTokens(256)
        .build())
    .user(u -> u.text(rawInvoiceText).param("format", converter.getFormat()))
    .call()
    .entity(InvoiceLine.class);

# Anthropic: tool_use with input_schema, or prompt + extract JSON block
import anthropic, json, re

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    temperature=0,
    system="Return ONLY a JSON object matching the schema. No markdown.",
    tools=[{
        "name": "extract_invoice",
        "description": "Structured invoice line",
        "input_schema": schema,  # same JSON Schema as OpenAI
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{"role": "user", "content": raw_invoice_text}],
)
for block in msg.content:
    if block.type == "tool_use":
        data = block.input  # already parsed dict
        line = InvoiceLine.model_validate(data)

// Spring AI AnthropicChatModel + ToolCallbacks or BeanOutputConverter
// Prefer tool_use with input_schema for Anthropic-native structured output
var tool = FunctionToolCallback.builder("extract_invoice", InvoiceLine.class)
    .description("Structured invoice line")
    .inputSchema(schemaJson)
    .build();

# Bedrock Converse — toolConfig with JSON schema (Claude on Bedrock)
import boto3, json

client = boto3.client("bedrock-runtime")
resp = client.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    toolConfig={
        "tools": [{
            "toolSpec": {
                "name": "extract_invoice",
                "inputSchema": {"json": schema},
            }
        }]
    },
    messages=[{"role": "user", "content": [{"text": raw_invoice_text}]}],
)

// BedrockConverseChatModel with tool definitions mirroring JSON Schema
ConverseResponse response = bedrockClient.converse(request -> request
    .modelId("anthropic.claude-3-5-sonnet-20241022-v2:0")
    .toolConfig(toolsWithSchema(schema))
    .messages(userMessage(rawInvoiceText)));

Instructor + Pydantic (repair loop built in)

Instructor wraps the OpenAI/Anthropic client, sends your Pydantic model as schema, and retries on validation failure with the error message injected— the pattern you would otherwise hand-roll.

import instructor
from pydantic import BaseModel, Field
from openai import OpenAI

class RefundDecision(BaseModel):
    eligible: bool
    reason: str = Field(max_length=200)
    policy_section: str

client = instructor.from_openai(OpenAI())
decision = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=RefundDecision,
    max_retries=2,
    messages=[
        {"role": "system", "content": "Extract refund eligibility from policy text."},
        {"role": "user", "content": f"Policy:\n{policy}\n\nTicket: {ticket}"},
    ],
)
# decision is a validated RefundDecision instance

// Java: Spring AI BeanOutputConverter + manual retry on JsonMappingException
for (int attempt = 0; attempt < 3; attempt++) {
    try {
        return chatClient.prompt()
            .user(buildExtractionPrompt(policy, ticket))
            .call()
            .entity(RefundDecision.class);
    } catch (NonTransientAiException e) {
        if (attempt == 2) throw e;
        // append validation error to next user message
    }
}

Spring AI structured output

Spring AI’s .entity(Class) and BeanOutputConverter embed format instructions in the prompt and parse the response. For OpenAI, prefer native response_format when available; use converter mode for providers without strict schema support or when migrating legacy prompts.

Map output — .entity(Map.class) for dynamic keys (fragile).
Record/POJO — compile-time types; Jackson validates after generation.
StructuredOutputChatOptions — wire OpenAI json_schema through Spring AI 1.0 options bean.

⚠️ Pitfall

Calling json.loads on raw text without schema validation. Models return "amount_usd": "49.99" (string), omit nullable fields, or add confidence you never defined. Always validate with Pydantic/Jackson; never pass unvalidated dicts to SQL or payment APIs.

⚖️ Trade-off

Strict schema reduces parse errors but increases latency and can refuse fuzzy-but-correct extractions from messy OCR. For noisy inputs, a two-step pipeline (free-text extract → normalize with strict schema) often beats one-shot strict mode.

💡 Pro Tip

Mirror your Pydantic model to JSON Schema in CI (model_json_schema()) and diff against the schema you send to the API—drift between code and provider schema is a common silent regression.

Markdown output — human-readable contracts

Not every response should be JSON. User-facing assistants, internal runbooks, and email drafts benefit from Markdown the model already knows how to emit—but only if you specify structure explicitly. Vague “use markdown” produces inconsistent heading levels, broken lists, and code fences that break your renderer.

When Markdown is the right format

Use Markdown	Use JSON / schema instead
Chat answers shown in a rich-text UI	Fields consumed by code (APIs, workflows, DB writes)
Docs, summaries, comparison tables for humans	Tool arguments, agent state, eval scoring
Code snippets with syntax highlighting	When you need guaranteed key presence

Prompt template for stable Markdown shape

Pin heading levels and section order in the system prompt—models follow templates better than adjectives.

Respond in Markdown using EXACTLY this structure:

## Summary
One paragraph (2–4 sentences).

## Steps
Numbered list; each step one actionable sentence.

## Code (optional)
If code is required, one fenced block with language tag, e.g. ```python

Do NOT use H1 (#). Do NOT wrap the entire response in a single code fence.
Do NOT add sections not listed above.

Headings and hierarchy

Fix levels — If your UI renders ## as section titles, forbid # in the prompt.
Limit depth — “Use at most ## and ###” prevents deep nesting that breaks mobile layouts.
Named sections — Required headings (“Summary”, “Evidence”, “Next steps”) enable regex or AST-based post-checks.

Code blocks

Models often emit triple-backtick fences without language tags or nest fences incorrectly. Specify: “Exactly one fenced code block per snippet; always include language tag (python, bash, json).” Sanitize before render—never pass model Markdown to dangerouslySetInnerHTML without a trusted parser.

import re

REQUIRED = ["## Summary", "## Steps"]

def validate_markdown_shape(text: str) -> list[str]:
    errors = []
    for heading in REQUIRED:
        if heading not in text:
            errors.append(f"missing section: {heading}")
    if text.strip().startswith("```") and text.strip().endswith("```"):
        errors.append("entire response wrapped in code fence")
    if re.search(r"^# [^#]", text, re.MULTILINE):
        errors.append("H1 heading used — forbidden")
    return errors

List<String> validateMarkdownShape(String text) {
    List<String> errors = new ArrayList<>();
    if (!text.contains("## Summary")) errors.add("missing Summary");
    if (!text.contains("## Steps")) errors.add("missing Steps");
    if (text.matches("(?s)^```.*```$"))
        errors.add("entire response in code fence");
    return errors;
}

📦 Real World

Notion and Slack-style assistants render Markdown client-side. Teams store raw model Markdown in the DB, run a lightweight validator before save, and re-prompt once if sections are missing—cheaper than letting broken layout reach users and support.

🔬 Under the Hood

Markdown is token-efficient for prose but ambiguous for machines—parsers differ on list indentation and HTML passthrough. For mixed human+machine consumers, use Markdown for display and a hidden JSON block in an HTML comment only in internal pipelines (not user-visible).

XML tags — delimiters Claude follows well

Anthropic’s documentation and Claude’s training heavily use XML-tagged sections. Tags like <thinking> and <answer> create clear boundaries for parsing, hiding chain-of-thought from users, and separating evidence from synthesis.

Common tag patterns

Tag	Purpose	Show to user?
<thinking>	Private reasoning, scratch work	No — strip before display
<answer>	Final user-facing text	Yes — extract and render
<context>	Retrieved chunks you inject	No — input only
<citation>	Evidence spans with ids	Parse for footnotes

System prompt with thinking / answer split

SYSTEM = """You must respond using exactly these tags:

<thinking>...your private reasoning...</thinking>
<answer>...user-facing answer only...</answer>

Rules:
- <thinking> may reference context; <answer> must not reveal tag names.
- If evidence is insufficient, say so inside <answer> only."""

msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=SYSTEM,
    messages=[{"role": "user", "content": user_question}],
)
raw = "".join(b.text for b in msg.content if b.type == "text")
answer = extract_tag(raw, "answer")  # see parser below

String system = """
You must respond using exactly these tags:
<thinking>...</thinking>
<answer>...</answer>
""";

ChatResponse r = anthropicChat.prompt()
    .system(system)
    .user(userQuestion)
    .call();
String raw = r.getResult().getOutput().getContent();
String answer = XmlTagParser.extract(raw, "answer");

# OpenAI models also follow XML tags when instructed — same parser
resp = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": user_question},
    ],
)
raw = resp.choices[0].message.content
answer = extract_tag(raw, "answer")

chatClient.prompt().system(system).user(userQuestion).call();
// Same XmlTagParser — provider-agnostic extraction

# Bedrock Claude — identical tag contract in system field
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "system": SYSTEM,
    "messages": [{"role": "user", "content": user_question}],
}

bedrockChat.prompt().system(system).user(userQuestion).call();

Robust tag parser

Regex alone fails on nested tags and unclosed tags. Use a non-greedy match for well-formed output, fail closed when tags are missing, and log raw text for eval—not for users.

import re

TAG_RE = re.compile(r"<(\\w+)>(.*?)</\\1>", re.DOTALL)

def extract_tag(text: str, name: str) -> str:
    for tag, body in TAG_RE.findall(text):
        if tag == name:
            return body.strip()
    raise FormatError(f"missing <{name}>...</{name}>")

def split_thinking_answer(text: str) -> tuple[str, str]:
    thinking = extract_tag(text, "thinking")  # optional: catch FormatError
    answer = extract_tag(text, "answer")
    return thinking, answer

public static String extract(String text, String name) {
    Pattern p = Pattern.compile(
        "<" + name + ">(.*?)</" + name + ">", Pattern.DOTALL);
    Matcher m = p.matcher(text);
    if (!m.find()) throw new FormatException("missing tag: " + name);
    return m.group(1).trim();
}

⚠️ Pitfall

Showing <thinking> to end users—leaks reasoning, policy hints, and retrieved PII. Always strip before SSE to the browser. Log thinking server-side with redaction if needed for debug.

🎯 Interview Tip

“How do you hide chain-of-thought?” — Mention tagged sections, server-side extraction, never relying on model self-censorship; optional distillation step that removes thinking before storage.

Length control — three levers that actually work

“Be concise” is not a length policy. Production systems combine explicit structure limits in the prompt, max_tokens as a hard ceiling, and post-generation validation—with different behavior when finish_reason is length.

Compare the levers

Technique	Effect	Reliability
“Be concise” / “briefly”	Vague shortening	Low — variance across models and temperatures
“Answer in exactly 3 sentences”	Structural cap on prose	Medium — good for summaries; count in validator
max_tokens	Hard token ceiling on completion	High — can truncate mid-JSON; always check finish_reason
Schema maxLength / Pydantic Field(max_length=)	Per-field bounds on structured output	High for JSON — combine with max_tokens headroom

Explicit sentence limits in prompt

SYSTEM = """Answer in EXACTLY 3 sentences. Number them 1. 2. 3.
Do not include a preamble or closing remark."""

def count_sentences(text: str) -> int:
    # naive split — replace with nltk/spacy for production locales
    parts = [p.strip() for p in re.split(r"[.!?]+", text) if p.strip()]
    return len(parts)

def validate_length(text: str, expected: int = 3) -> None:
    n = count_sentences(text)
    if n != expected:
        raise FormatError(f"expected {expected} sentences, got {n}")

max_tokens sizing

Size max_tokens from measured p95 completion length on golden evals, not guesses. For JSON, add ~20% headroom above largest valid object. When finish_reason == "length", do not parse partial JSON—retry with higher cap or return 422.

choice = resp.choices[0]
if choice.finish_reason == "length":
    raise TruncatedOutputError(
        "completion hit max_tokens — output may be invalid JSON or cut mid-sentence"
    )

if ("length".equals(response.getResult().getMetadata().getFinishReason())) {
    throw new TruncatedOutputException("max_tokens exceeded");
}

if msg.stop_reason == "max_tokens":
    raise TruncatedOutputError("Anthropic max_tokens — increase cap or shorten prompt")

// Anthropic: stopReason == StopReason.MAX_TOKENS

# Converse: stopReason == "max_tokens"
if resp["stopReason"] == "max_tokens":
    raise TruncatedOutputError("Bedrock truncation")

if (response.stopReason() == StopReason.MAX_TOKENS) { ... }

💰 Cost

Setting max_tokens too high “just in case” bills full completion length on some providers. Tune per route: ticket summaries 150 tokens, JSON extraction 256, long reports 1024—with metrics proving p99 fit.

⚖️ Trade-off

Sentence-count prompts improve readability but fight models that add bullet lists or numbered steps. Prefer “3 sentences OR a 3-item numbered list” if your UI accepts both—or enforce one format in the validator.

Citations — [1][2] grounded to retrieved context

RAG answers without citations are un-auditable. Numbered references ([1], [2]) tie each claim to a specific chunk you injected—so support can open the source doc and compliance can trace the answer.

Citation contract in prompt

Pre-number chunks in the context block. Instruct the model to cite inline and forbid citations to ids not in the list. See Retrieval strategies for how chunks arrive; here we focus on format.

def format_chunks_for_prompt(chunks: list[dict]) -> str:
    lines = []
    for i, c in enumerate(chunks, start=1):
        lines.append(f"[{i}] (doc={c['doc_id']}, page={c.get('page')})\n{c['text']}")
    return "\n\n".join(lines)

CITATION_RULES = """
Use inline citations [1], [2], ... matching the numbered sources above.
Every factual claim must have at least one citation.
Do NOT cite numbers outside 1..N. If sources are insufficient, say "I don't have evidence" without inventing citations.
"""

Parse and validate citations

import re

CITE_RE = re.compile(r"\\[(\\d+)\\]")

def parse_citations(text: str, max_source: int) -> set[int]:
    ids = {int(m) for m in CITE_RE.findall(text)}
    bad = ids - set(range(1, max_source + 1))
    if bad:
        raise FormatError(f"invalid citation ids: {bad}")
    return ids

def map_citations_to_chunks(ids: set[int], chunks: list[dict]) -> list[dict]:
    return [chunks[i - 1] for i in sorted(ids)]

Pattern CITE = Pattern.compile("\\\\[(\\\\d+)\\\\]");

Set<Integer> parseCitations(String text, int maxSource) {
    Set<Integer> ids = new TreeSet<>();
    Matcher m = CITE.matcher(text);
    while (m.find()) {
        int id = Integer.parseInt(m.group(1));
        if (id < 1 || id > maxSource)
            throw new FormatException("invalid citation: " + id);
        ids.add(id);
    }
    return ids;
}

UI mapping

Render [1] as superscript links to chunk metadata (title, page, URL).
Store citation_ids alongside the answer in your DB for ticket audit.
Eval: golden answers include expected citation set—flag answers with correct prose but wrong sources.

⚠️ Pitfall

Models cite confidently without matching text—[1] on a claim no chunk supports. Automated checks: embedding similarity between cited span and chunk, or LLM-as-judge on citation faithfulness (Track 5).

💡 Pro Tip

Duplicate chunk text under different numbers confuses models. Deduplicate by chunk_id before numbering; re-use the same [n] if the same passage appears twice in retrieval results.

Chain output — pipe through strict parsers

Multi-step LLM pipelines (classify → retrieve → answer → format) fail silently when step 2 accepts garbage from step 1. Treat each step’s output as an untrusted string until validated; never pass raw model text to the next prompt without parsing into typed objects.

Anti-pattern vs production pattern

Anti-pattern	Production pattern
step2_prompt = f"Intent: {step1_raw}\\n..."	intent = Intent.model_validate_json(step1_raw)
Retry entire chain on any error	Retry failed step only; cap total latency
Log full intermediate text to users on failure	Fail closed with generic error; log details server-side

Three-step chain with strict gates

from pydantic import BaseModel
from enum import Enum

class Intent(str, Enum):
    refund = "refund"
    billing = "billing"
    other = "other"

class IntentResult(BaseModel):
    intent: Intent
    confidence: float

class ExtractedFacts(BaseModel):
    order_id: str | None
    amount_usd: float | None

def run_chain(ticket: str) -> str:
    # Step 1 — strict JSON
    r1 = call_json(prompt_classify(ticket), IntentResult)
    if r1.confidence < 0.7:
        return "Escalating to human — low confidence."

    # Step 2 — only runs with validated intent
    r2 = call_json(prompt_extract(ticket, r1.intent), ExtractedFacts)

    # Step 3 — prose with citations; validate markdown shape
    raw = call_text(prompt_answer(ticket, r1, r2, chunks))
    errs = validate_markdown_shape(raw)
    if errs:
        raise FormatError(errs)
    return raw

public String runChain(String ticket) {
    IntentResult r1 = callJson(classifyPrompt(ticket), IntentResult.class);
    if (r1.confidence() < 0.7) return "Escalating to human — low confidence.";

    ExtractedFacts r2 = callJson(extractPrompt(ticket, r1), ExtractedFacts.class);

    String raw = callText(answerPrompt(ticket, r1, r2, chunks));
    List<String> errs = validateMarkdownShape(raw);
    if (!errs.isEmpty()) throw new FormatException(errs.toString());
    return raw;
}

Prevent error propagation

Typed boundaries — Pydantic/Jackson at every handoff; no stringly-typed pipelines.
Idempotent step keys — cache step 1 result by ticket hash so retries do not re-classify differently.
Circuit breaker — after N format failures on a route, fall back to template response and alert.
Version prompts per step — classify_v3, answer_v7 in logs for blame.

📦 Real World

Support automation at scale uses 4–7 LLM steps. Teams that skipped parsers saw “refund” intents drift into billing prompts after a model upgrade—fixed by JSON schema on step 1 and blocking the chain when validation fails.

🔬 Under the Hood

LangGraph / Temporal workflows encode the same idea: each node output is a typed state object. LLM nodes append to state only after parse success—graph execution stops on FormatError instead of polluting downstream nodes.

Production — formatting regression tests

Output format is a contract—test it like an API schema. Golden fixtures assert parse success, required fields, citation ids, and Markdown section presence; CI fails when a prompt or model change drops parse rate below threshold.

What to measure

Metric	Definition	Alert threshold (example)
Parse rate	% responses passing schema/tag/markdown validator	< 98% on golden set
Truncation rate	% with finish_reason length	> 2% on any route
Citation validity	% answers with in-range [n] only	< 95% for RAG routes
Repair retry rate	% needing Instructor/second attempt	Spike > 2× baseline

Golden formatting regression harness

"""Golden cases: input + expected schema / sections / citation ids."""
import json
from pathlib import Path

GOLDEN = Path("eval/golden/formatting_cases.jsonl")

def run_case(case: dict) -> dict:
    raw = invoke_route(case["route"], case["input"])
    result = {"id": case["id"], "ok": True, "errors": []}

    if case.get("expect_json"):
        try:
            obj = InvoiceLine.model_validate_json(raw)
            if case.get("required_fields"):
                for f in case["required_fields"]:
                    if getattr(obj, f) is None:
                        result["errors"].append(f"missing {f}")
        except Exception as e:
            result["ok"] = False
            result["errors"].append(str(e))

    if case.get("markdown_sections"):
        result["errors"].extend(validate_markdown_shape(raw))
        result["ok"] = result["ok"] and not result["errors"]

    if case.get("max_citation"):
        try:
            parse_citations(raw, case["max_citation"])
        except FormatError as e:
            result["ok"] = False
            result["errors"].append(str(e))

    return result

def main():
    cases = [json.loads(l) for l in GOLDEN.read_text().splitlines() if l.strip()]
    results = [run_case(c) for c in cases]
    parse_rate = sum(r["ok"] for r in results) / len(results)
    assert parse_rate >= 0.98, f"parse rate {parse_rate:.2%} below gate"
    print(f"format regression OK — {parse_rate:.2%} parse rate on {len(results)} cases")

if __name__ == "__main__":
    main()

// JUnit + golden JSONL — same gates as Python harness
@Test
void formattingRegressionGate() {
    List<GoldenCase> cases = loadJsonl("eval/golden/formatting_cases.jsonl");
    long ok = cases.stream().filter(this::runCase).count();
    double rate = (double) ok / cases.size();
    assertTrue(rate >= 0.98, "parse rate " + rate);
}

CI integration checklist

Trigger on changes under prompts/, eval/golden/, or model pin config.
Smoke subset (< 30s) on every PR; full golden nightly or pre-release.
Store raw failures as artifacts (redacted) for prompt engineers—compare before/after on regressions.
Block merge if parse rate drops > 1% vs main baseline (relative gate, not only absolute).
Tag releases with prompt_bundle_version and schema hash in observability.

Production readiness checklist

Every structured route uses provider strict schema or tool input_schema
Pydantic/Jackson validation before any side effect
finish_reason == length handled — no partial JSON parse
Markdown/XML validators on human-facing routes; thinking stripped from user stream
RAG routes validate citation ids against injected chunk count
Multi-step chains use typed handoffs — no raw string pipe
Formatting golden suite in CI with documented parse-rate gate
Dashboards: parse rate, truncation rate, repair retries by route and model

🎯 Interview Tip

“How do you prevent prompt changes from breaking production?” — Golden eval with parse-rate gates in CI, versioned prompt bundles, schema validation fail-closed, and canary routes with comparison to baseline metrics.

💰 Cost

Full golden runs on every PR get expensive. Cache embeddings for fixed inputs, use smaller models for format-only smoke tests, and reserve frontier models for weekly regression or changed routes only.