APIs, tokens & context

OpenAI API deep dive

OpenAI’s primary surface for text generation is POST /v1/chat/completions. The request body is an ordered messages array; the response includes choices[0].message and a usage block you should log on every call.

Messages array and roles

Each message has a role and content. Roles define who “spoke” in the conversation the model continues:

Role	Purpose	Production notes
system	High-level instructions, policies, output format	Pin version in config; treat as code—review in PRs. Some teams use a single system message; others split policy vs persona.
user	End-user or upstream agent input	Sanitize and length-limit before append. Never trust raw HTML/markdown from browsers without escaping in your UI.
assistant	Prior model turns in multi-turn chat	Store exactly what you sent to the API (including tool calls). Replaying edited summaries breaks tool loops.
tool	Results returned from function/tool execution	Pair with tool_call_id on the preceding assistant message. Covered in structured output below.

OpenAI accepts multiple system messages in practice, but most SDKs collapse to one. For chat products, keep system prompt stable across turns and append only new user/assistant pairs— re-sending the full history is correct but expensive; context management (later section) trims what you resend.

Core request fields

Field	What it does	Typical production value
model	Which weights to run	Pin string per environment; log on every response
messages	Prompt context	System + sliding history + current user
temperature	Sampling randomness	0 for extraction; 0.3–0.7 for chat
max_tokens	Cap on completion length	Set explicitly—default can be large and costly
stream	Token-by-token SSE response	true for user-facing chat UIs
response_format	JSON mode / schema	{"type":"json_schema",...} for strict extraction

Non-streaming completion

from openai import OpenAI

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=512,
    messages=[
        {"role": "system", "content": "Answer in one short paragraph. No markdown."},
        {"role": "user", "content": "What is exponential backoff?"},
    ],
)
choice = resp.choices[0]
usage = resp.usage
log = {
    "model": resp.model,
    "finish_reason": choice.finish_reason,
    "prompt_tokens": usage.prompt_tokens,
    "completion_tokens": usage.completion_tokens,
    "total_tokens": usage.total_tokens,
}
print(choice.message.content)
print(log)

ChatResponse response = chatClient.prompt()
    .options(ChatOptionsBuilder.builder()
        .withModel("gpt-4o-mini")
        .withTemperature(0.0)
        .withMaxTokens(512)
        .build())
    .system("Answer in one short paragraph. No markdown.")
    .user("What is exponential backoff?")
    .call();

// Spring AI exposes usage via metadata when provider supports it
var meta = response.getMetadata();
System.out.println(response.getResult().getOutput().getContent());

Streaming basics

With stream=True, OpenAI returns Server-Sent Events (SSE) chunks. Each chunk may contain delta.content—append until finish_reason is set (stop, length, tool_calls, etc.).

Time to first token (TTFT) — measure from request start to first non-empty delta; this is what users feel.
Final usage — some streams include a last chunk with usage; if missing, count tokens client-side or accept approximate billing from provider dashboards.
Partial failures — network drop mid-stream may leave half an answer in the UI; persist incrementally and offer retry.

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    stream=True,
    messages=[{"role": "user", "content": "List three retry strategies."}],
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    if delta:
        print(delta, end="", flush=True)

Flux<ChatResponse> flux = chatClient.prompt()
    .user("List three retry strategies.")
    .stream();

flux.map(r -> r.getResult().getOutput().getContent())
    .subscribe(token -> System.out.print(token));

Usage fields and billing

Non-streaming responses include usage:

prompt_tokens — all input tokens (system + history + user)
completion_tokens — generated output
total_tokens — sum; use for per-request cost estimate
prompt_tokens_details.cached_tokens — when prompt caching applies (long repeated system prompts)

⚠️ Pitfall

Ignoring finish_reason: "length". The model hit max_tokens mid-sentence— your JSON may be truncated and unparseable. Raise a user-visible error or retry with a higher cap; never silently treat partial output as success.

💡 Pro Tip

Pass user or a hashed tenant ID in metadata headers (OpenAI-Beta features aside) via your gateway, and store OpenAI’s x-request-id response header in logs—support tickets without it are painful.

Anthropic Messages API

Anthropic’s POST /v1/messages endpoint is the canonical surface for Claude. The biggest shape difference from OpenAI: system is a top-level string (or array of blocks), not a message role.

Request shape

Field	OpenAI equivalent	Notes
system	role: system message	String or content blocks; supports prompt caching on long system text
messages	messages (user/assistant only)	Alternating user/assistant turns; first must be user
max_tokens	max_tokens	Required on Anthropic—no implicit default
temperature	same	0–1 range on most models
tools	tools + tool_choice	JSON schema per tool; model returns tool_use blocks

Response content is an array of blocks, not a single string: {"type":"text","text":"..."} or {"type":"tool_use",...}. Your parser must iterate blocks—assuming content[0].text breaks when tools fire.

import anthropic

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    temperature=0,
    system="You are a billing support agent. Never invent refund amounts.",
    messages=[
        {"role": "user", "content": "Was my March invoice prorated?"},
    ],
)
for block in msg.content:
    if block.type == "text":
        print(block.text)
print("input:", msg.usage.input_tokens, "output:", msg.usage.output_tokens)

var model = AnthropicChatModel.builder()
    .apiKey(System.getenv("ANTHROPIC_API_KEY"))
    .modelName("claude-3-5-sonnet-20241022")
    .defaultSystem("You are a billing support agent. Never invent refund amounts.")
    .maxTokens(1024)
    .temperature(0.0)
    .build();

ChatResponse r = model.call(new Prompt("Was my March invoice prorated?"));
System.out.println(r.getResult().getOutput().getContent());

# OpenAI equivalent — system inside messages array
client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a billing support agent..."},
        {"role": "user", "content": "Was my March invoice prorated?"},
    ],
)

chatClient.prompt()
    .system("You are a billing support agent...")
    .user("Was my March invoice prorated?")
    .call();

# Bedrock Claude — same Anthropic message schema in invoke body
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "system": "You are a billing support agent...",
    "messages": [{"role": "user", "content": "Was my March invoice prorated?"}],
}

// BedrockAnthropicChatModel maps to same message schema
bedrockChat.call(new Prompt("Was my March invoice prorated?"));

Beta headers and feature flags

Anthropic ships preview capabilities behind anthropic-beta request headers (e.g. prompt caching, computer use, extended context). In production:

Gate beta features behind config flags—provider can change or remove betas without a semver bump you control.
Log which betas were enabled per request; regressions often trace to a header added in staging only.
SDKs expose betas via default_headers or client options—centralize in your gateway.

Tool use overview

Define tools with name, description, and input_schema (JSON Schema). Claude may respond with tool_use blocks containing id, name, and input. You execute the tool, then send a user message with tool_result content blocks referencing the same tool_use_id.

🔬 Under the Hood

Anthropic’s prompt caching marks blocks with cache_control. Cached prefix tokens bill at a reduced rate on hits—but you pay a write premium on first send. Worth it for 10K+ token system prompts reused across thousands of requests; not for one-off calls.

⚖️ Trade-off

Claude excels at long-context instruction following and XML-tagged prompts—but OpenAI’s ecosystem (JSON schema mode, wider third-party support) is often faster to integrate. Many teams standardize on one primary provider and keep a second for failover, not feature parity on day one.

🎯 Interview Tip

“How is Anthropic’s API different from OpenAI’s?” — System prompt placement, required max_tokens, content as typed blocks, separate input/output token fields. Mention tool_result echo pattern—shows you have shipped agents, not just chat.

AWS Bedrock: InvokeModel vs Converse API

Bedrock runs foundation models inside your AWS account boundary. You choose between model-specific InvokeModel (raw JSON bodies per vendor) and the unified Converse API (messages-shaped, closer to Anthropic/OpenAI ergonomics).

When to use which

API	Best for	Friction
InvokeModel / InvokeModelWithResponseStream	Exact vendor payload, legacy integrations, models only exposed via native schema	Different JSON per model family (Anthropic vs Meta vs Cohere)
Converse / ConverseStream	New app code, multi-model routing, consistent tool config	Not every Bedrock model supports Converse yet—check model card

Model IDs and inference profiles

Bedrock identifies models with IDs like anthropic.claude-3-5-sonnet-20241022-v2:0 or amazon.nova-pro-v1:0. IDs are region-specific—copy from the Bedrock console for your target region. Cross-region inference profiles (where available) improve availability at the cost of routing transparency; log the profile ARN you used.

IAM auth (no API keys in env)

Production Bedrock calls use the AWS SDK credential chain: IAM role on ECS/EKS/Lambda, not long-lived access keys in .env. Minimum permissions typically include bedrock:InvokeModel and/or bedrock:Converse scoped to specific model ARNs.

import boto3

brt = boto3.client("bedrock-runtime", region_name="us-east-1")

# Converse API — unified messages shape
resp = brt.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    system=[{"text": "You are a concise assistant."}],
    messages=[
        {"role": "user", "content": [{"text": "Summarize VPC peering in two sentences."}]},
    ],
    inferenceConfig={"maxTokens": 256, "temperature": 0},
)
text = resp["output"]["message"]["content"][0]["text"]
usage = resp["usage"]
print(text, usage)

ConverseResponse response = bedrockRuntimeClient.converse(
    ConverseRequest.builder()
        .modelId("anthropic.claude-3-5-sonnet-20241022-v2:0")
        .system(SystemContentBlock.builder().text("You are a concise assistant.").build())
        .messages(Message.builder()
            .role(ConversationRole.USER)
            .content(ContentBlock.builder().text("Summarize VPC peering...").build())
            .build())
        .inferenceConfig(InferenceConfiguration.builder()
            .maxTokens(256)
            .temperature(0.0f)
            .build())
        .build());

# InvokeModel — Anthropic-native JSON body (older integrations)
import json
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Summarize VPC peering..."}],
})
raw = brt.invoke_model(
    modelId="anthropic.claude-3-haiku-20240307-v1:0",
    body=body,
    contentType="application/json",
)

# OpenAI — API key auth, not IAM
from openai import OpenAI
OpenAI().chat.completions.create(model="gpt-4o", messages=[...])

# Direct Anthropic — API key header
anthropic.Anthropic().messages.create(model="claude-3-5-sonnet-20241022", ...)

// OpenAI via Spring AI — API key in config
@Bean ChatClient openAiChatClient(OpenAiChatModel model) { ... }

Region considerations

Model availability — Sonnet may launch in us-east-1 weeks before eu-west-1.
Data residency — prompts stay in-region; cross-region inference is opt-in—read the model card.
Latency — run Bedrock in the same region as your app tier; cross-AZ is fine, cross-region adds RTT.
Quotas — Service Quotas console; request increases before launch traffic, not after 429 storms.

📦 Real World

Enterprises on AWS often mandate Bedrock so prompts never leave their compliance boundary and billing rides the existing AWS invoice. The tradeoff is model lag vs direct APIs and more IAM/CloudTrail plumbing—platform teams wrap Bedrock behind an internal gateway with per-team model allowlists.

⚠️ Pitfall

Hardcoding model IDs from a blog post. Bedrock IDs include version suffixes that change; deploy failures show ValidationException. Store IDs in config per environment and validate against ListFoundationModels in CI smoke tests.

💰 Cost

Bedrock pricing mirrors vendor list rates plus AWS markup on some models—but you also pay for VPC endpoints, CloudWatch logs, and cross-AZ data transfer. Total cost of ownership is infra + tokens, not tokens alone.

Google Gemini: GenerateContent

Gemini’s client surface is generateContent (REST or Google AI / Vertex SDK). Content is built from Part objects inside Content turns labeled user or model—not assistant.

Part types

Part type	Use	Production note
text	Plain prompt text	Most chat and extraction paths
inline_data	Base64 image/audio bytes	Watch payload size; prefer GCS URIs at scale
file_data	Reference uploaded file by URI	Vertex file API; lifecycle and TTL policies
function_call / function_response	Tool loop	Parallel function calls supported on newer models

System instruction

Pass system_instruction at the model config level (not as a user turn). On Vertex AI, the same concept appears in systemInstruction on GenerativeModel. Keep system text stable for implicit caching behavior on long prompts.

import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel(
    "gemini-1.5-flash",
    system_instruction="Reply in bullet points. Max 5 bullets.",
)
response = model.generate_content(
    "What should I log on every LLM request?",
    generation_config={"temperature": 0, "max_output_tokens": 512},
)
print(response.text)
print(response.usage_metadata)  # prompt_token_count, candidates_token_count

var generativeModel = new GenerativeModel(
    "gemini-1.5-flash",
    vertexAi);  // or Google AI client

var response = generativeModel.generateContent(
    ContentMaker.fromMultiModalData(
        "What should I log on every LLM request?"));

System.out.println(ResponseHandler.getText(response));

Google AI Studio vs Vertex AI

Surface	Auth	When
Google AI (AI Studio API key)	API key	Prototypes, side projects, fast spikes
Vertex AI	GCP IAM / service account	Production on GCP, VPC-SC, enterprise billing

Gemini 1.5 Pro’s million-token context is attractive for whole-document ingest—verify quality and latency on your documents. Long context does not replace RAG for freshness; it changes how you chunk, not whether you need retrieval.

⚖️ Trade-off

Gemini Flash is among the cheapest frontier-class options for high volume—but multi-cloud teams may skip it to avoid a fourth SDK/auth model. Standardize on a gateway (LiteLLM, Portkey, in-house) if you truly need four providers.

💡 Pro Tip

Gemini returns finish_reason enums like MAX_TOKENS and safety blocks—map them to user-visible errors. A silent empty text often means safety filter, not success.

Token counting before you send

Context exceeded errors are preventable. Count tokens before the API call, reserve output budget, and reject or trim upstream—do not discover overflow from a 400 in production traffic.

Why count locally?

Pre-flight validation — return 413/422 to your client with a clear message instead of burning provider quota.
Cost estimation — show “this paste is ~12K tokens” in admin UIs before users submit.
Dynamic model routing — if prompt > 100K, route to Gemini Pro; else Haiku.
Chunking for RAG — size chunks to target token counts, not character counts.

OpenAI: tiktoken (Python) and jtokkit (Java)

import tiktoken

def count_openai_messages(model: str, messages: list[dict]) -> int:
    enc = tiktoken.encoding_for_model(model)
    # Per OpenAI cookbook: add per-message overhead (~3-4 tokens each)
    tokens = 0
    for m in messages:
        tokens += 4
        tokens += len(enc.encode(m.get("content") or ""))
    tokens += 2  # reply priming
    return tokens

MODEL_CTX = 128_000
MAX_OUT = 1_024
prompt_tokens = count_openai_messages("gpt-4o", messages)
if prompt_tokens + MAX_OUT > MODEL_CTX:
    raise ValueError(f"Context overflow: {prompt_tokens} + {MAX_OUT} > {MODEL_CTX}")

EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncodingForModel("gpt-4o");

int tokens = 0;
for (Message m : messages) {
    tokens += 4;
    tokens += enc.encode(m.getContent()).size();
}
tokens += 2;

int modelCtx = 128_000;
int maxOut = 1_024;
if (tokens + maxOut > modelCtx) {
    throw new IllegalArgumentException("Context overflow: " + tokens);
}

Anthropic counting API

Anthropic exposes a count_tokens endpoint (and SDK helpers) that mirrors their tokenizer— prefer this over tiktoken when calling Claude, especially with tool definitions and multimodal blocks in the payload.

import anthropic

client = anthropic.Anthropic()
count = client.messages.count_tokens(
    model="claude-3-5-sonnet-20241022",
    system="Long system prompt...",
    messages=[{"role": "user", "content": user_text}],
)
print(count.input_tokens)

// Call Anthropic count API via HTTP from gateway, or use SDK when available
// Do not use CL100K tokenizer for Claude production gates

Pre-flight validation pattern

Check	Action if fail
prompt_tokens + max_tokens ≤ model_limit	422 with “shorten input” or trigger summarization
single user message ≤ per-field cap	Reject upload; prevent abuse
tool schema size ≤ budget	Trim tool list or split agent capabilities
estimated_cost ≤ tenant daily cap	429 from your API, not the provider’s

🔬 Under the Hood

Tokenizers differ per vendor—tiktoken for GPT will mis-estimate Claude by several percent. That is fine for UI hints; it is not fine for hard gates on Claude without Anthropic’s counter. Unified gateways often maintain a provider → counter registry.

💰 Cost

Counting API calls are cheap relative to generation—but avoid calling count on every chat keystroke. Debounce client-side estimates; server-side exact count once per submit.

🎯 Interview Tip

“How do you prevent context length errors?” — Pre-flight token count, reserve max_tokens, sliding window on history, summarization fallback. Mention you never rely on char/4 in production for code or JSON payloads.

Context management

The context window is shared between instructions, history, retrieved docs, tools, and the answer. Production systems budget each bucket and evict deliberately—not by hope.

Token budget formula

Treat the window as a ledger:

available_for_prompt = model_limit − reserved_output − safety_margin

Always reserve max_tokens (your completion cap) before stuffing RAG chunks. A 128K window with 4K reserved output and 2K safety margin leaves 122K for everything else—in practice cap lower for latency.

Bucket	Eviction strategy	Keep
System prompt	Never evict in-session	Policies, format, tool rules
RAG chunks	Drop lowest rerank score	Top-K until budget full
Chat history	Sliding window or summary	Last N turns or rolling summary
Tool outputs	Truncate or summarize large JSON	IDs and fields the model needs next step

Sliding window

Keep the system prompt plus the last K user/assistant pairs. Simple and predictable; downside: user references something from 20 turns ago and the model forgets. Pair with explicit “session summary” injection when detecting topic shift.

Summarization

When history exceeds budget, call a cheap model to compress older turns into a bullet summary stored as a synthetic system or user preamble: “Earlier in this session: user asked about refunds; agent offered prorated credit.” Log summary model and prompt version—summaries can drop nuance and cause wrong answers if too aggressive.

Selective memory and RAG preview

Long-term memory (user preferences, account facts) should not live as infinite chat history. Store structured facts in your DB; inject only relevant rows per turn—this is a preview of Track 2 RAG. Same for documents: retrieve top chunks per query instead of pasting whole PDFs because the window allows it.

def build_messages(system: str, history: list, rag_chunks: list[str],
                   user: str, budget: int, count_fn) -> list[dict]:
    msgs = [{"role": "system", "content": system}]
    used = count_fn(msgs)

    # RAG: add highest-priority chunks until budget
    for chunk in rag_chunks:
        trial = msgs + [{"role": "user", "content": f"Context:\n{chunk}"}]
        if count_fn(trial) > budget:
            break
        msgs = trial
        used = count_fn(msgs)

    # History: newest first until budget
    trimmed = []
    for turn in reversed(history):
        trial = msgs + trimmed + [turn]
        if count_fn(trial) > budget:
            break
        trimmed.insert(0, turn)

    return msgs + trimmed + [{"role": "user", "content": user}]

// Same algorithm in your MessageAssemblyService:
// fixed system → greedy RAG chunks → newest history → current user
List<Message> buildMessages(String system, Deque<Message> history,
    List<String> ragChunks, String user, int budget) { ... }

⚠️ Pitfall

Setting max_tokens to the model maximum while sending a 120K prompt. You pay for output reservation in some providers’ validators; worse, you leave no room and get truncated answers. Cap output to what the UX needs (256–2K for most chat; more for codegen with review).

📦 Real World

Support copilots typically run 8K–16K total context with aggressive history trimming—not full 200K windows. Long context is for batch doc analysis pipelines, not every ticket reply. Measure p95 prompt size in week one.

Structured output and tools

Free-form prose is fine for chat; downstream code needs JSON, enums, and tool calls you can parse without prayer. Combine provider modes with schema validation in your runtime.

JSON mode vs strict JSON schema

Mode	Provider	Guarantee
response_format: json_object	OpenAI	Valid JSON, not schema
response_format: json_schema (strict)	OpenAI	Conforms to schema; best for extraction
Tool use with single forced tool	Anthropic / OpenAI	Arguments shaped by tool schema
response_mime_type: application/json	Gemini	JSON output; combine with schema where supported

Pydantic and Instructor (Python)

import instructor
from openai import OpenAI
from pydantic import BaseModel

class Invoice(BaseModel):
    vendor: str
    amount_usd: float
    due_date: str

client = instructor.from_openai(OpenAI())
invoice = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Invoice,
    messages=[{"role": "user", "content": raw_invoice_text}],
)
# invoice is a validated Invoice instance — not raw JSON string

// Spring AI structured output — map to record/class
record Invoice(String vendor, double amountUsd, String dueDate) {}

Invoice invoice = chatClient.prompt()
    .user(rawInvoiceText)
    .call()
    .entity(Invoice.class);

OpenAI strict JSON schema

schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "amount_usd": {"type": "number"},
    },
    "required": ["vendor", "amount_usd"],
    "additionalProperties": False,
}
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "invoice", "strict": True, "schema": schema},
    },
    messages=[{"role": "user", "content": raw_invoice_text}],
)

BeanOutputConverter<Invoice> converter = new BeanOutputConverter<>(Invoice.class);
String format = converter.getFormat();

chatClient.prompt()
    .user(u -> u.text(rawInvoiceText).param("format", format))
    .call()
    .entity(Invoice.class);

Function / tool calling loop

Send tools array with schemas.
Model returns tool call(s) with arguments.
Execute tools server-side (with authz!).
Append tool results; call model again until text answer or max iterations.

Cap iterations (e.g. 5) and wall-clock time—runaway tool loops are a common outage mode. Track 3 goes deeper on agents; here, treat tools as typed RPC with schema validation on arguments.

⚠️ Pitfall

Parsing JSON with json.loads and no schema validation. Models omit fields, use strings for numbers, or hallucinate enums. Always validate with Pydantic/Zod/Jackson and retry once with the validation error in the prompt—then fail closed.

⚖️ Trade-off

Strict JSON schema reduces parse errors but increases latency and can refuse valid-but-fuzzy extractions. For messy OCR text, a two-step pipeline (free extract → normalize) sometimes beats one strict shot.

Error handling and retries

LLM APIs fail in predictable categories. Map HTTP status and provider error codes to retry, trim and retry, or fail fast—never blanket retry everything.

Error taxonomy

Status / code	Meaning	Strategy
429 rate limit	Too many requests or tokens/min	Exponential backoff + full jitter; honor Retry-After; reduce concurrency
400 context exceeded	Prompt + max_tokens > limit	Do not retry same payload—trim, summarize, or return 422 to client
400 invalid request	Bad schema, missing max_tokens (Anthropic)	Fix code; alert—non-retryable
401 / 403	Auth or IAM	Rotate credentials; non-retryable until fixed
500 / 502 / 503	Provider or network blip	Limited retries (2–3) with backoff; circuit breaker
Timeout (client)	No response in deadline	See streaming vs non-streaming below

429: exponential backoff with jitter

import random, time
from openai import RateLimitError, APIStatusError

def call_with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimitError as e:
            if attempt == max_attempts - 1:
                raise
            retry_after = float(e.response.headers.get("retry-after", 0))
            base = max(retry_after, 2 ** attempt)
            time.sleep(random.uniform(0, base))  # full jitter on cap
        except APIStatusError as e:
            if e.status_code in (500, 502, 503) and attempt < max_attempts - 1:
                time.sleep(2 ** attempt + random.random())
            else:
                raise

// Resilience4j Retry: retryOnException(RateLimitException.class)
// exponentialBackoff with randomization; maxAttempts=5
// Do NOT retry ContextLengthExceededException

400 context exceeded

Retrying the same oversized prompt will never succeed. Pipeline:

Catch provider-specific error type / message substring.
Drop oldest history or lowest-score RAG chunks.
Optionally run emergency summarization.
Single retry with reduced payload—not an infinite loop.

Timeout strategies: streaming vs non-streaming

Mode	Timeout behavior	Recommendation
Non-streaming	One deadline for entire completion	Set client timeout > provider SLA; return 504 with retry id
Streaming	Idle timeout between chunks + overall cap	Reset idle timer on each delta; kill if no byte for 30s

Streaming improves TTFT but complicates retries—you may have partially rendered UI text. Persist stream state; on failure offer “continue” or “retry from scratch” explicitly.

🔬 Under the Hood

Full jitter (sleep(random(0, cap))) prevents synchronized retry storms when many pods hit 429 at once. Without jitter, backoff waves amplify the outage you are trying to survive.

📦 Real World

Load tests that only mock 200 responses ship blind. Record production error mixes (mostly 429 and timeouts) and replay them in CI so retry budgets and circuit breakers are tuned before launch day traffic.

💡 Pro Tip

Expose a request_id to clients on failure—it maps to provider headers and your logs. Support can trace one bad session without turning on prompt logging globally.

What this looks like in production

Provider APIs are building blocks. Production is a gateway, pre-flight token gates, structured observability, and a checklist you can run before enabling the feature flag for 10% of users.

Gateway pattern (preview)

Client apps call your /v1/chat—not OpenAI directly. The gateway holds provider credentials, normalizes messages across vendors, enforces per-tenant rate limits, runs token counting, attaches trace IDs, and emits metrics. One place to rotate keys, swap models, and block prompt injection patterns.

Logging fields (every call)

Field	Why
request_id, provider request id	Support and vendor tickets
provider, model, model version	Debug behavior drift
input_tokens, output_tokens, cached_tokens	Cost per feature / tenant
ttft_ms, total_latency_ms	SLO dashboards
finish_reason, http_status, error_code	Alerting on anomaly rates
feature, tenant_id (hashed), prompt_version	Blame regressions on the right deploy
stream: true/false, retry_count	Tune timeout and backoff policies

Production readiness checklist

Pre-flight token count rejects or trims before provider call
max_tokens set per route; never unbounded
429/503 retry with jitter; 400 context errors trim—not blind retry
Streaming idle timeout + overall deadline tested under load
Structured output validated; one repair retry then fail closed
Secrets in vault / IAM roles—not keys in images
Provider and model pinned per environment; logged on every response
Golden eval suite in CI (Track 5) for prompt changes
Per-tenant cost caps with alerts at 80% daily budget

💰 Cost

Ship a dashboard on day one: cost = Σ(input_tokens × price_in + output_tokens × price_out) by feature. Teams that discover spend from the vendor invoice weeks later over-build then panic-cut context.

🎯 Interview Tip

“Design an LLM API for your product.” — Start with gateway, auth, token budget, idempotent retry policy, observability fields, and provider abstraction. Mention you would not let mobile clients hold API keys—shows production instincts.