APIs, tokens & context
Production LLM features live behind HTTP APIs—not notebooks. This guide is a production-first tour of calling OpenAI, Anthropic, AWS Bedrock, and Google Gemini: message shapes, token budgets, structured output, and the error handling you need before users hit rate limits at 2 a.m.
After reading, you should be able to: call each provider’s chat API with correct message roles and streaming; count tokens before send and reserve max_tokens; manage context with sliding windows and summarization; return validated JSON and tool calls; implement 429/400/503 retry policies; and ship with a logging checklist and gateway sketch.
OpenAI API deep dive
OpenAI’s primary surface for text generation is POST /v1/chat/completions. The request body is an ordered messages array; the response includes choices[0].message and a usage block you should log on every call.
Messages array and roles
Each message has a role and content. Roles define who “spoke” in the conversation the model continues:
| Role | Purpose | Production notes |
|---|---|---|
| system | High-level instructions, policies, output format | Pin version in config; treat as code—review in PRs. Some teams use a single system message; others split policy vs persona. |
| user | End-user or upstream agent input | Sanitize and length-limit before append. Never trust raw HTML/markdown from browsers without escaping in your UI. |
| assistant | Prior model turns in multi-turn chat | Store exactly what you sent to the API (including tool calls). Replaying edited summaries breaks tool loops. |
| tool | Results returned from function/tool execution | Pair with tool_call_id on the preceding assistant message. Covered in structured output below. |
OpenAI accepts multiple system messages in practice, but most SDKs collapse to one. For chat products, keep system prompt stable across turns and append only new user/assistant pairs— re-sending the full history is correct but expensive; context management (later section) trims what you resend.
Core request fields
| Field | What it does | Typical production value |
|---|---|---|
| model | Which weights to run | Pin string per environment; log on every response |
| messages | Prompt context | System + sliding history + current user |
| temperature | Sampling randomness | 0 for extraction; 0.3–0.7 for chat |
| max_tokens | Cap on completion length | Set explicitly—default can be large and costly |
| stream | Token-by-token SSE response | true for user-facing chat UIs |
| response_format | JSON mode / schema | {"type":"json_schema",...} for strict extraction |
Non-streaming completion
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
max_tokens=512,
messages=[
{"role": "system", "content": "Answer in one short paragraph. No markdown."},
{"role": "user", "content": "What is exponential backoff?"},
],
)
choice = resp.choices[0]
usage = resp.usage
log = {
"model": resp.model,
"finish_reason": choice.finish_reason,
"prompt_tokens": usage.prompt_tokens,
"completion_tokens": usage.completion_tokens,
"total_tokens": usage.total_tokens,
}
print(choice.message.content)
print(log)
ChatResponse response = chatClient.prompt()
.options(ChatOptionsBuilder.builder()
.withModel("gpt-4o-mini")
.withTemperature(0.0)
.withMaxTokens(512)
.build())
.system("Answer in one short paragraph. No markdown.")
.user("What is exponential backoff?")
.call();
// Spring AI exposes usage via metadata when provider supports it
var meta = response.getMetadata();
System.out.println(response.getResult().getOutput().getContent());
Streaming basics
With stream=True, OpenAI returns Server-Sent Events (SSE) chunks. Each chunk may contain delta.content—append until finish_reason is set (stop, length, tool_calls, etc.).
- Time to first token (TTFT) — measure from request start to first non-empty delta; this is what users feel.
- Final usage — some streams include a last chunk with usage; if missing, count tokens client-side or accept approximate billing from provider dashboards.
- Partial failures — network drop mid-stream may leave half an answer in the UI; persist incrementally and offer retry.
stream = client.chat.completions.create(
model="gpt-4o-mini",
stream=True,
messages=[{"role": "user", "content": "List three retry strategies."}],
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
if delta:
print(delta, end="", flush=True)
Flux<ChatResponse> flux = chatClient.prompt()
.user("List three retry strategies.")
.stream();
flux.map(r -> r.getResult().getOutput().getContent())
.subscribe(token -> System.out.print(token));
Usage fields and billing
Non-streaming responses include usage:
- prompt_tokens — all input tokens (system + history + user)
- completion_tokens — generated output
- total_tokens — sum; use for per-request cost estimate
- prompt_tokens_details.cached_tokens — when prompt caching applies (long repeated system prompts)
Ignoring finish_reason: "length". The model hit max_tokens mid-sentence— your JSON may be truncated and unparseable. Raise a user-visible error or retry with a higher cap; never silently treat partial output as success.
Pass user or a hashed tenant ID in metadata headers (OpenAI-Beta features aside) via your gateway, and store OpenAI’s x-request-id response header in logs—support tickets without it are painful.
Anthropic Messages API
Anthropic’s POST /v1/messages endpoint is the canonical surface for Claude. The biggest shape difference from OpenAI: system is a top-level string (or array of blocks), not a message role.
Request shape
| Field | OpenAI equivalent | Notes |
|---|---|---|
| system | role: system message | String or content blocks; supports prompt caching on long system text |
| messages | messages (user/assistant only) | Alternating user/assistant turns; first must be user |
| max_tokens | max_tokens | Required on Anthropic—no implicit default |
| temperature | same | 0–1 range on most models |
| tools | tools + tool_choice | JSON schema per tool; model returns tool_use blocks |
Response content is an array of blocks, not a single string: {"type":"text","text":"..."} or {"type":"tool_use",...}. Your parser must iterate blocks—assuming content[0].text breaks when tools fire.
import anthropic
client = anthropic.Anthropic()
msg = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
temperature=0,
system="You are a billing support agent. Never invent refund amounts.",
messages=[
{"role": "user", "content": "Was my March invoice prorated?"},
],
)
for block in msg.content:
if block.type == "text":
print(block.text)
print("input:", msg.usage.input_tokens, "output:", msg.usage.output_tokens)
var model = AnthropicChatModel.builder()
.apiKey(System.getenv("ANTHROPIC_API_KEY"))
.modelName("claude-3-5-sonnet-20241022")
.defaultSystem("You are a billing support agent. Never invent refund amounts.")
.maxTokens(1024)
.temperature(0.0)
.build();
ChatResponse r = model.call(new Prompt("Was my March invoice prorated?"));
System.out.println(r.getResult().getOutput().getContent());
# OpenAI equivalent — system inside messages array
client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a billing support agent..."},
{"role": "user", "content": "Was my March invoice prorated?"},
],
)
chatClient.prompt()
.system("You are a billing support agent...")
.user("Was my March invoice prorated?")
.call();
# Bedrock Claude — same Anthropic message schema in invoke body
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"system": "You are a billing support agent...",
"messages": [{"role": "user", "content": "Was my March invoice prorated?"}],
}
// BedrockAnthropicChatModel maps to same message schema
bedrockChat.call(new Prompt("Was my March invoice prorated?"));
Beta headers and feature flags
Anthropic ships preview capabilities behind anthropic-beta request headers (e.g. prompt caching, computer use, extended context). In production:
- Gate beta features behind config flags—provider can change or remove betas without a semver bump you control.
- Log which betas were enabled per request; regressions often trace to a header added in staging only.
- SDKs expose betas via default_headers or client options—centralize in your gateway.
Tool use overview
Define tools with name, description, and input_schema (JSON Schema). Claude may respond with tool_use blocks containing id, name, and input. You execute the tool, then send a user message with tool_result content blocks referencing the same tool_use_id.
Anthropic’s prompt caching marks blocks with cache_control. Cached prefix tokens bill at a reduced rate on hits—but you pay a write premium on first send. Worth it for 10K+ token system prompts reused across thousands of requests; not for one-off calls.
Claude excels at long-context instruction following and XML-tagged prompts—but OpenAI’s ecosystem (JSON schema mode, wider third-party support) is often faster to integrate. Many teams standardize on one primary provider and keep a second for failover, not feature parity on day one.
“How is Anthropic’s API different from OpenAI’s?” — System prompt placement, required max_tokens, content as typed blocks, separate input/output token fields. Mention tool_result echo pattern—shows you have shipped agents, not just chat.
AWS Bedrock: InvokeModel vs Converse API
Bedrock runs foundation models inside your AWS account boundary. You choose between model-specific InvokeModel (raw JSON bodies per vendor) and the unified Converse API (messages-shaped, closer to Anthropic/OpenAI ergonomics).
When to use which
| API | Best for | Friction |
|---|---|---|
| InvokeModel / InvokeModelWithResponseStream | Exact vendor payload, legacy integrations, models only exposed via native schema | Different JSON per model family (Anthropic vs Meta vs Cohere) |
| Converse / ConverseStream | New app code, multi-model routing, consistent tool config | Not every Bedrock model supports Converse yet—check model card |
Model IDs and inference profiles
Bedrock identifies models with IDs like anthropic.claude-3-5-sonnet-20241022-v2:0 or amazon.nova-pro-v1:0. IDs are region-specific—copy from the Bedrock console for your target region. Cross-region inference profiles (where available) improve availability at the cost of routing transparency; log the profile ARN you used.
IAM auth (no API keys in env)
Production Bedrock calls use the AWS SDK credential chain: IAM role on ECS/EKS/Lambda, not long-lived access keys in .env. Minimum permissions typically include bedrock:InvokeModel and/or bedrock:Converse scoped to specific model ARNs.
import boto3
brt = boto3.client("bedrock-runtime", region_name="us-east-1")
# Converse API — unified messages shape
resp = brt.converse(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
system=[{"text": "You are a concise assistant."}],
messages=[
{"role": "user", "content": [{"text": "Summarize VPC peering in two sentences."}]},
],
inferenceConfig={"maxTokens": 256, "temperature": 0},
)
text = resp["output"]["message"]["content"][0]["text"]
usage = resp["usage"]
print(text, usage)
ConverseResponse response = bedrockRuntimeClient.converse(
ConverseRequest.builder()
.modelId("anthropic.claude-3-5-sonnet-20241022-v2:0")
.system(SystemContentBlock.builder().text("You are a concise assistant.").build())
.messages(Message.builder()
.role(ConversationRole.USER)
.content(ContentBlock.builder().text("Summarize VPC peering...").build())
.build())
.inferenceConfig(InferenceConfiguration.builder()
.maxTokens(256)
.temperature(0.0f)
.build())
.build());
# InvokeModel — Anthropic-native JSON body (older integrations)
import json
body = json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 256,
"messages": [{"role": "user", "content": "Summarize VPC peering..."}],
})
raw = brt.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=body,
contentType="application/json",
)
# OpenAI — API key auth, not IAM
from openai import OpenAI
OpenAI().chat.completions.create(model="gpt-4o", messages=[...])
# Direct Anthropic — API key header
anthropic.Anthropic().messages.create(model="claude-3-5-sonnet-20241022", ...)
// OpenAI via Spring AI — API key in config
@Bean ChatClient openAiChatClient(OpenAiChatModel model) { ... }
Region considerations
- Model availability — Sonnet may launch in us-east-1 weeks before eu-west-1.
- Data residency — prompts stay in-region; cross-region inference is opt-in—read the model card.
- Latency — run Bedrock in the same region as your app tier; cross-AZ is fine, cross-region adds RTT.
- Quotas — Service Quotas console; request increases before launch traffic, not after 429 storms.
Enterprises on AWS often mandate Bedrock so prompts never leave their compliance boundary and billing rides the existing AWS invoice. The tradeoff is model lag vs direct APIs and more IAM/CloudTrail plumbing—platform teams wrap Bedrock behind an internal gateway with per-team model allowlists.
Hardcoding model IDs from a blog post. Bedrock IDs include version suffixes that change; deploy failures show ValidationException. Store IDs in config per environment and validate against ListFoundationModels in CI smoke tests.
Bedrock pricing mirrors vendor list rates plus AWS markup on some models—but you also pay for VPC endpoints, CloudWatch logs, and cross-AZ data transfer. Total cost of ownership is infra + tokens, not tokens alone.
Google Gemini: GenerateContent
Gemini’s client surface is generateContent (REST or Google AI / Vertex SDK). Content is built from Part objects inside Content turns labeled user or model—not assistant.
Part types
| Part type | Use | Production note |
|---|---|---|
| text | Plain prompt text | Most chat and extraction paths |
| inline_data | Base64 image/audio bytes | Watch payload size; prefer GCS URIs at scale |
| file_data | Reference uploaded file by URI | Vertex file API; lifecycle and TTL policies |
| function_call / function_response | Tool loop | Parallel function calls supported on newer models |
System instruction
Pass system_instruction at the model config level (not as a user turn). On Vertex AI, the same concept appears in systemInstruction on GenerativeModel. Keep system text stable for implicit caching behavior on long prompts.
import google.generativeai as genai
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel(
"gemini-1.5-flash",
system_instruction="Reply in bullet points. Max 5 bullets.",
)
response = model.generate_content(
"What should I log on every LLM request?",
generation_config={"temperature": 0, "max_output_tokens": 512},
)
print(response.text)
print(response.usage_metadata) # prompt_token_count, candidates_token_count
var generativeModel = new GenerativeModel(
"gemini-1.5-flash",
vertexAi); // or Google AI client
var response = generativeModel.generateContent(
ContentMaker.fromMultiModalData(
"What should I log on every LLM request?"));
System.out.println(ResponseHandler.getText(response));
Google AI Studio vs Vertex AI
| Surface | Auth | When |
|---|---|---|
| Google AI (AI Studio API key) | API key | Prototypes, side projects, fast spikes |
| Vertex AI | GCP IAM / service account | Production on GCP, VPC-SC, enterprise billing |
Gemini 1.5 Pro’s million-token context is attractive for whole-document ingest—verify quality and latency on your documents. Long context does not replace RAG for freshness; it changes how you chunk, not whether you need retrieval.
Gemini Flash is among the cheapest frontier-class options for high volume—but multi-cloud teams may skip it to avoid a fourth SDK/auth model. Standardize on a gateway (LiteLLM, Portkey, in-house) if you truly need four providers.
Gemini returns finish_reason enums like MAX_TOKENS and safety blocks—map them to user-visible errors. A silent empty text often means safety filter, not success.
Token counting before you send
Context exceeded errors are preventable. Count tokens before the API call, reserve output budget, and reject or trim upstream—do not discover overflow from a 400 in production traffic.
Why count locally?
- Pre-flight validation — return 413/422 to your client with a clear message instead of burning provider quota.
- Cost estimation — show “this paste is ~12K tokens” in admin UIs before users submit.
- Dynamic model routing — if prompt > 100K, route to Gemini Pro; else Haiku.
- Chunking for RAG — size chunks to target token counts, not character counts.
OpenAI: tiktoken (Python) and jtokkit (Java)
import tiktoken
def count_openai_messages(model: str, messages: list[dict]) -> int:
enc = tiktoken.encoding_for_model(model)
# Per OpenAI cookbook: add per-message overhead (~3-4 tokens each)
tokens = 0
for m in messages:
tokens += 4
tokens += len(enc.encode(m.get("content") or ""))
tokens += 2 # reply priming
return tokens
MODEL_CTX = 128_000
MAX_OUT = 1_024
prompt_tokens = count_openai_messages("gpt-4o", messages)
if prompt_tokens + MAX_OUT > MODEL_CTX:
raise ValueError(f"Context overflow: {prompt_tokens} + {MAX_OUT} > {MODEL_CTX}")
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncodingForModel("gpt-4o");
int tokens = 0;
for (Message m : messages) {
tokens += 4;
tokens += enc.encode(m.getContent()).size();
}
tokens += 2;
int modelCtx = 128_000;
int maxOut = 1_024;
if (tokens + maxOut > modelCtx) {
throw new IllegalArgumentException("Context overflow: " + tokens);
}
Anthropic counting API
Anthropic exposes a count_tokens endpoint (and SDK helpers) that mirrors their tokenizer— prefer this over tiktoken when calling Claude, especially with tool definitions and multimodal blocks in the payload.
import anthropic
client = anthropic.Anthropic()
count = client.messages.count_tokens(
model="claude-3-5-sonnet-20241022",
system="Long system prompt...",
messages=[{"role": "user", "content": user_text}],
)
print(count.input_tokens)
// Call Anthropic count API via HTTP from gateway, or use SDK when available
// Do not use CL100K tokenizer for Claude production gates
Pre-flight validation pattern
| Check | Action if fail |
|---|---|
| prompt_tokens + max_tokens ≤ model_limit | 422 with “shorten input” or trigger summarization |
| single user message ≤ per-field cap | Reject upload; prevent abuse |
| tool schema size ≤ budget | Trim tool list or split agent capabilities |
| estimated_cost ≤ tenant daily cap | 429 from your API, not the provider’s |
Tokenizers differ per vendor—tiktoken for GPT will mis-estimate Claude by several percent. That is fine for UI hints; it is not fine for hard gates on Claude without Anthropic’s counter. Unified gateways often maintain a provider → counter registry.
Counting API calls are cheap relative to generation—but avoid calling count on every chat keystroke. Debounce client-side estimates; server-side exact count once per submit.
“How do you prevent context length errors?” — Pre-flight token count, reserve max_tokens, sliding window on history, summarization fallback. Mention you never rely on char/4 in production for code or JSON payloads.
Context management
The context window is shared between instructions, history, retrieved docs, tools, and the answer. Production systems budget each bucket and evict deliberately—not by hope.
Token budget formula
Treat the window as a ledger:
available_for_prompt = model_limit − reserved_output − safety_margin
Always reserve max_tokens (your completion cap) before stuffing RAG chunks. A 128K window with 4K reserved output and 2K safety margin leaves 122K for everything else—in practice cap lower for latency.
| Bucket | Eviction strategy | Keep |
|---|---|---|
| System prompt | Never evict in-session | Policies, format, tool rules |
| RAG chunks | Drop lowest rerank score | Top-K until budget full |
| Chat history | Sliding window or summary | Last N turns or rolling summary |
| Tool outputs | Truncate or summarize large JSON | IDs and fields the model needs next step |
Sliding window
Keep the system prompt plus the last K user/assistant pairs. Simple and predictable; downside: user references something from 20 turns ago and the model forgets. Pair with explicit “session summary” injection when detecting topic shift.
Summarization
When history exceeds budget, call a cheap model to compress older turns into a bullet summary stored as a synthetic system or user preamble: “Earlier in this session: user asked about refunds; agent offered prorated credit.” Log summary model and prompt version—summaries can drop nuance and cause wrong answers if too aggressive.
Selective memory and RAG preview
Long-term memory (user preferences, account facts) should not live as infinite chat history. Store structured facts in your DB; inject only relevant rows per turn—this is a preview of Track 2 RAG. Same for documents: retrieve top chunks per query instead of pasting whole PDFs because the window allows it.
def build_messages(system: str, history: list, rag_chunks: list[str],
user: str, budget: int, count_fn) -> list[dict]:
msgs = [{"role": "system", "content": system}]
used = count_fn(msgs)
# RAG: add highest-priority chunks until budget
for chunk in rag_chunks:
trial = msgs + [{"role": "user", "content": f"Context:\n{chunk}"}]
if count_fn(trial) > budget:
break
msgs = trial
used = count_fn(msgs)
# History: newest first until budget
trimmed = []
for turn in reversed(history):
trial = msgs + trimmed + [turn]
if count_fn(trial) > budget:
break
trimmed.insert(0, turn)
return msgs + trimmed + [{"role": "user", "content": user}]
// Same algorithm in your MessageAssemblyService:
// fixed system → greedy RAG chunks → newest history → current user
List<Message> buildMessages(String system, Deque<Message> history,
List<String> ragChunks, String user, int budget) { ... }
Setting max_tokens to the model maximum while sending a 120K prompt. You pay for output reservation in some providers’ validators; worse, you leave no room and get truncated answers. Cap output to what the UX needs (256–2K for most chat; more for codegen with review).
Support copilots typically run 8K–16K total context with aggressive history trimming—not full 200K windows. Long context is for batch doc analysis pipelines, not every ticket reply. Measure p95 prompt size in week one.
Structured output and tools
Free-form prose is fine for chat; downstream code needs JSON, enums, and tool calls you can parse without prayer. Combine provider modes with schema validation in your runtime.
JSON mode vs strict JSON schema
| Mode | Provider | Guarantee |
|---|---|---|
| response_format: json_object | OpenAI | Valid JSON, not schema |
| response_format: json_schema (strict) | OpenAI | Conforms to schema; best for extraction |
| Tool use with single forced tool | Anthropic / OpenAI | Arguments shaped by tool schema |
| response_mime_type: application/json | Gemini | JSON output; combine with schema where supported |
Pydantic and Instructor (Python)
import instructor
from openai import OpenAI
from pydantic import BaseModel
class Invoice(BaseModel):
vendor: str
amount_usd: float
due_date: str
client = instructor.from_openai(OpenAI())
invoice = client.chat.completions.create(
model="gpt-4o-mini",
response_model=Invoice,
messages=[{"role": "user", "content": raw_invoice_text}],
)
# invoice is a validated Invoice instance — not raw JSON string
// Spring AI structured output — map to record/class
record Invoice(String vendor, double amountUsd, String dueDate) {}
Invoice invoice = chatClient.prompt()
.user(rawInvoiceText)
.call()
.entity(Invoice.class);
OpenAI strict JSON schema
schema = {
"type": "object",
"properties": {
"vendor": {"type": "string"},
"amount_usd": {"type": "number"},
},
"required": ["vendor", "amount_usd"],
"additionalProperties": False,
}
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={
"type": "json_schema",
"json_schema": {"name": "invoice", "strict": True, "schema": schema},
},
messages=[{"role": "user", "content": raw_invoice_text}],
)
BeanOutputConverter<Invoice> converter = new BeanOutputConverter<>(Invoice.class);
String format = converter.getFormat();
chatClient.prompt()
.user(u -> u.text(rawInvoiceText).param("format", format))
.call()
.entity(Invoice.class);
Function / tool calling loop
- Send tools array with schemas.
- Model returns tool call(s) with arguments.
- Execute tools server-side (with authz!).
- Append tool results; call model again until text answer or max iterations.
Cap iterations (e.g. 5) and wall-clock time—runaway tool loops are a common outage mode. Track 3 goes deeper on agents; here, treat tools as typed RPC with schema validation on arguments.
Parsing JSON with json.loads and no schema validation. Models omit fields, use strings for numbers, or hallucinate enums. Always validate with Pydantic/Zod/Jackson and retry once with the validation error in the prompt—then fail closed.
Strict JSON schema reduces parse errors but increases latency and can refuse valid-but-fuzzy extractions. For messy OCR text, a two-step pipeline (free extract → normalize) sometimes beats one strict shot.
Error handling and retries
LLM APIs fail in predictable categories. Map HTTP status and provider error codes to retry, trim and retry, or fail fast—never blanket retry everything.
Error taxonomy
| Status / code | Meaning | Strategy |
|---|---|---|
| 429 rate limit | Too many requests or tokens/min | Exponential backoff + full jitter; honor Retry-After; reduce concurrency |
| 400 context exceeded | Prompt + max_tokens > limit | Do not retry same payload—trim, summarize, or return 422 to client |
| 400 invalid request | Bad schema, missing max_tokens (Anthropic) | Fix code; alert—non-retryable |
| 401 / 403 | Auth or IAM | Rotate credentials; non-retryable until fixed |
| 500 / 502 / 503 | Provider or network blip | Limited retries (2–3) with backoff; circuit breaker |
| Timeout (client) | No response in deadline | See streaming vs non-streaming below |
429: exponential backoff with jitter
import random, time
from openai import RateLimitError, APIStatusError
def call_with_retry(fn, max_attempts=5):
for attempt in range(max_attempts):
try:
return fn()
except RateLimitError as e:
if attempt == max_attempts - 1:
raise
retry_after = float(e.response.headers.get("retry-after", 0))
base = max(retry_after, 2 ** attempt)
time.sleep(random.uniform(0, base)) # full jitter on cap
except APIStatusError as e:
if e.status_code in (500, 502, 503) and attempt < max_attempts - 1:
time.sleep(2 ** attempt + random.random())
else:
raise
// Resilience4j Retry: retryOnException(RateLimitException.class)
// exponentialBackoff with randomization; maxAttempts=5
// Do NOT retry ContextLengthExceededException
400 context exceeded
Retrying the same oversized prompt will never succeed. Pipeline:
- Catch provider-specific error type / message substring.
- Drop oldest history or lowest-score RAG chunks.
- Optionally run emergency summarization.
- Single retry with reduced payload—not an infinite loop.
Timeout strategies: streaming vs non-streaming
| Mode | Timeout behavior | Recommendation |
|---|---|---|
| Non-streaming | One deadline for entire completion | Set client timeout > provider SLA; return 504 with retry id |
| Streaming | Idle timeout between chunks + overall cap | Reset idle timer on each delta; kill if no byte for 30s |
Streaming improves TTFT but complicates retries—you may have partially rendered UI text. Persist stream state; on failure offer “continue” or “retry from scratch” explicitly.
Full jitter (sleep(random(0, cap))) prevents synchronized retry storms when many pods hit 429 at once. Without jitter, backoff waves amplify the outage you are trying to survive.
Load tests that only mock 200 responses ship blind. Record production error mixes (mostly 429 and timeouts) and replay them in CI so retry budgets and circuit breakers are tuned before launch day traffic.
Expose a request_id to clients on failure—it maps to provider headers and your logs. Support can trace one bad session without turning on prompt logging globally.
What this looks like in production
Provider APIs are building blocks. Production is a gateway, pre-flight token gates, structured observability, and a checklist you can run before enabling the feature flag for 10% of users.
Gateway pattern (preview)
Client apps call your /v1/chat—not OpenAI directly. The gateway holds provider credentials, normalizes messages across vendors, enforces per-tenant rate limits, runs token counting, attaches trace IDs, and emits metrics. One place to rotate keys, swap models, and block prompt injection patterns.
Logging fields (every call)
| Field | Why |
|---|---|
| request_id, provider request id | Support and vendor tickets |
| provider, model, model version | Debug behavior drift |
| input_tokens, output_tokens, cached_tokens | Cost per feature / tenant |
| ttft_ms, total_latency_ms | SLO dashboards |
| finish_reason, http_status, error_code | Alerting on anomaly rates |
| feature, tenant_id (hashed), prompt_version | Blame regressions on the right deploy |
| stream: true/false, retry_count | Tune timeout and backoff policies |
Production readiness checklist
- Pre-flight token count rejects or trims before provider call
- max_tokens set per route; never unbounded
- 429/503 retry with jitter; 400 context errors trim—not blind retry
- Streaming idle timeout + overall deadline tested under load
- Structured output validated; one repair retry then fail closed
- Secrets in vault / IAM roles—not keys in images
- Provider and model pinned per environment; logged on every response
- Golden eval suite in CI (Track 5) for prompt changes
- Per-tenant cost caps with alerts at 80% daily budget
Ship a dashboard on day one: cost = Σ(input_tokens × price_in + output_tokens × price_out) by feature. Teams that discover spend from the vendor invoice weeks later over-build then panic-cut context.
“Design an LLM API for your product.” — Start with gateway, auth, token budget, idempotent retry policy, observability fields, and provider abstraction. Mention you would not let mobile clients hold API keys—shows production instincts.