Lang
API

APIs, tokens & context

Production LLM features live behind HTTP APIs—not notebooks. This guide is a production-first tour of calling OpenAI, Anthropic, AWS Bedrock, and Google Gemini: message shapes, token budgets, structured output, and the error handling you need before users hit rate limits at 2 a.m.

After reading, you should be able to: call each provider’s chat API with correct message roles and streaming; count tokens before send and reserve max_tokens; manage context with sliding windows and summarization; return validated JSON and tool calls; implement 429/400/503 retry policies; and ship with a logging checklist and gateway sketch.

developer platform architect Track 1 OpenAI SDK 1.x Anthropic SDK 0.40+ Bedrock Converse Gemini 1.5

OpenAI API deep dive

OpenAI’s primary surface for text generation is POST /v1/chat/completions. The request body is an ordered messages array; the response includes choices[0].message and a usage block you should log on every call.

Messages array and roles

Each message has a role and content. Roles define who “spoke” in the conversation the model continues:

RolePurposeProduction notes
system High-level instructions, policies, output format Pin version in config; treat as code—review in PRs. Some teams use a single system message; others split policy vs persona.
user End-user or upstream agent input Sanitize and length-limit before append. Never trust raw HTML/markdown from browsers without escaping in your UI.
assistant Prior model turns in multi-turn chat Store exactly what you sent to the API (including tool calls). Replaying edited summaries breaks tool loops.
tool Results returned from function/tool execution Pair with tool_call_id on the preceding assistant message. Covered in structured output below.

OpenAI accepts multiple system messages in practice, but most SDKs collapse to one. For chat products, keep system prompt stable across turns and append only new user/assistant pairs— re-sending the full history is correct but expensive; context management (later section) trims what you resend.

Core request fields

FieldWhat it doesTypical production value
modelWhich weights to runPin string per environment; log on every response
messagesPrompt contextSystem + sliding history + current user
temperatureSampling randomness0 for extraction; 0.3–0.7 for chat
max_tokensCap on completion lengthSet explicitly—default can be large and costly
streamToken-by-token SSE responsetrue for user-facing chat UIs
response_formatJSON mode / schema{"type":"json_schema",...} for strict extraction

Non-streaming completion

Chat completion with usage logging
from openai import OpenAI

client = OpenAI()
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=512,
    messages=[
        {"role": "system", "content": "Answer in one short paragraph. No markdown."},
        {"role": "user", "content": "What is exponential backoff?"},
    ],
)
choice = resp.choices[0]
usage = resp.usage
log = {
    "model": resp.model,
    "finish_reason": choice.finish_reason,
    "prompt_tokens": usage.prompt_tokens,
    "completion_tokens": usage.completion_tokens,
    "total_tokens": usage.total_tokens,
}
print(choice.message.content)
print(log)
ChatResponse response = chatClient.prompt()
    .options(ChatOptionsBuilder.builder()
        .withModel("gpt-4o-mini")
        .withTemperature(0.0)
        .withMaxTokens(512)
        .build())
    .system("Answer in one short paragraph. No markdown.")
    .user("What is exponential backoff?")
    .call();

// Spring AI exposes usage via metadata when provider supports it
var meta = response.getMetadata();
System.out.println(response.getResult().getOutput().getContent());

Streaming basics

With stream=True, OpenAI returns Server-Sent Events (SSE) chunks. Each chunk may contain delta.content—append until finish_reason is set (stop, length, tool_calls, etc.).

  • Time to first token (TTFT) — measure from request start to first non-empty delta; this is what users feel.
  • Final usage — some streams include a last chunk with usage; if missing, count tokens client-side or accept approximate billing from provider dashboards.
  • Partial failures — network drop mid-stream may leave half an answer in the UI; persist incrementally and offer retry.
Streaming to stdout (proxy pattern for SSE to browser)
stream = client.chat.completions.create(
    model="gpt-4o-mini",
    stream=True,
    messages=[{"role": "user", "content": "List three retry strategies."}],
)
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    if delta:
        print(delta, end="", flush=True)
Flux<ChatResponse> flux = chatClient.prompt()
    .user("List three retry strategies.")
    .stream();

flux.map(r -> r.getResult().getOutput().getContent())
    .subscribe(token -> System.out.print(token));

Usage fields and billing

Non-streaming responses include usage:

  • prompt_tokens — all input tokens (system + history + user)
  • completion_tokens — generated output
  • total_tokens — sum; use for per-request cost estimate
  • prompt_tokens_details.cached_tokens — when prompt caching applies (long repeated system prompts)
⚠️ Pitfall

Ignoring finish_reason: "length". The model hit max_tokens mid-sentence— your JSON may be truncated and unparseable. Raise a user-visible error or retry with a higher cap; never silently treat partial output as success.

💡 Pro Tip

Pass user or a hashed tenant ID in metadata headers (OpenAI-Beta features aside) via your gateway, and store OpenAI’s x-request-id response header in logs—support tickets without it are painful.

Anthropic Messages API

Anthropic’s POST /v1/messages endpoint is the canonical surface for Claude. The biggest shape difference from OpenAI: system is a top-level string (or array of blocks), not a message role.

Request shape

FieldOpenAI equivalentNotes
systemrole: system messageString or content blocks; supports prompt caching on long system text
messagesmessages (user/assistant only)Alternating user/assistant turns; first must be user
max_tokensmax_tokensRequired on Anthropic—no implicit default
temperaturesame0–1 range on most models
toolstools + tool_choiceJSON schema per tool; model returns tool_use blocks

Response content is an array of blocks, not a single string: {"type":"text","text":"..."} or {"type":"tool_use",...}. Your parser must iterate blocks—assuming content[0].text breaks when tools fire.

import anthropic

client = anthropic.Anthropic()
msg = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    temperature=0,
    system="You are a billing support agent. Never invent refund amounts.",
    messages=[
        {"role": "user", "content": "Was my March invoice prorated?"},
    ],
)
for block in msg.content:
    if block.type == "text":
        print(block.text)
print("input:", msg.usage.input_tokens, "output:", msg.usage.output_tokens)
var model = AnthropicChatModel.builder()
    .apiKey(System.getenv("ANTHROPIC_API_KEY"))
    .modelName("claude-3-5-sonnet-20241022")
    .defaultSystem("You are a billing support agent. Never invent refund amounts.")
    .maxTokens(1024)
    .temperature(0.0)
    .build();

ChatResponse r = model.call(new Prompt("Was my March invoice prorated?"));
System.out.println(r.getResult().getOutput().getContent());
# OpenAI equivalent — system inside messages array
client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a billing support agent..."},
        {"role": "user", "content": "Was my March invoice prorated?"},
    ],
)
chatClient.prompt()
    .system("You are a billing support agent...")
    .user("Was my March invoice prorated?")
    .call();
# Bedrock Claude — same Anthropic message schema in invoke body
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 1024,
    "system": "You are a billing support agent...",
    "messages": [{"role": "user", "content": "Was my March invoice prorated?"}],
}
// BedrockAnthropicChatModel maps to same message schema
bedrockChat.call(new Prompt("Was my March invoice prorated?"));

Beta headers and feature flags

Anthropic ships preview capabilities behind anthropic-beta request headers (e.g. prompt caching, computer use, extended context). In production:

  • Gate beta features behind config flags—provider can change or remove betas without a semver bump you control.
  • Log which betas were enabled per request; regressions often trace to a header added in staging only.
  • SDKs expose betas via default_headers or client options—centralize in your gateway.

Tool use overview

Define tools with name, description, and input_schema (JSON Schema). Claude may respond with tool_use blocks containing id, name, and input. You execute the tool, then send a user message with tool_result content blocks referencing the same tool_use_id.

🔬 Under the Hood

Anthropic’s prompt caching marks blocks with cache_control. Cached prefix tokens bill at a reduced rate on hits—but you pay a write premium on first send. Worth it for 10K+ token system prompts reused across thousands of requests; not for one-off calls.

⚖️ Trade-off

Claude excels at long-context instruction following and XML-tagged prompts—but OpenAI’s ecosystem (JSON schema mode, wider third-party support) is often faster to integrate. Many teams standardize on one primary provider and keep a second for failover, not feature parity on day one.

🎯 Interview Tip

“How is Anthropic’s API different from OpenAI’s?” — System prompt placement, required max_tokens, content as typed blocks, separate input/output token fields. Mention tool_result echo pattern—shows you have shipped agents, not just chat.

AWS Bedrock: InvokeModel vs Converse API

Bedrock runs foundation models inside your AWS account boundary. You choose between model-specific InvokeModel (raw JSON bodies per vendor) and the unified Converse API (messages-shaped, closer to Anthropic/OpenAI ergonomics).

When to use which

APIBest forFriction
InvokeModel / InvokeModelWithResponseStream Exact vendor payload, legacy integrations, models only exposed via native schema Different JSON per model family (Anthropic vs Meta vs Cohere)
Converse / ConverseStream New app code, multi-model routing, consistent tool config Not every Bedrock model supports Converse yet—check model card

Model IDs and inference profiles

Bedrock identifies models with IDs like anthropic.claude-3-5-sonnet-20241022-v2:0 or amazon.nova-pro-v1:0. IDs are region-specific—copy from the Bedrock console for your target region. Cross-region inference profiles (where available) improve availability at the cost of routing transparency; log the profile ARN you used.

IAM auth (no API keys in env)

Production Bedrock calls use the AWS SDK credential chain: IAM role on ECS/EKS/Lambda, not long-lived access keys in .env. Minimum permissions typically include bedrock:InvokeModel and/or bedrock:Converse scoped to specific model ARNs.

import boto3

brt = boto3.client("bedrock-runtime", region_name="us-east-1")

# Converse API — unified messages shape
resp = brt.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    system=[{"text": "You are a concise assistant."}],
    messages=[
        {"role": "user", "content": [{"text": "Summarize VPC peering in two sentences."}]},
    ],
    inferenceConfig={"maxTokens": 256, "temperature": 0},
)
text = resp["output"]["message"]["content"][0]["text"]
usage = resp["usage"]
print(text, usage)
ConverseResponse response = bedrockRuntimeClient.converse(
    ConverseRequest.builder()
        .modelId("anthropic.claude-3-5-sonnet-20241022-v2:0")
        .system(SystemContentBlock.builder().text("You are a concise assistant.").build())
        .messages(Message.builder()
            .role(ConversationRole.USER)
            .content(ContentBlock.builder().text("Summarize VPC peering...").build())
            .build())
        .inferenceConfig(InferenceConfiguration.builder()
            .maxTokens(256)
            .temperature(0.0f)
            .build())
        .build());
# InvokeModel — Anthropic-native JSON body (older integrations)
import json
body = json.dumps({
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 256,
    "messages": [{"role": "user", "content": "Summarize VPC peering..."}],
})
raw = brt.invoke_model(
    modelId="anthropic.claude-3-haiku-20240307-v1:0",
    body=body,
    contentType="application/json",
)
# OpenAI — API key auth, not IAM
from openai import OpenAI
OpenAI().chat.completions.create(model="gpt-4o", messages=[...])
# Direct Anthropic — API key header
anthropic.Anthropic().messages.create(model="claude-3-5-sonnet-20241022", ...)
// OpenAI via Spring AI — API key in config
@Bean ChatClient openAiChatClient(OpenAiChatModel model) { ... }

Region considerations

  • Model availability — Sonnet may launch in us-east-1 weeks before eu-west-1.
  • Data residency — prompts stay in-region; cross-region inference is opt-in—read the model card.
  • Latency — run Bedrock in the same region as your app tier; cross-AZ is fine, cross-region adds RTT.
  • Quotas — Service Quotas console; request increases before launch traffic, not after 429 storms.
📦 Real World

Enterprises on AWS often mandate Bedrock so prompts never leave their compliance boundary and billing rides the existing AWS invoice. The tradeoff is model lag vs direct APIs and more IAM/CloudTrail plumbing—platform teams wrap Bedrock behind an internal gateway with per-team model allowlists.

⚠️ Pitfall

Hardcoding model IDs from a blog post. Bedrock IDs include version suffixes that change; deploy failures show ValidationException. Store IDs in config per environment and validate against ListFoundationModels in CI smoke tests.

💰 Cost

Bedrock pricing mirrors vendor list rates plus AWS markup on some models—but you also pay for VPC endpoints, CloudWatch logs, and cross-AZ data transfer. Total cost of ownership is infra + tokens, not tokens alone.

Google Gemini: GenerateContent

Gemini’s client surface is generateContent (REST or Google AI / Vertex SDK). Content is built from Part objects inside Content turns labeled user or model—not assistant.

Part types

Part typeUseProduction note
textPlain prompt textMost chat and extraction paths
inline_dataBase64 image/audio bytesWatch payload size; prefer GCS URIs at scale
file_dataReference uploaded file by URIVertex file API; lifecycle and TTL policies
function_call / function_responseTool loopParallel function calls supported on newer models

System instruction

Pass system_instruction at the model config level (not as a user turn). On Vertex AI, the same concept appears in systemInstruction on GenerativeModel. Keep system text stable for implicit caching behavior on long prompts.

Gemini 1.5 Flash via google-generativeai
import google.generativeai as genai

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model = genai.GenerativeModel(
    "gemini-1.5-flash",
    system_instruction="Reply in bullet points. Max 5 bullets.",
)
response = model.generate_content(
    "What should I log on every LLM request?",
    generation_config={"temperature": 0, "max_output_tokens": 512},
)
print(response.text)
print(response.usage_metadata)  # prompt_token_count, candidates_token_count
var generativeModel = new GenerativeModel(
    "gemini-1.5-flash",
    vertexAi);  // or Google AI client

var response = generativeModel.generateContent(
    ContentMaker.fromMultiModalData(
        "What should I log on every LLM request?"));

System.out.println(ResponseHandler.getText(response));

Google AI Studio vs Vertex AI

SurfaceAuthWhen
Google AI (AI Studio API key)API keyPrototypes, side projects, fast spikes
Vertex AIGCP IAM / service accountProduction on GCP, VPC-SC, enterprise billing

Gemini 1.5 Pro’s million-token context is attractive for whole-document ingest—verify quality and latency on your documents. Long context does not replace RAG for freshness; it changes how you chunk, not whether you need retrieval.

⚖️ Trade-off

Gemini Flash is among the cheapest frontier-class options for high volume—but multi-cloud teams may skip it to avoid a fourth SDK/auth model. Standardize on a gateway (LiteLLM, Portkey, in-house) if you truly need four providers.

💡 Pro Tip

Gemini returns finish_reason enums like MAX_TOKENS and safety blocks—map them to user-visible errors. A silent empty text often means safety filter, not success.

Token counting before you send

Context exceeded errors are preventable. Count tokens before the API call, reserve output budget, and reject or trim upstream—do not discover overflow from a 400 in production traffic.

Why count locally?

  • Pre-flight validation — return 413/422 to your client with a clear message instead of burning provider quota.
  • Cost estimation — show “this paste is ~12K tokens” in admin UIs before users submit.
  • Dynamic model routing — if prompt > 100K, route to Gemini Pro; else Haiku.
  • Chunking for RAG — size chunks to target token counts, not character counts.

OpenAI: tiktoken (Python) and jtokkit (Java)

Local token count — OpenAI cl100k_base family
import tiktoken

def count_openai_messages(model: str, messages: list[dict]) -> int:
    enc = tiktoken.encoding_for_model(model)
    # Per OpenAI cookbook: add per-message overhead (~3-4 tokens each)
    tokens = 0
    for m in messages:
        tokens += 4
        tokens += len(enc.encode(m.get("content") or ""))
    tokens += 2  # reply priming
    return tokens

MODEL_CTX = 128_000
MAX_OUT = 1_024
prompt_tokens = count_openai_messages("gpt-4o", messages)
if prompt_tokens + MAX_OUT > MODEL_CTX:
    raise ValueError(f"Context overflow: {prompt_tokens} + {MAX_OUT} > {MODEL_CTX}")
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncodingForModel("gpt-4o");

int tokens = 0;
for (Message m : messages) {
    tokens += 4;
    tokens += enc.encode(m.getContent()).size();
}
tokens += 2;

int modelCtx = 128_000;
int maxOut = 1_024;
if (tokens + maxOut > modelCtx) {
    throw new IllegalArgumentException("Context overflow: " + tokens);
}

Anthropic counting API

Anthropic exposes a count_tokens endpoint (and SDK helpers) that mirrors their tokenizer— prefer this over tiktoken when calling Claude, especially with tool definitions and multimodal blocks in the payload.

Anthropic official count
import anthropic

client = anthropic.Anthropic()
count = client.messages.count_tokens(
    model="claude-3-5-sonnet-20241022",
    system="Long system prompt...",
    messages=[{"role": "user", "content": user_text}],
)
print(count.input_tokens)
// Call Anthropic count API via HTTP from gateway, or use SDK when available
// Do not use CL100K tokenizer for Claude production gates

Pre-flight validation pattern

CheckAction if fail
prompt_tokens + max_tokens ≤ model_limit422 with “shorten input” or trigger summarization
single user message ≤ per-field capReject upload; prevent abuse
tool schema size ≤ budgetTrim tool list or split agent capabilities
estimated_cost ≤ tenant daily cap429 from your API, not the provider’s
🔬 Under the Hood

Tokenizers differ per vendor—tiktoken for GPT will mis-estimate Claude by several percent. That is fine for UI hints; it is not fine for hard gates on Claude without Anthropic’s counter. Unified gateways often maintain a provider → counter registry.

💰 Cost

Counting API calls are cheap relative to generation—but avoid calling count on every chat keystroke. Debounce client-side estimates; server-side exact count once per submit.

🎯 Interview Tip

“How do you prevent context length errors?” — Pre-flight token count, reserve max_tokens, sliding window on history, summarization fallback. Mention you never rely on char/4 in production for code or JSON payloads.

Context management

The context window is shared between instructions, history, retrieved docs, tools, and the answer. Production systems budget each bucket and evict deliberately—not by hope.

Token budget formula

Treat the window as a ledger:

available_for_prompt = model_limit − reserved_output − safety_margin

Always reserve max_tokens (your completion cap) before stuffing RAG chunks. A 128K window with 4K reserved output and 2K safety margin leaves 122K for everything else—in practice cap lower for latency.

BucketEviction strategyKeep
System promptNever evict in-sessionPolicies, format, tool rules
RAG chunksDrop lowest rerank scoreTop-K until budget full
Chat historySliding window or summaryLast N turns or rolling summary
Tool outputsTruncate or summarize large JSONIDs and fields the model needs next step

Sliding window

Keep the system prompt plus the last K user/assistant pairs. Simple and predictable; downside: user references something from 20 turns ago and the model forgets. Pair with explicit “session summary” injection when detecting topic shift.

Summarization

When history exceeds budget, call a cheap model to compress older turns into a bullet summary stored as a synthetic system or user preamble: “Earlier in this session: user asked about refunds; agent offered prorated credit.” Log summary model and prompt version—summaries can drop nuance and cause wrong answers if too aggressive.

Selective memory and RAG preview

Long-term memory (user preferences, account facts) should not live as infinite chat history. Store structured facts in your DB; inject only relevant rows per turn—this is a preview of Track 2 RAG. Same for documents: retrieve top chunks per query instead of pasting whole PDFs because the window allows it.

Assemble messages under a token ceiling
def build_messages(system: str, history: list, rag_chunks: list[str],
                   user: str, budget: int, count_fn) -> list[dict]:
    msgs = [{"role": "system", "content": system}]
    used = count_fn(msgs)

    # RAG: add highest-priority chunks until budget
    for chunk in rag_chunks:
        trial = msgs + [{"role": "user", "content": f"Context:\n{chunk}"}]
        if count_fn(trial) > budget:
            break
        msgs = trial
        used = count_fn(msgs)

    # History: newest first until budget
    trimmed = []
    for turn in reversed(history):
        trial = msgs + trimmed + [turn]
        if count_fn(trial) > budget:
            break
        trimmed.insert(0, turn)

    return msgs + trimmed + [{"role": "user", "content": user}]
// Same algorithm in your MessageAssemblyService:
// fixed system → greedy RAG chunks → newest history → current user
List<Message> buildMessages(String system, Deque<Message> history,
    List<String> ragChunks, String user, int budget) { ... }
⚠️ Pitfall

Setting max_tokens to the model maximum while sending a 120K prompt. You pay for output reservation in some providers’ validators; worse, you leave no room and get truncated answers. Cap output to what the UX needs (256–2K for most chat; more for codegen with review).

📦 Real World

Support copilots typically run 8K–16K total context with aggressive history trimming—not full 200K windows. Long context is for batch doc analysis pipelines, not every ticket reply. Measure p95 prompt size in week one.

Structured output and tools

Free-form prose is fine for chat; downstream code needs JSON, enums, and tool calls you can parse without prayer. Combine provider modes with schema validation in your runtime.

JSON mode vs strict JSON schema

ModeProviderGuarantee
response_format: json_objectOpenAIValid JSON, not schema
response_format: json_schema (strict)OpenAIConforms to schema; best for extraction
Tool use with single forced toolAnthropic / OpenAIArguments shaped by tool schema
response_mime_type: application/jsonGeminiJSON output; combine with schema where supported

Pydantic and Instructor (Python)

Validated extraction with Instructor
import instructor
from openai import OpenAI
from pydantic import BaseModel

class Invoice(BaseModel):
    vendor: str
    amount_usd: float
    due_date: str

client = instructor.from_openai(OpenAI())
invoice = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Invoice,
    messages=[{"role": "user", "content": raw_invoice_text}],
)
# invoice is a validated Invoice instance — not raw JSON string
// Spring AI structured output — map to record/class
record Invoice(String vendor, double amountUsd, String dueDate) {}

Invoice invoice = chatClient.prompt()
    .user(rawInvoiceText)
    .call()
    .entity(Invoice.class);

OpenAI strict JSON schema

Strict schema — all fields required unless nullable
schema = {
    "type": "object",
    "properties": {
        "vendor": {"type": "string"},
        "amount_usd": {"type": "number"},
    },
    "required": ["vendor", "amount_usd"],
    "additionalProperties": False,
}
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "invoice", "strict": True, "schema": schema},
    },
    messages=[{"role": "user", "content": raw_invoice_text}],
)
BeanOutputConverter<Invoice> converter = new BeanOutputConverter<>(Invoice.class);
String format = converter.getFormat();

chatClient.prompt()
    .user(u -> u.text(rawInvoiceText).param("format", format))
    .call()
    .entity(Invoice.class);

Function / tool calling loop

  1. Send tools array with schemas.
  2. Model returns tool call(s) with arguments.
  3. Execute tools server-side (with authz!).
  4. Append tool results; call model again until text answer or max iterations.

Cap iterations (e.g. 5) and wall-clock time—runaway tool loops are a common outage mode. Track 3 goes deeper on agents; here, treat tools as typed RPC with schema validation on arguments.

⚠️ Pitfall

Parsing JSON with json.loads and no schema validation. Models omit fields, use strings for numbers, or hallucinate enums. Always validate with Pydantic/Zod/Jackson and retry once with the validation error in the prompt—then fail closed.

⚖️ Trade-off

Strict JSON schema reduces parse errors but increases latency and can refuse valid-but-fuzzy extractions. For messy OCR text, a two-step pipeline (free extract → normalize) sometimes beats one strict shot.

Error handling and retries

LLM APIs fail in predictable categories. Map HTTP status and provider error codes to retry, trim and retry, or fail fast—never blanket retry everything.

Error taxonomy

Status / codeMeaningStrategy
429 rate limit Too many requests or tokens/min Exponential backoff + full jitter; honor Retry-After; reduce concurrency
400 context exceeded Prompt + max_tokens > limit Do not retry same payload—trim, summarize, or return 422 to client
400 invalid request Bad schema, missing max_tokens (Anthropic) Fix code; alert—non-retryable
401 / 403 Auth or IAM Rotate credentials; non-retryable until fixed
500 / 502 / 503 Provider or network blip Limited retries (2–3) with backoff; circuit breaker
Timeout (client) No response in deadline See streaming vs non-streaming below

429: exponential backoff with jitter

Retry wrapper — rate limits and 503 only
import random, time
from openai import RateLimitError, APIStatusError

def call_with_retry(fn, max_attempts=5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimitError as e:
            if attempt == max_attempts - 1:
                raise
            retry_after = float(e.response.headers.get("retry-after", 0))
            base = max(retry_after, 2 ** attempt)
            time.sleep(random.uniform(0, base))  # full jitter on cap
        except APIStatusError as e:
            if e.status_code in (500, 502, 503) and attempt < max_attempts - 1:
                time.sleep(2 ** attempt + random.random())
            else:
                raise
// Resilience4j Retry: retryOnException(RateLimitException.class)
// exponentialBackoff with randomization; maxAttempts=5
// Do NOT retry ContextLengthExceededException

400 context exceeded

Retrying the same oversized prompt will never succeed. Pipeline:

  1. Catch provider-specific error type / message substring.
  2. Drop oldest history or lowest-score RAG chunks.
  3. Optionally run emergency summarization.
  4. Single retry with reduced payload—not an infinite loop.

Timeout strategies: streaming vs non-streaming

ModeTimeout behaviorRecommendation
Non-streaming One deadline for entire completion Set client timeout > provider SLA; return 504 with retry id
Streaming Idle timeout between chunks + overall cap Reset idle timer on each delta; kill if no byte for 30s

Streaming improves TTFT but complicates retries—you may have partially rendered UI text. Persist stream state; on failure offer “continue” or “retry from scratch” explicitly.

🔬 Under the Hood

Full jitter (sleep(random(0, cap))) prevents synchronized retry storms when many pods hit 429 at once. Without jitter, backoff waves amplify the outage you are trying to survive.

📦 Real World

Load tests that only mock 200 responses ship blind. Record production error mixes (mostly 429 and timeouts) and replay them in CI so retry budgets and circuit breakers are tuned before launch day traffic.

💡 Pro Tip

Expose a request_id to clients on failure—it maps to provider headers and your logs. Support can trace one bad session without turning on prompt logging globally.

What this looks like in production

Provider APIs are building blocks. Production is a gateway, pre-flight token gates, structured observability, and a checklist you can run before enabling the feature flag for 10% of users.

Gateway pattern (preview)

Client apps call your /v1/chat—not OpenAI directly. The gateway holds provider credentials, normalizes messages across vendors, enforces per-tenant rate limits, runs token counting, attaches trace IDs, and emits metrics. One place to rotate keys, swap models, and block prompt injection patterns.

Logging fields (every call)

FieldWhy
request_id, provider request idSupport and vendor tickets
provider, model, model versionDebug behavior drift
input_tokens, output_tokens, cached_tokensCost per feature / tenant
ttft_ms, total_latency_msSLO dashboards
finish_reason, http_status, error_codeAlerting on anomaly rates
feature, tenant_id (hashed), prompt_versionBlame regressions on the right deploy
stream: true/false, retry_countTune timeout and backoff policies

Production readiness checklist

  • Pre-flight token count rejects or trims before provider call
  • max_tokens set per route; never unbounded
  • 429/503 retry with jitter; 400 context errors trim—not blind retry
  • Streaming idle timeout + overall deadline tested under load
  • Structured output validated; one repair retry then fail closed
  • Secrets in vault / IAM roles—not keys in images
  • Provider and model pinned per environment; logged on every response
  • Golden eval suite in CI (Track 5) for prompt changes
  • Per-tenant cost caps with alerts at 80% daily budget
💰 Cost

Ship a dashboard on day one: cost = Σ(input_tokens × price_in + output_tokens × price_out) by feature. Teams that discover spend from the vendor invoice weeks later over-build then panic-cut context.

🎯 Interview Tip

“Design an LLM API for your product.” — Start with gateway, auth, token budget, idempotent retry policy, observability fields, and provider abstraction. Mention you would not let mobile clients hold API keys—shows production instincts.