Hello Ship — API to production
Track 1 taught you to call an API. Track 5 taught you to block bad merges with eval CI. Hello Ship is the bridge: a real HTTP service with health probes, structured logs, streaming responses, fallback when the provider blinks, and secrets that never hit git. This guide defines what “shipped” means in production, walks a full readiness checklist, implements streaming from OpenAI/Anthropic through SSE to the browser (and Spring WebFlux Flux), covers model and provider fallbacks including LiteLLM, and closes with a minimal FastAPI + Spring Boot sketch you can containerize today.
After reading, you should be able to: articulate the four pillars of a shipped LLM feature (observable, cost-controlled, gracefully degrading, evaluatable); run through a production readiness checklist before launch; stream tokens safely with error handling mid-stream; configure model and provider fallbacks; and deploy a minimal chat service with /health, /ready, and structured logging.
What “shipped” means
A Jupyter notebook that calls chat.completions.create once is not shipped. Shipped means paying customers—or internal teams with SLAs—depend on your feature daily, and you can answer four questions without opening Slack: Is it healthy? What did it cost? What happens when OpenAI 503s? Did last week’s prompt change break refunds?
Production LLM features fail differently from CRUD APIs. Correctness is probabilistic, latency has a long tail (time-to-first-token plus decode), and cost scales with usage—not with CPU cores. Teams that treat the LLM as “just another REST dependency” learn on incident day one. Teams that ship deliberately treat inference as an unreliable, metered backend with guardrails, eval gates, and fallbacks already wired.
Four pillars of a shipped LLM feature
| Pillar | Question it answers | Minimum bar | Track reference |
|---|---|---|---|
| Observable | What happened on request req_abc? | Structured logs, trace IDs, token/cost fields, latency histograms | Track 5 — observability |
| Cost-controlled | Will a bug loop bankrupt us? | Per-tenant budgets, max tokens, rate limits, model routing | Guide 2 — gateway |
| Gracefully degrading | What do users see when the model is down? | Fallback model, cached answer, canned response—not infinite spinner | This guide § fallback |
| Evaluatable | Did we regress quality? | Golden cases in CI, smoke on deploy, online sampling | Track 5 — CI eval |
Demo vs production (honest comparison)
| Dimension | Demo | Shipped |
|---|---|---|
| API key | In notebook or frontend env | Server-side only; rotated via secrets manager |
| Errors | Print stack trace | Retry with backoff; user-safe message + incident id |
| Latency | Wait for full response | Stream TTFT; timeout per stage |
| Quality | “Looks good to me” | Eval suite ≥ threshold; block deploy on regression |
| Safety | System prompt only | Input/output guardrails + adversarial CI |
| Cost | Unbounded | Token caps, per-user daily budget, cheaper model routing |
flowchart LR Client["Browser / API client"] Edge["API gateway\nauth · rate limit"] Svc["Chat service\nhealth · ready"] Guard["Guardrails\ninput · output"] LLM["LLM provider\nstream · retry"] Obs["Observability\ntraces · cost"] Client --> Edge --> Svc --> Guard --> LLM Svc --> Obs Guard --> Obs LLM --> Obs
The diagram is intentionally boring—boring is good. Every arrow should emit a span or log line with request_id, tenant_id, model, and prompt_version. If you cannot replay a failed request from logs alone, you are not observable yet.
from dataclasses import dataclass, asdict
import logging
logger = logging.getLogger("chat.shipped")
@dataclass
class ShippedRequestContext:
request_id: str
tenant_id: str
feature: str
model: str
prompt_version: str
stream: bool
def log_request_start(ctx: ShippedRequestContext, input_tokens_est: int) -> None:
logger.info("llm_request_start", extra={
**asdict(ctx),
"input_tokens_est": input_tokens_est,
"pillar": "observable",
})
public record ShippedRequestContext(
String requestId, String tenantId, String feature,
String model, String promptVersion, boolean stream) {}
public void logRequestStart(ShippedRequestContext ctx, int inputTokensEst) {
structuredLog.info("llm_request_start", Map.of(
"request_id", ctx.requestId(),
"tenant_id", ctx.tenantId(),
"model", ctx.model(),
"prompt_version", ctx.promptVersion(),
"input_tokens_est", inputTokensEst));
}
Calling “shipped” because the feature flag is on for 5% of users without eval CI, cost caps, or on-call runbooks. Soft launch is valid—but without the four pillars you have a demo with traffic.
“How do you know an LLM feature is production-ready?” — Name the four pillars with concrete artifacts: Datadog dashboards, token budgets, fallback chain, golden eval in GitHub Actions.
Notion’s internal checklist for new AI surfaces requires observability schema approval and eval smoke green before any external beta—feature code is the last gate, not the first.
“Evaluatable” does not mean perfect accuracy—it means measurable accuracy on a frozen dataset with a defined regression policy. Vibes are not a metric.
Production readiness checklist
Use this checklist before flipping a feature to production—or before your architect signs off. Each item maps to an incident class teams actually page on. Skipping “secrets” because “we’ll fix later” is how keys end up in browser DevTools.
Checklist by category
| Category | Item | Verify |
|---|---|---|
| Logging | Structured JSON logs (not print) | request_id on every line |
| No raw PII in logs | Presidio or hash before sink | |
| Prompt/chunk versioning | prompt_version, not full prompt in prod | |
| Cost | Max tokens per request | Enforced server-side |
| Per-tenant daily budget | 429 when exceeded | |
| Retry | Exponential backoff on 429/503 | Max 3 attempts; jitter |
| Idempotency for side-effect tools | Tool calls gated | |
| Rate limits | Per-user and per-tenant RPM | At API gateway |
| Provider quota headroom | Alert at 80% TPM | |
| SLO | p95 TTFT < target (e.g. 800ms) | Dashboard + alert |
| Error budget for 5xx | 0.1% monthly | |
| Guardrails | Input moderation before LLM | Track 5 Guide 4 |
| Output scan before client | Fail closed on PII | |
| CI eval | Smoke suite on PR | Non-zero exit on fail |
| Full suite nightly | Artifact JSON report | |
| Streaming | SSE heartbeat / timeout | Client reconnect policy |
| Mid-stream error frame | User sees partial + error event | |
| Fallback | Secondary model configured | Tested weekly |
| Degraded mode copy | No silent empty responses | |
| Secrets | Keys in vault / K8s secret | Never in repo or frontend |
| Rotation runbook | <15 min blast radius |
Pre-launch gate matrix
| Gate | Owner | Blocking? |
|---|---|---|
| eval/smoke.jsonl pass 100% | ML / product | Yes |
| Load test 2× expected QPS | Platform | Yes for external |
| On-call runbook linked in service README | Eng lead | Yes |
| Cost projection vs budget | FinOps | Warn |
| Shadow traffic 72h with full rails | Platform | Recommended |
import random
import time
from openai import OpenAI, RateLimitError, APIStatusError
RETRYABLE = {429, 500, 502, 503, 504}
def call_with_retry(fn, max_attempts: int = 3, base_delay: float = 0.5):
for attempt in range(max_attempts):
try:
return fn()
except RateLimitError:
if attempt == max_attempts - 1:
raise
except APIStatusError as e:
if e.status_code not in RETRYABLE or attempt == max_attempts - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 0.25)
time.sleep(delay)
public <T> T callWithRetry(Supplier<T> fn, int maxAttempts) {
int attempt = 0;
while (true) {
try {
return fn.get();
} catch (RateLimitException | ServiceUnavailableException e) {
if (++attempt >= maxAttempts) throw e;
long delayMs = (long) (500 * Math.pow(2, attempt - 1) + Math.random() * 250);
try { Thread.sleep(delayMs); } catch (InterruptedException ie) { Thread.currentThread().interrupt(); throw e; }
}
}
}
Secrets checklist: no OPENAI_API_KEY in frontend bundles, CI logs, or error reports. Use short-lived tokens where possible; scope Bedrock IAM to inference-only actions.
Externalize all limits: MAX_OUTPUT_TOKENS, RPM_PER_TENANT, FALLBACK_MODEL. Hardcoding in source guarantees a Friday deploy surprise.
Pre-launch: estimate monthly spend = avg tokens/request × requests/day × 30 × price/1M. Add 3× headroom for prompt growth and retry duplication.
Store checklist status in a launch ticket—each checkbox links to evidence (dashboard URL, CI run, runbook). Auditors and future-you will thank you.
100% checklist compliance delays launch. Tier: internal tools can skip external load test; customer-facing chat cannot skip guardrails or CI eval.
Streaming
Users perceive latency as time to first token (TTFT), not total generation time. Streaming improves UX and lets you cancel expensive decode early—but adds failure modes: mid-stream provider errors, proxy timeouts, and guardrails that need buffered text. This section covers provider streaming APIs, SSE to the browser, Spring WebFlux Flux, and error handling when the stream breaks halfway.
Provider streaming APIs
| Provider | Flag | Event shape | Notes |
|---|---|---|---|
| OpenAI | stream=True | SSE chunks; delta.content | stream_options.include_usage for billing |
| Anthropic | stream=True | SSE content_block_delta | Message start/delta/stop events |
| Bedrock | invoke_model_with_response_stream | Binary event stream | Converse API unified in 2024+ |
from openai import OpenAI
def stream_openai(messages: list[dict]) -> str:
client = OpenAI()
full = []
with client.chat.completions.stream(
model="gpt-4o-mini",
messages=messages,
max_tokens=1024,
) as stream:
for event in stream:
if event.type == "content.delta" and event.delta:
chunk = event.delta
full.append(chunk)
yield chunk
return "".join(full)
// Spring AI ChatClient — stream helper
public Flux<String> streamOpenAi(List<Message> messages) {
return chatClient.prompt(new Prompt(messages))
.options(OpenAiChatOptions.builder().model("gpt-4o-mini").maxTokens(1024).build())
.stream()
.map(r -> r.getResult().getOutput().getText())
.filter(t -> t != null && !t.isBlank());
}
import anthropic
def stream_anthropic(messages: list[dict]):
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=messages,
) as stream:
for text in stream.text_stream:
yield text
// AnthropicMessageChatModel streaming via Spring AI
public Flux<String> streamAnthropic(List<Message> messages) {
return anthropicChatModel.stream(new Prompt(messages))
.map(r -> r.getResult().getOutput().getText());
}
import boto3, json
def stream_bedrock_converse(model_id: str, messages: list[dict]):
client = boto3.client("bedrock-runtime")
resp = client.converse_stream(modelId=model_id, messages=messages)
for event in resp["stream"]:
if "contentBlockDelta" in event:
delta = event["contentBlockDelta"]["delta"].get("text", "")
if delta:
yield delta
// BedrockRuntimeAsyncClient — ConverseStreamResponseHandler
bedrockAsync.converseStream(request, new ConverseStreamResponseHandler() {
public void onEvent(ConverseStreamOutput event) {
event.contentBlockDelta().ifPresent(d -> sink.next(d.delta().text()));
}
});
SSE to the browser
Server-Sent Events (text/event-stream) is the standard pattern for browser chat UIs. Each token (or sentence buffer) becomes an SSE data: frame. Send a terminal event: done or JSON {"type":"done"} so clients know to close. Proxies (nginx, ALB) need idle timeout > max generation time—often 60–300s for long answers.
import json
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
app = FastAPI()
async def sse_chat_generator(user: str):
try:
for token in stream_openai([{"role": "user", "content": user}]):
yield f"data: {json.dumps({'token': token})}\n\n"
yield f"data: {json.dumps({'type': 'done'})}\n\n"
except Exception as e:
yield f"event: error\ndata: {json.dumps({'message': 'stream_failed', 'detail': str(e)[:200]})}\n\n"
@app.post("/chat/stream")
async def chat_stream(body: dict):
return StreamingResponse(
sse_chat_generator(body["message"]),
media_type="text/event-stream",
headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
)
@PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ServerSentEvent<String>> chatStream(@RequestBody ChatRequest body) {
return streamOpenAi(List.of(new UserMessage(body.message())))
.map(token -> ServerSentEvent.builder(token).build())
.concatWith(Flux.just(ServerSentEvent.builder("{\"type\":\"done\"}").event("done").build()))
.onErrorResume(e -> Flux.just(
ServerSentEvent.builder("{\"message\":\"stream_failed\"}").event("error").build()));
}
Stream error handling
| Failure | User experience | Server action |
|---|---|---|
| Provider 503 mid-stream | Partial answer + error banner | Emit SSE error event; log tokens_emitted |
| Client disconnect | User navigated away | Cancel provider stream; stop billing if supported |
| Output guardrail hit | Replace with safe message | Do not forward raw toxic partial |
| Proxy timeout | Spinner forever | Heartbeat comment frames every 15s |
import asyncio
async def sse_with_heartbeat(token_gen, heartbeat_sec: float = 15.0):
last = asyncio.get_event_loop().time()
async for token in token_gen:
now = asyncio.get_event_loop().time()
if now - last > heartbeat_sec:
yield ": heartbeat\n\n"
last = now
yield f"data: {json.dumps({'token': token})}\n\n"
// Flux.merge with interval heartbeat
Flux<ServerSentEvent<String>> tokens = streamOpenAi(messages).map(this::dataEvent);
Flux<ServerSentEvent<String>> heartbeats = Flux.interval(Duration.ofSeconds(15))
.map(i -> ServerSentEvent.builder("").comment("heartbeat").build());
return Flux.merge(tokens, heartbeats).takeUntilOther(tokens.ignoreElements());
Running output moderation only after the full stream completes—users already saw PII or toxic text for 3 seconds. Buffer to sentence boundaries or run parallel chunk scans.
OpenAI streams are SSE over HTTPS from their edge; your server opens one long-lived upstream connection and multiplexes to N downstream clients—watch connection pool limits.
“How would you implement streaming chat?” — TTFT metrics, SSE vs WebSockets (SSE simpler for one-way), cancel on disconnect, error events, nginx buffering off.
ChatGPT-style UIs batch tokens every 16–32ms before DOM update to reduce layout thrash—server still streams per token; client debounces render.
Fallback strategies
Providers have outages. Models have bad days. Fallback is not optional for customer-facing chat—it is how you stay gracefully degrading when pillar three matters. Layers: cheaper/smaller model fallback, alternate provider, feature degraded to search-only or canned FAQ, and unified routing via LiteLLM or an internal gateway (Guide 2).
Fallback layers
| Layer | Trigger | Behavior | Quality impact |
|---|---|---|---|
| Model fallback | Primary timeout / 5xx | gpt-4o → gpt-4o-mini | Moderate |
| Provider fallback | Regional outage | OpenAI → Anthropic → Bedrock | Variable (eval per pair) |
| Graceful degradation | All models fail | Static FAQ + “try again” | High UX hit; no hallucination |
| Cached semantic | Near-duplicate query | Return prior answer | Stale risk; great for FAQs |
LiteLLM intro
LiteLLM provides an OpenAI-compatible proxy across 100+ providers. Configure fallbacks in YAML, centralize keys, and emit unified logging for cost tracking. Platform teams often deploy LiteLLM as a sidecar before product services call any vendor directly—see Guide 2 for full gateway patterns.
from openai import OpenAI
import anthropic
PRIMARY = ("openai", "gpt-4o-mini")
FALLBACKS = [
("openai", "gpt-4o-mini"),
("anthropic", "claude-3-5-haiku-20241022"),
]
def chat_with_fallback(messages: list[dict]) -> tuple[str, str]:
errors = []
for provider, model in [PRIMARY] + FALLBACKS:
try:
if provider == "openai":
c = OpenAI()
r = c.chat.completions.create(model=model, messages=messages, timeout=30.0)
return r.choices[0].message.content or "", f"{provider}/{model}"
if provider == "anthropic":
c = anthropic.Anthropic()
r = c.messages.create(model=model, max_tokens=1024, messages=messages)
text = "".join(b.text for b in r.content if b.type == "text")
return text, f"{provider}/{model}"
except Exception as e:
errors.append(f"{provider}/{model}: {e}")
raise RuntimeError("all fallbacks exhausted: " + "; ".join(errors))
public record ChatResult(String text, String routeUsed) {}
public ChatResult chatWithFallback(List<Message> messages) {
List<Route> chain = List.of(
new Route("openai", "gpt-4o-mini"),
new Route("anthropic", "claude-3-5-haiku-20241022"));
List<String> errors = new ArrayList<>();
for (Route route : chain) {
try {
return new ChatResult(router.invoke(route, messages), route.label());
} catch (Exception e) {
errors.add(route.label() + ": " + e.getMessage());
}
}
throw new ServiceUnavailableException("all fallbacks exhausted: " + errors);
}
# litellm_config.yaml — model fallbacks at proxy layer
model_list:
- model_name: primary-chat
litellm_params:
model: gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: fallback-chat
litellm_params:
model: claude-3-5-haiku-20241022
api_key: os.environ/ANTHROPIC_API_KEY
router_settings:
fallbacks: [{"primary-chat": ["fallback-chat"]}]
num_retries: 2
timeout: 30
Graceful degradation copy
When all inference fails, return a deterministic response—never an empty 200 or infinite retry loop:
- Include incident_id for support correlation.
- Link to status page or FAQ search.
- Log degradation event as fallback_tier=static for SLO exclusion policy.
Provider fallback doubles eval surface—golden cases must pass on primary and fallback routes, or you ship silent quality cliffs during outages.
Fallback to a larger model during outage can 10× spend for hours. Cap fallback RPM and alert when >5% traffic hits secondary for >15 minutes.
Weekly “fallback drill”: block primary model in staging via feature flag; assert p95 latency and eval smoke still green on secondary.
Each provider in the fallback chain needs the same guardrail pipeline—do not skip output moderation on Anthropic because OpenAI path had it.
Minimal service sketch
Ship a thin chat API you can dockerize and deploy behind your existing platform. Includes /health (process up) and /ready (can reach LLM or gateway), non-streaming /chat, structured JSON logging, and a Dockerfile mention—enough for K8s probes and a first production deploy.
Endpoint contract
| Route | Purpose | K8s probe |
|---|---|---|
| GET /health | Process alive | livenessProbe |
| GET /ready | Deps OK (API key present, gateway ping) | readinessProbe |
| POST /chat | Sync chat JSON in/out | — |
| POST /chat/stream | SSE stream | — |
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import logging, uuid, os
logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("chat.api")
app = FastAPI(title="hello-ship")
class ChatRequest(BaseModel):
message: str = Field(max_length=8000)
tenant_id: str = "default"
class ChatResponse(BaseModel):
answer: str
request_id: str
model_route: str
@app.get("/health")
def health():
return {"status": "ok"}
@app.get("/ready")
def ready():
if not os.environ.get("OPENAI_API_KEY"):
raise HTTPException(503, "missing OPENAI_API_KEY")
return {"status": "ready"}
@app.post("/chat", response_model=ChatResponse)
def chat(req: ChatRequest):
request_id = f"req_{uuid.uuid4().hex[:12]}"
logger.info({"event": "chat_start", "request_id": request_id, "tenant_id": req.tenant_id})
text, route = chat_with_fallback([{"role": "user", "content": req.message}])
logger.info({"event": "chat_done", "request_id": request_id, "model_route": route})
return ChatResponse(answer=text, request_id=request_id, model_route=route)
@RestController
@RequestMapping("/")
public class ChatController {
@GetMapping("/health")
public Map<String, String> health() { return Map.of("status", "ok"); }
@GetMapping("/ready")
public Map<String, String> ready() {
if (System.getenv("OPENAI_API_KEY") == null) throw new ServiceUnavailableException("missing key");
return Map.of("status", "ready");
}
@PostMapping("/chat")
public ChatResponse chat(@RequestBody ChatRequest req) {
String requestId = "req_" + UUID.randomUUID().toString().substring(0, 12);
log.info("chat_start request_id={} tenant_id={}", requestId, req.tenantId());
var result = chatService.chatWithFallback(req.message());
log.info("chat_done request_id={} route={}", requestId, result.routeUsed());
return new ChatResponse(result.text(), requestId, result.routeUsed());
}
}
# Same FastAPI shell — swap chat_with_fallback primary to Anthropic in config
MODEL_PRIMARY = os.environ.get("PRIMARY_MODEL", "claude-3-5-haiku-20241022")
PROVIDER = os.environ.get("LLM_PROVIDER", "anthropic")
// application.yml — spring.ai.model.chat=anthropic
// ChatService injects AnthropicChatModel when LLM_PROVIDER=anthropic
# ready() check — verify AWS creds + bedrock-runtime reachability
import boto3
def bedrock_ready() -> bool:
sts = boto3.client("sts")
sts.get_caller_identity()
return True
// Bedrock readiness — BedrockRuntimeClient.invokeModel with minimal prompt in /ready (cache 30s)
Structured logging
Use JSON logs ingestible by Datadog, CloudWatch, or ELK. Always log request_id, latency_ms, model_route, input_tokens, output_tokens on completion—never log full user text in production.
import json, logging, sys
class JsonFormatter(logging.Formatter):
def format(self, record):
payload = {"level": record.levelname, "msg": record.getMessage()}
if isinstance(record.msg, dict):
payload.update(record.msg)
return json.dumps(payload)
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.getLogger("chat.api").handlers = [handler]
// logback-spring.xml — LogstashEncoder or custom JSON layout
// MDC.put("request_id", requestId) in filter; clear in finally
Dockerfile (Python)
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8080
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
Mount secrets at runtime (OPENAI_API_KEY from K8s Secret or AWS Secrets Manager). Set resource limits—LLM apps are I/O bound waiting on providers; 512Mi–1Gi RAM and 0.5–1 CPU per pod is a typical starting point.
K8s: liveness → /health; readiness → /ready. Do not call the LLM in liveness—provider blip should not restart pods.
Teams ship v1 as a single FastAPI service behind existing API gateway auth; extract LLM gateway (Guide 2) when the second product team copies your OpenAI key.
/ready should be cheap—env var check or cached gateway ping every 30s—not a full chat completion on every kubelet probe.
Production: Track 6 Guide 1 checklist
Close Guide 1 with a shipping checklist you can paste into a launch ticket—then continue to LLM gateway & infrastructure when multiple services need shared routing, caching, and tenant keys.
Track 6 Guide 1 production checklist
- Four pillars documented for your feature: observable, cost-controlled, gracefully degrading, evaluatable.
- Readiness checklist complete: logging, cost caps, retry, rate limits, SLO targets, guardrails, CI eval, streaming errors, fallback chain, secrets.
- /health and /ready wired to orchestrator probes.
- Structured JSON logs with request_id, model_route, token counts.
- Streaming endpoint emits error SSE events; heartbeat for long generations.
- Model and provider fallback tested in staging drill.
- API keys server-side only; rotation runbook linked.
- eval/smoke.jsonl green on deploy; link to Track 5 CI gates.
- On-call runbook: provider outage, cost spike, stream failure.
- Container image scanned; non-root user where possible.
| Risk | Guide 1 control | Next guide |
|---|---|---|
| Key sprawl across services | Single service first; plan gateway | Guide 2 |
| Uncapped token spend | Max tokens + tenant budget | Guide 2 — gateway quotas |
| Silent outage UX | Fallback + static degradation | Guide 6 — architecture |
| Quality drift | CI eval from Track 5 | All guides — eval in CI |
import httpx
def post_deploy_smoke(base_url: str) -> None:
r = httpx.get(f"{base_url}/ready", timeout=5.0)
r.raise_for_status()
r = httpx.post(f"{base_url}/chat", json={"message": "ping", "tenant_id": "smoke"}, timeout=60.0)
r.raise_for_status()
assert r.json().get("answer")
public void postDeploySmoke(String baseUrl) {
rest.get().uri(baseUrl + "/ready").retrieve().toBodilessEntity();
var resp = rest.post().uri(baseUrl + "/chat")
.body(new ChatRequest("ping", "smoke"))
.retrieve().body(ChatResponse.class);
assert resp != null && resp.answer() != null && !resp.answer().isBlank();
}
You defined shipped, ran the readiness checklist, implemented streaming with error handling, configured fallbacks, and sketched a minimal deployable service. Guide 2 adds the LLM gateway—shared routing, semantic cache, and multi-tenant keys when one service becomes many.
“Walk me from notebook to production.” — Health/ready probes, structured logs, streaming SSE, retry/fallback, secrets manager, eval CI from Track 5, guardrails on path.
Minimal service v1 cost is mostly inference—infra for a single small pod is noise until you forget token caps. Ship caps before you ship features.
Track 6 guide sequence
| Guide | Topic | You are here |
|---|---|---|
| Guide 1 | Hello Ship — API to production | ✓ |
| Guide 2 | LLM gateway & infrastructure | next → |
| Guides 3–6 | Fine-tuning, multimodal, extraction, architecture | — |
FAQ: Sync or stream first?
Ship sync /chat for integrations and eval runners; add /chat/stream for UI—same backend, shared fallback and logging.
FAQ: When to skip Docker?
Never skip container definition for production—even if you deploy to Lambda, ship a reproducible artifact. Dockerfile or equivalent is the contract platform teams need.