Hello Ship — API to production

What “shipped” means

A Jupyter notebook that calls chat.completions.create once is not shipped. Shipped means paying customers—or internal teams with SLAs—depend on your feature daily, and you can answer four questions without opening Slack: Is it healthy? What did it cost? What happens when OpenAI 503s? Did last week’s prompt change break refunds?

Production LLM features fail differently from CRUD APIs. Correctness is probabilistic, latency has a long tail (time-to-first-token plus decode), and cost scales with usage—not with CPU cores. Teams that treat the LLM as “just another REST dependency” learn on incident day one. Teams that ship deliberately treat inference as an unreliable, metered backend with guardrails, eval gates, and fallbacks already wired.

Four pillars of a shipped LLM feature

Pillar	Question it answers	Minimum bar	Track reference
Observable	What happened on request req_abc?	Structured logs, trace IDs, token/cost fields, latency histograms	Track 5 — observability
Cost-controlled	Will a bug loop bankrupt us?	Per-tenant budgets, max tokens, rate limits, model routing	Guide 2 — gateway
Gracefully degrading	What do users see when the model is down?	Fallback model, cached answer, canned response—not infinite spinner	This guide § fallback
Evaluatable	Did we regress quality?	Golden cases in CI, smoke on deploy, online sampling	Track 5 — CI eval

Demo vs production (honest comparison)

Dimension	Demo	Shipped
API key	In notebook or frontend env	Server-side only; rotated via secrets manager
Errors	Print stack trace	Retry with backoff; user-safe message + incident id
Latency	Wait for full response	Stream TTFT; timeout per stage
Quality	“Looks good to me”	Eval suite ≥ threshold; block deploy on regression
Safety	System prompt only	Input/output guardrails + adversarial CI
Cost	Unbounded	Token caps, per-user daily budget, cheaper model routing

flowchart LR
  Client["Browser / API client"]
  Edge["API gateway\nauth · rate limit"]
  Svc["Chat service\nhealth · ready"]
  Guard["Guardrails\ninput · output"]
  LLM["LLM provider\nstream · retry"]
  Obs["Observability\ntraces · cost"]
  Client --> Edge --> Svc --> Guard --> LLM
  Svc --> Obs
  Guard --> Obs
  LLM --> Obs

The diagram is intentionally boring—boring is good. Every arrow should emit a span or log line with request_id, tenant_id, model, and prompt_version. If you cannot replay a failed request from logs alone, you are not observable yet.

from dataclasses import dataclass, asdict
import logging

logger = logging.getLogger("chat.shipped")

@dataclass
class ShippedRequestContext:
    request_id: str
    tenant_id: str
    feature: str
    model: str
    prompt_version: str
    stream: bool

def log_request_start(ctx: ShippedRequestContext, input_tokens_est: int) -> None:
    logger.info("llm_request_start", extra={
        **asdict(ctx),
        "input_tokens_est": input_tokens_est,
        "pillar": "observable",
    })

public record ShippedRequestContext(
    String requestId, String tenantId, String feature,
    String model, String promptVersion, boolean stream) {}

public void logRequestStart(ShippedRequestContext ctx, int inputTokensEst) {
  structuredLog.info("llm_request_start", Map.of(
      "request_id", ctx.requestId(),
      "tenant_id", ctx.tenantId(),
      "model", ctx.model(),
      "prompt_version", ctx.promptVersion(),
      "input_tokens_est", inputTokensEst));
}

⚠️ Pitfall

Calling “shipped” because the feature flag is on for 5% of users without eval CI, cost caps, or on-call runbooks. Soft launch is valid—but without the four pillars you have a demo with traffic.

🎯 Interview Tip

“How do you know an LLM feature is production-ready?” — Name the four pillars with concrete artifacts: Datadog dashboards, token budgets, fallback chain, golden eval in GitHub Actions.

📦 Real World

Notion’s internal checklist for new AI surfaces requires observability schema approval and eval smoke green before any external beta—feature code is the last gate, not the first.

🔬 Under the Hood

“Evaluatable” does not mean perfect accuracy—it means measurable accuracy on a frozen dataset with a defined regression policy. Vibes are not a metric.

Production readiness checklist

Use this checklist before flipping a feature to production—or before your architect signs off. Each item maps to an incident class teams actually page on. Skipping “secrets” because “we’ll fix later” is how keys end up in browser DevTools.

Checklist by category

Category	Item	Verify
Logging	Structured JSON logs (not print)	request_id on every line
	No raw PII in logs	Presidio or hash before sink
	Prompt/chunk versioning	prompt_version, not full prompt in prod
Cost	Max tokens per request	Enforced server-side
Cost	Per-tenant daily budget	429 when exceeded
Retry	Exponential backoff on 429/503	Max 3 attempts; jitter
Retry	Idempotency for side-effect tools	Tool calls gated
Rate limits	Per-user and per-tenant RPM	At API gateway
Rate limits	Provider quota headroom	Alert at 80% TPM
SLO	p95 TTFT < target (e.g. 800ms)	Dashboard + alert
SLO	Error budget for 5xx	0.1% monthly
Guardrails	Input moderation before LLM	Track 5 Guide 4
Guardrails	Output scan before client	Fail closed on PII
CI eval	Smoke suite on PR	Non-zero exit on fail
CI eval	Full suite nightly	Artifact JSON report
Streaming	SSE heartbeat / timeout	Client reconnect policy
Streaming	Mid-stream error frame	User sees partial + error event
Fallback	Secondary model configured	Tested weekly
Fallback	Degraded mode copy	No silent empty responses
Secrets	Keys in vault / K8s secret	Never in repo or frontend
Secrets	Rotation runbook	<15 min blast radius

Pre-launch gate matrix

Gate	Owner	Blocking?
eval/smoke.jsonl pass 100%	ML / product	Yes
Load test 2× expected QPS	Platform	Yes for external
On-call runbook linked in service README	Eng lead	Yes
Cost projection vs budget	FinOps	Warn
Shadow traffic 72h with full rails	Platform	Recommended

import random
import time
from openai import OpenAI, RateLimitError, APIStatusError

RETRYABLE = {429, 500, 502, 503, 504}

def call_with_retry(fn, max_attempts: int = 3, base_delay: float = 0.5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimitError:
            if attempt == max_attempts - 1:
                raise
        except APIStatusError as e:
            if e.status_code not in RETRYABLE or attempt == max_attempts - 1:
                raise
        delay = base_delay * (2 ** attempt) + random.uniform(0, 0.25)
        time.sleep(delay)

public <T> T callWithRetry(Supplier<T> fn, int maxAttempts) {
  int attempt = 0;
  while (true) {
    try {
      return fn.get();
    } catch (RateLimitException | ServiceUnavailableException e) {
      if (++attempt >= maxAttempts) throw e;
      long delayMs = (long) (500 * Math.pow(2, attempt - 1) + Math.random() * 250);
      try { Thread.sleep(delayMs); } catch (InterruptedException ie) { Thread.currentThread().interrupt(); throw e; }
    }
  }
}

🔒 Security

Secrets checklist: no OPENAI_API_KEY in frontend bundles, CI logs, or error reports. Use short-lived tokens where possible; scope Bedrock IAM to inference-only actions.

⚙️ Config

Externalize all limits: MAX_OUTPUT_TOKENS, RPM_PER_TENANT, FALLBACK_MODEL. Hardcoding in source guarantees a Friday deploy surprise.

💰 Cost

Pre-launch: estimate monthly spend = avg tokens/request × requests/day × 30 × price/1M. Add 3× headroom for prompt growth and retry duplication.

💡 Tip

Store checklist status in a launch ticket—each checkbox links to evidence (dashboard URL, CI run, runbook). Auditors and future-you will thank you.

⚖️ Trade-off

100% checklist compliance delays launch. Tier: internal tools can skip external load test; customer-facing chat cannot skip guardrails or CI eval.

Streaming

Users perceive latency as time to first token (TTFT), not total generation time. Streaming improves UX and lets you cancel expensive decode early—but adds failure modes: mid-stream provider errors, proxy timeouts, and guardrails that need buffered text. This section covers provider streaming APIs, SSE to the browser, Spring WebFlux Flux, and error handling when the stream breaks halfway.

Provider streaming APIs

Provider	Flag	Event shape	Notes
OpenAI	stream=True	SSE chunks; delta.content	stream_options.include_usage for billing
Anthropic	stream=True	SSE content_block_delta	Message start/delta/stop events
Bedrock	invoke_model_with_response_stream	Binary event stream	Converse API unified in 2024+

from openai import OpenAI

def stream_openai(messages: list[dict]) -> str:
    client = OpenAI()
    full = []
    with client.chat.completions.stream(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=1024,
    ) as stream:
        for event in stream:
            if event.type == "content.delta" and event.delta:
                chunk = event.delta
                full.append(chunk)
                yield chunk
    return "".join(full)

// Spring AI ChatClient — stream helper
public Flux<String> streamOpenAi(List<Message> messages) {
  return chatClient.prompt(new Prompt(messages))
      .options(OpenAiChatOptions.builder().model("gpt-4o-mini").maxTokens(1024).build())
      .stream()
      .map(r -> r.getResult().getOutput().getText())
      .filter(t -> t != null && !t.isBlank());
}

import anthropic

def stream_anthropic(messages: list[dict]):
    client = anthropic.Anthropic()
    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=messages,
    ) as stream:
        for text in stream.text_stream:
            yield text

// AnthropicMessageChatModel streaming via Spring AI
public Flux<String> streamAnthropic(List<Message> messages) {
  return anthropicChatModel.stream(new Prompt(messages))
      .map(r -> r.getResult().getOutput().getText());
}

import boto3, json

def stream_bedrock_converse(model_id: str, messages: list[dict]):
    client = boto3.client("bedrock-runtime")
    resp = client.converse_stream(modelId=model_id, messages=messages)
    for event in resp["stream"]:
        if "contentBlockDelta" in event:
            delta = event["contentBlockDelta"]["delta"].get("text", "")
            if delta:
                yield delta

// BedrockRuntimeAsyncClient — ConverseStreamResponseHandler
bedrockAsync.converseStream(request, new ConverseStreamResponseHandler() {
  public void onEvent(ConverseStreamOutput event) {
    event.contentBlockDelta().ifPresent(d -> sink.next(d.delta().text()));
  }
});

SSE to the browser

Server-Sent Events (text/event-stream) is the standard pattern for browser chat UIs. Each token (or sentence buffer) becomes an SSE data: frame. Send a terminal event: done or JSON {"type":"done"} so clients know to close. Proxies (nginx, ALB) need idle timeout > max generation time—often 60–300s for long answers.

import json
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

async def sse_chat_generator(user: str):
    try:
        for token in stream_openai([{"role": "user", "content": user}]):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield f"data: {json.dumps({'type': 'done'})}\n\n"
    except Exception as e:
        yield f"event: error\ndata: {json.dumps({'message': 'stream_failed', 'detail': str(e)[:200]})}\n\n"

@app.post("/chat/stream")
async def chat_stream(body: dict):
    return StreamingResponse(
        sse_chat_generator(body["message"]),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )

@PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ServerSentEvent<String>> chatStream(@RequestBody ChatRequest body) {
  return streamOpenAi(List.of(new UserMessage(body.message())))
      .map(token -> ServerSentEvent.builder(token).build())
      .concatWith(Flux.just(ServerSentEvent.builder("{\"type\":\"done\"}").event("done").build()))
      .onErrorResume(e -> Flux.just(
          ServerSentEvent.builder("{\"message\":\"stream_failed\"}").event("error").build()));
}

Stream error handling

Failure	User experience	Server action
Provider 503 mid-stream	Partial answer + error banner	Emit SSE error event; log tokens_emitted
Client disconnect	User navigated away	Cancel provider stream; stop billing if supported
Output guardrail hit	Replace with safe message	Do not forward raw toxic partial
Proxy timeout	Spinner forever	Heartbeat comment frames every 15s

import asyncio

async def sse_with_heartbeat(token_gen, heartbeat_sec: float = 15.0):
    last = asyncio.get_event_loop().time()
    async for token in token_gen:
        now = asyncio.get_event_loop().time()
        if now - last > heartbeat_sec:
            yield ": heartbeat\n\n"
            last = now
        yield f"data: {json.dumps({'token': token})}\n\n"

// Flux.merge with interval heartbeat
Flux<ServerSentEvent<String>> tokens = streamOpenAi(messages).map(this::dataEvent);
Flux<ServerSentEvent<String>> heartbeats = Flux.interval(Duration.ofSeconds(15))
    .map(i -> ServerSentEvent.builder("").comment("heartbeat").build());
return Flux.merge(tokens, heartbeats).takeUntilOther(tokens.ignoreElements());

⚠️ Pitfall

Running output moderation only after the full stream completes—users already saw PII or toxic text for 3 seconds. Buffer to sentence boundaries or run parallel chunk scans.

🔬 Under the Hood

OpenAI streams are SSE over HTTPS from their edge; your server opens one long-lived upstream connection and multiplexes to N downstream clients—watch connection pool limits.

🎯 Interview Tip

“How would you implement streaming chat?” — TTFT metrics, SSE vs WebSockets (SSE simpler for one-way), cancel on disconnect, error events, nginx buffering off.

📦 Real World

ChatGPT-style UIs batch tokens every 16–32ms before DOM update to reduce layout thrash—server still streams per token; client debounces render.

Fallback strategies

Providers have outages. Models have bad days. Fallback is not optional for customer-facing chat—it is how you stay gracefully degrading when pillar three matters. Layers: cheaper/smaller model fallback, alternate provider, feature degraded to search-only or canned FAQ, and unified routing via LiteLLM or an internal gateway (Guide 2).

Fallback layers

Layer	Trigger	Behavior	Quality impact
Model fallback	Primary timeout / 5xx	gpt-4o → gpt-4o-mini	Moderate
Provider fallback	Regional outage	OpenAI → Anthropic → Bedrock	Variable (eval per pair)
Graceful degradation	All models fail	Static FAQ + “try again”	High UX hit; no hallucination
Cached semantic	Near-duplicate query	Return prior answer	Stale risk; great for FAQs

LiteLLM intro

LiteLLM provides an OpenAI-compatible proxy across 100+ providers. Configure fallbacks in YAML, centralize keys, and emit unified logging for cost tracking. Platform teams often deploy LiteLLM as a sidecar before product services call any vendor directly—see Guide 2 for full gateway patterns.

from openai import OpenAI
import anthropic

PRIMARY = ("openai", "gpt-4o-mini")
FALLBACKS = [
    ("openai", "gpt-4o-mini"),
    ("anthropic", "claude-3-5-haiku-20241022"),
]

def chat_with_fallback(messages: list[dict]) -> tuple[str, str]:
    errors = []
    for provider, model in [PRIMARY] + FALLBACKS:
        try:
            if provider == "openai":
                c = OpenAI()
                r = c.chat.completions.create(model=model, messages=messages, timeout=30.0)
                return r.choices[0].message.content or "", f"{provider}/{model}"
            if provider == "anthropic":
                c = anthropic.Anthropic()
                r = c.messages.create(model=model, max_tokens=1024, messages=messages)
                text = "".join(b.text for b in r.content if b.type == "text")
                return text, f"{provider}/{model}"
        except Exception as e:
            errors.append(f"{provider}/{model}: {e}")
    raise RuntimeError("all fallbacks exhausted: " + "; ".join(errors))

public record ChatResult(String text, String routeUsed) {}

public ChatResult chatWithFallback(List<Message> messages) {
  List<Route> chain = List.of(
      new Route("openai", "gpt-4o-mini"),
      new Route("anthropic", "claude-3-5-haiku-20241022"));
  List<String> errors = new ArrayList<>();
  for (Route route : chain) {
    try {
      return new ChatResult(router.invoke(route, messages), route.label());
    } catch (Exception e) {
      errors.add(route.label() + ": " + e.getMessage());
    }
  }
  throw new ServiceUnavailableException("all fallbacks exhausted: " + errors);
}

# litellm_config.yaml — model fallbacks at proxy layer
model_list:
  - model_name: primary-chat
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: fallback-chat
    litellm_params:
      model: claude-3-5-haiku-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{"primary-chat": ["fallback-chat"]}]
  num_retries: 2
  timeout: 30

Graceful degradation copy

When all inference fails, return a deterministic response—never an empty 200 or infinite retry loop:

Include incident_id for support correlation.
Link to status page or FAQ search.
Log degradation event as fallback_tier=static for SLO exclusion policy.

⚖️ Trade-off

Provider fallback doubles eval surface—golden cases must pass on primary and fallback routes, or you ship silent quality cliffs during outages.

💰 Cost

Fallback to a larger model during outage can 10× spend for hours. Cap fallback RPM and alert when >5% traffic hits secondary for >15 minutes.

💡 Tip

Weekly “fallback drill”: block primary model in staging via feature flag; assert p95 latency and eval smoke still green on secondary.

🔒 Security

Each provider in the fallback chain needs the same guardrail pipeline—do not skip output moderation on Anthropic because OpenAI path had it.

Minimal service sketch

Ship a thin chat API you can dockerize and deploy behind your existing platform. Includes /health (process up) and /ready (can reach LLM or gateway), non-streaming /chat, structured JSON logging, and a Dockerfile mention—enough for K8s probes and a first production deploy.

Endpoint contract

Route	Purpose	K8s probe
GET /health	Process alive	livenessProbe
GET /ready	Deps OK (API key present, gateway ping)	readinessProbe
POST /chat	Sync chat JSON in/out	—
POST /chat/stream	SSE stream	—

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import logging, uuid, os

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("chat.api")
app = FastAPI(title="hello-ship")

class ChatRequest(BaseModel):
    message: str = Field(max_length=8000)
    tenant_id: str = "default"

class ChatResponse(BaseModel):
    answer: str
    request_id: str
    model_route: str

@app.get("/health")
def health():
    return {"status": "ok"}

@app.get("/ready")
def ready():
    if not os.environ.get("OPENAI_API_KEY"):
        raise HTTPException(503, "missing OPENAI_API_KEY")
    return {"status": "ready"}

@app.post("/chat", response_model=ChatResponse)
def chat(req: ChatRequest):
    request_id = f"req_{uuid.uuid4().hex[:12]}"
    logger.info({"event": "chat_start", "request_id": request_id, "tenant_id": req.tenant_id})
    text, route = chat_with_fallback([{"role": "user", "content": req.message}])
    logger.info({"event": "chat_done", "request_id": request_id, "model_route": route})
    return ChatResponse(answer=text, request_id=request_id, model_route=route)

@RestController
@RequestMapping("/")
public class ChatController {
  @GetMapping("/health")
  public Map<String, String> health() { return Map.of("status", "ok"); }

  @GetMapping("/ready")
  public Map<String, String> ready() {
    if (System.getenv("OPENAI_API_KEY") == null) throw new ServiceUnavailableException("missing key");
    return Map.of("status", "ready");
  }

  @PostMapping("/chat")
  public ChatResponse chat(@RequestBody ChatRequest req) {
    String requestId = "req_" + UUID.randomUUID().toString().substring(0, 12);
    log.info("chat_start request_id={} tenant_id={}", requestId, req.tenantId());
    var result = chatService.chatWithFallback(req.message());
    log.info("chat_done request_id={} route={}", requestId, result.routeUsed());
    return new ChatResponse(result.text(), requestId, result.routeUsed());
  }
}

# Same FastAPI shell — swap chat_with_fallback primary to Anthropic in config
MODEL_PRIMARY = os.environ.get("PRIMARY_MODEL", "claude-3-5-haiku-20241022")
PROVIDER = os.environ.get("LLM_PROVIDER", "anthropic")

// application.yml — spring.ai.model.chat=anthropic
// ChatService injects AnthropicChatModel when LLM_PROVIDER=anthropic

# ready() check — verify AWS creds + bedrock-runtime reachability
import boto3
def bedrock_ready() -> bool:
    sts = boto3.client("sts")
    sts.get_caller_identity()
    return True

// Bedrock readiness — BedrockRuntimeClient.invokeModel with minimal prompt in /ready (cache 30s)

Structured logging

Use JSON logs ingestible by Datadog, CloudWatch, or ELK. Always log request_id, latency_ms, model_route, input_tokens, output_tokens on completion—never log full user text in production.

import json, logging, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        payload = {"level": record.levelname, "msg": record.getMessage()}
        if isinstance(record.msg, dict):
            payload.update(record.msg)
        return json.dumps(payload)

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.getLogger("chat.api").handlers = [handler]

// logback-spring.xml — LogstashEncoder or custom JSON layout
// MDC.put("request_id", requestId) in filter; clear in finally

Dockerfile (Python)

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8080
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Mount secrets at runtime (OPENAI_API_KEY from K8s Secret or AWS Secrets Manager). Set resource limits—LLM apps are I/O bound waiting on providers; 512Mi–1Gi RAM and 0.5–1 CPU per pod is a typical starting point.

⚙️ Config

K8s: liveness → /health; readiness → /ready. Do not call the LLM in liveness—provider blip should not restart pods.

📦 Real World

Teams ship v1 as a single FastAPI service behind existing API gateway auth; extract LLM gateway (Guide 2) when the second product team copies your OpenAI key.

🔬 Under the Hood

/ready should be cheap—env var check or cached gateway ping every 30s—not a full chat completion on every kubelet probe.

Production: Track 6 Guide 1 checklist

Close Guide 1 with a shipping checklist you can paste into a launch ticket—then continue to LLM gateway & infrastructure when multiple services need shared routing, caching, and tenant keys.

Track 6 Guide 1 production checklist

Four pillars documented for your feature: observable, cost-controlled, gracefully degrading, evaluatable.
Readiness checklist complete: logging, cost caps, retry, rate limits, SLO targets, guardrails, CI eval, streaming errors, fallback chain, secrets.
/health and /ready wired to orchestrator probes.
Structured JSON logs with request_id, model_route, token counts.
Streaming endpoint emits error SSE events; heartbeat for long generations.
Model and provider fallback tested in staging drill.
API keys server-side only; rotation runbook linked.
eval/smoke.jsonl green on deploy; link to Track 5 CI gates.
On-call runbook: provider outage, cost spike, stream failure.
Container image scanned; non-root user where possible.

Risk	Guide 1 control	Next guide
Key sprawl across services	Single service first; plan gateway	Guide 2
Uncapped token spend	Max tokens + tenant budget	Guide 2 — gateway quotas
Silent outage UX	Fallback + static degradation	Guide 6 — architecture
Quality drift	CI eval from Track 5	All guides — eval in CI

import httpx

def post_deploy_smoke(base_url: str) -> None:
    r = httpx.get(f"{base_url}/ready", timeout=5.0)
    r.raise_for_status()
    r = httpx.post(f"{base_url}/chat", json={"message": "ping", "tenant_id": "smoke"}, timeout=60.0)
    r.raise_for_status()
    assert r.json().get("answer")

public void postDeploySmoke(String baseUrl) {
  rest.get().uri(baseUrl + "/ready").retrieve().toBodilessEntity();
  var resp = rest.post().uri(baseUrl + "/chat")
      .body(new ChatRequest("ping", "smoke"))
      .retrieve().body(ChatResponse.class);
  assert resp != null && resp.answer() != null && !resp.answer().isBlank();
}

📦 End of Guide 1

You defined shipped, ran the readiness checklist, implemented streaming with error handling, configured fallbacks, and sketched a minimal deployable service. Guide 2 adds the LLM gateway—shared routing, semantic cache, and multi-tenant keys when one service becomes many.

🎯 Interview Tip

“Walk me from notebook to production.” — Health/ready probes, structured logs, streaming SSE, retry/fallback, secrets manager, eval CI from Track 5, guardrails on path.

💰 Cost

Minimal service v1 cost is mostly inference—infra for a single small pod is noise until you forget token caps. Ship caps before you ship features.

Track 6 guide sequence

Guide	Topic	You are here
Guide 1	Hello Ship — API to production	✓
Guide 2	LLM gateway & infrastructure	next →
Guides 3–6	Fine-tuning, multimodal, extraction, architecture	—

FAQ: Sync or stream first?

Ship sync /chat for integrations and eval runners; add /chat/stream for UI—same backend, shared fallback and logging.

FAQ: When to skip Docker?

Never skip container definition for production—even if you deploy to Lambda, ship a reproducible artifact. Dockerfile or equivalent is the contract platform teams need.