Lang
API

Hello Ship — API to production

Track 1 taught you to call an API. Track 5 taught you to block bad merges with eval CI. Hello Ship is the bridge: a real HTTP service with health probes, structured logs, streaming responses, fallback when the provider blinks, and secrets that never hit git. This guide defines what “shipped” means in production, walks a full readiness checklist, implements streaming from OpenAI/Anthropic through SSE to the browser (and Spring WebFlux Flux), covers model and provider fallbacks including LiteLLM, and closes with a minimal FastAPI + Spring Boot sketch you can containerize today.

After reading, you should be able to: articulate the four pillars of a shipped LLM feature (observable, cost-controlled, gracefully degrading, evaluatable); run through a production readiness checklist before launch; stream tokens safely with error handling mid-stream; configure model and provider fallbacks; and deploy a minimal chat service with /health, /ready, and structured logging.

developer platform architect Track 6 FastAPI Spring Boot SSE

What “shipped” means

A Jupyter notebook that calls chat.completions.create once is not shipped. Shipped means paying customers—or internal teams with SLAs—depend on your feature daily, and you can answer four questions without opening Slack: Is it healthy? What did it cost? What happens when OpenAI 503s? Did last week’s prompt change break refunds?

Production LLM features fail differently from CRUD APIs. Correctness is probabilistic, latency has a long tail (time-to-first-token plus decode), and cost scales with usage—not with CPU cores. Teams that treat the LLM as “just another REST dependency” learn on incident day one. Teams that ship deliberately treat inference as an unreliable, metered backend with guardrails, eval gates, and fallbacks already wired.

Four pillars of a shipped LLM feature

PillarQuestion it answersMinimum barTrack reference
ObservableWhat happened on request req_abc?Structured logs, trace IDs, token/cost fields, latency histogramsTrack 5 — observability
Cost-controlledWill a bug loop bankrupt us?Per-tenant budgets, max tokens, rate limits, model routingGuide 2 — gateway
Gracefully degradingWhat do users see when the model is down?Fallback model, cached answer, canned response—not infinite spinnerThis guide § fallback
EvaluatableDid we regress quality?Golden cases in CI, smoke on deploy, online samplingTrack 5 — CI eval

Demo vs production (honest comparison)

DimensionDemoShipped
API keyIn notebook or frontend envServer-side only; rotated via secrets manager
ErrorsPrint stack traceRetry with backoff; user-safe message + incident id
LatencyWait for full responseStream TTFT; timeout per stage
Quality“Looks good to me”Eval suite ≥ threshold; block deploy on regression
SafetySystem prompt onlyInput/output guardrails + adversarial CI
CostUnboundedToken caps, per-user daily budget, cheaper model routing
flowchart LR
  Client["Browser / API client"]
  Edge["API gateway\nauth · rate limit"]
  Svc["Chat service\nhealth · ready"]
  Guard["Guardrails\ninput · output"]
  LLM["LLM provider\nstream · retry"]
  Obs["Observability\ntraces · cost"]
  Client --> Edge --> Svc --> Guard --> LLM
  Svc --> Obs
  Guard --> Obs
  LLM --> Obs

The diagram is intentionally boring—boring is good. Every arrow should emit a span or log line with request_id, tenant_id, model, and prompt_version. If you cannot replay a failed request from logs alone, you are not observable yet.

Shipped request envelope (structured log fields)
from dataclasses import dataclass, asdict
import logging

logger = logging.getLogger("chat.shipped")

@dataclass
class ShippedRequestContext:
    request_id: str
    tenant_id: str
    feature: str
    model: str
    prompt_version: str
    stream: bool

def log_request_start(ctx: ShippedRequestContext, input_tokens_est: int) -> None:
    logger.info("llm_request_start", extra={
        **asdict(ctx),
        "input_tokens_est": input_tokens_est,
        "pillar": "observable",
    })
public record ShippedRequestContext(
    String requestId, String tenantId, String feature,
    String model, String promptVersion, boolean stream) {}

public void logRequestStart(ShippedRequestContext ctx, int inputTokensEst) {
  structuredLog.info("llm_request_start", Map.of(
      "request_id", ctx.requestId(),
      "tenant_id", ctx.tenantId(),
      "model", ctx.model(),
      "prompt_version", ctx.promptVersion(),
      "input_tokens_est", inputTokensEst));
}
⚠️ Pitfall

Calling “shipped” because the feature flag is on for 5% of users without eval CI, cost caps, or on-call runbooks. Soft launch is valid—but without the four pillars you have a demo with traffic.

🎯 Interview Tip

“How do you know an LLM feature is production-ready?” — Name the four pillars with concrete artifacts: Datadog dashboards, token budgets, fallback chain, golden eval in GitHub Actions.

📦 Real World

Notion’s internal checklist for new AI surfaces requires observability schema approval and eval smoke green before any external beta—feature code is the last gate, not the first.

🔬 Under the Hood

“Evaluatable” does not mean perfect accuracy—it means measurable accuracy on a frozen dataset with a defined regression policy. Vibes are not a metric.

Production readiness checklist

Use this checklist before flipping a feature to production—or before your architect signs off. Each item maps to an incident class teams actually page on. Skipping “secrets” because “we’ll fix later” is how keys end up in browser DevTools.

Checklist by category

CategoryItemVerify
LoggingStructured JSON logs (not print)request_id on every line
No raw PII in logsPresidio or hash before sink
Prompt/chunk versioningprompt_version, not full prompt in prod
CostMax tokens per requestEnforced server-side
Per-tenant daily budget429 when exceeded
RetryExponential backoff on 429/503Max 3 attempts; jitter
Idempotency for side-effect toolsTool calls gated
Rate limitsPer-user and per-tenant RPMAt API gateway
Provider quota headroomAlert at 80% TPM
SLOp95 TTFT < target (e.g. 800ms)Dashboard + alert
Error budget for 5xx0.1% monthly
GuardrailsInput moderation before LLMTrack 5 Guide 4
Output scan before clientFail closed on PII
CI evalSmoke suite on PRNon-zero exit on fail
Full suite nightlyArtifact JSON report
StreamingSSE heartbeat / timeoutClient reconnect policy
Mid-stream error frameUser sees partial + error event
FallbackSecondary model configuredTested weekly
Degraded mode copyNo silent empty responses
SecretsKeys in vault / K8s secretNever in repo or frontend
Rotation runbook<15 min blast radius

Pre-launch gate matrix

GateOwnerBlocking?
eval/smoke.jsonl pass 100%ML / productYes
Load test 2× expected QPSPlatformYes for external
On-call runbook linked in service READMEEng leadYes
Cost projection vs budgetFinOpsWarn
Shadow traffic 72h with full railsPlatformRecommended
Retry with exponential backoff (provider 429/503)
import random
import time
from openai import OpenAI, RateLimitError, APIStatusError

RETRYABLE = {429, 500, 502, 503, 504}

def call_with_retry(fn, max_attempts: int = 3, base_delay: float = 0.5):
    for attempt in range(max_attempts):
        try:
            return fn()
        except RateLimitError:
            if attempt == max_attempts - 1:
                raise
        except APIStatusError as e:
            if e.status_code not in RETRYABLE or attempt == max_attempts - 1:
                raise
        delay = base_delay * (2 ** attempt) + random.uniform(0, 0.25)
        time.sleep(delay)
public <T> T callWithRetry(Supplier<T> fn, int maxAttempts) {
  int attempt = 0;
  while (true) {
    try {
      return fn.get();
    } catch (RateLimitException | ServiceUnavailableException e) {
      if (++attempt >= maxAttempts) throw e;
      long delayMs = (long) (500 * Math.pow(2, attempt - 1) + Math.random() * 250);
      try { Thread.sleep(delayMs); } catch (InterruptedException ie) { Thread.currentThread().interrupt(); throw e; }
    }
  }
}
🔒 Security

Secrets checklist: no OPENAI_API_KEY in frontend bundles, CI logs, or error reports. Use short-lived tokens where possible; scope Bedrock IAM to inference-only actions.

⚙️ Config

Externalize all limits: MAX_OUTPUT_TOKENS, RPM_PER_TENANT, FALLBACK_MODEL. Hardcoding in source guarantees a Friday deploy surprise.

💰 Cost

Pre-launch: estimate monthly spend = avg tokens/request × requests/day × 30 × price/1M. Add 3× headroom for prompt growth and retry duplication.

💡 Tip

Store checklist status in a launch ticket—each checkbox links to evidence (dashboard URL, CI run, runbook). Auditors and future-you will thank you.

⚖️ Trade-off

100% checklist compliance delays launch. Tier: internal tools can skip external load test; customer-facing chat cannot skip guardrails or CI eval.

Streaming

Users perceive latency as time to first token (TTFT), not total generation time. Streaming improves UX and lets you cancel expensive decode early—but adds failure modes: mid-stream provider errors, proxy timeouts, and guardrails that need buffered text. This section covers provider streaming APIs, SSE to the browser, Spring WebFlux Flux, and error handling when the stream breaks halfway.

Provider streaming APIs

ProviderFlagEvent shapeNotes
OpenAIstream=TrueSSE chunks; delta.contentstream_options.include_usage for billing
Anthropicstream=TrueSSE content_block_deltaMessage start/delta/stop events
Bedrockinvoke_model_with_response_streamBinary event streamConverse API unified in 2024+
Provider streaming — iterate token deltas
from openai import OpenAI

def stream_openai(messages: list[dict]) -> str:
    client = OpenAI()
    full = []
    with client.chat.completions.stream(
        model="gpt-4o-mini",
        messages=messages,
        max_tokens=1024,
    ) as stream:
        for event in stream:
            if event.type == "content.delta" and event.delta:
                chunk = event.delta
                full.append(chunk)
                yield chunk
    return "".join(full)
// Spring AI ChatClient — stream helper
public Flux<String> streamOpenAi(List<Message> messages) {
  return chatClient.prompt(new Prompt(messages))
      .options(OpenAiChatOptions.builder().model("gpt-4o-mini").maxTokens(1024).build())
      .stream()
      .map(r -> r.getResult().getOutput().getText())
      .filter(t -> t != null && !t.isBlank());
}
import anthropic

def stream_anthropic(messages: list[dict]):
    client = anthropic.Anthropic()
    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1024,
        messages=messages,
    ) as stream:
        for text in stream.text_stream:
            yield text
// AnthropicMessageChatModel streaming via Spring AI
public Flux<String> streamAnthropic(List<Message> messages) {
  return anthropicChatModel.stream(new Prompt(messages))
      .map(r -> r.getResult().getOutput().getText());
}
import boto3, json

def stream_bedrock_converse(model_id: str, messages: list[dict]):
    client = boto3.client("bedrock-runtime")
    resp = client.converse_stream(modelId=model_id, messages=messages)
    for event in resp["stream"]:
        if "contentBlockDelta" in event:
            delta = event["contentBlockDelta"]["delta"].get("text", "")
            if delta:
                yield delta
// BedrockRuntimeAsyncClient — ConverseStreamResponseHandler
bedrockAsync.converseStream(request, new ConverseStreamResponseHandler() {
  public void onEvent(ConverseStreamOutput event) {
    event.contentBlockDelta().ifPresent(d -> sink.next(d.delta().text()));
  }
});

SSE to the browser

Server-Sent Events (text/event-stream) is the standard pattern for browser chat UIs. Each token (or sentence buffer) becomes an SSE data: frame. Send a terminal event: done or JSON {"type":"done"} so clients know to close. Proxies (nginx, ALB) need idle timeout > max generation time—often 60–300s for long answers.

FastAPI SSE endpoint + Spring WebFlux Flux
import json
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

async def sse_chat_generator(user: str):
    try:
        for token in stream_openai([{"role": "user", "content": user}]):
            yield f"data: {json.dumps({'token': token})}\n\n"
        yield f"data: {json.dumps({'type': 'done'})}\n\n"
    except Exception as e:
        yield f"event: error\ndata: {json.dumps({'message': 'stream_failed', 'detail': str(e)[:200]})}\n\n"

@app.post("/chat/stream")
async def chat_stream(body: dict):
    return StreamingResponse(
        sse_chat_generator(body["message"]),
        media_type="text/event-stream",
        headers={"Cache-Control": "no-cache", "X-Accel-Buffering": "no"},
    )
@PostMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ServerSentEvent<String>> chatStream(@RequestBody ChatRequest body) {
  return streamOpenAi(List.of(new UserMessage(body.message())))
      .map(token -> ServerSentEvent.builder(token).build())
      .concatWith(Flux.just(ServerSentEvent.builder("{\"type\":\"done\"}").event("done").build()))
      .onErrorResume(e -> Flux.just(
          ServerSentEvent.builder("{\"message\":\"stream_failed\"}").event("error").build()));
}

Stream error handling

FailureUser experienceServer action
Provider 503 mid-streamPartial answer + error bannerEmit SSE error event; log tokens_emitted
Client disconnectUser navigated awayCancel provider stream; stop billing if supported
Output guardrail hitReplace with safe messageDo not forward raw toxic partial
Proxy timeoutSpinner foreverHeartbeat comment frames every 15s
SSE heartbeat + mid-stream error frame
import asyncio

async def sse_with_heartbeat(token_gen, heartbeat_sec: float = 15.0):
    last = asyncio.get_event_loop().time()
    async for token in token_gen:
        now = asyncio.get_event_loop().time()
        if now - last > heartbeat_sec:
            yield ": heartbeat\n\n"
            last = now
        yield f"data: {json.dumps({'token': token})}\n\n"
// Flux.merge with interval heartbeat
Flux<ServerSentEvent<String>> tokens = streamOpenAi(messages).map(this::dataEvent);
Flux<ServerSentEvent<String>> heartbeats = Flux.interval(Duration.ofSeconds(15))
    .map(i -> ServerSentEvent.builder("").comment("heartbeat").build());
return Flux.merge(tokens, heartbeats).takeUntilOther(tokens.ignoreElements());
⚠️ Pitfall

Running output moderation only after the full stream completes—users already saw PII or toxic text for 3 seconds. Buffer to sentence boundaries or run parallel chunk scans.

🔬 Under the Hood

OpenAI streams are SSE over HTTPS from their edge; your server opens one long-lived upstream connection and multiplexes to N downstream clients—watch connection pool limits.

🎯 Interview Tip

“How would you implement streaming chat?” — TTFT metrics, SSE vs WebSockets (SSE simpler for one-way), cancel on disconnect, error events, nginx buffering off.

📦 Real World

ChatGPT-style UIs batch tokens every 16–32ms before DOM update to reduce layout thrash—server still streams per token; client debounces render.

Fallback strategies

Providers have outages. Models have bad days. Fallback is not optional for customer-facing chat—it is how you stay gracefully degrading when pillar three matters. Layers: cheaper/smaller model fallback, alternate provider, feature degraded to search-only or canned FAQ, and unified routing via LiteLLM or an internal gateway (Guide 2).

Fallback layers

LayerTriggerBehaviorQuality impact
Model fallbackPrimary timeout / 5xxgpt-4o → gpt-4o-miniModerate
Provider fallbackRegional outageOpenAI → Anthropic → BedrockVariable (eval per pair)
Graceful degradationAll models failStatic FAQ + “try again”High UX hit; no hallucination
Cached semanticNear-duplicate queryReturn prior answerStale risk; great for FAQs

LiteLLM intro

LiteLLM provides an OpenAI-compatible proxy across 100+ providers. Configure fallbacks in YAML, centralize keys, and emit unified logging for cost tracking. Platform teams often deploy LiteLLM as a sidecar before product services call any vendor directly—see Guide 2 for full gateway patterns.

Model + provider fallback chain
from openai import OpenAI
import anthropic

PRIMARY = ("openai", "gpt-4o-mini")
FALLBACKS = [
    ("openai", "gpt-4o-mini"),
    ("anthropic", "claude-3-5-haiku-20241022"),
]

def chat_with_fallback(messages: list[dict]) -> tuple[str, str]:
    errors = []
    for provider, model in [PRIMARY] + FALLBACKS:
        try:
            if provider == "openai":
                c = OpenAI()
                r = c.chat.completions.create(model=model, messages=messages, timeout=30.0)
                return r.choices[0].message.content or "", f"{provider}/{model}"
            if provider == "anthropic":
                c = anthropic.Anthropic()
                r = c.messages.create(model=model, max_tokens=1024, messages=messages)
                text = "".join(b.text for b in r.content if b.type == "text")
                return text, f"{provider}/{model}"
        except Exception as e:
            errors.append(f"{provider}/{model}: {e}")
    raise RuntimeError("all fallbacks exhausted: " + "; ".join(errors))
public record ChatResult(String text, String routeUsed) {}

public ChatResult chatWithFallback(List<Message> messages) {
  List<Route> chain = List.of(
      new Route("openai", "gpt-4o-mini"),
      new Route("anthropic", "claude-3-5-haiku-20241022"));
  List<String> errors = new ArrayList<>();
  for (Route route : chain) {
    try {
      return new ChatResult(router.invoke(route, messages), route.label());
    } catch (Exception e) {
      errors.add(route.label() + ": " + e.getMessage());
    }
  }
  throw new ServiceUnavailableException("all fallbacks exhausted: " + errors);
}
# litellm_config.yaml — model fallbacks at proxy layer
model_list:
  - model_name: primary-chat
    litellm_params:
      model: gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: fallback-chat
    litellm_params:
      model: claude-3-5-haiku-20241022
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  fallbacks: [{"primary-chat": ["fallback-chat"]}]
  num_retries: 2
  timeout: 30

Graceful degradation copy

When all inference fails, return a deterministic response—never an empty 200 or infinite retry loop:

  • Include incident_id for support correlation.
  • Link to status page or FAQ search.
  • Log degradation event as fallback_tier=static for SLO exclusion policy.
⚖️ Trade-off

Provider fallback doubles eval surface—golden cases must pass on primary and fallback routes, or you ship silent quality cliffs during outages.

💰 Cost

Fallback to a larger model during outage can 10× spend for hours. Cap fallback RPM and alert when >5% traffic hits secondary for >15 minutes.

💡 Tip

Weekly “fallback drill”: block primary model in staging via feature flag; assert p95 latency and eval smoke still green on secondary.

🔒 Security

Each provider in the fallback chain needs the same guardrail pipeline—do not skip output moderation on Anthropic because OpenAI path had it.

Minimal service sketch

Ship a thin chat API you can dockerize and deploy behind your existing platform. Includes /health (process up) and /ready (can reach LLM or gateway), non-streaming /chat, structured JSON logging, and a Dockerfile mention—enough for K8s probes and a first production deploy.

Endpoint contract

RoutePurposeK8s probe
GET /healthProcess alivelivenessProbe
GET /readyDeps OK (API key present, gateway ping)readinessProbe
POST /chatSync chat JSON in/out
POST /chat/streamSSE stream
POST /chat — sync completion behind provider abstraction
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
import logging, uuid, os

logging.basicConfig(level=logging.INFO, format="%(message)s")
logger = logging.getLogger("chat.api")
app = FastAPI(title="hello-ship")

class ChatRequest(BaseModel):
    message: str = Field(max_length=8000)
    tenant_id: str = "default"

class ChatResponse(BaseModel):
    answer: str
    request_id: str
    model_route: str

@app.get("/health")
def health():
    return {"status": "ok"}

@app.get("/ready")
def ready():
    if not os.environ.get("OPENAI_API_KEY"):
        raise HTTPException(503, "missing OPENAI_API_KEY")
    return {"status": "ready"}

@app.post("/chat", response_model=ChatResponse)
def chat(req: ChatRequest):
    request_id = f"req_{uuid.uuid4().hex[:12]}"
    logger.info({"event": "chat_start", "request_id": request_id, "tenant_id": req.tenant_id})
    text, route = chat_with_fallback([{"role": "user", "content": req.message}])
    logger.info({"event": "chat_done", "request_id": request_id, "model_route": route})
    return ChatResponse(answer=text, request_id=request_id, model_route=route)
@RestController
@RequestMapping("/")
public class ChatController {
  @GetMapping("/health")
  public Map<String, String> health() { return Map.of("status", "ok"); }

  @GetMapping("/ready")
  public Map<String, String> ready() {
    if (System.getenv("OPENAI_API_KEY") == null) throw new ServiceUnavailableException("missing key");
    return Map.of("status", "ready");
  }

  @PostMapping("/chat")
  public ChatResponse chat(@RequestBody ChatRequest req) {
    String requestId = "req_" + UUID.randomUUID().toString().substring(0, 12);
    log.info("chat_start request_id={} tenant_id={}", requestId, req.tenantId());
    var result = chatService.chatWithFallback(req.message());
    log.info("chat_done request_id={} route={}", requestId, result.routeUsed());
    return new ChatResponse(result.text(), requestId, result.routeUsed());
  }
}
# Same FastAPI shell — swap chat_with_fallback primary to Anthropic in config
MODEL_PRIMARY = os.environ.get("PRIMARY_MODEL", "claude-3-5-haiku-20241022")
PROVIDER = os.environ.get("LLM_PROVIDER", "anthropic")
// application.yml — spring.ai.model.chat=anthropic
// ChatService injects AnthropicChatModel when LLM_PROVIDER=anthropic
# ready() check — verify AWS creds + bedrock-runtime reachability
import boto3
def bedrock_ready() -> bool:
    sts = boto3.client("sts")
    sts.get_caller_identity()
    return True
// Bedrock readiness — BedrockRuntimeClient.invokeModel with minimal prompt in /ready (cache 30s)

Structured logging

Use JSON logs ingestible by Datadog, CloudWatch, or ELK. Always log request_id, latency_ms, model_route, input_tokens, output_tokens on completion—never log full user text in production.

JSON log formatter
import json, logging, sys

class JsonFormatter(logging.Formatter):
    def format(self, record):
        payload = {"level": record.levelname, "msg": record.getMessage()}
        if isinstance(record.msg, dict):
            payload.update(record.msg)
        return json.dumps(payload)

handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
logging.getLogger("chat.api").handlers = [handler]
// logback-spring.xml — LogstashEncoder or custom JSON layout
// MDC.put("request_id", requestId) in filter; clear in finally

Dockerfile (Python)

FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
ENV PORT=8080
EXPOSE 8080
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Mount secrets at runtime (OPENAI_API_KEY from K8s Secret or AWS Secrets Manager). Set resource limits—LLM apps are I/O bound waiting on providers; 512Mi–1Gi RAM and 0.5–1 CPU per pod is a typical starting point.

⚙️ Config

K8s: liveness → /health; readiness → /ready. Do not call the LLM in liveness—provider blip should not restart pods.

📦 Real World

Teams ship v1 as a single FastAPI service behind existing API gateway auth; extract LLM gateway (Guide 2) when the second product team copies your OpenAI key.

🔬 Under the Hood

/ready should be cheap—env var check or cached gateway ping every 30s—not a full chat completion on every kubelet probe.

Production: Track 6 Guide 1 checklist

Close Guide 1 with a shipping checklist you can paste into a launch ticket—then continue to LLM gateway & infrastructure when multiple services need shared routing, caching, and tenant keys.

Track 6 Guide 1 production checklist

  • Four pillars documented for your feature: observable, cost-controlled, gracefully degrading, evaluatable.
  • Readiness checklist complete: logging, cost caps, retry, rate limits, SLO targets, guardrails, CI eval, streaming errors, fallback chain, secrets.
  • /health and /ready wired to orchestrator probes.
  • Structured JSON logs with request_id, model_route, token counts.
  • Streaming endpoint emits error SSE events; heartbeat for long generations.
  • Model and provider fallback tested in staging drill.
  • API keys server-side only; rotation runbook linked.
  • eval/smoke.jsonl green on deploy; link to Track 5 CI gates.
  • On-call runbook: provider outage, cost spike, stream failure.
  • Container image scanned; non-root user where possible.
RiskGuide 1 controlNext guide
Key sprawl across servicesSingle service first; plan gatewayGuide 2
Uncapped token spendMax tokens + tenant budgetGuide 2 — gateway quotas
Silent outage UXFallback + static degradationGuide 6 — architecture
Quality driftCI eval from Track 5All guides — eval in CI
Deploy smoke test (post-deploy hook)
import httpx

def post_deploy_smoke(base_url: str) -> None:
    r = httpx.get(f"{base_url}/ready", timeout=5.0)
    r.raise_for_status()
    r = httpx.post(f"{base_url}/chat", json={"message": "ping", "tenant_id": "smoke"}, timeout=60.0)
    r.raise_for_status()
    assert r.json().get("answer")
public void postDeploySmoke(String baseUrl) {
  rest.get().uri(baseUrl + "/ready").retrieve().toBodilessEntity();
  var resp = rest.post().uri(baseUrl + "/chat")
      .body(new ChatRequest("ping", "smoke"))
      .retrieve().body(ChatResponse.class);
  assert resp != null && resp.answer() != null && !resp.answer().isBlank();
}
📦 End of Guide 1

You defined shipped, ran the readiness checklist, implemented streaming with error handling, configured fallbacks, and sketched a minimal deployable service. Guide 2 adds the LLM gateway—shared routing, semantic cache, and multi-tenant keys when one service becomes many.

🎯 Interview Tip

“Walk me from notebook to production.” — Health/ready probes, structured logs, streaming SSE, retry/fallback, secrets manager, eval CI from Track 5, guardrails on path.

💰 Cost

Minimal service v1 cost is mostly inference—infra for a single small pod is noise until you forget token caps. Ship caps before you ship features.

Track 6 guide sequence

GuideTopicYou are here
Guide 1Hello Ship — API to production
Guide 2LLM gateway & infrastructurenext →
Guides 3–6Fine-tuning, multimodal, extraction, architecture

FAQ: Sync or stream first?

Ship sync /chat for integrations and eval runners; add /chat/stream for UI—same backend, shared fallback and logging.

FAQ: When to skip Docker?

Never skip container definition for production—even if you deploy to Lambda, ship a reproducible artifact. Dockerfile or equivalent is the contract platform teams need.