Lang
API

Production AI architecture patterns

Track 1 taught inference. Track 2 taught RAG. Track 3 taught prompts. Track 4 taught agents. Track 5 taught eval and guardrails. Track 6 Guides 1–5 taught shipping mechanics—gateways, fine-tuning, multimodal ingest, structured extraction. Guide 6 is the capstone: how those pieces compose into a production AI platform you can defend in an architecture review. This guide maps the full stack from browser to vector DB, walks a 10-step RAG request lifecycle, designs multi-tenant isolation with EU/US residency, optimizes cost at scale, hardens reliability with circuit breakers and bulkheads, rolls out features safely with flags, and closes with the end-of-series checklist for Applied AI Core.

After reading, you should be able to: whiteboard an AI-native reference architecture; trace a RAG query through guardrails, cache, retrieval, and async logging; design tenant namespaces with pgvector RLS and regional data planes; apply semantic cache, prompt caching, model tiering, and Batch API for cost; implement circuit breaker, bulkhead, timeout+retry+jitter, and offline cached mode; configure gradual rollout and kill switches; and articulate how all six tracks connect in a shippable system.

developer platform architect Track 6 · Guide 6 Capstone RAG Multi-tenant

AI-native architecture: the full stack

Production AI is not “FastAPI + OpenAI.” It is a layered platform: clients hit an API gateway, business logic lives in an AI service, inference flows through an LLM gateway, retrieval touches vector DB + knowledge base, hot paths use Redis cache, and every hop emits telemetry to observability. The diagram below is the reference architecture teams at scale converge on—whether they build it or buy Portkey + Pinecone + Datadog.

The mistake in v1 is collapsing layers. When the chat endpoint calls OpenAI, embeds inline, and queries Postgres in one 400-line file, you ship fast—and inherit a monolith that cannot isolate tenant blast radius, swap embedding models without redeploying UI, or fail over providers without editing application code. The AI-native split keeps product orchestration (RAG assembly, tool routing, session state) in the AI service while vendor concerns (keys, retries, semantic cache, cost ledger) stay in the LLM gateway from Guide 2.

Layer responsibilities

LayerOwnsDoes not ownTrack reference
ClientUX, streaming SSE, session tokensAPI keys, retrieval logicGuide 1 — Hello Ship
API GatewayTLS, WAF, JWT/OAuth, rate limits, request IDsPrompt assembly, model selectionPlatform / Kong / AWS API GW
AI ServiceRAG orchestration, guardrails hook, tool/agent loopsRaw vendor SDK keysTrack 4 — agentic RAG
LLM GatewayModel routing, semantic cache, failover, cost ledgerBusiness rules per tenant SKUGuide 2 — gateway
ProvidersInference, embeddings APIsYour SLA to end usersTrack 1 — APIs
Vector DB + KBChunk storage, hybrid search, ingest metadataChat session history (usually)Track 2 — vector DBs
Redis cacheSemantic cache, session scratch, rate-limit countersDurable audit logsGuide 2 — semantic cache
ObservabilityTraces, metrics, eval sampling, cost dashboardsBlocking bad prompts (that's guardrails)Track 5 — observability

Full-stack reference diagram

flowchart TB
  subgraph edge["Edge & clients"]
    WEB["Web app / mobile"]
    B2B["Partner API clients"]
    INT["Internal tools / agents"]
  end

  subgraph apigw["API Gateway"]
    TLS["TLS termination + WAF"]
    AUTH["JWT / OAuth + tenant claims"]
    RL_EDGE["Edge rate limits"]
    RID["Request ID injection"]
  end

  subgraph aisvc["AI Service — product orchestration"]
    IN_G["Input guardrails"]
    RAG_O["RAG orchestrator"]
    AGT["Agent / tool router"]
    OUT_G["Output guardrails"]
    ASM["Context assembler"]
  end

  subgraph llgw["LLM Gateway"]
    RT["Model router + aliases"]
    SC["Semantic cache Redis"]
    PC["Prompt cache headers"]
    FB["Retry + circuit breaker"]
    LEDGER["Cost ledger"]
  end

  subgraph data["Data plane"]
    VDB["Vector DB pgvector / Pinecone"]
    KB["Knowledge base S3 + Postgres"]
    RDIS["Redis — sessions + cache"]
    PG["Postgres — metadata + RLS"]
  end

  subgraph providers["Model providers"]
    OAI["OpenAI"]
    ANT["Anthropic"]
    BR["AWS Bedrock"]
    LOCAL["Local Llama — edge tasks"]
  end

  subgraph obs["Observability"]
    TRACE["Distributed tracing"]
    MET["Metrics + SLO dashboards"]
    LOG["Structured logs"]
    EVAL["Online eval sampling"]
    ALERT["PagerDuty / on-call"]
  end

  WEB --> TLS
  B2B --> TLS
  INT --> TLS
  TLS --> AUTH
  AUTH --> RL_EDGE
  RL_EDGE --> RID
  RID --> IN_G
  IN_G --> RAG_O
  IN_G --> AGT
  RAG_O --> ASM
  ASM --> VDB
  ASM --> KB
  RAG_O --> SC
  AGT --> RT
  ASM --> RT
  SC -->|miss| RT
  SC -->|hit| OUT_G
  RT --> FB
  FB --> OAI
  FB --> ANT
  FB --> BR
  FB --> LOCAL
  FB --> LEDGER
  RT --> PC
  OUT_G --> WEB
  VDB --> PG
  KB --> PG
  aisvc --> RDIS
  llgw --> RDIS
  aisvc --> TRACE
  llgw --> TRACE
  TRACE --> MET
  TRACE --> LOG
  LOG --> EVAL
  MET --> ALERT

Read the diagram left-to-right as request flow, top-to-bottom as dependency depth. The AI service never holds vendor API keys—it calls the gateway with X-Tenant-Id, X-Feature-Id, and a logical model alias (support-quality). Retrieval hits the vector DB with the same tenant namespace enforced at query time. Observability receives correlated spans from gateway and service via a shared trace_id propagated at the API gateway.

Cross-track capability map

Architecture concernPrimary owner layerApplied AI Core guide
Token budgeting & model aliasesLLM GatewayTrack 1 — model selection
Chunking & ingestionKB + async workersTrack 2 — ingestion
System prompt & context assemblyAI ServiceTrack 3 — system prompts
Tool calling & agent loopsAI ServiceTrack 4 — tool calling
Input/output guardrailsAI Service (+ gateway policies)Track 5 — guardrails
CI eval regression gatesDeploy pipelineTrack 5 — CI eval
Structured extraction pipelinesAI Service + workersGuide 5 — extraction
Multimodal document ingestKB workersGuide 4 — multimodal
💡 Pro Tip

Draw this diagram in every architecture review before debating Pinecone vs pgvector. Stakeholders align on layers first; technology choices second.

🔬 Under the Hood

The API gateway injects traceparent (W3C) or X-Request-Id before the AI service—never let the browser generate trace IDs or PII can leak into logs via client-controlled headers.

⚠️ Pitfall

Embedding the vector DB client inside the LLM gateway couples retrieval latency to inference scaling. Keep retrieval in the AI service; gateway stays stateless for chat completions and embeddings only.

🎯 Interview Tip

“Design a production RAG system.” — Start with this layered diagram, then drill into tenant isolation, cache hit rates, and what happens when OpenAI 503s. Mention eval sampling in observability—interviewers notice.

RAG request lifecycle: ten production steps

A demo RAG pipeline is retrieve → prompt → answer. Production adds nine more steps: input guardrails, semantic cache lookup, query rewrite, hybrid retrieval, context budgeting, streamed generation, output guardrails, citation attachment, and async logging with eval sampling. Miss any step and you learn on incident day—which step failed is the difference between a 5-minute rollback and a week-long postmortem.

Ten-step lifecycle overview

StepActionSync / asyncTrack tie-in
1Authenticate + attach tenant contextSyncGuide 1
2Input guardrails (PII, injection, policy)SyncTrack 5 — guardrails
3Semantic cache lookup (Redis)SyncGuide 2 — cache
4Query rewrite / HyDE / decompositionSync (optional LLM)Track 2 — retrieval
5Hybrid retrieve (vector + BM25 + filters)SyncTrack 2 — advanced RAG
6Context assemble + token budget trimSyncTrack 3 — context
7LLM stream via gatewaySync (stream)Track 1 — streaming
8Output guardrails + schema validationSyncTrack 3 — formatting
9Attach citations + cache write-backSyncTrack 2 — RAG explained
10Async logging, cost ledger, eval sampleAsyncTrack 5 — observability

Sequence diagram: happy path + cache miss

sequenceDiagram
  autonumber
  participant C as Client
  participant GW as API Gateway
  participant AI as AI Service
  participant GR as Guardrails
  participant RC as Redis cache
  participant VDB as Vector DB
  participant LG as LLM Gateway
  participant LLM as Provider
  participant OBS as Observability

  C->>GW: POST /v1/chat (query, session)
  GW->>AI: Forward + tenant_id + trace_id
  AI->>GR: Input guard (step 2)
  alt blocked
    GR-->>C: 400 policy violation
  end
  AI->>RC: Semantic cache lookup (step 3)
  alt cache hit
    RC-->>AI: Cached answer + citations
    AI->>GR: Output guard (step 8)
    AI-->>C: SSE stream cached tokens
    AI--)OBS: Async log hit=true (step 10)
  else cache miss
    AI->>AI: Query rewrite (step 4)
    AI->>VDB: Hybrid retrieve top-k (step 5)
    VDB-->>AI: Chunks + scores
    AI->>AI: Assemble context trim (step 6)
    AI->>LG: Chat completion stream (step 7)
    LG->>LLM: Stream request
    loop token stream
      LLM-->>LG: delta
      LG-->>AI: delta
      AI-->>C: SSE event
    end
    AI->>GR: Output guard (step 8)
    AI->>RC: Cache write-back (step 9)
    AI--)OBS: Async log + eval sample (step 10)
  end

Orchestrator sketch

RAG lifecycle orchestrator (simplified)
async def handle_rag_query(ctx: RequestContext, query: str) -> AsyncIterator[str]:
    # Step 2 — input guardrails
    guard_in = await input_guard.check(query, tenant_id=ctx.tenant_id)
    if guard_in.blocked:
        raise PolicyViolation(guard_in.reason)

    # Step 3 — semantic cache
    cached = await semantic_cache.get(ctx.tenant_id, query, embedding_fn=embed)
    if cached:
        async for tok in stream_cached(cached):
            yield tok
        asyncio.create_task(log_async(ctx, query, cached, cache_hit=True))
        return

    # Steps 4–6 — rewrite, retrieve, assemble
    rewritten = await rewrite_query(query, ctx)
    chunks = await retriever.hybrid_search(
        tenant_id=ctx.tenant_id,
        query=rewritten,
        top_k=12,
        filters=ctx.doc_filters,
    )
    messages = assemble_context(system=ctx.system_prompt, chunks=chunks, query=query)

    # Steps 7–8 — stream via gateway + output guard
    buffer: list[str] = []
    async for delta in llm_gateway.stream(messages, alias="support-quality", ctx=ctx):
        buffer.append(delta)
        yield delta

    answer = "".join(buffer)
    out = await output_guard.check(answer, citations=chunks)
    if out.redacted:
        answer = out.text

    # Step 9 — citations + cache write-back
    await semantic_cache.put(ctx.tenant_id, query, answer, chunks)
    yield format_citations(chunks)

    # Step 10 — async logging (non-blocking)
    asyncio.create_task(log_async(ctx, query, answer, chunks=chunks, cache_hit=False))
public Flux<String> handleRagQuery(RequestContext ctx, String query) {
  return inputGuard.check(query, ctx.tenantId())
      .flatMapMany(in -> {
        if (in.blocked()) return Flux.error(new PolicyViolation(in.reason()));
        return semanticCache.get(ctx.tenantId(), query)
            .flatMapMany(cached -> streamCached(cached)
                .doOnComplete(() -> logAsync(ctx, query, cached, true)))
            .switchIfEmpty(Flux.defer(() -> {
              var rewritten = rewriteQuery(query, ctx);
              var chunks = retriever.hybridSearch(ctx.tenantId(), rewritten, 12, ctx.docFilters());
              var messages = assembleContext(ctx.systemPrompt(), chunks, query);
              var buffer = new StringBuilder();
              return llmGateway.stream(messages, "support-quality", ctx)
                  .doOnNext(buffer::append)
                  .concatWith(Flux.defer(() -> {
                    var answer = outputGuard.check(buffer.toString(), chunks);
                    semanticCache.put(ctx.tenantId(), query, answer.text(), chunks);
                    logAsync(ctx, query, answer.text(), false);
                    return formatCitations(chunks);
                  }));
            }));
      });
}
💡 Pro Tip

Run step 10 (async logging) on a dedicated queue with backpressure—never block the SSE stream on Splunk or Langfuse ingest. Users feel logging latency as “stuck” final tokens.

💰 Cost

Query rewrite (step 4) costs one extra LLM call—enable only for conversational queries or when retrieval confidence < threshold. FAQ-style queries should skip rewrite and hit cache step 3 first.

⚠️ Pitfall

Output guardrails (step 8) after streaming means you may have already sent toxic tokens to the client. For high-risk domains, buffer the full response server-side or use a two-pass “draft + verify” pattern from Track 5.

🔬 Under the Hood

Cache keys should include tenant_id, model_alias, kb_version, and embedding model id—otherwise a re-ingest serves stale answers from yesterday's corpus.

Multi-tenant RAG: isolation without duplicate stacks

SaaS AI products face a tension: customers demand data isolation and regional residency (EU vs US), while finance demands shared infrastructure so margin does not collapse at 500 tenants. The production pattern is logical namespaces on shared vector and object stores, enforced with pgvector RLS, gateway tenant context, and regional data planes—not 500 separate Pinecone indexes unless contractually required.

Isolation models compared

ModelIsolation strengthOps costWhen to use
Shared index + namespace filterMedium (app-enforced)LowEarly SaaS, trusted internal tenants
Shared pgvector + RLSHigh (DB-enforced)MediumDefault for B2B SaaS
Index per tenantVery highHighEnterprise contracts, <50 large tenants
Cell per region (EU/US)RegulatoryMedium-highGDPR, data localization clauses

Regional data plane architecture

flowchart LR
  subgraph global["Global control plane"]
    CP["Tenant registry"]
    FF["Feature flags"]
    RT_CFG["Routing config"]
  end

  subgraph eu["EU data plane — Frankfurt"]
    EU_GW["API Gateway EU"]
    EU_AI["AI Service EU"]
    EU_PG["Postgres + pgvector RLS"]
    EU_S3["S3 EU bucket KB"]
    EU_REDIS["Redis EU"]
  end

  subgraph us["US data plane — Virginia"]
    US_GW["API Gateway US"]
    US_AI["AI Service US"]
    US_PG["Postgres + pgvector RLS"]
    US_S3["S3 US bucket KB"]
    US_REDIS["Redis US"]
  end

  CP --> EU_GW
  CP --> US_GW
  FF --> EU_AI
  FF --> US_AI

  EU_AI --> EU_PG
  EU_AI --> EU_S3
  EU_AI --> EU_REDIS
  US_AI --> US_PG
  US_AI --> US_S3
  US_AI --> US_REDIS

  EU_PG -->|"SET app.tenant_id"| RLS_EU["RLS policy"]
  US_PG -->|"SET app.tenant_id"| RLS_US["RLS policy"]

Tenant registry stores home_region (eu-central-1 | us-east-1), data_class, and kb_namespace. The API gateway routes to the correct regional cell based on JWT claims—never trust client-supplied region headers. Cross-region inference (EU docs, US LLM endpoint) is a compliance violation unless explicit BAA/DPA allows it.

pgvector RLS pattern

Tenant-scoped retrieval with RLS
# SQL migration (run once)
RLS_SQL = """
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON document_chunks
  USING (tenant_id = current_setting('app.tenant_id')::uuid);
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);
"""

async def search_chunks(pool, tenant_id: str, embedding: list[float], top_k: int = 10):
    async with pool.acquire() as conn:
        await conn.execute("SELECT set_config('app.tenant_id', $1, true)", tenant_id)
        rows = await conn.fetch(
            """
            SELECT id, content, metadata,
                   1 - (embedding <=> $2::vector) AS score
            FROM document_chunks
            ORDER BY embedding <=> $2::vector
            LIMIT $3
            """,
            tenant_id, embedding, top_k,
        )
    return rows
public List<ChunkHit> searchChunks(DataSource ds, UUID tenantId, float[] embedding, int topK) {
  try (var conn = ds.getConnection()) {
    try (var st = conn.createStatement()) {
      st.execute("SELECT set_config('app.tenant_id', '" + tenantId + "', true)");
    }
    try (var ps = conn.prepareStatement("""
        SELECT id, content, metadata,
               1 - (embedding <=> ?::vector) AS score
        FROM document_chunks
        ORDER BY embedding <=> ?::vector
        LIMIT ?
        """)) {
      ps.setObject(1, toPgVector(embedding));
      ps.setObject(2, toPgVector(embedding));
      ps.setInt(3, topK);
      // RLS filters rows — even if WHERE tenant_id is forgotten
      return mapRows(ps.executeQuery());
    }
  }
}

Namespace metadata contract

FieldScopeExampleEnforced at
tenant_idRow isolationtnt_acme_corpPostgres RLS + gateway JWT
kb_namespaceLogical KB partitionacme/hr-policiesRetriever filter + S3 prefix
home_regionData residencyeu-central-1DNS / gateway routing
embedding_modelVector compatibilitytext-embedding-3-largeIngest + query must match
kb_versionCache invalidationv2026-06-07T14:00ZSemantic cache key
⚠️ Pitfall

Application-level WHERE tenant_id = ? without RLS fails the moment an intern ships a raw SQL debug endpoint. RLS is defense in depth—assume application bugs happen.

💡 Pro Tip

Integration tests should attempt cross-tenant retrieval with wrong JWTs and expect zero rows—not 403 from the app layer only. RLS tests belong in CI.

🎯 Interview Tip

“How do you multi-tenant a vector DB?” — Namespace + RLS on pgvector for most cases; per-index for regulated enterprise; regional cells for GDPR. Mention embedding model pinning per tenant.

Cost at scale: four levers that actually move the needle

At 10K queries/day, cost is noise. At 10M, the same architecture bankrupts you. Production teams combine semantic cache (30–40% hit rates on support FAQ workloads), provider prompt caching (Anthropic/OpenAI cache breakpoints on system prompts), model tiering (Haiku for classify, Opus for synthesize), Batch API for offline jobs, and local Llama for deterministic simple tasks—classification, PII regex assist, query routing.

Cost lever matrix

LeverTypical savingsLatency impactGuide reference
Semantic cache (Redis)30–40% $ on FAQ/support−200–800 ms on hitGuide 2 — semantic cache
Anthropic prompt caching~90% on cached input tokensNeutralTrack 1 — tokens
Model tiering50–80% vs all-frontier+/- by routeTrack 1 — model selection
OpenAI Batch API50% on batch-eligible jobsHours (async)Guide 5 — extraction
Local Llama 3.x 8B~$0 marginal per 1M tokens+50–200 ms on GPUGuide 3 — adaptation

Routing decision tree

flowchart TD
  Q["Incoming request"] --> CACHE{"Semantic cache hit?"}
  CACHE -->|yes| RET["Return cached — $0 LLM"]
  CACHE -->|no| CLASS["Classify intent — local Llama / Haiku"]
  CLASS --> SIMPLE{"Simple FAQ / routing?"}
  SIMPLE -->|yes| SMALL["Small model — Haiku / GPT-4o-mini"]
  SIMPLE -->|no| RAG{"Needs retrieval?"}
  RAG -->|yes| MED["Medium + RAG — Sonnet"]
  RAG -->|no| COMPLEX{"Multi-step reasoning?"}
  COMPLEX -->|yes| BIG["Frontier — Opus / o3"]
  COMPLEX -->|no| MED
  SMALL --> BATCH{"Offline acceptable?"}
  MED --> BATCH
  BIG --> BATCH
  BATCH -->|yes| BATCHAPI["Batch API — 50% discount"]
  BATCH -->|no| LIVE["Live gateway stream"]

Semantic cache implementation notes

Target 0.92–0.95 cosine similarity threshold for cache hits—lower thresholds save money until wrong answers spike support tickets. Include tenant id and KB version in the cache key. Log cache_hit, similarity_score, and saved_tokens for finance dashboards. Teams with 30–40% hit rates on tier-1 support typically see 25–35% total inference spend reduction after gateway overhead.

Anthropic prompt caching on system block
# System prompt ~8K tokens — cache breakpoint saves ~90% on repeat calls
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)
usage = response.usage
# usage.cache_creation_input_tokens, usage.cache_read_input_tokens
var systemBlock = SystemContentBlock.builder()
    .text(LONG_SYSTEM_PROMPT)
    .cacheControl(CacheControlEphemeral.builder().build())
    .build();

var response = client.messages().create(MessageCreateParams.builder()
    .model("claude-sonnet-4-20250514")
    .maxTokens(1024L)
    .systemOf(systemBlock)
    .addUserMessage(userQuery)
    .build());
// response.usage().cacheReadInputTokens()
💰 Cost

Prompt caching shines when system prompts exceed 2K tokens and traffic is repetitive—exactly RAG with a fat policy doc. Pair with semantic cache: prompt cache saves input tokens; semantic cache skips the LLM entirely.

💡 Pro Tip

Run a weekly “cost attribution” report: top 10 tenants by spend, top 10 features by p95 tokens, cache hit rate by feature. Finance trusts graphs; engineering trusts row-level ledger from Guide 2.

⚠️ Pitfall

Local Llama for classification sounds free until you provision GPU nodes 24/7. Use serverless GPU or batch classify offline unless QPS justifies reserved capacity.

🔬 Under the Hood

Batch API jobs need idempotency keys per input row—retries after 24h window expiry should not double-charge downstream actions triggered by extraction results.

Reliability: survive provider outages gracefully

LLM providers are unreliable dependencies with opaque incidents. Production architecture treats them like payment processors: circuit breakers stop retry storms, bulkheads isolate chat from batch extraction, timeout + retry + jitter handles transient 429/503, a provider health dashboard drives automatic failover, and offline cached mode returns stale-but-safe answers when everything upstream is red.

Resilience pattern catalog

PatternProblem solvedTypical configOwner
Circuit breakerRetry storm during outageOpen after 5 failures / 30s; half-open probeLLM Gateway
BulkheadBatch jobs starve interactive chatSeparate connection pools + quotasAI Service pools
Timeout + retry + jitterHung streams, thundering herd60s stream timeout; 3 retries; full jitterLLM Gateway
Provider health dashboardManual failover lagP50 error rate > 5% → deprioritize routePlatform SRE
Offline cached modeTotal provider failureSemantic cache + static FAQ fallbackAI Service feature flag

Failover state machine

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: error_rate > threshold\nOR 5 consecutive 5xx
  Open --> HalfOpen: cooldown 60s
  HalfOpen --> Closed: probe success
  HalfOpen --> Open: probe failure
  Closed --> Degraded: latency_p95 > SLO
  Degraded --> Closed: metrics recover
  Degraded --> OfflineCache: all routes unhealthy
  OfflineCache --> HalfOpen: any route healthy

Gateway retry with jitter

Retry with full jitter + circuit breaker
import random
import asyncio

RETRYABLE = {429, 500, 502, 503, 529}

async def call_with_resilience(route: str, fn, max_attempts: int = 4):
    if circuit.is_open(route):
        raise CircuitOpenError(route)
    for attempt in range(max_attempts):
        try:
            return await asyncio.wait_for(fn(), timeout=60.0)
        except Exception as e:
            status = getattr(e, "status_code", 500)
            if status not in RETRYABLE or attempt == max_attempts - 1:
                circuit.record_failure(route)
                raise
            base = min(2 ** attempt, 32)
            delay = random.uniform(0, base)  # full jitter
            await asyncio.sleep(delay)
    circuit.record_success(route)
public <T> T callWithResilience(String route, Supplier<CompletableFuture<T>> fn) {
  if (circuit.isOpen(route)) throw new CircuitOpenException(route);
  for (int attempt = 0; attempt < 4; attempt++) {
    try {
      var result = fn.get().get(60, TimeUnit.SECONDS);
      circuit.recordSuccess(route);
      return result;
    } catch (Exception e) {
      int status = extractStatus(e);
      if (!RETRYABLE.contains(status) || attempt == 3) {
        circuit.recordFailure(route);
        throw e;
      }
      double delay = ThreadLocalRandom.current().nextDouble(0, Math.min(1 << attempt, 32));
      Thread.sleep((long) (delay * 1000));
    }
  }
  throw new IllegalStateException("unreachable");
}

Offline cached mode UX contract

When circuit state is OfflineCache, return answers from semantic cache even below normal similarity threshold (e.g. 0.88), prepend a banner: “AI temporarily limited—answer may be outdated,” and disable tool calling. Never hallucinate fresh policy in offline mode—static FAQ JSON beats confident wrong answers. Wire mode to a feature flag (next section) for instant kill-switch rollback.

💡 Pro Tip

Provider health dashboard should aggregate your error rates per route—not vendor status pages alone. Vendor green + your 40% 429 means key pool exhaustion, not “OpenAI is fine.”

⚠️ Pitfall

Circuit breakers on the client SDK in every microservice create split-brain—one service thinks OpenAI is open, another closed. Centralize breaker state in the LLM gateway Redis.

🔬 Under the Hood

Bulkheads are as much about max_concurrent_streams per tenant as separate thread pools—one tenant's agent loop should not exhaust the shared streaming semaphore.

🎯 Interview Tip

“OpenAI is down—what happens?” — Walk circuit breaker → failover route (Bedrock/Azure) → offline cache → static degradation. Mention user-visible banner and disabled tools.

Feature flags: ship AI without betting the company

AI features are high-risk merges: a prompt change is a behavior change. Production teams wrap every new model route, RAG pipeline version, and agent tool behind feature flags—gradual rollout 5% → 20% → 100%, model A/B with eval metrics, and a kill switch that disables inference in one click. LaunchDarkly and Unleash are common; the pattern matters more than the vendor.

Flag taxonomy for AI platforms

Flag typeExample keyRollout strategyKill criteria
Releaserag-v2-hybrid-search5% → 20% → 100% over 2 weeksRAG faithfulness < baseline −5%
Model A/Bmodel-route-sonnet-vs-gpt4o50/50 by tenant_id hashCost + quality pareto worse
Ops kill switchai-inference-enabledBoolean globalProvider outage / cost spike
Tenant overridetenant-acme-beta-agentAllow-list tenantsContract pilot ends
Offline modeoffline-cache-onlyManual + auto triggerLinked to circuit breaker

Gradual rollout timeline

gantt
  title RAG v2 rollout with eval gates
  dateFormat YYYY-MM-DD
  section Canary
  5% internal + friendly tenants    :a1, 2026-06-01, 3d
  section Expand
  20% production traffic             :a2, after a1, 4d
  section GA
  100% with kill switch armed        :a3, after a2, 7d
  section Eval gates
  CI golden set                      :milestone, 2026-06-01, 0d
  Online faithfulness sample         :milestone, 2026-06-05, 0d
  Cost per resolved ticket check     :milestone, 2026-06-10, 0d

Flag evaluation in AI service

LaunchDarkly / Unleash-style evaluation
from dataclasses import dataclass

@dataclass
class FlagContext:
    tenant_id: str
    user_id: str
    feature_id: str

def resolve_model_alias(flags, ctx: FlagContext) -> str:
    if not flags.bool_variation("ai-inference-enabled", ctx, default=True):
        raise InferenceDisabledError("Kill switch active")

    if flags.bool_variation("offline-cache-only", ctx, default=False):
        return "offline-cache"

    ab = flags.string_variation("model-route-ab", ctx, default="sonnet")
    if flags.bool_variation("rag-v2-hybrid-search", ctx, default=False):
        return f"rag-v2-{ab}"
    return f"rag-v1-{ab}"
public String resolveModelAlias(FeatureFlags flags, FlagContext ctx) {
  if (!flags.boolVariation("ai-inference-enabled", ctx, true)) {
    throw new InferenceDisabledException("Kill switch active");
  }
  if (flags.boolVariation("offline-cache-only", ctx, false)) {
    return "offline-cache";
  }
  var ab = flags.stringVariation("model-route-ab", ctx, "sonnet");
  if (flags.boolVariation("rag-v2-hybrid-search", ctx, false)) {
    return "rag-v2-" + ab;
  }
  return "rag-v1-" + ab;
}

Model A/B must log variant on every trace and join to eval scores—otherwise you cannot prove Sonnet beat GPT-4o on your retrieval corpus. Tie flag changes to deploy tickets; LaunchDarkly audit log is part of SOC2 evidence.

💡 Pro Tip

Hash tenant_id for rollout percentage, not user_id—B2B users within one tenant should see consistent behavior during partial rollout.

⚠️ Pitfall

Feature flags without TTL become permanent if/else spaghetti. Every AI flag gets an owner, expiry date, and cleanup ticket when GA reaches 100%.

💰 Cost

Model A/B at 50/50 doubles frontier spend during experiments—cap experiment duration and sample size; use Track 5 offline eval before online A/B when possible.

Production: Applied AI Core capstone checklist

You have crossed six tracks and thirty-six guides. This checklist is what “production AI platform ready” means when an architect signs off—not demo complete. Paste it into your platform epic, then celebrate: Applied AI Core is complete.

🎉 Series complete

Congratulations—you finished Applied AI Core. Six tracks: LLMs, RAG, Prompts, Agents, Evaluation, and Shipping. Return to the Applied AI Core hub anytime.

End-to-end platform checklist

  • Reference architecture documented: Client → API GW → AI Service → LLM Gateway → providers + vector DB + KB + Redis + observability.
  • RAG lifecycle implements all 10 steps with async logging and eval sampling on step 10.
  • Multi-tenant namespaces with pgvector RLS (or contractual per-tenant index) and EU/US regional cells where required.
  • Semantic cache targeting 30–40% hit on repetitive workloads; prompt caching on fat system prompts.
  • Model tiering + Batch API + local Llama routing for simple classification tasks.
  • Circuit breaker + bulkhead + timeout/retry/jitter centralized in LLM gateway.
  • Provider health dashboard drives automatic route deprioritization.
  • Offline cached mode with user-visible degradation banner.
  • Feature flags: 5→20→100% rollout, model A/B with traced variants, global kill switch tested quarterly.
  • CI eval gates from Track 5 block prompt/model merges; on-call runbook links provider outage playbook.
  • Cost ledger per tenant/feature; weekly attribution report to finance.
  • Cross-track traceability table maintained—every layer maps to a guide owner.

All six tracks — quick reference

TrackFocusStart guideCapstone tie-in
1 — LLMsAPIs, tokens, embeddings, costLLMs explainedGateway provider layer, streaming step 7
2 — RAGChunking, retrieval, ingestionRAG explainedVector DB + KB, lifecycle steps 4–6
3 — PromptsSystem prompts, context, formattingPrompt explainedContext assembly step 6, output format step 8
4 — AgentsTools, memory, MCP, agentic RAGAgents explainedAI service agent router, bulkhead pools
5 — EvalsJudge, RAG eval, guardrails, CIEvaluation explainedGuardrails steps 2/8, async eval step 10
6 — ShippingHello Ship → architectureHello ShipThis guide — full platform composition

Track 6 guide sequence — complete

GuideTopicStatus
Guide 1Hello Ship — API to production
Guide 2LLM gateway & infrastructure
Guide 3Fine-tuning & adaptation
Guide 4Multimodal ingest & processing
Guide 5Structured data extraction & agents
Guide 6Production AI architecture patternsyou are here

Full curriculum map (36 guides)

TrackGuides
Track 1LLMs explained · APIs & tokens · Embeddings · Multimodal · Model selection
Track 2RAG explained · Chunking · Vector DBs · Retrieval · Advanced RAG · Ingestion
Track 3Prompt explained · Techniques · System prompts · Formatting · Context · Production prompts
Track 4Agents explained · Tool calling · Memory · Multi-agent · MCP · Agentic RAG
Track 5Evaluation explained · LLM-as-judge · RAG eval · Guardrails · Observability · CI eval
Track 6Hello Ship · Gateway · Fine-tuning · Multimodal · Extraction · Production architecture
Platform smoke test — post-deploy
import httpx

def platform_smoke(base: str, tenant: str) -> None:
    """Validates capstone path: health → RAG → cache hit → kill switch off."""
    httpx.get(f"{base}/ready", timeout=5.0).raise_for_status()
    q = "What is our refund policy for EU customers?"
    r1 = httpx.post(f"{base}/v1/chat", json={"query": q, "tenant_id": tenant}, timeout=90.0)
    r1.raise_for_status()
    assert r1.json().get("citations")
    r2 = httpx.post(f"{base}/v1/chat", json={"query": q, "tenant_id": tenant}, timeout=30.0)
    r2.raise_for_status()
    assert r2.json().get("cache_hit") is True
public void platformSmoke(String base, String tenant) {
  rest.get().uri(base + "/ready").retrieve().toBodilessEntity();
  var q = "What is our refund policy for EU customers?";
  var r1 = rest.post().uri(base + "/v1/chat")
      .body(new ChatRequest(q, tenant)).retrieve().body(ChatResponse.class);
  assert r1 != null && r1.citations() != null && !r1.citations().isEmpty();
  var r2 = rest.post().uri(base + "/v1/chat")
      .body(new ChatRequest(q, tenant)).retrieve().body(ChatResponse.class);
  assert r2 != null && Boolean.TRUE.equals(r2.cacheHit());
}
🎯 Interview Tip

“You finished our AI curriculum—summarize it.” — Six tracks from inference to shipping; capstone is layered architecture with RAG lifecycle, tenant isolation, cost levers, resilience, and flags. Offer to whiteboard the mermaid stack from memory.

🔬 Under the Hood

The platform smoke test asserts cache_hit on the second identical query—that single assertion forces semantic cache, tenant scoping, and logging to work together; it catches more regressions than /health alone.

💰 Cost

Capstone complete does not mean spend-capped. Before GA, confirm per-tenant budgets, kill switch drill, and finance dashboard—the architecture enables control; flags and gateway quotas enforce it.

Where to go next

FAQ: Build vs buy the platform?

Start with managed gateway (Portkey/LiteLLM) + pgvector + your AI service. Extract to full DIY when spend exceeds vendor fees or compliance demands cell isolation you cannot configure. The layers stay the same—only ownership changes.

FAQ: Did we cover fine-tuning?

Yes—Guide 3 covers when adaptation beats prompting. The capstone assumes most traffic routes through RAG + tiered models; fine-tuned adapters plug in at the LLM gateway alias layer.

FAQ: What's the minimum viable capstone?

Hello Ship service + LLM gateway + pgvector RLS + semantic cache + input guard + CI eval smoke + kill switch. Add regional cells and agent routers when contracts and complexity demand them—not on day one.