Production AI architecture patterns

AI-native architecture: the full stack

Production AI is not “FastAPI + OpenAI.” It is a layered platform: clients hit an API gateway, business logic lives in an AI service, inference flows through an LLM gateway, retrieval touches vector DB + knowledge base, hot paths use Redis cache, and every hop emits telemetry to observability. The diagram below is the reference architecture teams at scale converge on—whether they build it or buy Portkey + Pinecone + Datadog.

The mistake in v1 is collapsing layers. When the chat endpoint calls OpenAI, embeds inline, and queries Postgres in one 400-line file, you ship fast—and inherit a monolith that cannot isolate tenant blast radius, swap embedding models without redeploying UI, or fail over providers without editing application code. The AI-native split keeps product orchestration (RAG assembly, tool routing, session state) in the AI service while vendor concerns (keys, retries, semantic cache, cost ledger) stay in the LLM gateway from Guide 2.

Layer responsibilities

Layer	Owns	Does not own	Track reference
Client	UX, streaming SSE, session tokens	API keys, retrieval logic	Guide 1 — Hello Ship
API Gateway	TLS, WAF, JWT/OAuth, rate limits, request IDs	Prompt assembly, model selection	Platform / Kong / AWS API GW
AI Service	RAG orchestration, guardrails hook, tool/agent loops	Raw vendor SDK keys	Track 4 — agentic RAG
LLM Gateway	Model routing, semantic cache, failover, cost ledger	Business rules per tenant SKU	Guide 2 — gateway
Providers	Inference, embeddings APIs	Your SLA to end users	Track 1 — APIs
Vector DB + KB	Chunk storage, hybrid search, ingest metadata	Chat session history (usually)	Track 2 — vector DBs
Redis cache	Semantic cache, session scratch, rate-limit counters	Durable audit logs	Guide 2 — semantic cache
Observability	Traces, metrics, eval sampling, cost dashboards	Blocking bad prompts (that's guardrails)	Track 5 — observability

Full-stack reference diagram

flowchart TB
  subgraph edge["Edge & clients"]
    WEB["Web app / mobile"]
    B2B["Partner API clients"]
    INT["Internal tools / agents"]
  end

  subgraph apigw["API Gateway"]
    TLS["TLS termination + WAF"]
    AUTH["JWT / OAuth + tenant claims"]
    RL_EDGE["Edge rate limits"]
    RID["Request ID injection"]
  end

  subgraph aisvc["AI Service — product orchestration"]
    IN_G["Input guardrails"]
    RAG_O["RAG orchestrator"]
    AGT["Agent / tool router"]
    OUT_G["Output guardrails"]
    ASM["Context assembler"]
  end

  subgraph llgw["LLM Gateway"]
    RT["Model router + aliases"]
    SC["Semantic cache Redis"]
    PC["Prompt cache headers"]
    FB["Retry + circuit breaker"]
    LEDGER["Cost ledger"]
  end

  subgraph data["Data plane"]
    VDB["Vector DB pgvector / Pinecone"]
    KB["Knowledge base S3 + Postgres"]
    RDIS["Redis — sessions + cache"]
    PG["Postgres — metadata + RLS"]
  end

  subgraph providers["Model providers"]
    OAI["OpenAI"]
    ANT["Anthropic"]
    BR["AWS Bedrock"]
    LOCAL["Local Llama — edge tasks"]
  end

  subgraph obs["Observability"]
    TRACE["Distributed tracing"]
    MET["Metrics + SLO dashboards"]
    LOG["Structured logs"]
    EVAL["Online eval sampling"]
    ALERT["PagerDuty / on-call"]
  end

  WEB --> TLS
  B2B --> TLS
  INT --> TLS
  TLS --> AUTH
  AUTH --> RL_EDGE
  RL_EDGE --> RID
  RID --> IN_G
  IN_G --> RAG_O
  IN_G --> AGT
  RAG_O --> ASM
  ASM --> VDB
  ASM --> KB
  RAG_O --> SC
  AGT --> RT
  ASM --> RT
  SC -->|miss| RT
  SC -->|hit| OUT_G
  RT --> FB
  FB --> OAI
  FB --> ANT
  FB --> BR
  FB --> LOCAL
  FB --> LEDGER
  RT --> PC
  OUT_G --> WEB
  VDB --> PG
  KB --> PG
  aisvc --> RDIS
  llgw --> RDIS
  aisvc --> TRACE
  llgw --> TRACE
  TRACE --> MET
  TRACE --> LOG
  LOG --> EVAL
  MET --> ALERT

Read the diagram left-to-right as request flow, top-to-bottom as dependency depth. The AI service never holds vendor API keys—it calls the gateway with X-Tenant-Id, X-Feature-Id, and a logical model alias (support-quality). Retrieval hits the vector DB with the same tenant namespace enforced at query time. Observability receives correlated spans from gateway and service via a shared trace_id propagated at the API gateway.

Cross-track capability map

Architecture concern	Primary owner layer	Applied AI Core guide
Token budgeting & model aliases	LLM Gateway	Track 1 — model selection
Chunking & ingestion	KB + async workers	Track 2 — ingestion
System prompt & context assembly	AI Service	Track 3 — system prompts
Tool calling & agent loops	AI Service	Track 4 — tool calling
Input/output guardrails	AI Service (+ gateway policies)	Track 5 — guardrails
CI eval regression gates	Deploy pipeline	Track 5 — CI eval
Structured extraction pipelines	AI Service + workers	Guide 5 — extraction
Multimodal document ingest	KB workers	Guide 4 — multimodal

💡 Pro Tip

Draw this diagram in every architecture review before debating Pinecone vs pgvector. Stakeholders align on layers first; technology choices second.

🔬 Under the Hood

The API gateway injects traceparent (W3C) or X-Request-Id before the AI service—never let the browser generate trace IDs or PII can leak into logs via client-controlled headers.

⚠️ Pitfall

Embedding the vector DB client inside the LLM gateway couples retrieval latency to inference scaling. Keep retrieval in the AI service; gateway stays stateless for chat completions and embeddings only.

🎯 Interview Tip

“Design a production RAG system.” — Start with this layered diagram, then drill into tenant isolation, cache hit rates, and what happens when OpenAI 503s. Mention eval sampling in observability—interviewers notice.

RAG request lifecycle: ten production steps

A demo RAG pipeline is retrieve → prompt → answer. Production adds nine more steps: input guardrails, semantic cache lookup, query rewrite, hybrid retrieval, context budgeting, streamed generation, output guardrails, citation attachment, and async logging with eval sampling. Miss any step and you learn on incident day—which step failed is the difference between a 5-minute rollback and a week-long postmortem.

Ten-step lifecycle overview

Step	Action	Sync / async	Track tie-in
1	Authenticate + attach tenant context	Sync	Guide 1
2	Input guardrails (PII, injection, policy)	Sync	Track 5 — guardrails
3	Semantic cache lookup (Redis)	Sync	Guide 2 — cache
4	Query rewrite / HyDE / decomposition	Sync (optional LLM)	Track 2 — retrieval
5	Hybrid retrieve (vector + BM25 + filters)	Sync	Track 2 — advanced RAG
6	Context assemble + token budget trim	Sync	Track 3 — context
7	LLM stream via gateway	Sync (stream)	Track 1 — streaming
8	Output guardrails + schema validation	Sync	Track 3 — formatting
9	Attach citations + cache write-back	Sync	Track 2 — RAG explained
10	Async logging, cost ledger, eval sample	Async	Track 5 — observability

Sequence diagram: happy path + cache miss

sequenceDiagram
  autonumber
  participant C as Client
  participant GW as API Gateway
  participant AI as AI Service
  participant GR as Guardrails
  participant RC as Redis cache
  participant VDB as Vector DB
  participant LG as LLM Gateway
  participant LLM as Provider
  participant OBS as Observability

  C->>GW: POST /v1/chat (query, session)
  GW->>AI: Forward + tenant_id + trace_id
  AI->>GR: Input guard (step 2)
  alt blocked
    GR-->>C: 400 policy violation
  end
  AI->>RC: Semantic cache lookup (step 3)
  alt cache hit
    RC-->>AI: Cached answer + citations
    AI->>GR: Output guard (step 8)
    AI-->>C: SSE stream cached tokens
    AI--)OBS: Async log hit=true (step 10)
  else cache miss
    AI->>AI: Query rewrite (step 4)
    AI->>VDB: Hybrid retrieve top-k (step 5)
    VDB-->>AI: Chunks + scores
    AI->>AI: Assemble context trim (step 6)
    AI->>LG: Chat completion stream (step 7)
    LG->>LLM: Stream request
    loop token stream
      LLM-->>LG: delta
      LG-->>AI: delta
      AI-->>C: SSE event
    end
    AI->>GR: Output guard (step 8)
    AI->>RC: Cache write-back (step 9)
    AI--)OBS: Async log + eval sample (step 10)
  end

Orchestrator sketch

async def handle_rag_query(ctx: RequestContext, query: str) -> AsyncIterator[str]:
    # Step 2 — input guardrails
    guard_in = await input_guard.check(query, tenant_id=ctx.tenant_id)
    if guard_in.blocked:
        raise PolicyViolation(guard_in.reason)

    # Step 3 — semantic cache
    cached = await semantic_cache.get(ctx.tenant_id, query, embedding_fn=embed)
    if cached:
        async for tok in stream_cached(cached):
            yield tok
        asyncio.create_task(log_async(ctx, query, cached, cache_hit=True))
        return

    # Steps 4–6 — rewrite, retrieve, assemble
    rewritten = await rewrite_query(query, ctx)
    chunks = await retriever.hybrid_search(
        tenant_id=ctx.tenant_id,
        query=rewritten,
        top_k=12,
        filters=ctx.doc_filters,
    )
    messages = assemble_context(system=ctx.system_prompt, chunks=chunks, query=query)

    # Steps 7–8 — stream via gateway + output guard
    buffer: list[str] = []
    async for delta in llm_gateway.stream(messages, alias="support-quality", ctx=ctx):
        buffer.append(delta)
        yield delta

    answer = "".join(buffer)
    out = await output_guard.check(answer, citations=chunks)
    if out.redacted:
        answer = out.text

    # Step 9 — citations + cache write-back
    await semantic_cache.put(ctx.tenant_id, query, answer, chunks)
    yield format_citations(chunks)

    # Step 10 — async logging (non-blocking)
    asyncio.create_task(log_async(ctx, query, answer, chunks=chunks, cache_hit=False))

public Flux<String> handleRagQuery(RequestContext ctx, String query) {
  return inputGuard.check(query, ctx.tenantId())
      .flatMapMany(in -> {
        if (in.blocked()) return Flux.error(new PolicyViolation(in.reason()));
        return semanticCache.get(ctx.tenantId(), query)
            .flatMapMany(cached -> streamCached(cached)
                .doOnComplete(() -> logAsync(ctx, query, cached, true)))
            .switchIfEmpty(Flux.defer(() -> {
              var rewritten = rewriteQuery(query, ctx);
              var chunks = retriever.hybridSearch(ctx.tenantId(), rewritten, 12, ctx.docFilters());
              var messages = assembleContext(ctx.systemPrompt(), chunks, query);
              var buffer = new StringBuilder();
              return llmGateway.stream(messages, "support-quality", ctx)
                  .doOnNext(buffer::append)
                  .concatWith(Flux.defer(() -> {
                    var answer = outputGuard.check(buffer.toString(), chunks);
                    semanticCache.put(ctx.tenantId(), query, answer.text(), chunks);
                    logAsync(ctx, query, answer.text(), false);
                    return formatCitations(chunks);
                  }));
            }));
      });
}

💡 Pro Tip

Run step 10 (async logging) on a dedicated queue with backpressure—never block the SSE stream on Splunk or Langfuse ingest. Users feel logging latency as “stuck” final tokens.

💰 Cost

Query rewrite (step 4) costs one extra LLM call—enable only for conversational queries or when retrieval confidence < threshold. FAQ-style queries should skip rewrite and hit cache step 3 first.

⚠️ Pitfall

Output guardrails (step 8) after streaming means you may have already sent toxic tokens to the client. For high-risk domains, buffer the full response server-side or use a two-pass “draft + verify” pattern from Track 5.

🔬 Under the Hood

Cache keys should include tenant_id, model_alias, kb_version, and embedding model id—otherwise a re-ingest serves stale answers from yesterday's corpus.

Multi-tenant RAG: isolation without duplicate stacks

SaaS AI products face a tension: customers demand data isolation and regional residency (EU vs US), while finance demands shared infrastructure so margin does not collapse at 500 tenants. The production pattern is logical namespaces on shared vector and object stores, enforced with pgvector RLS, gateway tenant context, and regional data planes—not 500 separate Pinecone indexes unless contractually required.

Isolation models compared

Model	Isolation strength	Ops cost	When to use
Shared index + namespace filter	Medium (app-enforced)	Low	Early SaaS, trusted internal tenants
Shared pgvector + RLS	High (DB-enforced)	Medium	Default for B2B SaaS
Index per tenant	Very high	High	Enterprise contracts, <50 large tenants
Cell per region (EU/US)	Regulatory	Medium-high	GDPR, data localization clauses

Regional data plane architecture

flowchart LR
  subgraph global["Global control plane"]
    CP["Tenant registry"]
    FF["Feature flags"]
    RT_CFG["Routing config"]
  end

  subgraph eu["EU data plane — Frankfurt"]
    EU_GW["API Gateway EU"]
    EU_AI["AI Service EU"]
    EU_PG["Postgres + pgvector RLS"]
    EU_S3["S3 EU bucket KB"]
    EU_REDIS["Redis EU"]
  end

  subgraph us["US data plane — Virginia"]
    US_GW["API Gateway US"]
    US_AI["AI Service US"]
    US_PG["Postgres + pgvector RLS"]
    US_S3["S3 US bucket KB"]
    US_REDIS["Redis US"]
  end

  CP --> EU_GW
  CP --> US_GW
  FF --> EU_AI
  FF --> US_AI

  EU_AI --> EU_PG
  EU_AI --> EU_S3
  EU_AI --> EU_REDIS
  US_AI --> US_PG
  US_AI --> US_S3
  US_AI --> US_REDIS

  EU_PG -->|"SET app.tenant_id"| RLS_EU["RLS policy"]
  US_PG -->|"SET app.tenant_id"| RLS_US["RLS policy"]

Tenant registry stores home_region (eu-central-1 | us-east-1), data_class, and kb_namespace. The API gateway routes to the correct regional cell based on JWT claims—never trust client-supplied region headers. Cross-region inference (EU docs, US LLM endpoint) is a compliance violation unless explicit BAA/DPA allows it.

pgvector RLS pattern

# SQL migration (run once)
RLS_SQL = """
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON document_chunks
  USING (tenant_id = current_setting('app.tenant_id')::uuid);
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);
"""

async def search_chunks(pool, tenant_id: str, embedding: list[float], top_k: int = 10):
    async with pool.acquire() as conn:
        await conn.execute("SELECT set_config('app.tenant_id', $1, true)", tenant_id)
        rows = await conn.fetch(
            """
            SELECT id, content, metadata,
                   1 - (embedding <=> $2::vector) AS score
            FROM document_chunks
            ORDER BY embedding <=> $2::vector
            LIMIT $3
            """,
            tenant_id, embedding, top_k,
        )
    return rows

public List<ChunkHit> searchChunks(DataSource ds, UUID tenantId, float[] embedding, int topK) {
  try (var conn = ds.getConnection()) {
    try (var st = conn.createStatement()) {
      st.execute("SELECT set_config('app.tenant_id', '" + tenantId + "', true)");
    }
    try (var ps = conn.prepareStatement("""
        SELECT id, content, metadata,
               1 - (embedding <=> ?::vector) AS score
        FROM document_chunks
        ORDER BY embedding <=> ?::vector
        LIMIT ?
        """)) {
      ps.setObject(1, toPgVector(embedding));
      ps.setObject(2, toPgVector(embedding));
      ps.setInt(3, topK);
      // RLS filters rows — even if WHERE tenant_id is forgotten
      return mapRows(ps.executeQuery());
    }
  }
}

Namespace metadata contract

Field	Scope	Example	Enforced at
tenant_id	Row isolation	tnt_acme_corp	Postgres RLS + gateway JWT
kb_namespace	Logical KB partition	acme/hr-policies	Retriever filter + S3 prefix
home_region	Data residency	eu-central-1	DNS / gateway routing
embedding_model	Vector compatibility	text-embedding-3-large	Ingest + query must match
kb_version	Cache invalidation	v2026-06-07T14:00Z	Semantic cache key

⚠️ Pitfall

Application-level WHERE tenant_id = ? without RLS fails the moment an intern ships a raw SQL debug endpoint. RLS is defense in depth—assume application bugs happen.

💡 Pro Tip

Integration tests should attempt cross-tenant retrieval with wrong JWTs and expect zero rows—not 403 from the app layer only. RLS tests belong in CI.

🎯 Interview Tip

“How do you multi-tenant a vector DB?” — Namespace + RLS on pgvector for most cases; per-index for regulated enterprise; regional cells for GDPR. Mention embedding model pinning per tenant.

Cost at scale: four levers that actually move the needle

At 10K queries/day, cost is noise. At 10M, the same architecture bankrupts you. Production teams combine semantic cache (30–40% hit rates on support FAQ workloads), provider prompt caching (Anthropic/OpenAI cache breakpoints on system prompts), model tiering (Haiku for classify, Opus for synthesize), Batch API for offline jobs, and local Llama for deterministic simple tasks—classification, PII regex assist, query routing.

Cost lever matrix

Lever	Typical savings	Latency impact	Guide reference
Semantic cache (Redis)	30–40% $ on FAQ/support	−200–800 ms on hit	Guide 2 — semantic cache
Anthropic prompt caching	~90% on cached input tokens	Neutral	Track 1 — tokens
Model tiering	50–80% vs all-frontier	+/- by route	Track 1 — model selection
OpenAI Batch API	50% on batch-eligible jobs	Hours (async)	Guide 5 — extraction
Local Llama 3.x 8B	~$0 marginal per 1M tokens	+50–200 ms on GPU	Guide 3 — adaptation

Routing decision tree

flowchart TD
  Q["Incoming request"] --> CACHE{"Semantic cache hit?"}
  CACHE -->|yes| RET["Return cached — $0 LLM"]
  CACHE -->|no| CLASS["Classify intent — local Llama / Haiku"]
  CLASS --> SIMPLE{"Simple FAQ / routing?"}
  SIMPLE -->|yes| SMALL["Small model — Haiku / GPT-4o-mini"]
  SIMPLE -->|no| RAG{"Needs retrieval?"}
  RAG -->|yes| MED["Medium + RAG — Sonnet"]
  RAG -->|no| COMPLEX{"Multi-step reasoning?"}
  COMPLEX -->|yes| BIG["Frontier — Opus / o3"]
  COMPLEX -->|no| MED
  SMALL --> BATCH{"Offline acceptable?"}
  MED --> BATCH
  BIG --> BATCH
  BATCH -->|yes| BATCHAPI["Batch API — 50% discount"]
  BATCH -->|no| LIVE["Live gateway stream"]

Semantic cache implementation notes

Target 0.92–0.95 cosine similarity threshold for cache hits—lower thresholds save money until wrong answers spike support tickets. Include tenant id and KB version in the cache key. Log cache_hit, similarity_score, and saved_tokens for finance dashboards. Teams with 30–40% hit rates on tier-1 support typically see 25–35% total inference spend reduction after gateway overhead.

# System prompt ~8K tokens — cache breakpoint saves ~90% on repeat calls
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)
usage = response.usage
# usage.cache_creation_input_tokens, usage.cache_read_input_tokens

var systemBlock = SystemContentBlock.builder()
    .text(LONG_SYSTEM_PROMPT)
    .cacheControl(CacheControlEphemeral.builder().build())
    .build();

var response = client.messages().create(MessageCreateParams.builder()
    .model("claude-sonnet-4-20250514")
    .maxTokens(1024L)
    .systemOf(systemBlock)
    .addUserMessage(userQuery)
    .build());
// response.usage().cacheReadInputTokens()

💰 Cost

Prompt caching shines when system prompts exceed 2K tokens and traffic is repetitive—exactly RAG with a fat policy doc. Pair with semantic cache: prompt cache saves input tokens; semantic cache skips the LLM entirely.

💡 Pro Tip

Run a weekly “cost attribution” report: top 10 tenants by spend, top 10 features by p95 tokens, cache hit rate by feature. Finance trusts graphs; engineering trusts row-level ledger from Guide 2.

⚠️ Pitfall

Local Llama for classification sounds free until you provision GPU nodes 24/7. Use serverless GPU or batch classify offline unless QPS justifies reserved capacity.

🔬 Under the Hood

Batch API jobs need idempotency keys per input row—retries after 24h window expiry should not double-charge downstream actions triggered by extraction results.

Reliability: survive provider outages gracefully

LLM providers are unreliable dependencies with opaque incidents. Production architecture treats them like payment processors: circuit breakers stop retry storms, bulkheads isolate chat from batch extraction, timeout + retry + jitter handles transient 429/503, a provider health dashboard drives automatic failover, and offline cached mode returns stale-but-safe answers when everything upstream is red.

Resilience pattern catalog

Pattern	Problem solved	Typical config	Owner
Circuit breaker	Retry storm during outage	Open after 5 failures / 30s; half-open probe	LLM Gateway
Bulkhead	Batch jobs starve interactive chat	Separate connection pools + quotas	AI Service pools
Timeout + retry + jitter	Hung streams, thundering herd	60s stream timeout; 3 retries; full jitter	LLM Gateway
Provider health dashboard	Manual failover lag	P50 error rate > 5% → deprioritize route	Platform SRE
Offline cached mode	Total provider failure	Semantic cache + static FAQ fallback	AI Service feature flag

Failover state machine

stateDiagram-v2
  [*] --> Closed
  Closed --> Open: error_rate > threshold\nOR 5 consecutive 5xx
  Open --> HalfOpen: cooldown 60s
  HalfOpen --> Closed: probe success
  HalfOpen --> Open: probe failure
  Closed --> Degraded: latency_p95 > SLO
  Degraded --> Closed: metrics recover
  Degraded --> OfflineCache: all routes unhealthy
  OfflineCache --> HalfOpen: any route healthy

Gateway retry with jitter

import random
import asyncio

RETRYABLE = {429, 500, 502, 503, 529}

async def call_with_resilience(route: str, fn, max_attempts: int = 4):
    if circuit.is_open(route):
        raise CircuitOpenError(route)
    for attempt in range(max_attempts):
        try:
            return await asyncio.wait_for(fn(), timeout=60.0)
        except Exception as e:
            status = getattr(e, "status_code", 500)
            if status not in RETRYABLE or attempt == max_attempts - 1:
                circuit.record_failure(route)
                raise
            base = min(2 ** attempt, 32)
            delay = random.uniform(0, base)  # full jitter
            await asyncio.sleep(delay)
    circuit.record_success(route)

public <T> T callWithResilience(String route, Supplier<CompletableFuture<T>> fn) {
  if (circuit.isOpen(route)) throw new CircuitOpenException(route);
  for (int attempt = 0; attempt < 4; attempt++) {
    try {
      var result = fn.get().get(60, TimeUnit.SECONDS);
      circuit.recordSuccess(route);
      return result;
    } catch (Exception e) {
      int status = extractStatus(e);
      if (!RETRYABLE.contains(status) || attempt == 3) {
        circuit.recordFailure(route);
        throw e;
      }
      double delay = ThreadLocalRandom.current().nextDouble(0, Math.min(1 << attempt, 32));
      Thread.sleep((long) (delay * 1000));
    }
  }
  throw new IllegalStateException("unreachable");
}

Offline cached mode UX contract

When circuit state is OfflineCache, return answers from semantic cache even below normal similarity threshold (e.g. 0.88), prepend a banner: “AI temporarily limited—answer may be outdated,” and disable tool calling. Never hallucinate fresh policy in offline mode—static FAQ JSON beats confident wrong answers. Wire mode to a feature flag (next section) for instant kill-switch rollback.

💡 Pro Tip

Provider health dashboard should aggregate your error rates per route—not vendor status pages alone. Vendor green + your 40% 429 means key pool exhaustion, not “OpenAI is fine.”

⚠️ Pitfall

Circuit breakers on the client SDK in every microservice create split-brain—one service thinks OpenAI is open, another closed. Centralize breaker state in the LLM gateway Redis.

🔬 Under the Hood

Bulkheads are as much about max_concurrent_streams per tenant as separate thread pools—one tenant's agent loop should not exhaust the shared streaming semaphore.

🎯 Interview Tip

“OpenAI is down—what happens?” — Walk circuit breaker → failover route (Bedrock/Azure) → offline cache → static degradation. Mention user-visible banner and disabled tools.

Feature flags: ship AI without betting the company

AI features are high-risk merges: a prompt change is a behavior change. Production teams wrap every new model route, RAG pipeline version, and agent tool behind feature flags—gradual rollout 5% → 20% → 100%, model A/B with eval metrics, and a kill switch that disables inference in one click. LaunchDarkly and Unleash are common; the pattern matters more than the vendor.

Flag taxonomy for AI platforms

Flag type	Example key	Rollout strategy	Kill criteria
Release	rag-v2-hybrid-search	5% → 20% → 100% over 2 weeks	RAG faithfulness < baseline −5%
Model A/B	model-route-sonnet-vs-gpt4o	50/50 by tenant_id hash	Cost + quality pareto worse
Ops kill switch	ai-inference-enabled	Boolean global	Provider outage / cost spike
Tenant override	tenant-acme-beta-agent	Allow-list tenants	Contract pilot ends
Offline mode	offline-cache-only	Manual + auto trigger	Linked to circuit breaker

Gradual rollout timeline

gantt
  title RAG v2 rollout with eval gates
  dateFormat YYYY-MM-DD
  section Canary
  5% internal + friendly tenants    :a1, 2026-06-01, 3d
  section Expand
  20% production traffic             :a2, after a1, 4d
  section GA
  100% with kill switch armed        :a3, after a2, 7d
  section Eval gates
  CI golden set                      :milestone, 2026-06-01, 0d
  Online faithfulness sample         :milestone, 2026-06-05, 0d
  Cost per resolved ticket check     :milestone, 2026-06-10, 0d

Flag evaluation in AI service

from dataclasses import dataclass

@dataclass
class FlagContext:
    tenant_id: str
    user_id: str
    feature_id: str

def resolve_model_alias(flags, ctx: FlagContext) -> str:
    if not flags.bool_variation("ai-inference-enabled", ctx, default=True):
        raise InferenceDisabledError("Kill switch active")

    if flags.bool_variation("offline-cache-only", ctx, default=False):
        return "offline-cache"

    ab = flags.string_variation("model-route-ab", ctx, default="sonnet")
    if flags.bool_variation("rag-v2-hybrid-search", ctx, default=False):
        return f"rag-v2-{ab}"
    return f"rag-v1-{ab}"

public String resolveModelAlias(FeatureFlags flags, FlagContext ctx) {
  if (!flags.boolVariation("ai-inference-enabled", ctx, true)) {
    throw new InferenceDisabledException("Kill switch active");
  }
  if (flags.boolVariation("offline-cache-only", ctx, false)) {
    return "offline-cache";
  }
  var ab = flags.stringVariation("model-route-ab", ctx, "sonnet");
  if (flags.boolVariation("rag-v2-hybrid-search", ctx, false)) {
    return "rag-v2-" + ab;
  }
  return "rag-v1-" + ab;
}

Model A/B must log variant on every trace and join to eval scores—otherwise you cannot prove Sonnet beat GPT-4o on your retrieval corpus. Tie flag changes to deploy tickets; LaunchDarkly audit log is part of SOC2 evidence.

💡 Pro Tip

Hash tenant_id for rollout percentage, not user_id—B2B users within one tenant should see consistent behavior during partial rollout.

⚠️ Pitfall

Feature flags without TTL become permanent if/else spaghetti. Every AI flag gets an owner, expiry date, and cleanup ticket when GA reaches 100%.

💰 Cost

Model A/B at 50/50 doubles frontier spend during experiments—cap experiment duration and sample size; use Track 5 offline eval before online A/B when possible.

Production: Applied AI Core capstone checklist

You have crossed six tracks and thirty-six guides. This checklist is what “production AI platform ready” means when an architect signs off—not demo complete. Paste it into your platform epic, then celebrate: Applied AI Core is complete.

🎉 Series complete

Congratulations—you finished Applied AI Core. Six tracks: LLMs, RAG, Prompts, Agents, Evaluation, and Shipping. Return to the Applied AI Core hub anytime.

End-to-end platform checklist

Reference architecture documented: Client → API GW → AI Service → LLM Gateway → providers + vector DB + KB + Redis + observability.
RAG lifecycle implements all 10 steps with async logging and eval sampling on step 10.
Multi-tenant namespaces with pgvector RLS (or contractual per-tenant index) and EU/US regional cells where required.
Semantic cache targeting 30–40% hit on repetitive workloads; prompt caching on fat system prompts.
Model tiering + Batch API + local Llama routing for simple classification tasks.
Circuit breaker + bulkhead + timeout/retry/jitter centralized in LLM gateway.
Provider health dashboard drives automatic route deprioritization.
Offline cached mode with user-visible degradation banner.
Feature flags: 5→20→100% rollout, model A/B with traced variants, global kill switch tested quarterly.
CI eval gates from Track 5 block prompt/model merges; on-call runbook links provider outage playbook.
Cost ledger per tenant/feature; weekly attribution report to finance.
Cross-track traceability table maintained—every layer maps to a guide owner.

All six tracks — quick reference

Track	Focus	Start guide	Capstone tie-in
1 — LLMs	APIs, tokens, embeddings, cost	LLMs explained	Gateway provider layer, streaming step 7
2 — RAG	Chunking, retrieval, ingestion	RAG explained	Vector DB + KB, lifecycle steps 4–6
3 — Prompts	System prompts, context, formatting	Prompt explained	Context assembly step 6, output format step 8
4 — Agents	Tools, memory, MCP, agentic RAG	Agents explained	AI service agent router, bulkhead pools
5 — Evals	Judge, RAG eval, guardrails, CI	Evaluation explained	Guardrails steps 2/8, async eval step 10
6 — Shipping	Hello Ship → architecture	Hello Ship	This guide — full platform composition

Track 6 guide sequence — complete

Guide	Topic	Status
Guide 1	Hello Ship — API to production	✓
Guide 2	LLM gateway & infrastructure	✓
Guide 3	Fine-tuning & adaptation	✓
Guide 4	Multimodal ingest & processing	✓
Guide 5	Structured data extraction & agents	✓
Guide 6	Production AI architecture patterns	✓ you are here

Full curriculum map (36 guides)

Track	Guides
Track 1	LLMs explained · APIs & tokens · Embeddings · Multimodal · Model selection
Track 2	RAG explained · Chunking · Vector DBs · Retrieval · Advanced RAG · Ingestion
Track 3	Prompt explained · Techniques · System prompts · Formatting · Context · Production prompts
Track 4	Agents explained · Tool calling · Memory · Multi-agent · MCP · Agentic RAG
Track 5	Evaluation explained · LLM-as-judge · RAG eval · Guardrails · Observability · CI eval
Track 6	Hello Ship · Gateway · Fine-tuning · Multimodal · Extraction · Production architecture

import httpx

def platform_smoke(base: str, tenant: str) -> None:
    """Validates capstone path: health → RAG → cache hit → kill switch off."""
    httpx.get(f"{base}/ready", timeout=5.0).raise_for_status()
    q = "What is our refund policy for EU customers?"
    r1 = httpx.post(f"{base}/v1/chat", json={"query": q, "tenant_id": tenant}, timeout=90.0)
    r1.raise_for_status()
    assert r1.json().get("citations")
    r2 = httpx.post(f"{base}/v1/chat", json={"query": q, "tenant_id": tenant}, timeout=30.0)
    r2.raise_for_status()
    assert r2.json().get("cache_hit") is True

public void platformSmoke(String base, String tenant) {
  rest.get().uri(base + "/ready").retrieve().toBodilessEntity();
  var q = "What is our refund policy for EU customers?";
  var r1 = rest.post().uri(base + "/v1/chat")
      .body(new ChatRequest(q, tenant)).retrieve().body(ChatResponse.class);
  assert r1 != null && r1.citations() != null && !r1.citations().isEmpty();
  var r2 = rest.post().uri(base + "/v1/chat")
      .body(new ChatRequest(q, tenant)).retrieve().body(ChatResponse.class);
  assert r2 != null && Boolean.TRUE.equals(r2.cacheHit());
}

🎯 Interview Tip

“You finished our AI curriculum—summarize it.” — Six tracks from inference to shipping; capstone is layered architecture with RAG lifecycle, tenant isolation, cost levers, resilience, and flags. Offer to whiteboard the mermaid stack from memory.

🔬 Under the Hood

The platform smoke test asserts cache_hit on the second identical query—that single assertion forces semantic cache, tenant scoping, and logging to work together; it catches more regressions than /health alone.

💰 Cost

Capstone complete does not mean spend-capped. Before GA, confirm per-tenant budgets, kill switch drill, and finance dashboard—the architecture enables control; flags and gateway quotas enforce it.

Where to go next

Applied AI Core hub — revisit any track or share the curriculum with your team.
sharpbyte.dev Hub — system design guides, interview prep, and future series.
Interview-ready guide — whiteboard system design with the patterns from this capstone.

FAQ: Build vs buy the platform?

Start with managed gateway (Portkey/LiteLLM) + pgvector + your AI service. Extract to full DIY when spend exceeds vendor fees or compliance demands cell isolation you cannot configure. The layers stay the same—only ownership changes.

FAQ: Did we cover fine-tuning?

Yes—Guide 3 covers when adaptation beats prompting. The capstone assumes most traffic routes through RAG + tiered models; fine-tuned adapters plug in at the LLM gateway alias layer.

FAQ: What's the minimum viable capstone?

Hello Ship service + LLM gateway + pgvector RLS + semantic cache + input guard + CI eval smoke + kill switch. Add regional cells and agent routers when contracts and complexity demand them—not on day one.