Production AI architecture patterns
Track 1 taught inference. Track 2 taught RAG. Track 3 taught prompts. Track 4 taught agents. Track 5 taught eval and guardrails. Track 6 Guides 1–5 taught shipping mechanics—gateways, fine-tuning, multimodal ingest, structured extraction. Guide 6 is the capstone: how those pieces compose into a production AI platform you can defend in an architecture review. This guide maps the full stack from browser to vector DB, walks a 10-step RAG request lifecycle, designs multi-tenant isolation with EU/US residency, optimizes cost at scale, hardens reliability with circuit breakers and bulkheads, rolls out features safely with flags, and closes with the end-of-series checklist for Applied AI Core.
After reading, you should be able to: whiteboard an AI-native reference architecture; trace a RAG query through guardrails, cache, retrieval, and async logging; design tenant namespaces with pgvector RLS and regional data planes; apply semantic cache, prompt caching, model tiering, and Batch API for cost; implement circuit breaker, bulkhead, timeout+retry+jitter, and offline cached mode; configure gradual rollout and kill switches; and articulate how all six tracks connect in a shippable system.
AI-native architecture: the full stack
Production AI is not “FastAPI + OpenAI.” It is a layered platform: clients hit an API gateway, business logic lives in an AI service, inference flows through an LLM gateway, retrieval touches vector DB + knowledge base, hot paths use Redis cache, and every hop emits telemetry to observability. The diagram below is the reference architecture teams at scale converge on—whether they build it or buy Portkey + Pinecone + Datadog.
The mistake in v1 is collapsing layers. When the chat endpoint calls OpenAI, embeds inline, and queries Postgres in one 400-line file, you ship fast—and inherit a monolith that cannot isolate tenant blast radius, swap embedding models without redeploying UI, or fail over providers without editing application code. The AI-native split keeps product orchestration (RAG assembly, tool routing, session state) in the AI service while vendor concerns (keys, retries, semantic cache, cost ledger) stay in the LLM gateway from Guide 2.
Layer responsibilities
| Layer | Owns | Does not own | Track reference |
|---|---|---|---|
| Client | UX, streaming SSE, session tokens | API keys, retrieval logic | Guide 1 — Hello Ship |
| API Gateway | TLS, WAF, JWT/OAuth, rate limits, request IDs | Prompt assembly, model selection | Platform / Kong / AWS API GW |
| AI Service | RAG orchestration, guardrails hook, tool/agent loops | Raw vendor SDK keys | Track 4 — agentic RAG |
| LLM Gateway | Model routing, semantic cache, failover, cost ledger | Business rules per tenant SKU | Guide 2 — gateway |
| Providers | Inference, embeddings APIs | Your SLA to end users | Track 1 — APIs |
| Vector DB + KB | Chunk storage, hybrid search, ingest metadata | Chat session history (usually) | Track 2 — vector DBs |
| Redis cache | Semantic cache, session scratch, rate-limit counters | Durable audit logs | Guide 2 — semantic cache |
| Observability | Traces, metrics, eval sampling, cost dashboards | Blocking bad prompts (that's guardrails) | Track 5 — observability |
Full-stack reference diagram
flowchart TB
subgraph edge["Edge & clients"]
WEB["Web app / mobile"]
B2B["Partner API clients"]
INT["Internal tools / agents"]
end
subgraph apigw["API Gateway"]
TLS["TLS termination + WAF"]
AUTH["JWT / OAuth + tenant claims"]
RL_EDGE["Edge rate limits"]
RID["Request ID injection"]
end
subgraph aisvc["AI Service — product orchestration"]
IN_G["Input guardrails"]
RAG_O["RAG orchestrator"]
AGT["Agent / tool router"]
OUT_G["Output guardrails"]
ASM["Context assembler"]
end
subgraph llgw["LLM Gateway"]
RT["Model router + aliases"]
SC["Semantic cache Redis"]
PC["Prompt cache headers"]
FB["Retry + circuit breaker"]
LEDGER["Cost ledger"]
end
subgraph data["Data plane"]
VDB["Vector DB pgvector / Pinecone"]
KB["Knowledge base S3 + Postgres"]
RDIS["Redis — sessions + cache"]
PG["Postgres — metadata + RLS"]
end
subgraph providers["Model providers"]
OAI["OpenAI"]
ANT["Anthropic"]
BR["AWS Bedrock"]
LOCAL["Local Llama — edge tasks"]
end
subgraph obs["Observability"]
TRACE["Distributed tracing"]
MET["Metrics + SLO dashboards"]
LOG["Structured logs"]
EVAL["Online eval sampling"]
ALERT["PagerDuty / on-call"]
end
WEB --> TLS
B2B --> TLS
INT --> TLS
TLS --> AUTH
AUTH --> RL_EDGE
RL_EDGE --> RID
RID --> IN_G
IN_G --> RAG_O
IN_G --> AGT
RAG_O --> ASM
ASM --> VDB
ASM --> KB
RAG_O --> SC
AGT --> RT
ASM --> RT
SC -->|miss| RT
SC -->|hit| OUT_G
RT --> FB
FB --> OAI
FB --> ANT
FB --> BR
FB --> LOCAL
FB --> LEDGER
RT --> PC
OUT_G --> WEB
VDB --> PG
KB --> PG
aisvc --> RDIS
llgw --> RDIS
aisvc --> TRACE
llgw --> TRACE
TRACE --> MET
TRACE --> LOG
LOG --> EVAL
MET --> ALERT
Read the diagram left-to-right as request flow, top-to-bottom as dependency depth. The AI service never holds vendor API keys—it calls the gateway with X-Tenant-Id, X-Feature-Id, and a logical model alias (support-quality). Retrieval hits the vector DB with the same tenant namespace enforced at query time. Observability receives correlated spans from gateway and service via a shared trace_id propagated at the API gateway.
Cross-track capability map
| Architecture concern | Primary owner layer | Applied AI Core guide |
|---|---|---|
| Token budgeting & model aliases | LLM Gateway | Track 1 — model selection |
| Chunking & ingestion | KB + async workers | Track 2 — ingestion |
| System prompt & context assembly | AI Service | Track 3 — system prompts |
| Tool calling & agent loops | AI Service | Track 4 — tool calling |
| Input/output guardrails | AI Service (+ gateway policies) | Track 5 — guardrails |
| CI eval regression gates | Deploy pipeline | Track 5 — CI eval |
| Structured extraction pipelines | AI Service + workers | Guide 5 — extraction |
| Multimodal document ingest | KB workers | Guide 4 — multimodal |
Draw this diagram in every architecture review before debating Pinecone vs pgvector. Stakeholders align on layers first; technology choices second.
The API gateway injects traceparent (W3C) or X-Request-Id before the AI service—never let the browser generate trace IDs or PII can leak into logs via client-controlled headers.
Embedding the vector DB client inside the LLM gateway couples retrieval latency to inference scaling. Keep retrieval in the AI service; gateway stays stateless for chat completions and embeddings only.
“Design a production RAG system.” — Start with this layered diagram, then drill into tenant isolation, cache hit rates, and what happens when OpenAI 503s. Mention eval sampling in observability—interviewers notice.
RAG request lifecycle: ten production steps
A demo RAG pipeline is retrieve → prompt → answer. Production adds nine more steps: input guardrails, semantic cache lookup, query rewrite, hybrid retrieval, context budgeting, streamed generation, output guardrails, citation attachment, and async logging with eval sampling. Miss any step and you learn on incident day—which step failed is the difference between a 5-minute rollback and a week-long postmortem.
Ten-step lifecycle overview
| Step | Action | Sync / async | Track tie-in |
|---|---|---|---|
| 1 | Authenticate + attach tenant context | Sync | Guide 1 |
| 2 | Input guardrails (PII, injection, policy) | Sync | Track 5 — guardrails |
| 3 | Semantic cache lookup (Redis) | Sync | Guide 2 — cache |
| 4 | Query rewrite / HyDE / decomposition | Sync (optional LLM) | Track 2 — retrieval |
| 5 | Hybrid retrieve (vector + BM25 + filters) | Sync | Track 2 — advanced RAG |
| 6 | Context assemble + token budget trim | Sync | Track 3 — context |
| 7 | LLM stream via gateway | Sync (stream) | Track 1 — streaming |
| 8 | Output guardrails + schema validation | Sync | Track 3 — formatting |
| 9 | Attach citations + cache write-back | Sync | Track 2 — RAG explained |
| 10 | Async logging, cost ledger, eval sample | Async | Track 5 — observability |
Sequence diagram: happy path + cache miss
sequenceDiagram
autonumber
participant C as Client
participant GW as API Gateway
participant AI as AI Service
participant GR as Guardrails
participant RC as Redis cache
participant VDB as Vector DB
participant LG as LLM Gateway
participant LLM as Provider
participant OBS as Observability
C->>GW: POST /v1/chat (query, session)
GW->>AI: Forward + tenant_id + trace_id
AI->>GR: Input guard (step 2)
alt blocked
GR-->>C: 400 policy violation
end
AI->>RC: Semantic cache lookup (step 3)
alt cache hit
RC-->>AI: Cached answer + citations
AI->>GR: Output guard (step 8)
AI-->>C: SSE stream cached tokens
AI--)OBS: Async log hit=true (step 10)
else cache miss
AI->>AI: Query rewrite (step 4)
AI->>VDB: Hybrid retrieve top-k (step 5)
VDB-->>AI: Chunks + scores
AI->>AI: Assemble context trim (step 6)
AI->>LG: Chat completion stream (step 7)
LG->>LLM: Stream request
loop token stream
LLM-->>LG: delta
LG-->>AI: delta
AI-->>C: SSE event
end
AI->>GR: Output guard (step 8)
AI->>RC: Cache write-back (step 9)
AI--)OBS: Async log + eval sample (step 10)
end
Orchestrator sketch
async def handle_rag_query(ctx: RequestContext, query: str) -> AsyncIterator[str]:
# Step 2 — input guardrails
guard_in = await input_guard.check(query, tenant_id=ctx.tenant_id)
if guard_in.blocked:
raise PolicyViolation(guard_in.reason)
# Step 3 — semantic cache
cached = await semantic_cache.get(ctx.tenant_id, query, embedding_fn=embed)
if cached:
async for tok in stream_cached(cached):
yield tok
asyncio.create_task(log_async(ctx, query, cached, cache_hit=True))
return
# Steps 4–6 — rewrite, retrieve, assemble
rewritten = await rewrite_query(query, ctx)
chunks = await retriever.hybrid_search(
tenant_id=ctx.tenant_id,
query=rewritten,
top_k=12,
filters=ctx.doc_filters,
)
messages = assemble_context(system=ctx.system_prompt, chunks=chunks, query=query)
# Steps 7–8 — stream via gateway + output guard
buffer: list[str] = []
async for delta in llm_gateway.stream(messages, alias="support-quality", ctx=ctx):
buffer.append(delta)
yield delta
answer = "".join(buffer)
out = await output_guard.check(answer, citations=chunks)
if out.redacted:
answer = out.text
# Step 9 — citations + cache write-back
await semantic_cache.put(ctx.tenant_id, query, answer, chunks)
yield format_citations(chunks)
# Step 10 — async logging (non-blocking)
asyncio.create_task(log_async(ctx, query, answer, chunks=chunks, cache_hit=False))
public Flux<String> handleRagQuery(RequestContext ctx, String query) {
return inputGuard.check(query, ctx.tenantId())
.flatMapMany(in -> {
if (in.blocked()) return Flux.error(new PolicyViolation(in.reason()));
return semanticCache.get(ctx.tenantId(), query)
.flatMapMany(cached -> streamCached(cached)
.doOnComplete(() -> logAsync(ctx, query, cached, true)))
.switchIfEmpty(Flux.defer(() -> {
var rewritten = rewriteQuery(query, ctx);
var chunks = retriever.hybridSearch(ctx.tenantId(), rewritten, 12, ctx.docFilters());
var messages = assembleContext(ctx.systemPrompt(), chunks, query);
var buffer = new StringBuilder();
return llmGateway.stream(messages, "support-quality", ctx)
.doOnNext(buffer::append)
.concatWith(Flux.defer(() -> {
var answer = outputGuard.check(buffer.toString(), chunks);
semanticCache.put(ctx.tenantId(), query, answer.text(), chunks);
logAsync(ctx, query, answer.text(), false);
return formatCitations(chunks);
}));
}));
});
}
Run step 10 (async logging) on a dedicated queue with backpressure—never block the SSE stream on Splunk or Langfuse ingest. Users feel logging latency as “stuck” final tokens.
Query rewrite (step 4) costs one extra LLM call—enable only for conversational queries or when retrieval confidence < threshold. FAQ-style queries should skip rewrite and hit cache step 3 first.
Output guardrails (step 8) after streaming means you may have already sent toxic tokens to the client. For high-risk domains, buffer the full response server-side or use a two-pass “draft + verify” pattern from Track 5.
Cache keys should include tenant_id, model_alias, kb_version, and embedding model id—otherwise a re-ingest serves stale answers from yesterday's corpus.
Multi-tenant RAG: isolation without duplicate stacks
SaaS AI products face a tension: customers demand data isolation and regional residency (EU vs US), while finance demands shared infrastructure so margin does not collapse at 500 tenants. The production pattern is logical namespaces on shared vector and object stores, enforced with pgvector RLS, gateway tenant context, and regional data planes—not 500 separate Pinecone indexes unless contractually required.
Isolation models compared
| Model | Isolation strength | Ops cost | When to use |
|---|---|---|---|
| Shared index + namespace filter | Medium (app-enforced) | Low | Early SaaS, trusted internal tenants |
| Shared pgvector + RLS | High (DB-enforced) | Medium | Default for B2B SaaS |
| Index per tenant | Very high | High | Enterprise contracts, <50 large tenants |
| Cell per region (EU/US) | Regulatory | Medium-high | GDPR, data localization clauses |
Regional data plane architecture
flowchart LR
subgraph global["Global control plane"]
CP["Tenant registry"]
FF["Feature flags"]
RT_CFG["Routing config"]
end
subgraph eu["EU data plane — Frankfurt"]
EU_GW["API Gateway EU"]
EU_AI["AI Service EU"]
EU_PG["Postgres + pgvector RLS"]
EU_S3["S3 EU bucket KB"]
EU_REDIS["Redis EU"]
end
subgraph us["US data plane — Virginia"]
US_GW["API Gateway US"]
US_AI["AI Service US"]
US_PG["Postgres + pgvector RLS"]
US_S3["S3 US bucket KB"]
US_REDIS["Redis US"]
end
CP --> EU_GW
CP --> US_GW
FF --> EU_AI
FF --> US_AI
EU_AI --> EU_PG
EU_AI --> EU_S3
EU_AI --> EU_REDIS
US_AI --> US_PG
US_AI --> US_S3
US_AI --> US_REDIS
EU_PG -->|"SET app.tenant_id"| RLS_EU["RLS policy"]
US_PG -->|"SET app.tenant_id"| RLS_US["RLS policy"]
Tenant registry stores home_region (eu-central-1 | us-east-1), data_class, and kb_namespace. The API gateway routes to the correct regional cell based on JWT claims—never trust client-supplied region headers. Cross-region inference (EU docs, US LLM endpoint) is a compliance violation unless explicit BAA/DPA allows it.
pgvector RLS pattern
# SQL migration (run once)
RLS_SQL = """
ALTER TABLE document_chunks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON document_chunks
USING (tenant_id = current_setting('app.tenant_id')::uuid);
CREATE INDEX ON document_chunks USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
"""
async def search_chunks(pool, tenant_id: str, embedding: list[float], top_k: int = 10):
async with pool.acquire() as conn:
await conn.execute("SELECT set_config('app.tenant_id', $1, true)", tenant_id)
rows = await conn.fetch(
"""
SELECT id, content, metadata,
1 - (embedding <=> $2::vector) AS score
FROM document_chunks
ORDER BY embedding <=> $2::vector
LIMIT $3
""",
tenant_id, embedding, top_k,
)
return rows
public List<ChunkHit> searchChunks(DataSource ds, UUID tenantId, float[] embedding, int topK) {
try (var conn = ds.getConnection()) {
try (var st = conn.createStatement()) {
st.execute("SELECT set_config('app.tenant_id', '" + tenantId + "', true)");
}
try (var ps = conn.prepareStatement("""
SELECT id, content, metadata,
1 - (embedding <=> ?::vector) AS score
FROM document_chunks
ORDER BY embedding <=> ?::vector
LIMIT ?
""")) {
ps.setObject(1, toPgVector(embedding));
ps.setObject(2, toPgVector(embedding));
ps.setInt(3, topK);
// RLS filters rows — even if WHERE tenant_id is forgotten
return mapRows(ps.executeQuery());
}
}
}
Namespace metadata contract
| Field | Scope | Example | Enforced at |
|---|---|---|---|
| tenant_id | Row isolation | tnt_acme_corp | Postgres RLS + gateway JWT |
| kb_namespace | Logical KB partition | acme/hr-policies | Retriever filter + S3 prefix |
| home_region | Data residency | eu-central-1 | DNS / gateway routing |
| embedding_model | Vector compatibility | text-embedding-3-large | Ingest + query must match |
| kb_version | Cache invalidation | v2026-06-07T14:00Z | Semantic cache key |
Application-level WHERE tenant_id = ? without RLS fails the moment an intern ships a raw SQL debug endpoint. RLS is defense in depth—assume application bugs happen.
Integration tests should attempt cross-tenant retrieval with wrong JWTs and expect zero rows—not 403 from the app layer only. RLS tests belong in CI.
“How do you multi-tenant a vector DB?” — Namespace + RLS on pgvector for most cases; per-index for regulated enterprise; regional cells for GDPR. Mention embedding model pinning per tenant.
Cost at scale: four levers that actually move the needle
At 10K queries/day, cost is noise. At 10M, the same architecture bankrupts you. Production teams combine semantic cache (30–40% hit rates on support FAQ workloads), provider prompt caching (Anthropic/OpenAI cache breakpoints on system prompts), model tiering (Haiku for classify, Opus for synthesize), Batch API for offline jobs, and local Llama for deterministic simple tasks—classification, PII regex assist, query routing.
Cost lever matrix
| Lever | Typical savings | Latency impact | Guide reference |
|---|---|---|---|
| Semantic cache (Redis) | 30–40% $ on FAQ/support | −200–800 ms on hit | Guide 2 — semantic cache |
| Anthropic prompt caching | ~90% on cached input tokens | Neutral | Track 1 — tokens |
| Model tiering | 50–80% vs all-frontier | +/- by route | Track 1 — model selection |
| OpenAI Batch API | 50% on batch-eligible jobs | Hours (async) | Guide 5 — extraction |
| Local Llama 3.x 8B | ~$0 marginal per 1M tokens | +50–200 ms on GPU | Guide 3 — adaptation |
Routing decision tree
flowchart TD
Q["Incoming request"] --> CACHE{"Semantic cache hit?"}
CACHE -->|yes| RET["Return cached — $0 LLM"]
CACHE -->|no| CLASS["Classify intent — local Llama / Haiku"]
CLASS --> SIMPLE{"Simple FAQ / routing?"}
SIMPLE -->|yes| SMALL["Small model — Haiku / GPT-4o-mini"]
SIMPLE -->|no| RAG{"Needs retrieval?"}
RAG -->|yes| MED["Medium + RAG — Sonnet"]
RAG -->|no| COMPLEX{"Multi-step reasoning?"}
COMPLEX -->|yes| BIG["Frontier — Opus / o3"]
COMPLEX -->|no| MED
SMALL --> BATCH{"Offline acceptable?"}
MED --> BATCH
BIG --> BATCH
BATCH -->|yes| BATCHAPI["Batch API — 50% discount"]
BATCH -->|no| LIVE["Live gateway stream"]
Semantic cache implementation notes
Target 0.92–0.95 cosine similarity threshold for cache hits—lower thresholds save money until wrong answers spike support tickets. Include tenant id and KB version in the cache key. Log cache_hit, similarity_score, and saved_tokens for finance dashboards. Teams with 30–40% hit rates on tier-1 support typically see 25–35% total inference spend reduction after gateway overhead.
# System prompt ~8K tokens — cache breakpoint saves ~90% on repeat calls
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_query}],
)
usage = response.usage
# usage.cache_creation_input_tokens, usage.cache_read_input_tokens
var systemBlock = SystemContentBlock.builder()
.text(LONG_SYSTEM_PROMPT)
.cacheControl(CacheControlEphemeral.builder().build())
.build();
var response = client.messages().create(MessageCreateParams.builder()
.model("claude-sonnet-4-20250514")
.maxTokens(1024L)
.systemOf(systemBlock)
.addUserMessage(userQuery)
.build());
// response.usage().cacheReadInputTokens()
Prompt caching shines when system prompts exceed 2K tokens and traffic is repetitive—exactly RAG with a fat policy doc. Pair with semantic cache: prompt cache saves input tokens; semantic cache skips the LLM entirely.
Run a weekly “cost attribution” report: top 10 tenants by spend, top 10 features by p95 tokens, cache hit rate by feature. Finance trusts graphs; engineering trusts row-level ledger from Guide 2.
Local Llama for classification sounds free until you provision GPU nodes 24/7. Use serverless GPU or batch classify offline unless QPS justifies reserved capacity.
Batch API jobs need idempotency keys per input row—retries after 24h window expiry should not double-charge downstream actions triggered by extraction results.
Reliability: survive provider outages gracefully
LLM providers are unreliable dependencies with opaque incidents. Production architecture treats them like payment processors: circuit breakers stop retry storms, bulkheads isolate chat from batch extraction, timeout + retry + jitter handles transient 429/503, a provider health dashboard drives automatic failover, and offline cached mode returns stale-but-safe answers when everything upstream is red.
Resilience pattern catalog
| Pattern | Problem solved | Typical config | Owner |
|---|---|---|---|
| Circuit breaker | Retry storm during outage | Open after 5 failures / 30s; half-open probe | LLM Gateway |
| Bulkhead | Batch jobs starve interactive chat | Separate connection pools + quotas | AI Service pools |
| Timeout + retry + jitter | Hung streams, thundering herd | 60s stream timeout; 3 retries; full jitter | LLM Gateway |
| Provider health dashboard | Manual failover lag | P50 error rate > 5% → deprioritize route | Platform SRE |
| Offline cached mode | Total provider failure | Semantic cache + static FAQ fallback | AI Service feature flag |
Failover state machine
stateDiagram-v2 [*] --> Closed Closed --> Open: error_rate > threshold\nOR 5 consecutive 5xx Open --> HalfOpen: cooldown 60s HalfOpen --> Closed: probe success HalfOpen --> Open: probe failure Closed --> Degraded: latency_p95 > SLO Degraded --> Closed: metrics recover Degraded --> OfflineCache: all routes unhealthy OfflineCache --> HalfOpen: any route healthy
Gateway retry with jitter
import random
import asyncio
RETRYABLE = {429, 500, 502, 503, 529}
async def call_with_resilience(route: str, fn, max_attempts: int = 4):
if circuit.is_open(route):
raise CircuitOpenError(route)
for attempt in range(max_attempts):
try:
return await asyncio.wait_for(fn(), timeout=60.0)
except Exception as e:
status = getattr(e, "status_code", 500)
if status not in RETRYABLE or attempt == max_attempts - 1:
circuit.record_failure(route)
raise
base = min(2 ** attempt, 32)
delay = random.uniform(0, base) # full jitter
await asyncio.sleep(delay)
circuit.record_success(route)
public <T> T callWithResilience(String route, Supplier<CompletableFuture<T>> fn) {
if (circuit.isOpen(route)) throw new CircuitOpenException(route);
for (int attempt = 0; attempt < 4; attempt++) {
try {
var result = fn.get().get(60, TimeUnit.SECONDS);
circuit.recordSuccess(route);
return result;
} catch (Exception e) {
int status = extractStatus(e);
if (!RETRYABLE.contains(status) || attempt == 3) {
circuit.recordFailure(route);
throw e;
}
double delay = ThreadLocalRandom.current().nextDouble(0, Math.min(1 << attempt, 32));
Thread.sleep((long) (delay * 1000));
}
}
throw new IllegalStateException("unreachable");
}
Offline cached mode UX contract
When circuit state is OfflineCache, return answers from semantic cache even below normal similarity threshold (e.g. 0.88), prepend a banner: “AI temporarily limited—answer may be outdated,” and disable tool calling. Never hallucinate fresh policy in offline mode—static FAQ JSON beats confident wrong answers. Wire mode to a feature flag (next section) for instant kill-switch rollback.
Provider health dashboard should aggregate your error rates per route—not vendor status pages alone. Vendor green + your 40% 429 means key pool exhaustion, not “OpenAI is fine.”
Circuit breakers on the client SDK in every microservice create split-brain—one service thinks OpenAI is open, another closed. Centralize breaker state in the LLM gateway Redis.
Bulkheads are as much about max_concurrent_streams per tenant as separate thread pools—one tenant's agent loop should not exhaust the shared streaming semaphore.
“OpenAI is down—what happens?” — Walk circuit breaker → failover route (Bedrock/Azure) → offline cache → static degradation. Mention user-visible banner and disabled tools.
Feature flags: ship AI without betting the company
AI features are high-risk merges: a prompt change is a behavior change. Production teams wrap every new model route, RAG pipeline version, and agent tool behind feature flags—gradual rollout 5% → 20% → 100%, model A/B with eval metrics, and a kill switch that disables inference in one click. LaunchDarkly and Unleash are common; the pattern matters more than the vendor.
Flag taxonomy for AI platforms
| Flag type | Example key | Rollout strategy | Kill criteria |
|---|---|---|---|
| Release | rag-v2-hybrid-search | 5% → 20% → 100% over 2 weeks | RAG faithfulness < baseline −5% |
| Model A/B | model-route-sonnet-vs-gpt4o | 50/50 by tenant_id hash | Cost + quality pareto worse |
| Ops kill switch | ai-inference-enabled | Boolean global | Provider outage / cost spike |
| Tenant override | tenant-acme-beta-agent | Allow-list tenants | Contract pilot ends |
| Offline mode | offline-cache-only | Manual + auto trigger | Linked to circuit breaker |
Gradual rollout timeline
gantt title RAG v2 rollout with eval gates dateFormat YYYY-MM-DD section Canary 5% internal + friendly tenants :a1, 2026-06-01, 3d section Expand 20% production traffic :a2, after a1, 4d section GA 100% with kill switch armed :a3, after a2, 7d section Eval gates CI golden set :milestone, 2026-06-01, 0d Online faithfulness sample :milestone, 2026-06-05, 0d Cost per resolved ticket check :milestone, 2026-06-10, 0d
Flag evaluation in AI service
from dataclasses import dataclass
@dataclass
class FlagContext:
tenant_id: str
user_id: str
feature_id: str
def resolve_model_alias(flags, ctx: FlagContext) -> str:
if not flags.bool_variation("ai-inference-enabled", ctx, default=True):
raise InferenceDisabledError("Kill switch active")
if flags.bool_variation("offline-cache-only", ctx, default=False):
return "offline-cache"
ab = flags.string_variation("model-route-ab", ctx, default="sonnet")
if flags.bool_variation("rag-v2-hybrid-search", ctx, default=False):
return f"rag-v2-{ab}"
return f"rag-v1-{ab}"
public String resolveModelAlias(FeatureFlags flags, FlagContext ctx) {
if (!flags.boolVariation("ai-inference-enabled", ctx, true)) {
throw new InferenceDisabledException("Kill switch active");
}
if (flags.boolVariation("offline-cache-only", ctx, false)) {
return "offline-cache";
}
var ab = flags.stringVariation("model-route-ab", ctx, "sonnet");
if (flags.boolVariation("rag-v2-hybrid-search", ctx, false)) {
return "rag-v2-" + ab;
}
return "rag-v1-" + ab;
}
Model A/B must log variant on every trace and join to eval scores—otherwise you cannot prove Sonnet beat GPT-4o on your retrieval corpus. Tie flag changes to deploy tickets; LaunchDarkly audit log is part of SOC2 evidence.
Hash tenant_id for rollout percentage, not user_id—B2B users within one tenant should see consistent behavior during partial rollout.
Feature flags without TTL become permanent if/else spaghetti. Every AI flag gets an owner, expiry date, and cleanup ticket when GA reaches 100%.
Model A/B at 50/50 doubles frontier spend during experiments—cap experiment duration and sample size; use Track 5 offline eval before online A/B when possible.
Production: Applied AI Core capstone checklist
You have crossed six tracks and thirty-six guides. This checklist is what “production AI platform ready” means when an architect signs off—not demo complete. Paste it into your platform epic, then celebrate: Applied AI Core is complete.
Congratulations—you finished Applied AI Core. Six tracks: LLMs, RAG, Prompts, Agents, Evaluation, and Shipping. Return to the Applied AI Core hub anytime.
End-to-end platform checklist
- Reference architecture documented: Client → API GW → AI Service → LLM Gateway → providers + vector DB + KB + Redis + observability.
- RAG lifecycle implements all 10 steps with async logging and eval sampling on step 10.
- Multi-tenant namespaces with pgvector RLS (or contractual per-tenant index) and EU/US regional cells where required.
- Semantic cache targeting 30–40% hit on repetitive workloads; prompt caching on fat system prompts.
- Model tiering + Batch API + local Llama routing for simple classification tasks.
- Circuit breaker + bulkhead + timeout/retry/jitter centralized in LLM gateway.
- Provider health dashboard drives automatic route deprioritization.
- Offline cached mode with user-visible degradation banner.
- Feature flags: 5→20→100% rollout, model A/B with traced variants, global kill switch tested quarterly.
- CI eval gates from Track 5 block prompt/model merges; on-call runbook links provider outage playbook.
- Cost ledger per tenant/feature; weekly attribution report to finance.
- Cross-track traceability table maintained—every layer maps to a guide owner.
All six tracks — quick reference
| Track | Focus | Start guide | Capstone tie-in |
|---|---|---|---|
| 1 — LLMs | APIs, tokens, embeddings, cost | LLMs explained | Gateway provider layer, streaming step 7 |
| 2 — RAG | Chunking, retrieval, ingestion | RAG explained | Vector DB + KB, lifecycle steps 4–6 |
| 3 — Prompts | System prompts, context, formatting | Prompt explained | Context assembly step 6, output format step 8 |
| 4 — Agents | Tools, memory, MCP, agentic RAG | Agents explained | AI service agent router, bulkhead pools |
| 5 — Evals | Judge, RAG eval, guardrails, CI | Evaluation explained | Guardrails steps 2/8, async eval step 10 |
| 6 — Shipping | Hello Ship → architecture | Hello Ship | This guide — full platform composition |
Track 6 guide sequence — complete
| Guide | Topic | Status |
|---|---|---|
| Guide 1 | Hello Ship — API to production | ✓ |
| Guide 2 | LLM gateway & infrastructure | ✓ |
| Guide 3 | Fine-tuning & adaptation | ✓ |
| Guide 4 | Multimodal ingest & processing | ✓ |
| Guide 5 | Structured data extraction & agents | ✓ |
| Guide 6 | Production AI architecture patterns | ✓ you are here |
Full curriculum map (36 guides)
| Track | Guides |
|---|---|
| Track 1 | LLMs explained · APIs & tokens · Embeddings · Multimodal · Model selection |
| Track 2 | RAG explained · Chunking · Vector DBs · Retrieval · Advanced RAG · Ingestion |
| Track 3 | Prompt explained · Techniques · System prompts · Formatting · Context · Production prompts |
| Track 4 | Agents explained · Tool calling · Memory · Multi-agent · MCP · Agentic RAG |
| Track 5 | Evaluation explained · LLM-as-judge · RAG eval · Guardrails · Observability · CI eval |
| Track 6 | Hello Ship · Gateway · Fine-tuning · Multimodal · Extraction · Production architecture |
import httpx
def platform_smoke(base: str, tenant: str) -> None:
"""Validates capstone path: health → RAG → cache hit → kill switch off."""
httpx.get(f"{base}/ready", timeout=5.0).raise_for_status()
q = "What is our refund policy for EU customers?"
r1 = httpx.post(f"{base}/v1/chat", json={"query": q, "tenant_id": tenant}, timeout=90.0)
r1.raise_for_status()
assert r1.json().get("citations")
r2 = httpx.post(f"{base}/v1/chat", json={"query": q, "tenant_id": tenant}, timeout=30.0)
r2.raise_for_status()
assert r2.json().get("cache_hit") is True
public void platformSmoke(String base, String tenant) {
rest.get().uri(base + "/ready").retrieve().toBodilessEntity();
var q = "What is our refund policy for EU customers?";
var r1 = rest.post().uri(base + "/v1/chat")
.body(new ChatRequest(q, tenant)).retrieve().body(ChatResponse.class);
assert r1 != null && r1.citations() != null && !r1.citations().isEmpty();
var r2 = rest.post().uri(base + "/v1/chat")
.body(new ChatRequest(q, tenant)).retrieve().body(ChatResponse.class);
assert r2 != null && Boolean.TRUE.equals(r2.cacheHit());
}
“You finished our AI curriculum—summarize it.” — Six tracks from inference to shipping; capstone is layered architecture with RAG lifecycle, tenant isolation, cost levers, resilience, and flags. Offer to whiteboard the mermaid stack from memory.
The platform smoke test asserts cache_hit on the second identical query—that single assertion forces semantic cache, tenant scoping, and logging to work together; it catches more regressions than /health alone.
Capstone complete does not mean spend-capped. Before GA, confirm per-tenant budgets, kill switch drill, and finance dashboard—the architecture enables control; flags and gateway quotas enforce it.
Where to go next
- Applied AI Core hub — revisit any track or share the curriculum with your team.
- sharpbyte.dev Hub — system design guides, interview prep, and future series.
- Interview-ready guide — whiteboard system design with the patterns from this capstone.
FAQ: Build vs buy the platform?
Start with managed gateway (Portkey/LiteLLM) + pgvector + your AI service. Extract to full DIY when spend exceeds vendor fees or compliance demands cell isolation you cannot configure. The layers stay the same—only ownership changes.
FAQ: Did we cover fine-tuning?
Yes—Guide 3 covers when adaptation beats prompting. The capstone assumes most traffic routes through RAG + tiered models; fine-tuned adapters plug in at the LLM gateway alias layer.
FAQ: What's the minimum viable capstone?
Hello Ship service + LLM gateway + pgvector RLS + semantic cache + input guard + CI eval smoke + kill switch. Add regional cells and agent routers when contracts and complexity demand them—not on day one.