LLM gateway & infrastructure
Track 5 taught you to eval, guardrail, and observe. Track 6 asks: how do you ship? The first production move is an LLM gateway—one front door that owns API keys, model routing, rate limits, semantic caching, cost ledgers, and failover. Without it, every team embeds OPENAI_API_KEY in env vars, retries 429s without jitter, and finance discovers a six-figure bill three weeks late. This guide maps gateway responsibilities, compares LiteLLM Proxy, Portkey, OpenRouter, and DIY Traefik stacks, covers self-hosted vs managed tradeoffs, multi-tenant isolation on ~/hello-ship, and closes with the Track 6 gateway production checklist.
After reading, you should be able to: explain why keys must never scatter across microservices; enumerate seven gateway responsibilities (auth, routing, rate limits, semantic cache, cost tracking, retry/fallback, observability); configure LiteLLM Proxy with model lists and budgets; implement per-tenant rate-limit middleware; sketch a Redis semantic cache; define a cost ledger JSON schema; compare open-source gateways; choose self-hosted LiteLLM vs managed Bedrock/Azure OpenAI; design multi-tenant namespaces with cost allocation; and ship with the Track 6 gateway checklist.
Why an LLM gateway: stop scattering keys
An LLM gateway is the single network hop between your apps and every model vendor. It centralizes auth, routing, rate limits, cost tracking, and caching—so product teams call one stable API instead of importing vendor SDKs in twenty repos.
Picture ~/hello-ship after Guide 1: FastAPI serves /v1/chat, workers call OpenAI directly,
and a cron job hits Anthropic for summarization. Keys live in three Kubernetes Secrets. When OpenAI returns 429,
the API retries immediately while the worker backs off—both hammer the same exhausted key. Finance asks “which team spent $40K on GPT-4o last month?”
and nobody can answer because logs lack tenant_id and model_id.
That is the scatter-keys anti-pattern.
The fix is not “use LiteLLM in one service.” It is a platform contract: every inference request—chat, embeddings, structured extraction, agent tool loops—flows through the gateway. Apps pass X-Tenant-Id, X-Feature-Id, and a logical model alias (support-fast, support-quality). The gateway resolves aliases to pinned vendor model IDs, enforces budgets, writes a cost ledger row, and returns normalized usage metadata. Vendor keys never leave the gateway VPC.
Without vs with a gateway
| Concern | Scattered keys (anti-pattern) | Centralized gateway |
|---|---|---|
| Auth | Keys in 12 repos; one intern commits .env | Keys in vault; apps use gateway API keys or mTLS |
| Routing | Hard-coded gpt-4o in each service | Alias map in git; change model without redeploying apps |
| Rate limits | Per-service retry storms on 429 | Token bucket per tenant + per vendor key pool |
| Cost | Billing export ≠ app logs | Per-request ledger with tokens and $ estimate |
| Cache | Duplicate identical FAQ calls | Semantic cache at gateway; apps stay unaware |
| Failover | Outage = every team pages separately | Gateway fails over OpenAI → Azure OpenAI → Bedrock |
| Observability | Inconsistent log shapes | One trace schema → Langfuse / Datadog / Splunk |
Gateway in the hello-ship architecture
flowchart TB
subgraph clients["Application tier"]
API["hello-ship FastAPI"]
WK["Async workers"]
AG["Agent orchestrator"]
end
subgraph gateway["LLM gateway — single front door"]
AUTH["Auth + tenant context"]
RL["Rate limiter"]
RT["Model router"]
SC["Semantic cache Redis"]
CB["Cost ledger"]
FB["Retry + fallback"]
OBS["Trace + metrics"]
end
subgraph vendors["Model providers"]
OAI["OpenAI"]
ANT["Anthropic"]
BR["AWS Bedrock"]
AZ["Azure OpenAI"]
end
API --> AUTH
WK --> AUTH
AG --> AUTH
AUTH --> RL
RL --> SC
SC -->|miss| RT
SC -->|hit| OBS
RT --> FB
FB --> OAI
FB --> ANT
FB --> BR
FB --> AZ
FB --> CB
CB --> OBS
In production, the gateway often sits behind your existing API gateway (Kong, Envoy, AWS API Gateway) for TLS termination and WAF. hello-ship’s FastAPI should call http://llm-gateway.internal/v1/chat/completions with an OpenAI-compatible payload—the same shape you already use in Track 1—so swapping direct OpenAI for the gateway is a one-line base URL change.
Vendor API keys belong in the gateway’s secret store only. Application pods receive gateway credentials scoped per tenant—not sk-proj-…. Rotate vendor keys without touching app deployments.
“We use LangChain, so we don’t need a gateway.” LangChain is an app framework; it does not centralize keys, tenant budgets, or org-wide failover. Wrap LangChain’s LLM client to point at the gateway base URL.
Stripe, Notion, and Perplexity-style platforms expose an internal model router service. Product teams request access via ticket; platform provisions tenant namespace, daily token cap, and allowed model aliases—same pattern at smaller scale on hello-ship.
“How would you integrate multiple LLM providers?” — Draw one gateway box. List auth, routing, rate limits, cost ledger, cache, failover. Emphasize stable app contract vs interchangeable vendors.
When you can defer the gateway
A single developer prototype with one API key and <1K requests/day can call OpenAI directly—until you add a second service or a second model. Deferral ends when any of these trigger: second team consumes LLMs, compliance asks for key audit trail, finance needs per-feature cost, or CI eval gates require consistent model pinning across staging and prod.
FAQ: Gateway vs API gateway?
Your Kong/Envoy layer handles HTTP routing, JWT auth, and WAF. The LLM gateway understands chat payloads, token usage, model aliases, and semantic cache keys. Often both exist: Kong → LLM gateway → OpenAI. Do not stuff LLM-specific logic into generic API gateway plugins unless you enjoy maintaining Lua.
Gateway responsibilities: seven cross-cutting concerns
Treat the gateway as a platform microservice with seven owned concerns—each maps to code you ship once and every app inherits.
Responsibility matrix
| # | Responsibility | Gateway does | App does NOT |
|---|---|---|---|
| 1 | Auth | Validate gateway API key / mTLS; attach tenant_id | Store OpenAI keys |
| 2 | Routing | Resolve support-fast → gpt-4o-mini; cascade tiers | Hard-code vendor model strings |
| 3 | Rate limiting | Token bucket per tenant, per key, global TPM | Spin on 429 without backoff |
| 4 | Semantic cache | Embedding similarity lookup before vendor call | Build per-feature Redis clients |
| 5 | Cost tracking | Ledger row: tokens × price table → estimated USD | Guess cost from response length |
| 6 | Retry / fallback | Jittered retry; failover chain per model alias | Ad-hoc try/catch per SDK |
| 7 | Observability | Structured trace: request_id, latency, model, cache_hit | Inconsistent log formats |
1 — Auth and request context
Gateway auth is two-layer: who is calling (service API key or workload identity) and on behalf of whom (tenant_id, user_id hash). hello-ship passes tenant from JWT claims; workers pass tenant from job metadata. Reject requests missing tenant on paid tiers—anonymous abuse burns shared quota.
2 — Model routing and aliases
Apps request logical aliases defined in gateway_models.yaml. Routing rules can include: time-of-day downgrade, tenant plan (free → mini only), content length (long docs → frontier), and cascade escalation when confidence is low.
3 — Rate limiting
Implement three limiters: global gateway TPM (protect org), per-tenant daily token budget (protect finance), and per-vendor-key RPM (protect vendor relationship). Return 429 with Retry-After—never pass vendor 429 straight through without recording which limit tripped.
"""FastAPI middleware — token bucket per tenant in Redis."""
import time
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware
class TenantRateLimitMiddleware(BaseHTTPMiddleware):
def __init__(self, app, redis, rpm: int = 120, daily_token_cap: int = 500_000):
super().__init__(app)
self.redis = redis
self.rpm = rpm
self.daily_token_cap = daily_token_cap
async def dispatch(self, request: Request, call_next):
tenant = request.headers.get("X-Tenant-Id")
if not tenant:
raise HTTPException(401, "X-Tenant-Id required")
minute_key = f"rl:{tenant}:rpm:{int(time.time()) // 60}"
count = await self.redis.incr(minute_key)
if count == 1:
await self.redis.expire(minute_key, 120)
if count > self.rpm:
raise HTTPException(429, "Tenant RPM exceeded", headers={"Retry-After": "60"})
day_key = f"rl:{tenant}:tokens:{time.strftime('%Y%m%d')}"
used = int(await self.redis.get(day_key) or 0)
est = int(request.headers.get("X-Est-Tokens", "2000"))
if used + est > self.daily_token_cap:
raise HTTPException(429, "Daily token budget exceeded")
response = await call_next(request)
usage = int(response.headers.get("X-Usage-Total-Tokens", est))
await self.redis.incrby(day_key, usage)
await self.redis.expire(day_key, 86_400)
return response
/** Spring WebFlux filter — per-tenant RPM + daily token cap in Redis. */
@Component
public class TenantRateLimitFilter implements WebFilter {
private final ReactiveStringRedisTemplate redis;
private final int rpm = 120;
private final long dailyTokenCap = 500_000L;
@Override
public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
String tenant = exchange.getRequest().getHeaders().getFirst("X-Tenant-Id");
if (tenant == null || tenant.isBlank()) {
exchange.getResponse().setStatusCode(HttpStatus.UNAUTHORIZED);
return exchange.getResponse().setComplete();
}
long minute = System.currentTimeMillis() / 60_000;
String minuteKey = "rl:" + tenant + ":rpm:" + minute;
return redis.opsForValue().increment(minuteKey)
.flatMap(count -> {
if (count == 1) return redis.expire(minuteKey, Duration.ofMinutes(2)).thenReturn(count);
return Mono.just(count);
})
.flatMap(count -> {
if (count > rpm) {
exchange.getResponse().setStatusCode(HttpStatus.TOO_MANY_REQUESTS);
exchange.getResponse().getHeaders().add("Retry-After", "60");
return exchange.getResponse().setComplete();
}
return chain.filter(exchange);
});
}
}
4 — Semantic cache
Exact cache keys on hash(prompt + model + temperature). Semantic cache embeds the user message, searches Redis vector index for cosine similarity ≥ 0.95, returns cached completion if hit. Skip cache for tool-calling turns, streaming responses you cannot replay, or tenant tiers that disable cache for compliance.
"""Semantic cache: embed query, KNN in Redis, return cached completion."""
import hashlib, json, numpy as np
SIM_THRESHOLD = 0.95
CACHE_TTL_SEC = 3600
async def semantic_cache_lookup(redis, embed_fn, tenant: str, model: str, user_msg: str):
vec = await embed_fn(user_msg) # e.g. text-embedding-3-small
key_prefix = f"sc:{tenant}:{model}"
# Redis FT.SEARCH on vector index — pseudocode via redis-py stack
hits = await redis.ft(f"{key_prefix}:idx").search(
Query(f"*=>[KNN 3 @embedding $vec AS score]")
.sort_by("score")
.dialect(2),
query_params={"vec": np.array(vec, dtype=np.float32).tobytes()},
)
if hits.docs and float(hits.docs[0].score) >= SIM_THRESHOLD:
return json.loads(hits.docs[0].payload)
return None
async def semantic_cache_store(redis, embed_fn, tenant: str, model: str, user_msg: str, completion: dict):
vec = await embed_fn(user_msg)
doc_id = hashlib.sha256(f"{tenant}:{model}:{user_msg}".encode()).hexdigest()[:16]
await redis.hset(
f"sc:{tenant}:{model}:{doc_id}",
mapping={"embedding": vec, "payload": json.dumps(completion)},
)
await redis.expire(f"sc:{tenant}:{model}:{doc_id}", CACHE_TTL_SEC)
/** Semantic cache service — Redis vector search + embedding client. */
public Optional<CachedCompletion> lookup(String tenant, String model, String userMsg) {
float[] vec = embedder.embed(userMsg);
String index = "sc:" + tenant + ":" + model + ":idx";
List<VectorHit> hits = redisVector.knn(index, vec, 3);
if (hits.isEmpty() || hits.get(0).score() < 0.95f) return Optional.empty();
return Optional.of(deserialize(hits.get(0).payload()));
}
public void store(String tenant, String model, String userMsg, CachedCompletion completion) {
float[] vec = embedder.embed(userMsg);
String docId = sha256(tenant + model + userMsg).substring(0, 16);
redis.opsForHash().putAll("sc:" + tenant + ":" + model + ":" + docId,
Map.of("embedding", vec, "payload", serialize(completion)));
redis.expire("sc:" + tenant + ":" + model + ":" + docId, Duration.ofHours(1));
}
5 — Cost tracking and ledger
Every gateway response appends a cost ledger event—async to Kafka, S3, or Postgres. Finance reconciles against vendor invoices using vendor_request_id when available. Price tables live in git, versioned weekly.
{
"$schema": "https://sharpbyte.dev/schemas/cost-ledger-v1.json",
"event_id": "550e8400-e29b-41d4-a716-446655440000",
"timestamp": "2026-06-07T14:32:01.123Z",
"request_id": "req_8f3a2b1c",
"trace_id": "trace_abc123",
"tenant_id": "acme-corp",
"feature_id": "support-chat",
"user_id_hash": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
"model_alias": "support-fast",
"model_resolved": "gpt-4o-mini-2024-07-18",
"provider": "openai",
"region": "us-east-1",
"cache_hit": false,
"usage": {
"input_tokens": 1842,
"output_tokens": 312,
"total_tokens": 2154,
"embedding_tokens": 0
},
"cost_usd": {
"input": 0.000276,
"output": 0.000187,
"total": 0.000463,
"price_table_version": "2026-06-01"
},
"latency_ms": {
"gateway": 12,
"cache_lookup": 3,
"provider": 840,
"total": 855
},
"status": "success",
"fallback_chain": ["openai/gpt-4o-mini", "azure/gpt-4o-mini"],
"fallback_used": false,
"rate_limit_tier": "standard",
"prompt_version": "support_v3.2",
"guardrail_blocked": false
}
6 — Retry and fallback
Classify errors: transient (429, 502, 503, timeout) → retry with full jitter, max 3 attempts; permanent (400, 401, invalid model) → fail fast, no retry. Fallback chain per alias: primary OpenAI → secondary Azure OpenAI (same model family) → Bedrock Claude. Log every fallback—silent failover hides primary degradation.
7 — Observability
Emit OpenTelemetry spans: gateway.auth, gateway.cache, gateway.provider. Propagate trace_id to hello-ship logs per Track 5 observability. Dashboard SLIs: p95 latency, error rate, cache hit rate, $/1K requests, 429 rate per tenant.
Semantic cache hit rates of 15–35% on FAQ-heavy support bots routinely cut inference spend 20%+. Log cache_hit and saved_usd per request so FinOps can attribute savings.
Semantic cache improves cost and latency but can return stale policy answers after KB updates. Tie cache TTL to content version (kb_revision in cache key) or invalidate on ingest webhook.
LiteLLM Proxy implements many responsibilities via config.yaml hooks—budget manager, router, cooldowns. Custom middleware still needed for tenant-specific ACLs and your cost ledger schema.
Return X-Gateway-Model-Resolved and X-Usage-Total-Tokens response headers so hello-ship can log them without parsing JSON bodies.
Open-source gateways: compare and configure
You rarely need to build a gateway from scratch on day one. These four options cover most teams—from batteries-included proxy to fully DIY.
Gateway comparison
| Gateway | Hosting | Strengths | Gaps | Best for |
|---|---|---|---|---|
| LiteLLM Proxy | Self-hosted (Docker/K8s) | 100+ providers, OpenAI-compatible API, budgets, routing, admin UI | Ops burden; semantic cache needs Redis add-on | Platform teams wanting OSS control |
| Portkey | Managed + self-host option | Gateway + observability, guardrails, prompt management | Vendor lock-in on managed tier pricing | Teams wanting gateway + traces in one SKU |
| OpenRouter | Managed SaaS | One API key, many models, automatic failover | Data leaves your VPC; limited custom tenant logic | Startups / fast prototypes |
| Traefik DIY | Self-hosted | Full control, mesh with existing ingress | You build routing, cache, ledger, SDK adapters | Mature platform eng with gateway charter |
LiteLLM Proxy — hello-ship config
LiteLLM exposes an OpenAI-compatible /v1/chat/completions. Point hello-ship’s LLM_BASE_URL at the proxy. Model list, budgets, and fallbacks live in litellm_config.yaml.
# litellm_config.yaml — hello-ship gateway
model_list:
- model_name: support-fast
litellm_params:
model: openai/gpt-4o-mini
api_key: os.environ/OPENAI_API_KEY
- model_name: support-quality
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
- model_name: support-quality
litellm_params:
model: azure/gpt-4o
api_base: https://acme.openai.azure.com
api_key: os.environ/AZURE_OPENAI_KEY
api_version: "2024-02-15-preview"
router_settings:
routing_strategy: simple-shuffle
num_retries: 2
timeout: 60
allowed_fails: 3
cooldown_time: 30
litellm_settings:
drop_params: true
set_verbose: false
cache: true
cache_params:
type: redis
host: redis.llm-gateway.svc
port: 6379
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL # spend logs
# hello-ship — call gateway instead of OpenAI directly
from openai import OpenAI
gateway = OpenAI(
base_url="http://llm-gateway.internal:4000/v1",
api_key=os.environ["HELLO_SHIP_GATEWAY_KEY"],
default_headers={
"X-Tenant-Id": tenant_id,
"X-Feature-Id": "support-chat",
},
)
resp = gateway.chat.completions.create(
model="support-fast", # alias, not gpt-4o-mini
messages=messages,
temperature=0,
)
usage = resp.usage # logged by gateway + returned here
# Same litellm_config.yaml as Python panel — shared gateway config
// hello-ship — OpenAI-compatible Java client pointed at LiteLLM Proxy
OpenAIClient gateway = OpenAIOkHttpClient.builder()
.baseUrl("http://llm-gateway.internal:4000/v1")
.apiKey(System.getenv("HELLO_SHIP_GATEWAY_KEY"))
.build();
ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
.model("support-fast")
.addUserMessage(userMessage)
.temperature(0.0)
.putAdditionalHeader("X-Tenant-Id", tenantId)
.putAdditionalHeader("X-Feature-Id", "support-chat")
.build();
ChatCompletion completion = gateway.chat().completions().create(params);
long totalTokens = completion.usage().map(u -> u.totalTokens()).orElse(0L);
Portkey — gateway headers
Portkey uses a managed gateway URL with config and metadata in headers—useful when you want traces and guardrails without running LiteLLM yourself.
curl https://api.portkey.ai/v1/chat/completions \
-H "x-portkey-api-key: $PORTKEY_API_KEY" \
-H "x-portkey-provider: openai" \
-H "x-portkey-config: pc-support-fast-abc123" \
-H "x-portkey-metadata: {\"tenant_id\":\"acme-corp\",\"feature\":\"support-chat\"}" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"refund policy?"}]}'
OpenRouter — single key, many models
# OpenRouter — OpenAI SDK compatible; model string selects vendor
client = OpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.environ["OPENROUTER_API_KEY"],
)
resp = client.chat.completions.create(
model="anthropic/claude-3.5-sonnet", # or openai/gpt-4o-mini
messages=messages,
extra_headers={
"HTTP-Referer": "https://hello-ship.internal",
"X-Title": "hello-ship support",
},
)
Traefik DIY — ingress + custom upstream
Roll your own when LiteLLM abstractions fight your compliance model. Traefik terminates TLS, routes /v1/* to a small FastAPI/Spring gateway service you own—that service implements adapters, ledger, and cache. Higher build cost, zero framework magic.
# traefik/dynamic/llm-gateway.yml
http:
routers:
llm-gateway:
rule: "Host(`llm-gateway.internal`) && PathPrefix(`/v1`)"
service: llm-gateway-svc
middlewares:
- rate-limit-tenant
- strip-internal-headers
services:
llm-gateway-svc:
loadBalancer:
servers:
- url: "http://llm-gateway-app.llm.svc.cluster.local:8080"
middlewares:
rate-limit-tenant:
rateLimit:
average: 100
burst: 200
LiteLLM ships fastest; Portkey adds managed observability; OpenRouter minimizes ops but sends prompts to a third party; DIY maximizes control at 3–6 month build cost.
Running LiteLLM Proxy without master_key auth on a cluster-internal URL still risks lateral movement after a pod compromise. Require mTLS or signed service tokens even “inside” the VPC.
Many Series B companies start on OpenRouter for velocity, migrate to self-hosted LiteLLM at ~$15K/month inference spend when finance demands per-tenant attribution and VPC boundaries.
Keep hello-ship’s integration test on a mock gateway that returns fixture JSON—same OpenAI response shape. Point staging at real LiteLLM; prod at HA LiteLLM behind load balancer.
Feature matrix (detail)
| Feature | LiteLLM | Portkey | OpenRouter | Traefik DIY |
|---|---|---|---|---|
| OpenAI-compatible API | ✓ | ✓ | ✓ | You build |
| Per-tenant budgets | ✓ (DB) | ✓ | Limited | You build |
| Semantic cache | Redis plugin | ✓ managed | ✗ | You build |
| Fallback routing | ✓ | ✓ | ✓ auto | You build |
| Spend dashboard | Admin UI | ✓ | Basic | You build |
| VPC-only deploy | ✓ | Self-host | ✗ | ✓ |
| Guardrails integration | Hooks | Native | ✗ | Your pipeline |
Self-hosted vs managed: where keys and data live
The gateway decision overlaps with provider choice. Self-hosted LiteLLM in your VPC vs calling managed AWS Bedrock or Azure OpenAI directly are different trust and ops tradeoffs—not mutually exclusive.
Teams often run LiteLLM Proxy in EKS with Bedrock and Azure as upstreams—self-hosted gateway, managed inference. Alternatively, Bedrock alone can be your “gateway” if you only need AWS models and IAM auth, accepting vendor lock-in on routing and cache.
Three deployment patterns
| Pattern | Gateway | Inference | Keys live in | Typical buyer |
|---|---|---|---|---|
| A — LiteLLM + multi-vendor | Self-hosted LiteLLM | OpenAI, Anthropic, Bedrock, Azure | Gateway vault | Multi-cloud product teams |
| B — Bedrock-native | Thin wrapper / SDK | Bedrock only | IAM roles | AWS-all-in enterprises |
| C — Azure OpenAI-native | APIM + Azure OAI | Azure OpenAI | Azure Key Vault | Microsoft EA customers |
LiteLLM self-hosted — pros and cons
- Pros: One OpenAI-compatible surface; swap vendors without app changes; full custom ledger schema; semantic cache in your Redis; air-gapped deploy possible.
- Cons: You operate HA proxy, Postgres for spend logs, Redis, upgrades; on-call for gateway outages blocks all LLM features.
- Cost: Infra ~$200–800/month + engineer time; inference billed per vendor separately.
AWS Bedrock — pros and cons
- Pros: IAM auth (no API keys in pods); Claude, Llama, Titan in one AWS bill; private VPC endpoints; enterprise procurement already on AWS.
- Cons: No OpenAI API shape without adapter; model lag vs OpenAI day-zero; cross-region failover is your design; limited semantic cache story.
- Cost: Per-token Bedrock pricing; no gateway SKU—but you still build routing/ledger or use LiteLLM in front.
Azure OpenAI — pros and cons
- Pros: GPT-4o under enterprise DPA; regional deploy for data residency; integrates with Azure APIM for rate limits and keys.
- Cons: Quota approval latency; not a multi-vendor router; Anthropic/Google require separate paths.
- Cost: Similar per-token to OpenAI; APIM adds per-call charge.
Decision rubric
| Requirement | Prefer | Why |
|---|---|---|
| Multi-vendor failover | LiteLLM (pattern A) | Bedrock/Azure alone are single-vendor |
| AWS-only, IAM-only auth | Bedrock (pattern B) | No API keys; Converse API unified |
| EU data residency + GPT | Azure OpenAI (pattern C) | Regional Azure OAI endpoints |
| Fastest hello-ship curriculum | LiteLLM + OpenAI mock | OpenAI-compatible; matches Track 1 code |
| FinOps per-tenant ledger | LiteLLM + custom middleware | Full schema control |
| Zero gateway ops | OpenRouter or Portkey managed | Trade VPC control for speed |
Bedrock IAM roles beat long-lived API keys for lateral-movement risk. If you use LiteLLM with OpenAI keys, store in AWS Secrets Manager / HashiCorp Vault with automatic rotation—never Kubernetes Secret literals in git.
Managed Azure OpenAI + APIM can add 5–15% overhead per call vs direct OpenAI—but compliance and consolidated billing often justify it. Model routing to mini tiers via LiteLLM saves more than gateway infra costs.
“Bedrock vs OpenAI?” — Not either/or. Position Bedrock for AWS-native IAM and Claude/Llama; OpenAI for latest GPT and ecosystem; LiteLLM gateway unifies both behind one app contract.
Azure OpenAI uses the same model weights as OpenAI but different endpoint, api-version query param, and deployment name instead of raw model ID—LiteLLM’s azure/ prefix normalizes this for apps.
Hybrid hello-ship recommendation
For the curriculum: deploy LiteLLM Proxy in docker-compose beside hello-ship; upstream support-fast → mock server in CI, OpenAI in dev, Bedrock optional via aws_profile in staging. One client config, three environments—mirrors how platform teams ship.
Multi-tenant gateway: namespaces, budgets, ACLs
Once two product lines share the gateway, tenant isolation is not optional. Namespace every limit, cache key, ledger row, and model permission by tenant_id.
Tenant namespace model
A tenant is a billing and policy boundary—customer org, internal business unit, or environment (staging vs prod). hello-ship uses acme-corp, acme-staging, internal-tools. Every gateway subsystem prefixes keys:
- Rate limit: rl:{tenant_id}:rpm:{minute}
- Semantic cache: sc:{tenant_id}:{model_alias}:{hash}
- Cost ledger: partition by tenant_id in Postgres or S3 prefix
- Traces: tenant_id tag on every span
Cost allocation
Finance wants showback/chargeback by tenant and feature. Aggregate ledger events daily:
| Dimension | Source field | Report use |
|---|---|---|
| Tenant | tenant_id | Per-customer invoice / internal budget |
| Feature | feature_id | Which product surface spent tokens |
| Model | model_resolved | FinOps model mix optimization |
| Environment | tenant_id suffix or tag | Exclude staging from customer bills |
| Cache savings | cache_hit + counterfactual cost | Prove cache ROI |
"""Aggregate cost ledger JSONL → reports/cost/{tenant}/daily.json"""
import json
from collections import defaultdict
from pathlib import Path
def rollup(path: Path, day: str) -> dict:
totals = defaultdict(lambda: {"tokens": 0, "usd": 0.0, "requests": 0})
for line in path.read_text().splitlines():
ev = json.loads(line)
if not ev["timestamp"].startswith(day):
continue
key = (ev["tenant_id"], ev.get("feature_id", "unknown"))
totals[key]["tokens"] += ev["usage"]["total_tokens"]
totals[key]["usd"] += ev["cost_usd"]["total"]
totals[key]["requests"] += 1
return {f"{t}:{f}": v for (t, f), v in totals.items()}
/** Roll up cost ledger JSONL by tenant + feature for a given day. */
public Map<String, TenantCostRollup> rollup(Path ledgerFile, String dayPrefix) throws IOException {
Map<String, TenantCostRollup> totals = new HashMap<>();
try (var lines = Files.lines(ledgerFile)) {
for (String line : (Iterable<String>) lines::iterator) {
CostLedgerEvent ev = mapper.readValue(line, CostLedgerEvent.class);
if (!ev.timestamp().startsWith(dayPrefix)) continue;
String key = ev.tenantId() + ":" + ev.featureId();
totals.computeIfAbsent(key, k -> new TenantCostRollup())
.add(ev.usage().totalTokens(), ev.costUsd().total());
}
}
return totals;
}
Model access control
Not every tenant gets frontier models. Define tenant_policy.yaml:
tenants:
acme-corp:
plan: enterprise
allowed_models: [support-fast, support-quality, embed-small]
daily_token_cap: 2_000_000
rpm: 300
acme-free-trial:
plan: free
allowed_models: [support-fast]
daily_token_cap: 50_000
rpm: 30
internal-tools:
plan: internal
allowed_models: ["*"]
daily_token_cap: 10_000_000
rpm: 600
Gateway rejects support-quality for free-trial with 403 and structured error model_not_allowed—prevents surprise frontier spend from sandbox accounts.
Usage dashboards
Minimum viable dashboard panels for platform on-call:
| Panel | Query | Alert threshold |
|---|---|---|
| Spend today by tenant | sum(cost_usd.total) by tenant_id | > 120% of 7d avg |
| p95 gateway latency | histogram(latency_ms.total) | > 3s for 5m |
| 429 rate | rate(status=429) | > 5% of traffic |
| Cache hit rate | cache_hit=true / total | < 5% (misconfigured) |
| Fallback rate | fallback_used=true | > 10% (primary unhealthy) |
| Error rate by provider | status=5xx by provider | > 1% any provider |
Shared semantic cache across tenants leaks answers—a cached response for tenant A’s policy query must never return to tenant B. Always include tenant_id in cache index namespace.
SaaS AI platforms expose a “Usage” tab per workspace fed by the same ledger schema—customers self-serve before finance tickets. Internal tenants get Grafana; external get Metronome/Stripe metering integration.
Wire daily_token_cap to a soft warning at 80% (Slack) and hard block at 100% (429). Lets tenant admins upgrade before hard outage.
Strict model ACLs reduce bill shock but slow innovation—give internal-tools wildcard access and prod tenants explicit allowlists with quarterly review.
FAQ: Tenant vs API key?
One gateway API key per service (hello-ship API, worker fleet); X-Tenant-Id identifies the end customer. Do not mint per-customer gateway keys unless you have thousands of tenants—policy tables scale better.
Production: Track 6 gateway checklist
Close Guide 2 with a gateway shipping checklist—auth, routing, limits, cache, ledger, failover, observability—then continue to fine-tuning & adaptation for when routing alone is not enough.
Track 6 gateway production checklist
- No vendor API keys in application repos or hello-ship env—gateway vault only.
- All inference traffic routes through gateway; CI grep blocks direct api.openai.com imports in app code.
- Model aliases in git (gateway_models.yaml); pinned vendor IDs per environment.
- Per-tenant rate limits: RPM + daily token cap with 429 + Retry-After.
- Semantic cache namespaced by tenant_id; TTL tied to KB revision where applicable.
- Cost ledger schema v1 emitted on every request; daily rollup job per tenant.
- Fallback chain configured and tested; alerts on fallback_used > 10%.
- Structured traces with trace_id, tenant_id, model_resolved, cache_hit.
- Gateway HA: ≥2 replicas, health probe, graceful shutdown draining in-flight streams.
- Model ACL policy enforced; free-tier cannot invoke frontier aliases.
- Dashboards: spend, p95 latency, 429 rate, cache hit rate, provider errors.
- Incident runbook: gateway down → enable read-only mode / cached responses / status page.
- Pre-ship load test: 2× expected peak RPM without tripping vendor limits.
- Reconcile ledger vs vendor invoice weekly; drift > 5% triggers investigation.
| Risk | Gateway control | Signal | Owner |
|---|---|---|---|
| Runaway token spend | Daily cap + model ACL | tenant 429 + spend alert | Platform / FinOps |
| Key leak in git | Gateway-only keys | secret scanner CI | Security |
| Vendor outage | Fallback chain | fallback_used rate | Platform on-call |
| Cross-tenant cache leak | Namespace isolation | security audit + unit tests | Platform |
| Stale cached policy | KB revision in cache key | faithfulness eval regression | ML + Product |
| Retry storm on 429 | Jittered backoff + key pool | provider 429 rate drop | Platform |
Pre-ship gate
| Gate | Threshold | Blocking? |
|---|---|---|
| gateway_smoke.sh (mock upstream) | 100% pass | Yes |
| No direct vendor SDK in app grep | 0 matches | Yes |
| Ledger row on every integration test call | 100% present | Yes |
| Fallback drill in staging | Primary blocked → secondary succeeds | Yes for release |
| Load test 2× peak RPM | p95 < 3s, error < 0.5% | Warn then block |
| Eval smoke via gateway | ≥ baseline pass rate | Yes |
#!/usr/bin/env bash
# gateway_smoke.sh — run after deploy
set -euo pipefail
GW="${LLM_GATEWAY_URL:-http://localhost:4000}"
KEY="${HELLO_SHIP_GATEWAY_KEY:?}"
curl -sf "$GW/health" | grep -q ok
curl -sf "$GW/v1/chat/completions" \
-H "Authorization: Bearer $KEY" \
-H "X-Tenant-Id: smoke-test" \
-H "Content-Type: application/json" \
-d '{"model":"support-fast","messages":[{"role":"user","content":"ping"}]}' \
| jq -e '.choices[0].message.content'
echo "gateway_smoke ok"
# Same gateway_smoke.sh — gateway is language-agnostic
You can justify a centralized gateway, implement seven responsibilities, choose among LiteLLM/Portkey/OpenRouter/DIY, design multi-tenant isolation, and ship with the checklist above. Guide 3 covers fine-tuning & adaptation—when gateway routing and RAG are not enough to hit quality SLOs.
Gateway logs contain prompts—apply the same redaction pipeline as Track 5 guardrails before logs ship to third-party observability vendors.
Whiteboard the gateway box between apps and clouds. List seven responsibilities without naming a vendor first—shows platform thinking; then mention LiteLLM as implementation detail.
Guide 3’s fine-tuning path still calls inference through this gateway—training jobs submit to a separate fine-tune API, but serving the adapted model uses the same model_alias map.
Track 6 guide sequence
| Guide | Topic | You are here |
|---|---|---|
| Guide 1 | hello-ship explained | ← prev |
| Guide 2 | LLM gateway & infrastructure | ✓ |
| Guide 3 | Fine-tuning & adaptation | next → |
| Guide 4+ | Cost control, flags, deploy, multimodal | — |
Handoff to fine-tuning
Export these gateway fields before Guide 3 wiring:
- model_alias and model_resolved per request
- cost_usd.total aggregated per feature
- tenant_id for adaptation path experiments
- cache_hit to avoid tuning on cached traffic
- Baseline eval pass rate via gateway (same path as prod)
FAQ: Build or buy gateway?
Buy (Portkey/OpenRouter) until multi-tenant ledger + VPC + custom ACLs block you—usually $10–30K/month inference or compliance review. Build (LiteLLM self-hosted) when platform team exists and two product lines share models.
FAQ: Gateway before or after hello-ship?
Guide 1 ships hello-ship calling a mock. Guide 2 inserts the gateway—minimal hello-ship diff is base URL + headers. Do not rewrite hello-ship; reroute it.