Lang
API

LLM gateway & infrastructure

Track 5 taught you to eval, guardrail, and observe. Track 6 asks: how do you ship? The first production move is an LLM gateway—one front door that owns API keys, model routing, rate limits, semantic caching, cost ledgers, and failover. Without it, every team embeds OPENAI_API_KEY in env vars, retries 429s without jitter, and finance discovers a six-figure bill three weeks late. This guide maps gateway responsibilities, compares LiteLLM Proxy, Portkey, OpenRouter, and DIY Traefik stacks, covers self-hosted vs managed tradeoffs, multi-tenant isolation on ~/hello-ship, and closes with the Track 6 gateway production checklist.

After reading, you should be able to: explain why keys must never scatter across microservices; enumerate seven gateway responsibilities (auth, routing, rate limits, semantic cache, cost tracking, retry/fallback, observability); configure LiteLLM Proxy with model lists and budgets; implement per-tenant rate-limit middleware; sketch a Redis semantic cache; define a cost ledger JSON schema; compare open-source gateways; choose self-hosted LiteLLM vs managed Bedrock/Azure OpenAI; design multi-tenant namespaces with cost allocation; and ship with the Track 6 gateway checklist.

developer platform architect Track 6 LiteLLM Portkey OpenRouter

Why an LLM gateway: stop scattering keys

An LLM gateway is the single network hop between your apps and every model vendor. It centralizes auth, routing, rate limits, cost tracking, and caching—so product teams call one stable API instead of importing vendor SDKs in twenty repos.

Picture ~/hello-ship after Guide 1: FastAPI serves /v1/chat, workers call OpenAI directly, and a cron job hits Anthropic for summarization. Keys live in three Kubernetes Secrets. When OpenAI returns 429, the API retries immediately while the worker backs off—both hammer the same exhausted key. Finance asks “which team spent $40K on GPT-4o last month?” and nobody can answer because logs lack tenant_id and model_id. That is the scatter-keys anti-pattern.

The fix is not “use LiteLLM in one service.” It is a platform contract: every inference request—chat, embeddings, structured extraction, agent tool loops—flows through the gateway. Apps pass X-Tenant-Id, X-Feature-Id, and a logical model alias (support-fast, support-quality). The gateway resolves aliases to pinned vendor model IDs, enforces budgets, writes a cost ledger row, and returns normalized usage metadata. Vendor keys never leave the gateway VPC.

Without vs with a gateway

ConcernScattered keys (anti-pattern)Centralized gateway
AuthKeys in 12 repos; one intern commits .envKeys in vault; apps use gateway API keys or mTLS
RoutingHard-coded gpt-4o in each serviceAlias map in git; change model without redeploying apps
Rate limitsPer-service retry storms on 429Token bucket per tenant + per vendor key pool
CostBilling export ≠ app logsPer-request ledger with tokens and $ estimate
CacheDuplicate identical FAQ callsSemantic cache at gateway; apps stay unaware
FailoverOutage = every team pages separatelyGateway fails over OpenAI → Azure OpenAI → Bedrock
ObservabilityInconsistent log shapesOne trace schema → Langfuse / Datadog / Splunk

Gateway in the hello-ship architecture

flowchart TB
  subgraph clients["Application tier"]
    API["hello-ship FastAPI"]
    WK["Async workers"]
    AG["Agent orchestrator"]
  end

  subgraph gateway["LLM gateway — single front door"]
    AUTH["Auth + tenant context"]
    RL["Rate limiter"]
    RT["Model router"]
    SC["Semantic cache Redis"]
    CB["Cost ledger"]
    FB["Retry + fallback"]
    OBS["Trace + metrics"]
  end

  subgraph vendors["Model providers"]
    OAI["OpenAI"]
    ANT["Anthropic"]
    BR["AWS Bedrock"]
    AZ["Azure OpenAI"]
  end

  API --> AUTH
  WK --> AUTH
  AG --> AUTH
  AUTH --> RL
  RL --> SC
  SC -->|miss| RT
  SC -->|hit| OBS
  RT --> FB
  FB --> OAI
  FB --> ANT
  FB --> BR
  FB --> AZ
  FB --> CB
  CB --> OBS

In production, the gateway often sits behind your existing API gateway (Kong, Envoy, AWS API Gateway) for TLS termination and WAF. hello-ship’s FastAPI should call http://llm-gateway.internal/v1/chat/completions with an OpenAI-compatible payload—the same shape you already use in Track 1—so swapping direct OpenAI for the gateway is a one-line base URL change.

🔒 Security

Vendor API keys belong in the gateway’s secret store only. Application pods receive gateway credentials scoped per tenant—not sk-proj-…. Rotate vendor keys without touching app deployments.

⚠️ Pitfall

“We use LangChain, so we don’t need a gateway.” LangChain is an app framework; it does not centralize keys, tenant budgets, or org-wide failover. Wrap LangChain’s LLM client to point at the gateway base URL.

🏢 Real World

Stripe, Notion, and Perplexity-style platforms expose an internal model router service. Product teams request access via ticket; platform provisions tenant namespace, daily token cap, and allowed model aliases—same pattern at smaller scale on hello-ship.

🎯 Interview Tip

“How would you integrate multiple LLM providers?” — Draw one gateway box. List auth, routing, rate limits, cost ledger, cache, failover. Emphasize stable app contract vs interchangeable vendors.

When you can defer the gateway

A single developer prototype with one API key and <1K requests/day can call OpenAI directly—until you add a second service or a second model. Deferral ends when any of these trigger: second team consumes LLMs, compliance asks for key audit trail, finance needs per-feature cost, or CI eval gates require consistent model pinning across staging and prod.

FAQ: Gateway vs API gateway?

Your Kong/Envoy layer handles HTTP routing, JWT auth, and WAF. The LLM gateway understands chat payloads, token usage, model aliases, and semantic cache keys. Often both exist: Kong → LLM gateway → OpenAI. Do not stuff LLM-specific logic into generic API gateway plugins unless you enjoy maintaining Lua.

Gateway responsibilities: seven cross-cutting concerns

Treat the gateway as a platform microservice with seven owned concerns—each maps to code you ship once and every app inherits.

Responsibility matrix

#ResponsibilityGateway doesApp does NOT
1AuthValidate gateway API key / mTLS; attach tenant_idStore OpenAI keys
2RoutingResolve support-fastgpt-4o-mini; cascade tiersHard-code vendor model strings
3Rate limitingToken bucket per tenant, per key, global TPMSpin on 429 without backoff
4Semantic cacheEmbedding similarity lookup before vendor callBuild per-feature Redis clients
5Cost trackingLedger row: tokens × price table → estimated USDGuess cost from response length
6Retry / fallbackJittered retry; failover chain per model aliasAd-hoc try/catch per SDK
7ObservabilityStructured trace: request_id, latency, model, cache_hitInconsistent log formats

1 — Auth and request context

Gateway auth is two-layer: who is calling (service API key or workload identity) and on behalf of whom (tenant_id, user_id hash). hello-ship passes tenant from JWT claims; workers pass tenant from job metadata. Reject requests missing tenant on paid tiers—anonymous abuse burns shared quota.

2 — Model routing and aliases

Apps request logical aliases defined in gateway_models.yaml. Routing rules can include: time-of-day downgrade, tenant plan (free → mini only), content length (long docs → frontier), and cascade escalation when confidence is low.

3 — Rate limiting

Implement three limiters: global gateway TPM (protect org), per-tenant daily token budget (protect finance), and per-vendor-key RPM (protect vendor relationship). Return 429 with Retry-After—never pass vendor 429 straight through without recording which limit tripped.

Per-tenant rate limit middleware (hello-ship gateway)
"""FastAPI middleware — token bucket per tenant in Redis."""
import time
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware

class TenantRateLimitMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, redis, rpm: int = 120, daily_token_cap: int = 500_000):
        super().__init__(app)
        self.redis = redis
        self.rpm = rpm
        self.daily_token_cap = daily_token_cap

    async def dispatch(self, request: Request, call_next):
        tenant = request.headers.get("X-Tenant-Id")
        if not tenant:
            raise HTTPException(401, "X-Tenant-Id required")

        minute_key = f"rl:{tenant}:rpm:{int(time.time()) // 60}"
        count = await self.redis.incr(minute_key)
        if count == 1:
            await self.redis.expire(minute_key, 120)
        if count > self.rpm:
            raise HTTPException(429, "Tenant RPM exceeded", headers={"Retry-After": "60"})

        day_key = f"rl:{tenant}:tokens:{time.strftime('%Y%m%d')}"
        used = int(await self.redis.get(day_key) or 0)
        est = int(request.headers.get("X-Est-Tokens", "2000"))
        if used + est > self.daily_token_cap:
            raise HTTPException(429, "Daily token budget exceeded")

        response = await call_next(request)
        usage = int(response.headers.get("X-Usage-Total-Tokens", est))
        await self.redis.incrby(day_key, usage)
        await self.redis.expire(day_key, 86_400)
        return response
/** Spring WebFlux filter — per-tenant RPM + daily token cap in Redis. */
@Component
public class TenantRateLimitFilter implements WebFilter {
  private final ReactiveStringRedisTemplate redis;
  private final int rpm = 120;
  private final long dailyTokenCap = 500_000L;

  @Override
  public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
    String tenant = exchange.getRequest().getHeaders().getFirst("X-Tenant-Id");
    if (tenant == null || tenant.isBlank()) {
      exchange.getResponse().setStatusCode(HttpStatus.UNAUTHORIZED);
      return exchange.getResponse().setComplete();
    }
    long minute = System.currentTimeMillis() / 60_000;
    String minuteKey = "rl:" + tenant + ":rpm:" + minute;
    return redis.opsForValue().increment(minuteKey)
      .flatMap(count -> {
        if (count == 1) return redis.expire(minuteKey, Duration.ofMinutes(2)).thenReturn(count);
        return Mono.just(count);
      })
      .flatMap(count -> {
        if (count > rpm) {
          exchange.getResponse().setStatusCode(HttpStatus.TOO_MANY_REQUESTS);
          exchange.getResponse().getHeaders().add("Retry-After", "60");
          return exchange.getResponse().setComplete();
        }
        return chain.filter(exchange);
      });
  }
}

4 — Semantic cache

Exact cache keys on hash(prompt + model + temperature). Semantic cache embeds the user message, searches Redis vector index for cosine similarity ≥ 0.95, returns cached completion if hit. Skip cache for tool-calling turns, streaming responses you cannot replay, or tenant tiers that disable cache for compliance.

Semantic cache sketch — Redis + embeddings
"""Semantic cache: embed query, KNN in Redis, return cached completion."""
import hashlib, json, numpy as np

SIM_THRESHOLD = 0.95
CACHE_TTL_SEC = 3600

async def semantic_cache_lookup(redis, embed_fn, tenant: str, model: str, user_msg: str):
    vec = await embed_fn(user_msg)  # e.g. text-embedding-3-small
    key_prefix = f"sc:{tenant}:{model}"
    # Redis FT.SEARCH on vector index — pseudocode via redis-py stack
    hits = await redis.ft(f"{key_prefix}:idx").search(
        Query(f"*=>[KNN 3 @embedding $vec AS score]")
          .sort_by("score")
          .dialect(2),
        query_params={"vec": np.array(vec, dtype=np.float32).tobytes()},
    )
    if hits.docs and float(hits.docs[0].score) >= SIM_THRESHOLD:
        return json.loads(hits.docs[0].payload)
    return None

async def semantic_cache_store(redis, embed_fn, tenant: str, model: str, user_msg: str, completion: dict):
    vec = await embed_fn(user_msg)
    doc_id = hashlib.sha256(f"{tenant}:{model}:{user_msg}".encode()).hexdigest()[:16]
    await redis.hset(
        f"sc:{tenant}:{model}:{doc_id}",
        mapping={"embedding": vec, "payload": json.dumps(completion)},
    )
    await redis.expire(f"sc:{tenant}:{model}:{doc_id}", CACHE_TTL_SEC)
/** Semantic cache service — Redis vector search + embedding client. */
public Optional<CachedCompletion> lookup(String tenant, String model, String userMsg) {
  float[] vec = embedder.embed(userMsg);
  String index = "sc:" + tenant + ":" + model + ":idx";
  List<VectorHit> hits = redisVector.knn(index, vec, 3);
  if (hits.isEmpty() || hits.get(0).score() < 0.95f) return Optional.empty();
  return Optional.of(deserialize(hits.get(0).payload()));
}

public void store(String tenant, String model, String userMsg, CachedCompletion completion) {
  float[] vec = embedder.embed(userMsg);
  String docId = sha256(tenant + model + userMsg).substring(0, 16);
  redis.opsForHash().putAll("sc:" + tenant + ":" + model + ":" + docId,
      Map.of("embedding", vec, "payload", serialize(completion)));
  redis.expire("sc:" + tenant + ":" + model + ":" + docId, Duration.ofHours(1));
}

5 — Cost tracking and ledger

Every gateway response appends a cost ledger event—async to Kafka, S3, or Postgres. Finance reconciles against vendor invoices using vendor_request_id when available. Price tables live in git, versioned weekly.

Cost ledger JSON schema (per inference request)
{
  "$schema": "https://sharpbyte.dev/schemas/cost-ledger-v1.json",
  "event_id": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2026-06-07T14:32:01.123Z",
  "request_id": "req_8f3a2b1c",
  "trace_id": "trace_abc123",
  "tenant_id": "acme-corp",
  "feature_id": "support-chat",
  "user_id_hash": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
  "model_alias": "support-fast",
  "model_resolved": "gpt-4o-mini-2024-07-18",
  "provider": "openai",
  "region": "us-east-1",
  "cache_hit": false,
  "usage": {
    "input_tokens": 1842,
    "output_tokens": 312,
    "total_tokens": 2154,
    "embedding_tokens": 0
  },
  "cost_usd": {
    "input": 0.000276,
    "output": 0.000187,
    "total": 0.000463,
    "price_table_version": "2026-06-01"
  },
  "latency_ms": {
    "gateway": 12,
    "cache_lookup": 3,
    "provider": 840,
    "total": 855
  },
  "status": "success",
  "fallback_chain": ["openai/gpt-4o-mini", "azure/gpt-4o-mini"],
  "fallback_used": false,
  "rate_limit_tier": "standard",
  "prompt_version": "support_v3.2",
  "guardrail_blocked": false
}

6 — Retry and fallback

Classify errors: transient (429, 502, 503, timeout) → retry with full jitter, max 3 attempts; permanent (400, 401, invalid model) → fail fast, no retry. Fallback chain per alias: primary OpenAI → secondary Azure OpenAI (same model family) → Bedrock Claude. Log every fallback—silent failover hides primary degradation.

7 — Observability

Emit OpenTelemetry spans: gateway.auth, gateway.cache, gateway.provider. Propagate trace_id to hello-ship logs per Track 5 observability. Dashboard SLIs: p95 latency, error rate, cache hit rate, $/1K requests, 429 rate per tenant.

💰 Cost

Semantic cache hit rates of 15–35% on FAQ-heavy support bots routinely cut inference spend 20%+. Log cache_hit and saved_usd per request so FinOps can attribute savings.

⚖️ Trade-off

Semantic cache improves cost and latency but can return stale policy answers after KB updates. Tie cache TTL to content version (kb_revision in cache key) or invalidate on ingest webhook.

🔬 Under the Hood

LiteLLM Proxy implements many responsibilities via config.yaml hooks—budget manager, router, cooldowns. Custom middleware still needed for tenant-specific ACLs and your cost ledger schema.

💡 Pro Tip

Return X-Gateway-Model-Resolved and X-Usage-Total-Tokens response headers so hello-ship can log them without parsing JSON bodies.

Open-source gateways: compare and configure

You rarely need to build a gateway from scratch on day one. These four options cover most teams—from batteries-included proxy to fully DIY.

Gateway comparison

Gateway Hosting Strengths Gaps Best for
LiteLLM Proxy Self-hosted (Docker/K8s) 100+ providers, OpenAI-compatible API, budgets, routing, admin UI Ops burden; semantic cache needs Redis add-on Platform teams wanting OSS control
Portkey Managed + self-host option Gateway + observability, guardrails, prompt management Vendor lock-in on managed tier pricing Teams wanting gateway + traces in one SKU
OpenRouter Managed SaaS One API key, many models, automatic failover Data leaves your VPC; limited custom tenant logic Startups / fast prototypes
Traefik DIY Self-hosted Full control, mesh with existing ingress You build routing, cache, ledger, SDK adapters Mature platform eng with gateway charter

LiteLLM Proxy — hello-ship config

LiteLLM exposes an OpenAI-compatible /v1/chat/completions. Point hello-ship’s LLM_BASE_URL at the proxy. Model list, budgets, and fallbacks live in litellm_config.yaml.

LiteLLM Proxy — config + hello-ship client
# litellm_config.yaml — hello-ship gateway
model_list:
  - model_name: support-fast
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: support-quality
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: support-quality
    litellm_params:
      model: azure/gpt-4o
      api_base: https://acme.openai.azure.com
      api_key: os.environ/AZURE_OPENAI_KEY
      api_version: "2024-02-15-preview"

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 2
  timeout: 60
  allowed_fails: 3
  cooldown_time: 30

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis.llm-gateway.svc
    port: 6379

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # spend logs
# hello-ship — call gateway instead of OpenAI directly
from openai import OpenAI

gateway = OpenAI(
    base_url="http://llm-gateway.internal:4000/v1",
    api_key=os.environ["HELLO_SHIP_GATEWAY_KEY"],
    default_headers={
        "X-Tenant-Id": tenant_id,
        "X-Feature-Id": "support-chat",
    },
)

resp = gateway.chat.completions.create(
    model="support-fast",  # alias, not gpt-4o-mini
    messages=messages,
    temperature=0,
)
usage = resp.usage  # logged by gateway + returned here
# Same litellm_config.yaml as Python panel — shared gateway config
// hello-ship — OpenAI-compatible Java client pointed at LiteLLM Proxy
OpenAIClient gateway = OpenAIOkHttpClient.builder()
    .baseUrl("http://llm-gateway.internal:4000/v1")
    .apiKey(System.getenv("HELLO_SHIP_GATEWAY_KEY"))
    .build();

ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
    .model("support-fast")
    .addUserMessage(userMessage)
    .temperature(0.0)
    .putAdditionalHeader("X-Tenant-Id", tenantId)
    .putAdditionalHeader("X-Feature-Id", "support-chat")
    .build();

ChatCompletion completion = gateway.chat().completions().create(params);
long totalTokens = completion.usage().map(u -> u.totalTokens()).orElse(0L);

Portkey — gateway headers

Portkey uses a managed gateway URL with config and metadata in headers—useful when you want traces and guardrails without running LiteLLM yourself.

Portkey — curl through gateway
curl https://api.portkey.ai/v1/chat/completions \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -H "x-portkey-provider: openai" \
  -H "x-portkey-config: pc-support-fast-abc123" \
  -H "x-portkey-metadata: {\"tenant_id\":\"acme-corp\",\"feature\":\"support-chat\"}" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"refund policy?"}]}'

OpenRouter — single key, many models

OpenRouter — model string routing
# OpenRouter — OpenAI SDK compatible; model string selects vendor
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
resp = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",  # or openai/gpt-4o-mini
    messages=messages,
    extra_headers={
        "HTTP-Referer": "https://hello-ship.internal",
        "X-Title": "hello-ship support",
    },
)

Traefik DIY — ingress + custom upstream

Roll your own when LiteLLM abstractions fight your compliance model. Traefik terminates TLS, routes /v1/* to a small FastAPI/Spring gateway service you own—that service implements adapters, ledger, and cache. Higher build cost, zero framework magic.

Traefik — route to custom LLM gateway service
# traefik/dynamic/llm-gateway.yml
http:
  routers:
    llm-gateway:
      rule: "Host(`llm-gateway.internal`) && PathPrefix(`/v1`)"
      service: llm-gateway-svc
      middlewares:
        - rate-limit-tenant
        - strip-internal-headers
  services:
    llm-gateway-svc:
      loadBalancer:
        servers:
          - url: "http://llm-gateway-app.llm.svc.cluster.local:8080"
  middlewares:
    rate-limit-tenant:
      rateLimit:
        average: 100
        burst: 200
⚖️ Trade-off

LiteLLM ships fastest; Portkey adds managed observability; OpenRouter minimizes ops but sends prompts to a third party; DIY maximizes control at 3–6 month build cost.

⚠️ Pitfall

Running LiteLLM Proxy without master_key auth on a cluster-internal URL still risks lateral movement after a pod compromise. Require mTLS or signed service tokens even “inside” the VPC.

🏢 Real World

Many Series B companies start on OpenRouter for velocity, migrate to self-hosted LiteLLM at ~$15K/month inference spend when finance demands per-tenant attribution and VPC boundaries.

💡 Pro Tip

Keep hello-ship’s integration test on a mock gateway that returns fixture JSON—same OpenAI response shape. Point staging at real LiteLLM; prod at HA LiteLLM behind load balancer.

Feature matrix (detail)

FeatureLiteLLMPortkeyOpenRouterTraefik DIY
OpenAI-compatible APIYou build
Per-tenant budgets✓ (DB)LimitedYou build
Semantic cacheRedis plugin✓ managedYou build
Fallback routing✓ autoYou build
Spend dashboardAdmin UIBasicYou build
VPC-only deploySelf-host
Guardrails integrationHooksNativeYour pipeline

Self-hosted vs managed: where keys and data live

The gateway decision overlaps with provider choice. Self-hosted LiteLLM in your VPC vs calling managed AWS Bedrock or Azure OpenAI directly are different trust and ops tradeoffs—not mutually exclusive.

Teams often run LiteLLM Proxy in EKS with Bedrock and Azure as upstreams—self-hosted gateway, managed inference. Alternatively, Bedrock alone can be your “gateway” if you only need AWS models and IAM auth, accepting vendor lock-in on routing and cache.

Three deployment patterns

PatternGatewayInferenceKeys live inTypical buyer
A — LiteLLM + multi-vendorSelf-hosted LiteLLMOpenAI, Anthropic, Bedrock, AzureGateway vaultMulti-cloud product teams
B — Bedrock-nativeThin wrapper / SDKBedrock onlyIAM rolesAWS-all-in enterprises
C — Azure OpenAI-nativeAPIM + Azure OAIAzure OpenAIAzure Key VaultMicrosoft EA customers

LiteLLM self-hosted — pros and cons

  • Pros: One OpenAI-compatible surface; swap vendors without app changes; full custom ledger schema; semantic cache in your Redis; air-gapped deploy possible.
  • Cons: You operate HA proxy, Postgres for spend logs, Redis, upgrades; on-call for gateway outages blocks all LLM features.
  • Cost: Infra ~$200–800/month + engineer time; inference billed per vendor separately.

AWS Bedrock — pros and cons

  • Pros: IAM auth (no API keys in pods); Claude, Llama, Titan in one AWS bill; private VPC endpoints; enterprise procurement already on AWS.
  • Cons: No OpenAI API shape without adapter; model lag vs OpenAI day-zero; cross-region failover is your design; limited semantic cache story.
  • Cost: Per-token Bedrock pricing; no gateway SKU—but you still build routing/ledger or use LiteLLM in front.

Azure OpenAI — pros and cons

  • Pros: GPT-4o under enterprise DPA; regional deploy for data residency; integrates with Azure APIM for rate limits and keys.
  • Cons: Quota approval latency; not a multi-vendor router; Anthropic/Google require separate paths.
  • Cost: Similar per-token to OpenAI; APIM adds per-call charge.

Decision rubric

RequirementPreferWhy
Multi-vendor failoverLiteLLM (pattern A)Bedrock/Azure alone are single-vendor
AWS-only, IAM-only authBedrock (pattern B)No API keys; Converse API unified
EU data residency + GPTAzure OpenAI (pattern C)Regional Azure OAI endpoints
Fastest hello-ship curriculumLiteLLM + OpenAI mockOpenAI-compatible; matches Track 1 code
FinOps per-tenant ledgerLiteLLM + custom middlewareFull schema control
Zero gateway opsOpenRouter or Portkey managedTrade VPC control for speed
🔒 Security

Bedrock IAM roles beat long-lived API keys for lateral-movement risk. If you use LiteLLM with OpenAI keys, store in AWS Secrets Manager / HashiCorp Vault with automatic rotation—never Kubernetes Secret literals in git.

💰 Cost

Managed Azure OpenAI + APIM can add 5–15% overhead per call vs direct OpenAI—but compliance and consolidated billing often justify it. Model routing to mini tiers via LiteLLM saves more than gateway infra costs.

🎯 Interview Tip

“Bedrock vs OpenAI?” — Not either/or. Position Bedrock for AWS-native IAM and Claude/Llama; OpenAI for latest GPT and ecosystem; LiteLLM gateway unifies both behind one app contract.

🔬 Under the Hood

Azure OpenAI uses the same model weights as OpenAI but different endpoint, api-version query param, and deployment name instead of raw model ID—LiteLLM’s azure/ prefix normalizes this for apps.

Hybrid hello-ship recommendation

For the curriculum: deploy LiteLLM Proxy in docker-compose beside hello-ship; upstream support-fast → mock server in CI, OpenAI in dev, Bedrock optional via aws_profile in staging. One client config, three environments—mirrors how platform teams ship.

Multi-tenant gateway: namespaces, budgets, ACLs

Once two product lines share the gateway, tenant isolation is not optional. Namespace every limit, cache key, ledger row, and model permission by tenant_id.

Tenant namespace model

A tenant is a billing and policy boundary—customer org, internal business unit, or environment (staging vs prod). hello-ship uses acme-corp, acme-staging, internal-tools. Every gateway subsystem prefixes keys:

  • Rate limit: rl:{tenant_id}:rpm:{minute}
  • Semantic cache: sc:{tenant_id}:{model_alias}:{hash}
  • Cost ledger: partition by tenant_id in Postgres or S3 prefix
  • Traces: tenant_id tag on every span

Cost allocation

Finance wants showback/chargeback by tenant and feature. Aggregate ledger events daily:

DimensionSource fieldReport use
Tenanttenant_idPer-customer invoice / internal budget
Featurefeature_idWhich product surface spent tokens
Modelmodel_resolvedFinOps model mix optimization
Environmenttenant_id suffix or tagExclude staging from customer bills
Cache savingscache_hit + counterfactual costProve cache ROI
Daily cost rollup per tenant
"""Aggregate cost ledger JSONL → reports/cost/{tenant}/daily.json"""
import json
from collections import defaultdict
from pathlib import Path

def rollup(path: Path, day: str) -> dict:
    totals = defaultdict(lambda: {"tokens": 0, "usd": 0.0, "requests": 0})
    for line in path.read_text().splitlines():
        ev = json.loads(line)
        if not ev["timestamp"].startswith(day):
            continue
        key = (ev["tenant_id"], ev.get("feature_id", "unknown"))
        totals[key]["tokens"] += ev["usage"]["total_tokens"]
        totals[key]["usd"] += ev["cost_usd"]["total"]
        totals[key]["requests"] += 1
    return {f"{t}:{f}": v for (t, f), v in totals.items()}
/** Roll up cost ledger JSONL by tenant + feature for a given day. */
public Map<String, TenantCostRollup> rollup(Path ledgerFile, String dayPrefix) throws IOException {
  Map<String, TenantCostRollup> totals = new HashMap<>();
  try (var lines = Files.lines(ledgerFile)) {
    for (String line : (Iterable<String>) lines::iterator) {
      CostLedgerEvent ev = mapper.readValue(line, CostLedgerEvent.class);
      if (!ev.timestamp().startsWith(dayPrefix)) continue;
      String key = ev.tenantId() + ":" + ev.featureId();
      totals.computeIfAbsent(key, k -> new TenantCostRollup())
        .add(ev.usage().totalTokens(), ev.costUsd().total());
    }
  }
  return totals;
}

Model access control

Not every tenant gets frontier models. Define tenant_policy.yaml:

tenant_policy.yaml — model ACLs
tenants:
  acme-corp:
    plan: enterprise
    allowed_models: [support-fast, support-quality, embed-small]
    daily_token_cap: 2_000_000
    rpm: 300
  acme-free-trial:
    plan: free
    allowed_models: [support-fast]
    daily_token_cap: 50_000
    rpm: 30
  internal-tools:
    plan: internal
    allowed_models: ["*"]
    daily_token_cap: 10_000_000
    rpm: 600

Gateway rejects support-quality for free-trial with 403 and structured error model_not_allowed—prevents surprise frontier spend from sandbox accounts.

Usage dashboards

Minimum viable dashboard panels for platform on-call:

PanelQueryAlert threshold
Spend today by tenantsum(cost_usd.total) by tenant_id> 120% of 7d avg
p95 gateway latencyhistogram(latency_ms.total)> 3s for 5m
429 raterate(status=429)> 5% of traffic
Cache hit ratecache_hit=true / total< 5% (misconfigured)
Fallback ratefallback_used=true> 10% (primary unhealthy)
Error rate by providerstatus=5xx by provider> 1% any provider
⚠️ Pitfall

Shared semantic cache across tenants leaks answers—a cached response for tenant A’s policy query must never return to tenant B. Always include tenant_id in cache index namespace.

🏢 Real World

SaaS AI platforms expose a “Usage” tab per workspace fed by the same ledger schema—customers self-serve before finance tickets. Internal tenants get Grafana; external get Metronome/Stripe metering integration.

💡 Pro Tip

Wire daily_token_cap to a soft warning at 80% (Slack) and hard block at 100% (429). Lets tenant admins upgrade before hard outage.

⚖️ Trade-off

Strict model ACLs reduce bill shock but slow innovation—give internal-tools wildcard access and prod tenants explicit allowlists with quarterly review.

FAQ: Tenant vs API key?

One gateway API key per service (hello-ship API, worker fleet); X-Tenant-Id identifies the end customer. Do not mint per-customer gateway keys unless you have thousands of tenants—policy tables scale better.

Production: Track 6 gateway checklist

Close Guide 2 with a gateway shipping checklist—auth, routing, limits, cache, ledger, failover, observability—then continue to fine-tuning & adaptation for when routing alone is not enough.

Track 6 gateway production checklist

  • No vendor API keys in application repos or hello-ship env—gateway vault only.
  • All inference traffic routes through gateway; CI grep blocks direct api.openai.com imports in app code.
  • Model aliases in git (gateway_models.yaml); pinned vendor IDs per environment.
  • Per-tenant rate limits: RPM + daily token cap with 429 + Retry-After.
  • Semantic cache namespaced by tenant_id; TTL tied to KB revision where applicable.
  • Cost ledger schema v1 emitted on every request; daily rollup job per tenant.
  • Fallback chain configured and tested; alerts on fallback_used > 10%.
  • Structured traces with trace_id, tenant_id, model_resolved, cache_hit.
  • Gateway HA: ≥2 replicas, health probe, graceful shutdown draining in-flight streams.
  • Model ACL policy enforced; free-tier cannot invoke frontier aliases.
  • Dashboards: spend, p95 latency, 429 rate, cache hit rate, provider errors.
  • Incident runbook: gateway down → enable read-only mode / cached responses / status page.
  • Pre-ship load test: 2× expected peak RPM without tripping vendor limits.
  • Reconcile ledger vs vendor invoice weekly; drift > 5% triggers investigation.
RiskGateway controlSignalOwner
Runaway token spendDaily cap + model ACLtenant 429 + spend alertPlatform / FinOps
Key leak in gitGateway-only keyssecret scanner CISecurity
Vendor outageFallback chainfallback_used ratePlatform on-call
Cross-tenant cache leakNamespace isolationsecurity audit + unit testsPlatform
Stale cached policyKB revision in cache keyfaithfulness eval regressionML + Product
Retry storm on 429Jittered backoff + key poolprovider 429 rate dropPlatform

Pre-ship gate

GateThresholdBlocking?
gateway_smoke.sh (mock upstream)100% passYes
No direct vendor SDK in app grep0 matchesYes
Ledger row on every integration test call100% presentYes
Fallback drill in stagingPrimary blocked → secondary succeedsYes for release
Load test 2× peak RPMp95 < 3s, error < 0.5%Warn then block
Eval smoke via gateway≥ baseline pass rateYes
gateway_smoke.sh helper — verify hello-ship → gateway path
#!/usr/bin/env bash
# gateway_smoke.sh — run after deploy
set -euo pipefail
GW="${LLM_GATEWAY_URL:-http://localhost:4000}"
KEY="${HELLO_SHIP_GATEWAY_KEY:?}"
curl -sf "$GW/health" | grep -q ok
curl -sf "$GW/v1/chat/completions" \
  -H "Authorization: Bearer $KEY" \
  -H "X-Tenant-Id: smoke-test" \
  -H "Content-Type: application/json" \
  -d '{"model":"support-fast","messages":[{"role":"user","content":"ping"}]}' \
  | jq -e '.choices[0].message.content'
echo "gateway_smoke ok"
# Same gateway_smoke.sh — gateway is language-agnostic
📦 End of Guide 2

You can justify a centralized gateway, implement seven responsibilities, choose among LiteLLM/Portkey/OpenRouter/DIY, design multi-tenant isolation, and ship with the checklist above. Guide 3 covers fine-tuning & adaptation—when gateway routing and RAG are not enough to hit quality SLOs.

🔒 Security

Gateway logs contain prompts—apply the same redaction pipeline as Track 5 guardrails before logs ship to third-party observability vendors.

🎯 Interview Tip

Whiteboard the gateway box between apps and clouds. List seven responsibilities without naming a vendor first—shows platform thinking; then mention LiteLLM as implementation detail.

🔬 Under the Hood

Guide 3’s fine-tuning path still calls inference through this gateway—training jobs submit to a separate fine-tune API, but serving the adapted model uses the same model_alias map.

Track 6 guide sequence

GuideTopicYou are here
Guide 1hello-ship explained← prev
Guide 2LLM gateway & infrastructure
Guide 3Fine-tuning & adaptationnext →
Guide 4+Cost control, flags, deploy, multimodal

Handoff to fine-tuning

Export these gateway fields before Guide 3 wiring:

  • model_alias and model_resolved per request
  • cost_usd.total aggregated per feature
  • tenant_id for adaptation path experiments
  • cache_hit to avoid tuning on cached traffic
  • Baseline eval pass rate via gateway (same path as prod)

FAQ: Build or buy gateway?

Buy (Portkey/OpenRouter) until multi-tenant ledger + VPC + custom ACLs block you—usually $10–30K/month inference or compliance review. Build (LiteLLM self-hosted) when platform team exists and two product lines share models.

FAQ: Gateway before or after hello-ship?

Guide 1 ships hello-ship calling a mock. Guide 2 inserts the gateway—minimal hello-ship diff is base URL + headers. Do not rewrite hello-ship; reroute it.