LLM gateway & infrastructure

Why an LLM gateway: stop scattering keys

An LLM gateway is the single network hop between your apps and every model vendor. It centralizes auth, routing, rate limits, cost tracking, and caching—so product teams call one stable API instead of importing vendor SDKs in twenty repos.

Picture ~/hello-ship after Guide 1: FastAPI serves /v1/chat, workers call OpenAI directly, and a cron job hits Anthropic for summarization. Keys live in three Kubernetes Secrets. When OpenAI returns 429, the API retries immediately while the worker backs off—both hammer the same exhausted key. Finance asks “which team spent $40K on GPT-4o last month?” and nobody can answer because logs lack tenant_id and model_id. That is the scatter-keys anti-pattern.

The fix is not “use LiteLLM in one service.” It is a platform contract: every inference request—chat, embeddings, structured extraction, agent tool loops—flows through the gateway. Apps pass X-Tenant-Id, X-Feature-Id, and a logical model alias (support-fast, support-quality). The gateway resolves aliases to pinned vendor model IDs, enforces budgets, writes a cost ledger row, and returns normalized usage metadata. Vendor keys never leave the gateway VPC.

Without vs with a gateway

Concern	Scattered keys (anti-pattern)	Centralized gateway
Auth	Keys in 12 repos; one intern commits .env	Keys in vault; apps use gateway API keys or mTLS
Routing	Hard-coded gpt-4o in each service	Alias map in git; change model without redeploying apps
Rate limits	Per-service retry storms on 429	Token bucket per tenant + per vendor key pool
Cost	Billing export ≠ app logs	Per-request ledger with tokens and $ estimate
Cache	Duplicate identical FAQ calls	Semantic cache at gateway; apps stay unaware
Failover	Outage = every team pages separately	Gateway fails over OpenAI → Azure OpenAI → Bedrock
Observability	Inconsistent log shapes	One trace schema → Langfuse / Datadog / Splunk

Gateway in the hello-ship architecture

flowchart TB
  subgraph clients["Application tier"]
    API["hello-ship FastAPI"]
    WK["Async workers"]
    AG["Agent orchestrator"]
  end

  subgraph gateway["LLM gateway — single front door"]
    AUTH["Auth + tenant context"]
    RL["Rate limiter"]
    RT["Model router"]
    SC["Semantic cache Redis"]
    CB["Cost ledger"]
    FB["Retry + fallback"]
    OBS["Trace + metrics"]
  end

  subgraph vendors["Model providers"]
    OAI["OpenAI"]
    ANT["Anthropic"]
    BR["AWS Bedrock"]
    AZ["Azure OpenAI"]
  end

  API --> AUTH
  WK --> AUTH
  AG --> AUTH
  AUTH --> RL
  RL --> SC
  SC -->|miss| RT
  SC -->|hit| OBS
  RT --> FB
  FB --> OAI
  FB --> ANT
  FB --> BR
  FB --> AZ
  FB --> CB
  CB --> OBS

In production, the gateway often sits behind your existing API gateway (Kong, Envoy, AWS API Gateway) for TLS termination and WAF. hello-ship’s FastAPI should call http://llm-gateway.internal/v1/chat/completions with an OpenAI-compatible payload—the same shape you already use in Track 1—so swapping direct OpenAI for the gateway is a one-line base URL change.

🔒 Security

Vendor API keys belong in the gateway’s secret store only. Application pods receive gateway credentials scoped per tenant—not sk-proj-…. Rotate vendor keys without touching app deployments.

⚠️ Pitfall

“We use LangChain, so we don’t need a gateway.” LangChain is an app framework; it does not centralize keys, tenant budgets, or org-wide failover. Wrap LangChain’s LLM client to point at the gateway base URL.

🏢 Real World

Stripe, Notion, and Perplexity-style platforms expose an internal model router service. Product teams request access via ticket; platform provisions tenant namespace, daily token cap, and allowed model aliases—same pattern at smaller scale on hello-ship.

🎯 Interview Tip

“How would you integrate multiple LLM providers?” — Draw one gateway box. List auth, routing, rate limits, cost ledger, cache, failover. Emphasize stable app contract vs interchangeable vendors.

When you can defer the gateway

A single developer prototype with one API key and <1K requests/day can call OpenAI directly—until you add a second service or a second model. Deferral ends when any of these trigger: second team consumes LLMs, compliance asks for key audit trail, finance needs per-feature cost, or CI eval gates require consistent model pinning across staging and prod.

FAQ: Gateway vs API gateway?

Your Kong/Envoy layer handles HTTP routing, JWT auth, and WAF. The LLM gateway understands chat payloads, token usage, model aliases, and semantic cache keys. Often both exist: Kong → LLM gateway → OpenAI. Do not stuff LLM-specific logic into generic API gateway plugins unless you enjoy maintaining Lua.

Gateway responsibilities: seven cross-cutting concerns

Treat the gateway as a platform microservice with seven owned concerns—each maps to code you ship once and every app inherits.

Responsibility matrix

#	Responsibility	Gateway does	App does NOT
1	Auth	Validate gateway API key / mTLS; attach tenant_id	Store OpenAI keys
2	Routing	Resolve support-fast → gpt-4o-mini; cascade tiers	Hard-code vendor model strings
3	Rate limiting	Token bucket per tenant, per key, global TPM	Spin on 429 without backoff
4	Semantic cache	Embedding similarity lookup before vendor call	Build per-feature Redis clients
5	Cost tracking	Ledger row: tokens × price table → estimated USD	Guess cost from response length
6	Retry / fallback	Jittered retry; failover chain per model alias	Ad-hoc try/catch per SDK
7	Observability	Structured trace: request_id, latency, model, cache_hit	Inconsistent log formats

1 — Auth and request context

Gateway auth is two-layer: who is calling (service API key or workload identity) and on behalf of whom (tenant_id, user_id hash). hello-ship passes tenant from JWT claims; workers pass tenant from job metadata. Reject requests missing tenant on paid tiers—anonymous abuse burns shared quota.

2 — Model routing and aliases

Apps request logical aliases defined in gateway_models.yaml. Routing rules can include: time-of-day downgrade, tenant plan (free → mini only), content length (long docs → frontier), and cascade escalation when confidence is low.

3 — Rate limiting

Implement three limiters: global gateway TPM (protect org), per-tenant daily token budget (protect finance), and per-vendor-key RPM (protect vendor relationship). Return 429 with Retry-After—never pass vendor 429 straight through without recording which limit tripped.

"""FastAPI middleware — token bucket per tenant in Redis."""
import time
from fastapi import Request, HTTPException
from starlette.middleware.base import BaseHTTPMiddleware

class TenantRateLimitMiddleware(BaseHTTPMiddleware):
    def __init__(self, app, redis, rpm: int = 120, daily_token_cap: int = 500_000):
        super().__init__(app)
        self.redis = redis
        self.rpm = rpm
        self.daily_token_cap = daily_token_cap

    async def dispatch(self, request: Request, call_next):
        tenant = request.headers.get("X-Tenant-Id")
        if not tenant:
            raise HTTPException(401, "X-Tenant-Id required")

        minute_key = f"rl:{tenant}:rpm:{int(time.time()) // 60}"
        count = await self.redis.incr(minute_key)
        if count == 1:
            await self.redis.expire(minute_key, 120)
        if count > self.rpm:
            raise HTTPException(429, "Tenant RPM exceeded", headers={"Retry-After": "60"})

        day_key = f"rl:{tenant}:tokens:{time.strftime('%Y%m%d')}"
        used = int(await self.redis.get(day_key) or 0)
        est = int(request.headers.get("X-Est-Tokens", "2000"))
        if used + est > self.daily_token_cap:
            raise HTTPException(429, "Daily token budget exceeded")

        response = await call_next(request)
        usage = int(response.headers.get("X-Usage-Total-Tokens", est))
        await self.redis.incrby(day_key, usage)
        await self.redis.expire(day_key, 86_400)
        return response

/** Spring WebFlux filter — per-tenant RPM + daily token cap in Redis. */
@Component
public class TenantRateLimitFilter implements WebFilter {
  private final ReactiveStringRedisTemplate redis;
  private final int rpm = 120;
  private final long dailyTokenCap = 500_000L;

  @Override
  public Mono<Void> filter(ServerWebExchange exchange, WebFilterChain chain) {
    String tenant = exchange.getRequest().getHeaders().getFirst("X-Tenant-Id");
    if (tenant == null || tenant.isBlank()) {
      exchange.getResponse().setStatusCode(HttpStatus.UNAUTHORIZED);
      return exchange.getResponse().setComplete();
    }
    long minute = System.currentTimeMillis() / 60_000;
    String minuteKey = "rl:" + tenant + ":rpm:" + minute;
    return redis.opsForValue().increment(minuteKey)
      .flatMap(count -> {
        if (count == 1) return redis.expire(minuteKey, Duration.ofMinutes(2)).thenReturn(count);
        return Mono.just(count);
      })
      .flatMap(count -> {
        if (count > rpm) {
          exchange.getResponse().setStatusCode(HttpStatus.TOO_MANY_REQUESTS);
          exchange.getResponse().getHeaders().add("Retry-After", "60");
          return exchange.getResponse().setComplete();
        }
        return chain.filter(exchange);
      });
  }
}

4 — Semantic cache

Exact cache keys on hash(prompt + model + temperature). Semantic cache embeds the user message, searches Redis vector index for cosine similarity ≥ 0.95, returns cached completion if hit. Skip cache for tool-calling turns, streaming responses you cannot replay, or tenant tiers that disable cache for compliance.

"""Semantic cache: embed query, KNN in Redis, return cached completion."""
import hashlib, json, numpy as np

SIM_THRESHOLD = 0.95
CACHE_TTL_SEC = 3600

async def semantic_cache_lookup(redis, embed_fn, tenant: str, model: str, user_msg: str):
    vec = await embed_fn(user_msg)  # e.g. text-embedding-3-small
    key_prefix = f"sc:{tenant}:{model}"
    # Redis FT.SEARCH on vector index — pseudocode via redis-py stack
    hits = await redis.ft(f"{key_prefix}:idx").search(
        Query(f"*=>[KNN 3 @embedding $vec AS score]")
          .sort_by("score")
          .dialect(2),
        query_params={"vec": np.array(vec, dtype=np.float32).tobytes()},
    )
    if hits.docs and float(hits.docs[0].score) >= SIM_THRESHOLD:
        return json.loads(hits.docs[0].payload)
    return None

async def semantic_cache_store(redis, embed_fn, tenant: str, model: str, user_msg: str, completion: dict):
    vec = await embed_fn(user_msg)
    doc_id = hashlib.sha256(f"{tenant}:{model}:{user_msg}".encode()).hexdigest()[:16]
    await redis.hset(
        f"sc:{tenant}:{model}:{doc_id}",
        mapping={"embedding": vec, "payload": json.dumps(completion)},
    )
    await redis.expire(f"sc:{tenant}:{model}:{doc_id}", CACHE_TTL_SEC)

/** Semantic cache service — Redis vector search + embedding client. */
public Optional<CachedCompletion> lookup(String tenant, String model, String userMsg) {
  float[] vec = embedder.embed(userMsg);
  String index = "sc:" + tenant + ":" + model + ":idx";
  List<VectorHit> hits = redisVector.knn(index, vec, 3);
  if (hits.isEmpty() || hits.get(0).score() < 0.95f) return Optional.empty();
  return Optional.of(deserialize(hits.get(0).payload()));
}

public void store(String tenant, String model, String userMsg, CachedCompletion completion) {
  float[] vec = embedder.embed(userMsg);
  String docId = sha256(tenant + model + userMsg).substring(0, 16);
  redis.opsForHash().putAll("sc:" + tenant + ":" + model + ":" + docId,
      Map.of("embedding", vec, "payload", serialize(completion)));
  redis.expire("sc:" + tenant + ":" + model + ":" + docId, Duration.ofHours(1));
}

5 — Cost tracking and ledger

Every gateway response appends a cost ledger event—async to Kafka, S3, or Postgres. Finance reconciles against vendor invoices using vendor_request_id when available. Price tables live in git, versioned weekly.

{
  "$schema": "https://sharpbyte.dev/schemas/cost-ledger-v1.json",
  "event_id": "550e8400-e29b-41d4-a716-446655440000",
  "timestamp": "2026-06-07T14:32:01.123Z",
  "request_id": "req_8f3a2b1c",
  "trace_id": "trace_abc123",
  "tenant_id": "acme-corp",
  "feature_id": "support-chat",
  "user_id_hash": "sha256:9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08",
  "model_alias": "support-fast",
  "model_resolved": "gpt-4o-mini-2024-07-18",
  "provider": "openai",
  "region": "us-east-1",
  "cache_hit": false,
  "usage": {
    "input_tokens": 1842,
    "output_tokens": 312,
    "total_tokens": 2154,
    "embedding_tokens": 0
  },
  "cost_usd": {
    "input": 0.000276,
    "output": 0.000187,
    "total": 0.000463,
    "price_table_version": "2026-06-01"
  },
  "latency_ms": {
    "gateway": 12,
    "cache_lookup": 3,
    "provider": 840,
    "total": 855
  },
  "status": "success",
  "fallback_chain": ["openai/gpt-4o-mini", "azure/gpt-4o-mini"],
  "fallback_used": false,
  "rate_limit_tier": "standard",
  "prompt_version": "support_v3.2",
  "guardrail_blocked": false
}

6 — Retry and fallback

Classify errors: transient (429, 502, 503, timeout) → retry with full jitter, max 3 attempts; permanent (400, 401, invalid model) → fail fast, no retry. Fallback chain per alias: primary OpenAI → secondary Azure OpenAI (same model family) → Bedrock Claude. Log every fallback—silent failover hides primary degradation.

7 — Observability

Emit OpenTelemetry spans: gateway.auth, gateway.cache, gateway.provider. Propagate trace_id to hello-ship logs per Track 5 observability. Dashboard SLIs: p95 latency, error rate, cache hit rate, $/1K requests, 429 rate per tenant.

💰 Cost

Semantic cache hit rates of 15–35% on FAQ-heavy support bots routinely cut inference spend 20%+. Log cache_hit and saved_usd per request so FinOps can attribute savings.

⚖️ Trade-off

Semantic cache improves cost and latency but can return stale policy answers after KB updates. Tie cache TTL to content version (kb_revision in cache key) or invalidate on ingest webhook.

🔬 Under the Hood

LiteLLM Proxy implements many responsibilities via config.yaml hooks—budget manager, router, cooldowns. Custom middleware still needed for tenant-specific ACLs and your cost ledger schema.

💡 Pro Tip

Return X-Gateway-Model-Resolved and X-Usage-Total-Tokens response headers so hello-ship can log them without parsing JSON bodies.

Open-source gateways: compare and configure

You rarely need to build a gateway from scratch on day one. These four options cover most teams—from batteries-included proxy to fully DIY.

Gateway comparison

Gateway	Hosting	Strengths	Gaps	Best for
LiteLLM Proxy	Self-hosted (Docker/K8s)	100+ providers, OpenAI-compatible API, budgets, routing, admin UI	Ops burden; semantic cache needs Redis add-on	Platform teams wanting OSS control
Portkey	Managed + self-host option	Gateway + observability, guardrails, prompt management	Vendor lock-in on managed tier pricing	Teams wanting gateway + traces in one SKU
OpenRouter	Managed SaaS	One API key, many models, automatic failover	Data leaves your VPC; limited custom tenant logic	Startups / fast prototypes
Traefik DIY	Self-hosted	Full control, mesh with existing ingress	You build routing, cache, ledger, SDK adapters	Mature platform eng with gateway charter

LiteLLM Proxy — hello-ship config

LiteLLM exposes an OpenAI-compatible /v1/chat/completions. Point hello-ship’s LLM_BASE_URL at the proxy. Model list, budgets, and fallbacks live in litellm_config.yaml.

# litellm_config.yaml — hello-ship gateway
model_list:
  - model_name: support-fast
    litellm_params:
      model: openai/gpt-4o-mini
      api_key: os.environ/OPENAI_API_KEY
  - model_name: support-quality
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
  - model_name: support-quality
    litellm_params:
      model: azure/gpt-4o
      api_base: https://acme.openai.azure.com
      api_key: os.environ/AZURE_OPENAI_KEY
      api_version: "2024-02-15-preview"

router_settings:
  routing_strategy: simple-shuffle
  num_retries: 2
  timeout: 60
  allowed_fails: 3
  cooldown_time: 30

litellm_settings:
  drop_params: true
  set_verbose: false
  cache: true
  cache_params:
    type: redis
    host: redis.llm-gateway.svc
    port: 6379

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  database_url: os.environ/DATABASE_URL  # spend logs

# hello-ship — call gateway instead of OpenAI directly
from openai import OpenAI

gateway = OpenAI(
    base_url="http://llm-gateway.internal:4000/v1",
    api_key=os.environ["HELLO_SHIP_GATEWAY_KEY"],
    default_headers={
        "X-Tenant-Id": tenant_id,
        "X-Feature-Id": "support-chat",
    },
)

resp = gateway.chat.completions.create(
    model="support-fast",  # alias, not gpt-4o-mini
    messages=messages,
    temperature=0,
)
usage = resp.usage  # logged by gateway + returned here

# Same litellm_config.yaml as Python panel — shared gateway config

// hello-ship — OpenAI-compatible Java client pointed at LiteLLM Proxy
OpenAIClient gateway = OpenAIOkHttpClient.builder()
    .baseUrl("http://llm-gateway.internal:4000/v1")
    .apiKey(System.getenv("HELLO_SHIP_GATEWAY_KEY"))
    .build();

ChatCompletionCreateParams params = ChatCompletionCreateParams.builder()
    .model("support-fast")
    .addUserMessage(userMessage)
    .temperature(0.0)
    .putAdditionalHeader("X-Tenant-Id", tenantId)
    .putAdditionalHeader("X-Feature-Id", "support-chat")
    .build();

ChatCompletion completion = gateway.chat().completions().create(params);
long totalTokens = completion.usage().map(u -> u.totalTokens()).orElse(0L);

Portkey — gateway headers

Portkey uses a managed gateway URL with config and metadata in headers—useful when you want traces and guardrails without running LiteLLM yourself.

curl https://api.portkey.ai/v1/chat/completions \
  -H "x-portkey-api-key: $PORTKEY_API_KEY" \
  -H "x-portkey-provider: openai" \
  -H "x-portkey-config: pc-support-fast-abc123" \
  -H "x-portkey-metadata: {\"tenant_id\":\"acme-corp\",\"feature\":\"support-chat\"}" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"refund policy?"}]}'

OpenRouter — single key, many models

# OpenRouter — OpenAI SDK compatible; model string selects vendor
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
)
resp = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",  # or openai/gpt-4o-mini
    messages=messages,
    extra_headers={
        "HTTP-Referer": "https://hello-ship.internal",
        "X-Title": "hello-ship support",
    },
)

Traefik DIY — ingress + custom upstream

Roll your own when LiteLLM abstractions fight your compliance model. Traefik terminates TLS, routes /v1/* to a small FastAPI/Spring gateway service you own—that service implements adapters, ledger, and cache. Higher build cost, zero framework magic.

# traefik/dynamic/llm-gateway.yml
http:
  routers:
    llm-gateway:
      rule: "Host(`llm-gateway.internal`) && PathPrefix(`/v1`)"
      service: llm-gateway-svc
      middlewares:
        - rate-limit-tenant
        - strip-internal-headers
  services:
    llm-gateway-svc:
      loadBalancer:
        servers:
          - url: "http://llm-gateway-app.llm.svc.cluster.local:8080"
  middlewares:
    rate-limit-tenant:
      rateLimit:
        average: 100
        burst: 200

⚖️ Trade-off

LiteLLM ships fastest; Portkey adds managed observability; OpenRouter minimizes ops but sends prompts to a third party; DIY maximizes control at 3–6 month build cost.

⚠️ Pitfall

Running LiteLLM Proxy without master_key auth on a cluster-internal URL still risks lateral movement after a pod compromise. Require mTLS or signed service tokens even “inside” the VPC.

🏢 Real World

Many Series B companies start on OpenRouter for velocity, migrate to self-hosted LiteLLM at ~$15K/month inference spend when finance demands per-tenant attribution and VPC boundaries.

💡 Pro Tip

Keep hello-ship’s integration test on a mock gateway that returns fixture JSON—same OpenAI response shape. Point staging at real LiteLLM; prod at HA LiteLLM behind load balancer.

Feature matrix (detail)

Feature	LiteLLM	Portkey	OpenRouter	Traefik DIY
OpenAI-compatible API	✓	✓	✓	You build
Per-tenant budgets	✓ (DB)	✓	Limited	You build
Semantic cache	Redis plugin	✓ managed	✗	You build
Fallback routing	✓	✓	✓ auto	You build
Spend dashboard	Admin UI	✓	Basic	You build
VPC-only deploy	✓	Self-host	✗	✓
Guardrails integration	Hooks	Native	✗	Your pipeline

Self-hosted vs managed: where keys and data live

The gateway decision overlaps with provider choice. Self-hosted LiteLLM in your VPC vs calling managed AWS Bedrock or Azure OpenAI directly are different trust and ops tradeoffs—not mutually exclusive.

Teams often run LiteLLM Proxy in EKS with Bedrock and Azure as upstreams—self-hosted gateway, managed inference. Alternatively, Bedrock alone can be your “gateway” if you only need AWS models and IAM auth, accepting vendor lock-in on routing and cache.

Three deployment patterns

Pattern	Gateway	Inference	Keys live in	Typical buyer
A — LiteLLM + multi-vendor	Self-hosted LiteLLM	OpenAI, Anthropic, Bedrock, Azure	Gateway vault	Multi-cloud product teams
B — Bedrock-native	Thin wrapper / SDK	Bedrock only	IAM roles	AWS-all-in enterprises
C — Azure OpenAI-native	APIM + Azure OAI	Azure OpenAI	Azure Key Vault	Microsoft EA customers

LiteLLM self-hosted — pros and cons

Pros: One OpenAI-compatible surface; swap vendors without app changes; full custom ledger schema; semantic cache in your Redis; air-gapped deploy possible.
Cons: You operate HA proxy, Postgres for spend logs, Redis, upgrades; on-call for gateway outages blocks all LLM features.
Cost: Infra ~$200–800/month + engineer time; inference billed per vendor separately.

AWS Bedrock — pros and cons

Pros: IAM auth (no API keys in pods); Claude, Llama, Titan in one AWS bill; private VPC endpoints; enterprise procurement already on AWS.
Cons: No OpenAI API shape without adapter; model lag vs OpenAI day-zero; cross-region failover is your design; limited semantic cache story.
Cost: Per-token Bedrock pricing; no gateway SKU—but you still build routing/ledger or use LiteLLM in front.

Azure OpenAI — pros and cons

Pros: GPT-4o under enterprise DPA; regional deploy for data residency; integrates with Azure APIM for rate limits and keys.
Cons: Quota approval latency; not a multi-vendor router; Anthropic/Google require separate paths.
Cost: Similar per-token to OpenAI; APIM adds per-call charge.

Decision rubric

Requirement	Prefer	Why
Multi-vendor failover	LiteLLM (pattern A)	Bedrock/Azure alone are single-vendor
AWS-only, IAM-only auth	Bedrock (pattern B)	No API keys; Converse API unified
EU data residency + GPT	Azure OpenAI (pattern C)	Regional Azure OAI endpoints
Fastest hello-ship curriculum	LiteLLM + OpenAI mock	OpenAI-compatible; matches Track 1 code
FinOps per-tenant ledger	LiteLLM + custom middleware	Full schema control
Zero gateway ops	OpenRouter or Portkey managed	Trade VPC control for speed

🔒 Security

Bedrock IAM roles beat long-lived API keys for lateral-movement risk. If you use LiteLLM with OpenAI keys, store in AWS Secrets Manager / HashiCorp Vault with automatic rotation—never Kubernetes Secret literals in git.

💰 Cost

Managed Azure OpenAI + APIM can add 5–15% overhead per call vs direct OpenAI—but compliance and consolidated billing often justify it. Model routing to mini tiers via LiteLLM saves more than gateway infra costs.

🎯 Interview Tip

“Bedrock vs OpenAI?” — Not either/or. Position Bedrock for AWS-native IAM and Claude/Llama; OpenAI for latest GPT and ecosystem; LiteLLM gateway unifies both behind one app contract.

🔬 Under the Hood

Azure OpenAI uses the same model weights as OpenAI but different endpoint, api-version query param, and deployment name instead of raw model ID—LiteLLM’s azure/ prefix normalizes this for apps.

Hybrid hello-ship recommendation

For the curriculum: deploy LiteLLM Proxy in docker-compose beside hello-ship; upstream support-fast → mock server in CI, OpenAI in dev, Bedrock optional via aws_profile in staging. One client config, three environments—mirrors how platform teams ship.

Multi-tenant gateway: namespaces, budgets, ACLs

Once two product lines share the gateway, tenant isolation is not optional. Namespace every limit, cache key, ledger row, and model permission by tenant_id.

Tenant namespace model

A tenant is a billing and policy boundary—customer org, internal business unit, or environment (staging vs prod). hello-ship uses acme-corp, acme-staging, internal-tools. Every gateway subsystem prefixes keys:

Rate limit: rl:{tenant_id}:rpm:{minute}
Semantic cache: sc:{tenant_id}:{model_alias}:{hash}
Cost ledger: partition by tenant_id in Postgres or S3 prefix
Traces: tenant_id tag on every span

Cost allocation

Finance wants showback/chargeback by tenant and feature. Aggregate ledger events daily:

Dimension	Source field	Report use
Tenant	tenant_id	Per-customer invoice / internal budget
Feature	feature_id	Which product surface spent tokens
Model	model_resolved	FinOps model mix optimization
Environment	tenant_id suffix or tag	Exclude staging from customer bills
Cache savings	cache_hit + counterfactual cost	Prove cache ROI

"""Aggregate cost ledger JSONL → reports/cost/{tenant}/daily.json"""
import json
from collections import defaultdict
from pathlib import Path

def rollup(path: Path, day: str) -> dict:
    totals = defaultdict(lambda: {"tokens": 0, "usd": 0.0, "requests": 0})
    for line in path.read_text().splitlines():
        ev = json.loads(line)
        if not ev["timestamp"].startswith(day):
            continue
        key = (ev["tenant_id"], ev.get("feature_id", "unknown"))
        totals[key]["tokens"] += ev["usage"]["total_tokens"]
        totals[key]["usd"] += ev["cost_usd"]["total"]
        totals[key]["requests"] += 1
    return {f"{t}:{f}": v for (t, f), v in totals.items()}

/** Roll up cost ledger JSONL by tenant + feature for a given day. */
public Map<String, TenantCostRollup> rollup(Path ledgerFile, String dayPrefix) throws IOException {
  Map<String, TenantCostRollup> totals = new HashMap<>();
  try (var lines = Files.lines(ledgerFile)) {
    for (String line : (Iterable<String>) lines::iterator) {
      CostLedgerEvent ev = mapper.readValue(line, CostLedgerEvent.class);
      if (!ev.timestamp().startsWith(dayPrefix)) continue;
      String key = ev.tenantId() + ":" + ev.featureId();
      totals.computeIfAbsent(key, k -> new TenantCostRollup())
        .add(ev.usage().totalTokens(), ev.costUsd().total());
    }
  }
  return totals;
}

Model access control

Not every tenant gets frontier models. Define tenant_policy.yaml:

tenants:
  acme-corp:
    plan: enterprise
    allowed_models: [support-fast, support-quality, embed-small]
    daily_token_cap: 2_000_000
    rpm: 300
  acme-free-trial:
    plan: free
    allowed_models: [support-fast]
    daily_token_cap: 50_000
    rpm: 30
  internal-tools:
    plan: internal
    allowed_models: ["*"]
    daily_token_cap: 10_000_000
    rpm: 600

Gateway rejects support-quality for free-trial with 403 and structured error model_not_allowed—prevents surprise frontier spend from sandbox accounts.

Usage dashboards

Minimum viable dashboard panels for platform on-call:

Panel	Query	Alert threshold
Spend today by tenant	sum(cost_usd.total) by tenant_id	> 120% of 7d avg
p95 gateway latency	histogram(latency_ms.total)	> 3s for 5m
429 rate	rate(status=429)	> 5% of traffic
Cache hit rate	cache_hit=true / total	< 5% (misconfigured)
Fallback rate	fallback_used=true	> 10% (primary unhealthy)
Error rate by provider	status=5xx by provider	> 1% any provider

⚠️ Pitfall

Shared semantic cache across tenants leaks answers—a cached response for tenant A’s policy query must never return to tenant B. Always include tenant_id in cache index namespace.

🏢 Real World

SaaS AI platforms expose a “Usage” tab per workspace fed by the same ledger schema—customers self-serve before finance tickets. Internal tenants get Grafana; external get Metronome/Stripe metering integration.

💡 Pro Tip

Wire daily_token_cap to a soft warning at 80% (Slack) and hard block at 100% (429). Lets tenant admins upgrade before hard outage.

⚖️ Trade-off

Strict model ACLs reduce bill shock but slow innovation—give internal-tools wildcard access and prod tenants explicit allowlists with quarterly review.

FAQ: Tenant vs API key?

One gateway API key per service (hello-ship API, worker fleet); X-Tenant-Id identifies the end customer. Do not mint per-customer gateway keys unless you have thousands of tenants—policy tables scale better.

Production: Track 6 gateway checklist

Close Guide 2 with a gateway shipping checklist—auth, routing, limits, cache, ledger, failover, observability—then continue to fine-tuning & adaptation for when routing alone is not enough.

Track 6 gateway production checklist

No vendor API keys in application repos or hello-ship env—gateway vault only.
All inference traffic routes through gateway; CI grep blocks direct api.openai.com imports in app code.
Model aliases in git (gateway_models.yaml); pinned vendor IDs per environment.
Per-tenant rate limits: RPM + daily token cap with 429 + Retry-After.
Semantic cache namespaced by tenant_id; TTL tied to KB revision where applicable.
Cost ledger schema v1 emitted on every request; daily rollup job per tenant.
Fallback chain configured and tested; alerts on fallback_used > 10%.
Structured traces with trace_id, tenant_id, model_resolved, cache_hit.
Gateway HA: ≥2 replicas, health probe, graceful shutdown draining in-flight streams.
Model ACL policy enforced; free-tier cannot invoke frontier aliases.
Dashboards: spend, p95 latency, 429 rate, cache hit rate, provider errors.
Incident runbook: gateway down → enable read-only mode / cached responses / status page.
Pre-ship load test: 2× expected peak RPM without tripping vendor limits.
Reconcile ledger vs vendor invoice weekly; drift > 5% triggers investigation.

Risk	Gateway control	Signal	Owner
Runaway token spend	Daily cap + model ACL	tenant 429 + spend alert	Platform / FinOps
Key leak in git	Gateway-only keys	secret scanner CI	Security
Vendor outage	Fallback chain	fallback_used rate	Platform on-call
Cross-tenant cache leak	Namespace isolation	security audit + unit tests	Platform
Stale cached policy	KB revision in cache key	faithfulness eval regression	ML + Product
Retry storm on 429	Jittered backoff + key pool	provider 429 rate drop	Platform

Pre-ship gate

Gate	Threshold	Blocking?
gateway_smoke.sh (mock upstream)	100% pass	Yes
No direct vendor SDK in app grep	0 matches	Yes
Ledger row on every integration test call	100% present	Yes
Fallback drill in staging	Primary blocked → secondary succeeds	Yes for release
Load test 2× peak RPM	p95 < 3s, error < 0.5%	Warn then block
Eval smoke via gateway	≥ baseline pass rate	Yes

#!/usr/bin/env bash
# gateway_smoke.sh — run after deploy
set -euo pipefail
GW="${LLM_GATEWAY_URL:-http://localhost:4000}"
KEY="${HELLO_SHIP_GATEWAY_KEY:?}"
curl -sf "$GW/health" | grep -q ok
curl -sf "$GW/v1/chat/completions" \
  -H "Authorization: Bearer $KEY" \
  -H "X-Tenant-Id: smoke-test" \
  -H "Content-Type: application/json" \
  -d '{"model":"support-fast","messages":[{"role":"user","content":"ping"}]}' \
  | jq -e '.choices[0].message.content'
echo "gateway_smoke ok"

# Same gateway_smoke.sh — gateway is language-agnostic

📦 End of Guide 2

You can justify a centralized gateway, implement seven responsibilities, choose among LiteLLM/Portkey/OpenRouter/DIY, design multi-tenant isolation, and ship with the checklist above. Guide 3 covers fine-tuning & adaptation—when gateway routing and RAG are not enough to hit quality SLOs.

🔒 Security

Gateway logs contain prompts—apply the same redaction pipeline as Track 5 guardrails before logs ship to third-party observability vendors.

🎯 Interview Tip

Whiteboard the gateway box between apps and clouds. List seven responsibilities without naming a vendor first—shows platform thinking; then mention LiteLLM as implementation detail.

🔬 Under the Hood

Guide 3’s fine-tuning path still calls inference through this gateway—training jobs submit to a separate fine-tune API, but serving the adapted model uses the same model_alias map.

Track 6 guide sequence

Guide	Topic	You are here
Guide 1	hello-ship explained	← prev
Guide 2	LLM gateway & infrastructure	✓
Guide 3	Fine-tuning & adaptation	next →
Guide 4+	Cost control, flags, deploy, multimodal	—

Handoff to fine-tuning

Export these gateway fields before Guide 3 wiring:

model_alias and model_resolved per request
cost_usd.total aggregated per feature
tenant_id for adaptation path experiments
cache_hit to avoid tuning on cached traffic
Baseline eval pass rate via gateway (same path as prod)

FAQ: Build or buy gateway?

Buy (Portkey/OpenRouter) until multi-tenant ledger + VPC + custom ACLs block you—usually $10–30K/month inference or compliance review. Build (LiteLLM self-hosted) when platform team exists and two product lines share models.

FAQ: Gateway before or after hello-ship?

Guide 1 ships hello-ship calling a mock. Guide 2 inserts the gateway—minimal hello-ship diff is base URL + headers. Do not rewrite hello-ship; reroute it.