Model selection & cost optimization
Frontier models are priced per token, not per user. The difference between GPT-4o and GPT-4o-mini on a high-volume feature is often 10–30× on the monthly bill—not a rounding error. This guide covers pricing baselines, cascade routing, caching, batch APIs, local inference trade-offs, and the FinOps controls that keep spend predictable in production.
After reading, you should be able to: read a provider price sheet and compute monthly cost; design a Haiku→Sonnet→Opus cascade with confidence escalation; apply prompt and semantic caching where ROI is clear; choose batch vs real-time paths; estimate when self-hosting beats hosted APIs; build a model selection matrix for your product; wire budget alerts and per-tenant quotas.
Pricing baseline: cost per million tokens
Every hosted LLM API bills in two buckets: input tokens (everything you send) and output tokens (everything the model generates). Prices are quoted per 1 million tokens and change frequently—treat the table below as order-of-magnitude planning numbers, not a contract.
Output is almost always more expensive than input because generation runs autoregressively—one forward pass per token. A chat feature that returns 800 tokens when 200 would suffice is burning budget on max_tokens and prompt design, not on model choice alone.
| Model | Input / 1M tokens | Output / 1M tokens | Typical role |
|---|---|---|---|
| GPT-4o | ~$2.50 – $5 | ~$10 – $15 | Complex reasoning, agents, high-stakes extraction |
| GPT-4o-mini | ~$0.15 | ~$0.60 | Classification, routing, bulk summarization |
| Claude 3.5 Sonnet | ~$3 | ~$15 | Long-context RAG, coding, instruction following |
| Claude 3 Haiku | ~$0.25 | ~$1.25 | First-pass answers, triage, high-volume chat |
| Gemini 1.5 Flash | ~$0.075 | ~$0.30 | Very cheap first pass, multimodal ingest |
Prices change. OpenAI, Anthropic, and Google adjust list rates when new model generations ship; enterprise contracts (Bedrock, Azure OpenAI, Vertex) may differ from public API pricing. Before you commit to a margin model or customer pricing, pull current rates from provider docs and re-run your spreadsheet monthly.
Input vs output: why the split matters
Consider a support bot averaging 2,000 input tokens (system prompt + RAG chunks + user message) and 400 output tokens, at 50,000 requests per day on GPT-4o at ~$5 / $15 per million:
- Input cost/day: 2,000 × 50,000 × ($5 / 1,000,000) = $500
- Output cost/day: 400 × 50,000 × ($15 / 1,000,000) = $300
- Total: ~$800/day (~$24K/month) — before caching, routing, or failures/retries
The same traffic on GPT-4o-mini at ~$0.15 / $0.60 drops to roughly $18/day—a 40× swing driven entirely by model ID. That is why routing and caps belong in architecture, not as a post-launch optimization.
Quick per-request estimate: cost = (input_tokens × price_in + output_tokens × price_out) / 1_000_000. Log both token counts on every API response—OpenAI exposes usage.prompt_tokens and usage.completion_tokens; Anthropic uses input_tokens / output_tokens.
Budgeting with list price for GPT-4o while shipping 4o-mini in code—or the reverse. Pin model IDs in config, log them per request, and reconcile billing exports against application logs weekly. Silent env drift (staging model ≠ prod model) has caused six-figure surprises at scale.
Add a “cost preview” column to your feature spec template: avg input, avg output, daily volume, model ID, estimated $/day. If the number exceeds a threshold (e.g. $500/day), the spec must include routing or caching strategy before eng kickoff.
Model routing: cascades and confidence escalation
Routing sends each request to the cheapest model that can satisfy quality SLOs. A cascade tries a small model first and escalates only when confidence is low—Haiku → Sonnet → Opus on Anthropic, or 4o-mini → 4o on OpenAI.
Most production traffic is “easy”: FAQ lookup, intent classification, short summarization, formatting JSON. Reserve frontier models for the long tail where small models fail evals. At 70–90% easy-case rates, cascades routinely cut spend 60–80% with negligible quality loss if escalation triggers are tuned on real data.
Cascade pattern: Haiku → Sonnet → Opus
- Tier 0 (Haiku / 4o-mini / Flash) — answer or classify with temperature 0 and a strict output schema.
- Confidence check — logprobs threshold, JSON schema validation, self-consistency vote, or a dedicated “needs escalation” classifier.
- Tier 1 (Sonnet / GPT-4o) — retry with richer model if check fails; pass prior attempt as context (“your draft was rejected because…”).
- Tier 2 (Opus / o1-class) — only for flagged high-value or high-complexity paths (legal review queue, executive summary).
Confidence-based escalation signals
| Signal | How it works | Trade-off |
|---|---|---|
| Logprobs / token probability | Low average logprob on answer tokens → escalate | Not all APIs expose logprobs; calibration varies by model |
| Schema validation | JSON / Pydantic parse fails → escalate | Deterministic; misses semantically wrong but valid JSON |
| Self-check prompt | Second call: “Is this answer supported by context? yes/no” | Extra latency and cost on every request unless batched |
| Routing classifier | Small model or embedding k-NN picks tier before main call | Requires labeled routing dataset and periodic retrain |
| Human queue | Escalate to review UI instead of bigger model | Best for compliance; not a latency win |
Routing classifier with a small model
Train or prompt a tiny model to predict tier=cheap|standard|frontier from the user query and metadata (tenant plan, document length, detected language). Keep the classifier prompt under 500 tokens; run at temperature 0. Log predicted tier vs actual escalation rate—if you escalate >40% of “cheap” routes, the classifier needs more labels or features.
from dataclasses import dataclass
from openai import OpenAI
client = OpenAI()
TIERS = [
{"name": "cheap", "model": "gpt-4o-mini", "price_in": 0.15, "price_out": 0.60},
{"name": "standard", "model": "gpt-4o", "price_in": 5.0, "price_out": 15.0},
]
@dataclass
class RouteResult:
text: str
model: str
input_tokens: int
output_tokens: int
escalated: bool
def call_model(model: str, messages: list) -> RouteResult:
resp = client.chat.completions.create(
model=model, temperature=0, max_tokens=512, messages=messages,
response_format={"type": "json_object"},
)
u = resp.usage
return RouteResult(
text=resp.choices[0].message.content or "",
model=model,
input_tokens=u.prompt_tokens,
output_tokens=u.completion_tokens,
escalated=False,
)
def needs_escalation(draft_json: str) -> bool:
# Replace with Pydantic validation, logprobs, or classifier score
return '"confidence"' not in draft_json or '"low"' in draft_json
def cascade_chat(messages: list) -> RouteResult:
cheap = call_model(TIERS[0]["model"], messages)
if not needs_escalation(cheap.text):
return cheap
standard = call_model(TIERS[1]["model"], messages + [
{"role": "assistant", "content": cheap.text},
{"role": "user", "content": "Previous answer failed validation. Try again."},
])
standard.escalated = True
return standard
public record RouteResult(String text, String model, int inputTokens,
int outputTokens, boolean escalated) {}
public class CascadeRouter {
private final ChatClient cheapClient;
private final ChatClient standardClient;
public CascadeRouter(ChatClient.Builder builder) {
this.cheapClient = builder.clone().defaultOptions(
ChatOptionsBuilder.builder().withModel("gpt-4o-mini").withTemperature(0.0).build()
).build();
this.standardClient = builder.clone().defaultOptions(
ChatOptionsBuilder.builder().withModel("gpt-4o").withTemperature(0.0).build()
).build();
}
public RouteResult cascade(List<Message> messages) {
var cheap = invoke(cheapClient, messages);
if (!needsEscalation(cheap.text())) return cheap;
var retryMessages = new ArrayList<>(messages);
retryMessages.add(new AssistantMessage(cheap.text()));
retryMessages.add(new UserMessage("Previous answer failed validation. Try again."));
var standard = invoke(standardClient, retryMessages);
return new RouteResult(standard.text(), "gpt-4o", standard.inputTokens(),
standard.outputTokens(), true);
}
private boolean needsEscalation(String json) {
return !json.contains("\"confidence\"") || json.contains("\"low\"");
}
}
Cascades add tail latency when you escalate (two sequential API calls). Mitigate with parallel speculative execution only if you can afford wasted cheap tokens—usually not worth it. Better: tighten cheap-tier prompts so escalation rate stays under 15–20% on production traffic.
Customer support platforms route “where is my order?” to mini-class models and escalate billing disputes to Sonnet-class with full account history. The classifier is re-evaluated weekly against human-labeled tickets—routing drift is an ops task, not a one-time ML project.
“How would you cut LLM cost 50% without hurting quality?” — Answer with cascade + caching + max_tokens caps + per-feature dashboards. Mention measuring escalation rate and running A/B evals before changing tiers—not blind downgrades.
Prompt caching and semantic caching
Long static prefixes—system prompts, tool definitions, RAG policy blocks—are re-sent on every request. Prompt caching lets providers reuse computed attention states for identical prefixes. Semantic caching stores prior Q→A pairs and returns a hit when a new question is embedding-similar.
Anthropic prefix caching (~90% input discount on cache hits)
Mark cacheable blocks with cache_control: {type: "ephemeral"} on system content or tool definitions. The first request pays full input price to populate cache; subsequent requests with the same prefix read cached tokens at roughly 10% of normal input cost (verify current Anthropic pricing—rates evolve). TTL is typically minutes to hours; ideal for stable system prompts over high QPS.
OpenAI automatic prefix cache
OpenAI applies automatic caching when the prompt prefix matches a recent request (often 1,024+ token prefixes). You do not opt in explicitly—check usage.prompt_tokens_details.cached_tokens on responses. Structure prompts so static content comes first and variable user/RAG content comes last to maximize hit rate.
Semantic caching (GPTCache, Redis + embeddings)
Application-level cache flow:
- Embed user query (small embedding model—cheap).
- Vector search Redis / pgvector for neighbors above similarity threshold (e.g. cosine > 0.92).
- On hit, return stored answer; on miss, call LLM and write (embedding, answer) with TTL.
Libraries like GPTCache wrap this pattern. Works best for repetitive FAQ, internal policy bots, and developer doc assistants—not for personalized or time-sensitive answers unless you partition cache keys by tenant + doc version.
| Cache type | Match key | Best for | Risk |
|---|---|---|---|
| Provider prefix cache | Exact byte prefix | Long static system prompts, tool schemas | Low—deterministic |
| Exact hash cache | SHA256(full prompt) | Idempotent batch jobs | Stale if upstream data changes |
| Semantic cache | Embedding similarity | FAQ, support macros | Wrong answer if threshold too loose |
Prefix caching saves prefill compute on the provider side—they pass savings as discounted input tokens. Semantic caching saves entire model calls but introduces a false-hit risk; always log cache hit/miss and sample human review on hits until precision is proven above 99% for your domain.
Never cache responses that include user-specific PII across tenants. Namespace cache keys by tenant_id + prompt_version + model_id. Invalidate semantic cache on policy or knowledge-base updates—stale compliance answers are a liability, not a cost win.
Example: 8K-token system prompt, 100 req/s, $3/M input on Sonnet. Uncached input alone ≈ $8.64/hour. With 80% prefix cache hits at 90% discount on cached portion, input cost drops roughly 70%—often worth more than switching models if your prompt is large and static.
Batch API: 50% discount, 24-hour turnaround
OpenAI’s Batch API (and similar async endpoints from other vendors) processes large job files within ~24 hours at roughly 50% off standard rates. You upload JSONL requests, poll for completion, download results—no streaming, no SLA for minutes-level latency.
When batch fits
- Nightly eval runs over golden datasets (Track 5 regression suites)
- Bulk document summarization, classification, or enrichment pipelines
- Backfill after schema or prompt version changes
- Embedding or caption generation for offline indexes
- Monthly report generation, invoice parsing, CRM hygiene
When real-time wins
- User-facing chat, copilot, or search-with-answer
- Agent tool loops that need sub-second tool selection
- Streaming UX (SSE to browser)
- Any workflow blocked on LLM output to continue the transaction
Hybrid architectures are common: real-time path for interactive features, batch path for analytics and eval at half price. Same prompt templates—different transport. Idempotency keys on batch line items prevent duplicate charges when you retry uploads.
Run your CI eval suite via Batch on every main merge instead of synchronous API calls—you cut CI LLM spend in half and avoid rate-limit flakes. Gate merges on downloaded results, not on live latency.
Batch jobs share provider capacity; during incidents completion can slip past 24h. Design batch consumers to be restartable and idempotent. Do not use batch for safety-critical same-day workflows unless you have a real-time fallback path.
Content platforms batch-translate or tag millions of catalog rows overnight while live product pages use Flash-class models for on-demand “explain this spec” buttons. Product managers see one feature; infra sees two cost profiles.
Local inference: Ollama, vLLM, llama.cpp
Self-hosting moves spend from per-token OPEX to GPU CAPEX + ops. Tools like Ollama (dev and small prod), vLLM (high-throughput serving), and llama.cpp (CPU/edge) expose OpenAI-compatible HTTP APIs over open weights (Llama, Mistral, Phi).
| Runtime | Strength | Typical deployment |
|---|---|---|
| Ollama | Fast local dev, simple model pull, low ops | Laptop, single GPU box, internal tools |
| vLLM | Continuous batching, high QPS on A100/H100 | K8s GPU pool, dedicated inference fleet |
| llama.cpp | CPU inference, quantized models, edge | On-prem air-gap, embedded, cost-sensitive batch |
When local wins
- Volume — millions of tokens/day where GPU amortization beats API list price (model size dependent).
- Privacy — data cannot leave VPC; regulated workloads (health, finance, gov).
- Fine-tuned weights — custom LoRA adapters on Llama; no hosted equivalent.
- Offline / edge — ships without reliable cloud connectivity.
Ops cost trade-off
A single L4 or A10G instance running 8B–70B quantized models might cost $500–2,000/month reserved—but you own utilization risk, model updates, security patches, and eval regressions when you bump weights. Staff time often exceeds GPU rent for teams without existing ML platform. Rule of thumb: prove ROI with hosted APIs and traffic metrics first; self-host when marginal token cost × volume exceeds fully loaded infra cost for 6+ months.
“Free” Ollama on a laptop does not predict production cost. Throughput, batching, failover, and observability on vLLM at scale are different engineering problems. Pilot on one GPU with production prompts before committing to a fleet.
vLLM uses PagedAttention to batch variable-length sequences efficiently—why it beats naive HuggingFace serving on QPS. Quantization (AWQ, GPTQ) cuts VRAM so larger models fit smaller GPUs at slight quality cost—measure on your eval set.
Cost estimation: templates and worked examples
Ship features with a spreadsheet—or code—that turns token averages and traffic into dollars. The canonical formula is simple; the hard part is honest token counts from production-shaped prompts.
Template
monthly_cost = (avg_input × price_in + avg_output × price_out) / 1_000_000 × requests_per_day × 30
Extend with cache hit rate, escalation rate, and retry multiplier: effective_cost = base_cost × (1 - cache_savings) × (1 + escalation_rate × tier_multiplier) × retry_factor.
Worked example: RAG support feature
| Parameter | Value |
|---|---|
| Model | GPT-4o-mini |
| Price in / out (per 1M) | $0.15 / $0.60 |
| Avg input tokens | 3,500 (system + 4 RAG chunks + history) |
| Avg output tokens | 350 |
| Requests / day | 20,000 |
| Per-request cost | (3500×0.15 + 350×0.60) / 1e6 ≈ $0.000735 |
| Monthly (30 days) | 0.000735 × 20,000 × 30 ≈ $441 |
Same volume on GPT-4o at $5 / $15: per-request ≈ $0.0228, monthly ≈ $13,680—31× higher. A cascade with 85% mini / 15% 4o blends to roughly ~$2,400/month if escalations double token use on tier-2.
Per-feature dashboards
Tag every LLM call with feature, tenant_id, model, and computed cost_usd. Aggregate in your metrics stack (Datadog, Grafana, CloudWatch) or warehouse (BigQuery, Snowflake):
- Daily spend by feature — catches runaway cron jobs
- p50 / p95 tokens in and out — informs prompt compression work
- Escalation rate by route — tunes cascade thresholds
- Cost per successful task — ties dollars to business outcome, not raw calls
from dataclasses import dataclass
@dataclass(frozen=True)
class ModelPrice:
name: str
price_in_per_m: float
price_out_per_m: float
PRICES = {
"gpt-4o-mini": ModelPrice("gpt-4o-mini", 0.15, 0.60),
"gpt-4o": ModelPrice("gpt-4o", 5.0, 15.0),
"claude-haiku": ModelPrice("claude-haiku", 0.25, 1.25),
}
def request_cost(model: str, input_tokens: int, output_tokens: int) -> float:
p = PRICES[model]
return (input_tokens * p.price_in_per_m + output_tokens * p.price_out_per_m) / 1_000_000
def monthly_cost(model: str, avg_in: int, avg_out: int, req_per_day: int, days: int = 30) -> float:
per_req = request_cost(model, avg_in, avg_out)
return per_req * req_per_day * days
# RAG support example
print(f"mini/month: ${monthly_cost('gpt-4o-mini', 3500, 350, 20_000):,.2f}")
print(f"4o/month: ${monthly_cost('gpt-4o', 3500, 350, 20_000):,.2f}")
public record ModelPrice(String name, double priceInPerM, double priceOutPerM) {}
public final class LlmCostCalculator {
private static final Map<String, ModelPrice> PRICES = Map.of(
"gpt-4o-mini", new ModelPrice("gpt-4o-mini", 0.15, 0.60),
"gpt-4o", new ModelPrice("gpt-4o", 5.0, 15.0)
);
public static double requestCost(String model, int inputTokens, int outputTokens) {
var p = PRICES.get(model);
return (inputTokens * p.priceInPerM() + outputTokens * p.priceOutPerM()) / 1_000_000.0;
}
public static double monthlyCost(String model, int avgIn, int avgOut, int reqPerDay, int days) {
return requestCost(model, avgIn, avgOut) * reqPerDay * days;
}
}
Whiteboard: “10M requests/month, 2K in / 300 out, which model and what budget?” Walk through formula, cite mini vs 4o, mention cascade and caching—not just picking the biggest model.
Selection framework: complexity, latency, cost, privacy
Model choice is a multi-objective decision—not “always use the best benchmark score.” Use a decision matrix per feature class; validate with your eval set, using public benchmarks only to shortlist.
Decision dimensions
- Task complexity — extraction vs multi-hop reasoning vs open-ended writing
- Latency SLO — TTFT and total ms budget for UX tier (chat vs async email)
- Cost envelope — target $/successful task or $/tenant/month
- Privacy / residency — data leaves region? BAA required? air-gap?
- Context needs — max document size without truncation
- Tooling — JSON schema, function calling, vision, audio
Benchmark awareness (MMLU, HumanEval, etc.)
MMLU measures broad knowledge multiple-choice; HumanEval measures code synthesis pass@1. Leaderboard gains of 2–3 points rarely predict your invoice-parser accuracy. Use benchmarks to eliminate clearly weak candidates, then run golden-set evals (Track 5) on production prompts. Log regression when providers silently upgrade model snapshots.
| Feature profile | Complexity | Latency | Recommended tier | Notes |
|---|---|---|---|---|
| Intent classifier | Low | <200ms | Haiku / 4o-mini / Flash | Temperature 0; few-shot or fine-tune if needed |
| RAG Q&A (internal docs) | Medium | <3s TTFT streaming | Sonnet / 4o-mini cascade | Prefix-cache long system prompt |
| Code review agent | High | <30s | Sonnet / GPT-4o | HumanEval-like internal suite |
| Executive brief from 50 PDFs | High | Async OK | Batch + Gemini Pro / Sonnet | Long context; batch discount |
| PHI summarization | Medium | <5s | Bedrock / VPC-hosted Llama | Privacy overrides cost |
| Multimodal OCR + Q&A | Medium–High | <5s | GPT-4o / Gemini Flash | See multimodal guide |
Optimizing only for MMLU pushes you to frontier models everywhere. Optimizing only for cost pushes you to mini models that fail on edge cases users remember. The matrix forces explicit trade-offs per feature—not one global default.
Document the chosen tier in an ADR: eval scores, cost projection, fallback model, and review date. Revisit when traffic 10×’s or a new model generation ships—whichever comes first.
Production: FinOps checklist and guardrails
Cost optimization without controls is a one-time spreadsheet exercise. Production needs budgets, alerts, quotas, and ownership—the same discipline you apply to cloud compute and egress.
FinOps checklist
- Central LLM gateway — one place for keys, routing, rate limits, and cost attribution
- Per-request cost log — model, tokens, feature, tenant, USD estimate
- Budget alerts — daily/weekly thresholds at org and feature level (PagerDuty / Slack)
- Per-tenant quotas — hard caps on tokens/day; graceful degradation message
- max_tokens enforced server-side — never trust client-supplied limits alone
- Prompt version registry — tie spend spikes to prompt deploys
- Monthly FinOps review — top 5 expensive features, escalation rates, cache hit rates
- Provider invoice reconciliation — billing CSV vs application logs within 5%
Per-tenant quotas pattern
Store rolling token usage in Redis keyed by tenant_id:yyyy-mm-dd. Before each call, increment estimated tokens; reject with 429 and upgrade CTA when over plan limit. Enterprise tenants get higher caps and optional burst allowance with finance approval.
| Control | Implementation | Alert when |
|---|---|---|
| Daily org budget | Sum cost_usd in metrics DB | >80% of budget by 3pm UTC |
| Feature anomaly | Compare to 7-day rolling median | 2× median for 1 hour |
| Tenant abuse | Per-tenant token counter | Quota exceeded or 10× usual |
| Model drift | Eval pass rate in CI | >5% drop vs baseline |
Track 1 wrap-up
You now have the full LLM foundation stack: how models work, APIs and context (coming guides), embeddings, multimodal inputs, and this cost playbook. Multimodal features (PDF, image, audio) multiply token counts—size those paths with the same calculator and routing patterns before enabling them for all tenants.
Quotas mitigate cost abuse but not prompt injection. Pair spend controls with input validation, output filtering, and RBAC on which features can call which models (e.g. only backend services may invoke Opus-tier).
SaaS AI add-ons often ship with “AI credits” derived from token budgets per seat. Engineering exposes remaining_credits from the same counters used for FinOps—product and infra share one source of truth.
Track 1 complete when you can answer: “What did we spend yesterday by feature and tenant?” and “What happens when we hit budget?” If the answer is unknown, wire logging and quotas before scaling traffic.