Model selection & cost optimization

Pricing baseline: cost per million tokens

Every hosted LLM API bills in two buckets: input tokens (everything you send) and output tokens (everything the model generates). Prices are quoted per 1 million tokens and change frequently—treat the table below as order-of-magnitude planning numbers, not a contract.

Output is almost always more expensive than input because generation runs autoregressively—one forward pass per token. A chat feature that returns 800 tokens when 200 would suffice is burning budget on max_tokens and prompt design, not on model choice alone.

Model	Input / 1M tokens	Output / 1M tokens	Typical role
GPT-4o	~$2.50 – $5	~$10 – $15	Complex reasoning, agents, high-stakes extraction
GPT-4o-mini	~$0.15	~$0.60	Classification, routing, bulk summarization
Claude 3.5 Sonnet	~$3	~$15	Long-context RAG, coding, instruction following
Claude 3 Haiku	~$0.25	~$1.25	First-pass answers, triage, high-volume chat
Gemini 1.5 Flash	~$0.075	~$0.30	Very cheap first pass, multimodal ingest

Prices change. OpenAI, Anthropic, and Google adjust list rates when new model generations ship; enterprise contracts (Bedrock, Azure OpenAI, Vertex) may differ from public API pricing. Before you commit to a margin model or customer pricing, pull current rates from provider docs and re-run your spreadsheet monthly.

Input vs output: why the split matters

Consider a support bot averaging 2,000 input tokens (system prompt + RAG chunks + user message) and 400 output tokens, at 50,000 requests per day on GPT-4o at ~$5 / $15 per million:

Input cost/day: 2,000 × 50,000 × ($5 / 1,000,000) = $500
Output cost/day: 400 × 50,000 × ($15 / 1,000,000) = $300
Total: ~$800/day (~$24K/month) — before caching, routing, or failures/retries

The same traffic on GPT-4o-mini at ~$0.15 / $0.60 drops to roughly $18/day—a 40× swing driven entirely by model ID. That is why routing and caps belong in architecture, not as a post-launch optimization.

💰 Cost

Quick per-request estimate: cost = (input_tokens × price_in + output_tokens × price_out) / 1_000_000. Log both token counts on every API response—OpenAI exposes usage.prompt_tokens and usage.completion_tokens; Anthropic uses input_tokens / output_tokens.

⚠️ Pitfall

Budgeting with list price for GPT-4o while shipping 4o-mini in code—or the reverse. Pin model IDs in config, log them per request, and reconcile billing exports against application logs weekly. Silent env drift (staging model ≠ prod model) has caused six-figure surprises at scale.

💡 Pro Tip

Add a “cost preview” column to your feature spec template: avg input, avg output, daily volume, model ID, estimated $/day. If the number exceeds a threshold (e.g. $500/day), the spec must include routing or caching strategy before eng kickoff.

Model routing: cascades and confidence escalation

Routing sends each request to the cheapest model that can satisfy quality SLOs. A cascade tries a small model first and escalates only when confidence is low—Haiku → Sonnet → Opus on Anthropic, or 4o-mini → 4o on OpenAI.

Most production traffic is “easy”: FAQ lookup, intent classification, short summarization, formatting JSON. Reserve frontier models for the long tail where small models fail evals. At 70–90% easy-case rates, cascades routinely cut spend 60–80% with negligible quality loss if escalation triggers are tuned on real data.

Cascade pattern: Haiku → Sonnet → Opus

Tier 0 (Haiku / 4o-mini / Flash) — answer or classify with temperature 0 and a strict output schema.
Confidence check — logprobs threshold, JSON schema validation, self-consistency vote, or a dedicated “needs escalation” classifier.
Tier 1 (Sonnet / GPT-4o) — retry with richer model if check fails; pass prior attempt as context (“your draft was rejected because…”).
Tier 2 (Opus / o1-class) — only for flagged high-value or high-complexity paths (legal review queue, executive summary).

Confidence-based escalation signals

Signal	How it works	Trade-off
Logprobs / token probability	Low average logprob on answer tokens → escalate	Not all APIs expose logprobs; calibration varies by model
Schema validation	JSON / Pydantic parse fails → escalate	Deterministic; misses semantically wrong but valid JSON
Self-check prompt	Second call: “Is this answer supported by context? yes/no”	Extra latency and cost on every request unless batched
Routing classifier	Small model or embedding k-NN picks tier before main call	Requires labeled routing dataset and periodic retrain
Human queue	Escalate to review UI instead of bigger model	Best for compliance; not a latency win

Routing classifier with a small model

Train or prompt a tiny model to predict tier=cheap|standard|frontier from the user query and metadata (tenant plan, document length, detected language). Keep the classifier prompt under 500 tokens; run at temperature 0. Log predicted tier vs actual escalation rate—if you escalate >40% of “cheap” routes, the classifier needs more labels or features.

from dataclasses import dataclass
from openai import OpenAI

client = OpenAI()

TIERS = [
    {"name": "cheap", "model": "gpt-4o-mini", "price_in": 0.15, "price_out": 0.60},
    {"name": "standard", "model": "gpt-4o", "price_in": 5.0, "price_out": 15.0},
]

@dataclass
class RouteResult:
    text: str
    model: str
    input_tokens: int
    output_tokens: int
    escalated: bool

def call_model(model: str, messages: list) -> RouteResult:
    resp = client.chat.completions.create(
        model=model, temperature=0, max_tokens=512, messages=messages,
        response_format={"type": "json_object"},
    )
    u = resp.usage
    return RouteResult(
        text=resp.choices[0].message.content or "",
        model=model,
        input_tokens=u.prompt_tokens,
        output_tokens=u.completion_tokens,
        escalated=False,
    )

def needs_escalation(draft_json: str) -> bool:
    # Replace with Pydantic validation, logprobs, or classifier score
    return '"confidence"' not in draft_json or '"low"' in draft_json

def cascade_chat(messages: list) -> RouteResult:
    cheap = call_model(TIERS[0]["model"], messages)
    if not needs_escalation(cheap.text):
        return cheap
    standard = call_model(TIERS[1]["model"], messages + [
        {"role": "assistant", "content": cheap.text},
        {"role": "user", "content": "Previous answer failed validation. Try again."},
    ])
    standard.escalated = True
    return standard

public record RouteResult(String text, String model, int inputTokens,
                          int outputTokens, boolean escalated) {}

public class CascadeRouter {
  private final ChatClient cheapClient;
  private final ChatClient standardClient;

  public CascadeRouter(ChatClient.Builder builder) {
    this.cheapClient = builder.clone().defaultOptions(
        ChatOptionsBuilder.builder().withModel("gpt-4o-mini").withTemperature(0.0).build()
    ).build();
    this.standardClient = builder.clone().defaultOptions(
        ChatOptionsBuilder.builder().withModel("gpt-4o").withTemperature(0.0).build()
    ).build();
  }

  public RouteResult cascade(List<Message> messages) {
    var cheap = invoke(cheapClient, messages);
    if (!needsEscalation(cheap.text())) return cheap;
    var retryMessages = new ArrayList<>(messages);
    retryMessages.add(new AssistantMessage(cheap.text()));
    retryMessages.add(new UserMessage("Previous answer failed validation. Try again."));
    var standard = invoke(standardClient, retryMessages);
    return new RouteResult(standard.text(), "gpt-4o", standard.inputTokens(),
        standard.outputTokens(), true);
  }

  private boolean needsEscalation(String json) {
    return !json.contains("\"confidence\"") || json.contains("\"low\"");
  }
}

⚖️ Trade-off

Cascades add tail latency when you escalate (two sequential API calls). Mitigate with parallel speculative execution only if you can afford wasted cheap tokens—usually not worth it. Better: tighten cheap-tier prompts so escalation rate stays under 15–20% on production traffic.

📦 Real World

Customer support platforms route “where is my order?” to mini-class models and escalate billing disputes to Sonnet-class with full account history. The classifier is re-evaluated weekly against human-labeled tickets—routing drift is an ops task, not a one-time ML project.

🎯 Interview Tip

“How would you cut LLM cost 50% without hurting quality?” — Answer with cascade + caching + max_tokens caps + per-feature dashboards. Mention measuring escalation rate and running A/B evals before changing tiers—not blind downgrades.

Prompt caching and semantic caching

Long static prefixes—system prompts, tool definitions, RAG policy blocks—are re-sent on every request. Prompt caching lets providers reuse computed attention states for identical prefixes. Semantic caching stores prior Q→A pairs and returns a hit when a new question is embedding-similar.

Anthropic prefix caching (~90% input discount on cache hits)

Mark cacheable blocks with cache_control: {type: "ephemeral"} on system content or tool definitions. The first request pays full input price to populate cache; subsequent requests with the same prefix read cached tokens at roughly 10% of normal input cost (verify current Anthropic pricing—rates evolve). TTL is typically minutes to hours; ideal for stable system prompts over high QPS.

OpenAI automatic prefix cache

OpenAI applies automatic caching when the prompt prefix matches a recent request (often 1,024+ token prefixes). You do not opt in explicitly—check usage.prompt_tokens_details.cached_tokens on responses. Structure prompts so static content comes first and variable user/RAG content comes last to maximize hit rate.

Semantic caching (GPTCache, Redis + embeddings)

Application-level cache flow:

Embed user query (small embedding model—cheap).
Vector search Redis / pgvector for neighbors above similarity threshold (e.g. cosine > 0.92).
On hit, return stored answer; on miss, call LLM and write (embedding, answer) with TTL.

Libraries like GPTCache wrap this pattern. Works best for repetitive FAQ, internal policy bots, and developer doc assistants—not for personalized or time-sensitive answers unless you partition cache keys by tenant + doc version.

Cache type	Match key	Best for	Risk
Provider prefix cache	Exact byte prefix	Long static system prompts, tool schemas	Low—deterministic
Exact hash cache	SHA256(full prompt)	Idempotent batch jobs	Stale if upstream data changes
Semantic cache	Embedding similarity	FAQ, support macros	Wrong answer if threshold too loose

🔬 Under the Hood

Prefix caching saves prefill compute on the provider side—they pass savings as discounted input tokens. Semantic caching saves entire model calls but introduces a false-hit risk; always log cache hit/miss and sample human review on hits until precision is proven above 99% for your domain.

🔒 Security

Never cache responses that include user-specific PII across tenants. Namespace cache keys by tenant_id + prompt_version + model_id. Invalidate semantic cache on policy or knowledge-base updates—stale compliance answers are a liability, not a cost win.

💰 Cost

Example: 8K-token system prompt, 100 req/s, $3/M input on Sonnet. Uncached input alone ≈ $8.64/hour. With 80% prefix cache hits at 90% discount on cached portion, input cost drops roughly 70%—often worth more than switching models if your prompt is large and static.

Batch API: 50% discount, 24-hour turnaround

OpenAI’s Batch API (and similar async endpoints from other vendors) processes large job files within ~24 hours at roughly 50% off standard rates. You upload JSONL requests, poll for completion, download results—no streaming, no SLA for minutes-level latency.

When batch fits

Nightly eval runs over golden datasets (Track 5 regression suites)
Bulk document summarization, classification, or enrichment pipelines
Backfill after schema or prompt version changes
Embedding or caption generation for offline indexes
Monthly report generation, invoice parsing, CRM hygiene

When real-time wins

User-facing chat, copilot, or search-with-answer
Agent tool loops that need sub-second tool selection
Streaming UX (SSE to browser)
Any workflow blocked on LLM output to continue the transaction

Hybrid architectures are common: real-time path for interactive features, batch path for analytics and eval at half price. Same prompt templates—different transport. Idempotency keys on batch line items prevent duplicate charges when you retry uploads.

💡 Pro Tip

Run your CI eval suite via Batch on every main merge instead of synchronous API calls—you cut CI LLM spend in half and avoid rate-limit flakes. Gate merges on downloaded results, not on live latency.

⚖️ Trade-off

Batch jobs share provider capacity; during incidents completion can slip past 24h. Design batch consumers to be restartable and idempotent. Do not use batch for safety-critical same-day workflows unless you have a real-time fallback path.

📦 Real World

Content platforms batch-translate or tag millions of catalog rows overnight while live product pages use Flash-class models for on-demand “explain this spec” buttons. Product managers see one feature; infra sees two cost profiles.

Local inference: Ollama, vLLM, llama.cpp

Self-hosting moves spend from per-token OPEX to GPU CAPEX + ops. Tools like Ollama (dev and small prod), vLLM (high-throughput serving), and llama.cpp (CPU/edge) expose OpenAI-compatible HTTP APIs over open weights (Llama, Mistral, Phi).

Runtime	Strength	Typical deployment
Ollama	Fast local dev, simple model pull, low ops	Laptop, single GPU box, internal tools
vLLM	Continuous batching, high QPS on A100/H100	K8s GPU pool, dedicated inference fleet
llama.cpp	CPU inference, quantized models, edge	On-prem air-gap, embedded, cost-sensitive batch

When local wins

Volume — millions of tokens/day where GPU amortization beats API list price (model size dependent).
Privacy — data cannot leave VPC; regulated workloads (health, finance, gov).
Fine-tuned weights — custom LoRA adapters on Llama; no hosted equivalent.
Offline / edge — ships without reliable cloud connectivity.

Ops cost trade-off

A single L4 or A10G instance running 8B–70B quantized models might cost $500–2,000/month reserved—but you own utilization risk, model updates, security patches, and eval regressions when you bump weights. Staff time often exceeds GPU rent for teams without existing ML platform. Rule of thumb: prove ROI with hosted APIs and traffic metrics first; self-host when marginal token cost × volume exceeds fully loaded infra cost for 6+ months.

⚠️ Pitfall

“Free” Ollama on a laptop does not predict production cost. Throughput, batching, failover, and observability on vLLM at scale are different engineering problems. Pilot on one GPU with production prompts before committing to a fleet.

🔬 Under the Hood

vLLM uses PagedAttention to batch variable-length sequences efficiently—why it beats naive HuggingFace serving on QPS. Quantization (AWQ, GPTQ) cuts VRAM so larger models fit smaller GPUs at slight quality cost—measure on your eval set.

Cost estimation: templates and worked examples

Ship features with a spreadsheet—or code—that turns token averages and traffic into dollars. The canonical formula is simple; the hard part is honest token counts from production-shaped prompts.

Template

monthly_cost = (avg_input × price_in + avg_output × price_out) / 1_000_000 × requests_per_day × 30

Extend with cache hit rate, escalation rate, and retry multiplier: effective_cost = base_cost × (1 - cache_savings) × (1 + escalation_rate × tier_multiplier) × retry_factor.

Worked example: RAG support feature

Parameter	Value
Model	GPT-4o-mini
Price in / out (per 1M)	$0.15 / $0.60
Avg input tokens	3,500 (system + 4 RAG chunks + history)
Avg output tokens	350
Requests / day	20,000
Per-request cost	(3500×0.15 + 350×0.60) / 1e6 ≈ $0.000735
Monthly (30 days)	0.000735 × 20,000 × 30 ≈ $441

Same volume on GPT-4o at $5 / $15: per-request ≈ $0.0228, monthly ≈ $13,680—31× higher. A cascade with 85% mini / 15% 4o blends to roughly ~$2,400/month if escalations double token use on tier-2.

Token counter — paste text to estimate tokens & cost

Use this widget (from LLMs explained) to size prompts before you plug numbers into the calculator below.

Est. tokens 0

Characters 0

GPT-4o $0.0000

GPT-4o-mini $0.0000

Claude Sonnet $0.0000

Gemini Flash $0.0000

Per-feature dashboards

Tag every LLM call with feature, tenant_id, model, and computed cost_usd. Aggregate in your metrics stack (Datadog, Grafana, CloudWatch) or warehouse (BigQuery, Snowflake):

Daily spend by feature — catches runaway cron jobs
p50 / p95 tokens in and out — informs prompt compression work
Escalation rate by route — tunes cascade thresholds
Cost per successful task — ties dollars to business outcome, not raw calls

from dataclasses import dataclass

@dataclass(frozen=True)
class ModelPrice:
    name: str
    price_in_per_m: float
    price_out_per_m: float

PRICES = {
    "gpt-4o-mini": ModelPrice("gpt-4o-mini", 0.15, 0.60),
    "gpt-4o": ModelPrice("gpt-4o", 5.0, 15.0),
    "claude-haiku": ModelPrice("claude-haiku", 0.25, 1.25),
}

def request_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    p = PRICES[model]
    return (input_tokens * p.price_in_per_m + output_tokens * p.price_out_per_m) / 1_000_000

def monthly_cost(model: str, avg_in: int, avg_out: int, req_per_day: int, days: int = 30) -> float:
    per_req = request_cost(model, avg_in, avg_out)
    return per_req * req_per_day * days

# RAG support example
print(f"mini/month: ${monthly_cost('gpt-4o-mini', 3500, 350, 20_000):,.2f}")
print(f"4o/month:   ${monthly_cost('gpt-4o', 3500, 350, 20_000):,.2f}")

public record ModelPrice(String name, double priceInPerM, double priceOutPerM) {}

public final class LlmCostCalculator {
  private static final Map<String, ModelPrice> PRICES = Map.of(
      "gpt-4o-mini", new ModelPrice("gpt-4o-mini", 0.15, 0.60),
      "gpt-4o", new ModelPrice("gpt-4o", 5.0, 15.0)
  );

  public static double requestCost(String model, int inputTokens, int outputTokens) {
    var p = PRICES.get(model);
    return (inputTokens * p.priceInPerM() + outputTokens * p.priceOutPerM()) / 1_000_000.0;
  }

  public static double monthlyCost(String model, int avgIn, int avgOut, int reqPerDay, int days) {
    return requestCost(model, avgIn, avgOut) * reqPerDay * days;
  }
}

🎯 Interview Tip

Whiteboard: “10M requests/month, 2K in / 300 out, which model and what budget?” Walk through formula, cite mini vs 4o, mention cascade and caching—not just picking the biggest model.

Selection framework: complexity, latency, cost, privacy

Model choice is a multi-objective decision—not “always use the best benchmark score.” Use a decision matrix per feature class; validate with your eval set, using public benchmarks only to shortlist.

Decision dimensions

Task complexity — extraction vs multi-hop reasoning vs open-ended writing
Latency SLO — TTFT and total ms budget for UX tier (chat vs async email)
Cost envelope — target $/successful task or $/tenant/month
Privacy / residency — data leaves region? BAA required? air-gap?
Context needs — max document size without truncation
Tooling — JSON schema, function calling, vision, audio

Benchmark awareness (MMLU, HumanEval, etc.)

MMLU measures broad knowledge multiple-choice; HumanEval measures code synthesis pass@1. Leaderboard gains of 2–3 points rarely predict your invoice-parser accuracy. Use benchmarks to eliminate clearly weak candidates, then run golden-set evals (Track 5) on production prompts. Log regression when providers silently upgrade model snapshots.

Feature profile	Complexity	Latency	Recommended tier	Notes
Intent classifier	Low	<200ms	Haiku / 4o-mini / Flash	Temperature 0; few-shot or fine-tune if needed
RAG Q&A (internal docs)	Medium	<3s TTFT streaming	Sonnet / 4o-mini cascade	Prefix-cache long system prompt
Code review agent	High	<30s	Sonnet / GPT-4o	HumanEval-like internal suite
Executive brief from 50 PDFs	High	Async OK	Batch + Gemini Pro / Sonnet	Long context; batch discount
PHI summarization	Medium	<5s	Bedrock / VPC-hosted Llama	Privacy overrides cost
Multimodal OCR + Q&A	Medium–High	<5s	GPT-4o / Gemini Flash	See multimodal guide

⚖️ Trade-off

Optimizing only for MMLU pushes you to frontier models everywhere. Optimizing only for cost pushes you to mini models that fail on edge cases users remember. The matrix forces explicit trade-offs per feature—not one global default.

💡 Pro Tip

Document the chosen tier in an ADR: eval scores, cost projection, fallback model, and review date. Revisit when traffic 10×’s or a new model generation ships—whichever comes first.

Production: FinOps checklist and guardrails

Cost optimization without controls is a one-time spreadsheet exercise. Production needs budgets, alerts, quotas, and ownership—the same discipline you apply to cloud compute and egress.

FinOps checklist

Central LLM gateway — one place for keys, routing, rate limits, and cost attribution
Per-request cost log — model, tokens, feature, tenant, USD estimate
Budget alerts — daily/weekly thresholds at org and feature level (PagerDuty / Slack)
Per-tenant quotas — hard caps on tokens/day; graceful degradation message
max_tokens enforced server-side — never trust client-supplied limits alone
Prompt version registry — tie spend spikes to prompt deploys
Monthly FinOps review — top 5 expensive features, escalation rates, cache hit rates
Provider invoice reconciliation — billing CSV vs application logs within 5%

Per-tenant quotas pattern

Store rolling token usage in Redis keyed by tenant_id:yyyy-mm-dd. Before each call, increment estimated tokens; reject with 429 and upgrade CTA when over plan limit. Enterprise tenants get higher caps and optional burst allowance with finance approval.

Control	Implementation	Alert when
Daily org budget	Sum cost_usd in metrics DB	>80% of budget by 3pm UTC
Feature anomaly	Compare to 7-day rolling median	2× median for 1 hour
Tenant abuse	Per-tenant token counter	Quota exceeded or 10× usual
Model drift	Eval pass rate in CI	>5% drop vs baseline

Track 1 wrap-up

You now have the full LLM foundation stack: how models work, APIs and context (coming guides), embeddings, multimodal inputs, and this cost playbook. Multimodal features (PDF, image, audio) multiply token counts—size those paths with the same calculator and routing patterns before enabling them for all tenants.

🔒 Security

Quotas mitigate cost abuse but not prompt injection. Pair spend controls with input validation, output filtering, and RBAC on which features can call which models (e.g. only backend services may invoke Opus-tier).

📦 Real World

SaaS AI add-ons often ship with “AI credits” derived from token budgets per seat. Engineering exposes remaining_credits from the same counters used for FinOps—product and infra share one source of truth.

📦 Production checklist

Track 1 complete when you can answer: “What did we spend yesterday by feature and tenant?” and “What happens when we hit budget?” If the answer is unknown, wire logging and quotas before scaling traffic.