Lang
API

What is an LLM?

A large language model (LLM) is a neural network—almost always a transformer—trained on vast text to assign probabilities to the next token given everything that came before. Generation is that same prediction repeated: sample a token, append it, predict again until you hit a stop condition.

The model does not “look up” answers in a database at inference time (unless you add retrieval or tools). It completes patterns it saw during training—grammar, APIs, refund policies, code idioms—weighted by the prompt you send. That is why the same model can be brilliant on familiar tasks and confidently wrong on niche facts: it is optimizing for plausible continuation, not guaranteed truth.

Foundation model is the industry term for the base weights (GPT-4 class, Claude, Llama). Application is your system prompt, RAG chunks, tool results, guardrails, and UI—everything that turns raw completion into a product feature.

⚠️ Pitfall

Treating the LLM as a search engine or database. Users ask factual questions; the model invents plausible answers. For knowledge-heavy features, plan RAG or tool calls in Track 2—do not rely on parametric memory alone.

The inference loop (what happens on every API call)

Each generation step follows the same pattern:

  1. Encode prompt — text → token IDs via the model’s tokenizer.
  2. Forward pass — transformer layers output a vector of logits (scores) over the vocabulary.
  3. Softmax — logits become a probability distribution over all possible next tokens.
  4. Sample — pick one token (argmax at temperature 0, or stochastic with temperature / top-p / top-k).
  5. Append & repeat — add token to context; stop at max_tokens, stop sequence, or EOS token.

Output length directly drives latency and cost: step 2–4 run once per generated token. A 500-token answer is roughly 500 forward passes (batched efficiently on GPU, but still proportional).

🔬 Under the Hood

Autoregressive means token t only sees tokens 1…t−1 (causal mask in attention). There is no parallel generation of the full answer inside one forward pass—providers hide this with streaming, which sends each token as it is sampled so time-to-first-token (TTFT) feels fast even when total time is unchanged.

Hosted API vs self-hosted inference

You almost always integrate via HTTP API or a local OpenAI-compatible server—not by loading weights yourself unless you operate ML infra.

OptionProsConsWhen to use
Hosted frontier (OpenAI, Anthropic, Google) Best quality, tools, long context, no GPU ops Per-token cost, data leaves VPC, rate limits Default for product features, agents, RAG answers
Cloud enterprise (Bedrock, Azure OpenAI, Vertex) VPC, IAM, existing cloud billing, compliance Model lag vs direct API, higher friction Regulated industries, existing AWS/Azure/GCP spend
Local / self-host (Ollama, vLLM, llama.cpp) Zero marginal token cost, data stays on box Ops burden, weaker models, scaling GPUs High volume, sensitive data, offline, fine-tuned weights
📦 Real World

Stripe and many SaaS products route classification and extraction to smaller hosted or self-hosted models, and reserve frontier models for complex reasoning. GitHub Copilot runs specialized models on dedicated inference infra—not a raw public API call per keystroke without caching and routing.

⚖️ Trade-off

Self-hosting Llama 3.1 8B looks cheap on paper until you count GPU availability, model updates, and eval regressions when you change weights. Most teams start hosted, measure cost, then self-host only proven high-volume paths.

Transformer architecture (intuition, no math)

You do not need to implement attention to ship LLM features—but you need the mental model to debug context limits, “lost in the middle,” and why longer prompts cost more.

A decoder-only transformer (GPT-style) stacks repeated blocks. Each block roughly does:

  • Self-attention — every token gathers information from prior tokens.
  • Feed-forward MLP — per-token non-linear transform.
  • Residual + layer norm — stabilizes deep stacks (dozens to 100+ layers on frontier models).

Deeper stacks tend to capture more abstract patterns (multi-step reasoning, code structure) at the cost of compute and memory. Model size (parameters) and context length are separate knobs—a 8B model can still have a 128K context window.

Tokenization: text → token IDs

Raw Unicode text is split into tokens using subword algorithms: BPE (GPT, Llama), WordPiece (older BERT lineage), SentencePiece (many multilingual models). Each token maps to an integer ID in a fixed vocabulary (often 32K–200K entries).

Token ≠ word. English averages ~4 characters per token, but:

  • unbelievable → often 3 tokens (un, believ, able)
  • One emoji → 1–4 tokens depending on tokenizer
  • Code, JSON, and base64 → more tokens per character than prose
  • CJK text → often one token per character or small groups
Count tokens before calling the API
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "Refund policy: partial months are not refundable on annual plans."
ids = enc.encode(text)
print("tokens:", len(ids))
print("ids:", ids[:12], "...")
// jtokkit — OpenAI-compatible token counting in JVM services
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
String text = "Refund policy: partial months are not refundable on annual plans.";
List<Integer> ids = enc.encode(text);
System.out.println("tokens: " + ids.size());

Embeddings and attention

Embedding layer: each token ID → dense vector (e.g. 4096 dimensions). Similar tokens sit near each other in space; the model learns these vectors during training.

Attention: for each token, the model computes how much to “look at” every previous token and builds a weighted mix. Classic example: “The animal didn't cross the street because it was tired.” — attention links it to animal, not street. Without that mechanism, pronoun resolution and long-range dependencies fail.

Output head: final hidden state → logits over vocabulary → softmax → probabilities → sample next token.

⚠️ Pitfall

Budgeting with len(text) / 4 for production prompts. Code, JSON, and legal text routinely break that rule by 2×. Always count with the same tokenizer as the model you call—or use the provider’s counting API before send.

🎯 Interview Tip

“Explain attention in one minute.” — Each token produces query/key/value vectors; softmax weights determine which prior tokens matter; stacked layers compose local syntax into document-level reasoning. You do not need matrix formulas—show you understand why context length and position matter.

Temperature and sampling

After softmax, you rarely take the single highest-probability token for product work—you sample from the distribution. Sampling controls creativity, diversity, and reproducibility.

Temperature

Temperature scales logits before softmax. Lower → sharper distribution (more deterministic); higher → flatter (more random).

TemperatureEffectTypical use
0 (or omitted with greedy)Argmax — same input → same output (modulo provider updates)JSON extraction, classification, code gen with tests
0.2 – 0.5Low variance, still natural languageSupport bots, internal copilots, policy Q&A
~1.0Balanced — close to model’s raw distributionGeneral chat, brainstorming with review
1.5 – 2.0+High chaos — rare words, unstable structureCreative writing diversity; bad for facts

Top-p (nucleus sampling)

Sort tokens by probability; take the smallest set whose cumulative mass ≥ p (e.g. 0.9); sample only within that set. Adapts to context—narrow when one token dominates, wide when many are plausible. Most production apps use temperature 0–0.7 + top_p 0.9–0.95.

Top-k

Sample only from the top k tokens (e.g. 40). Simpler but less adaptive than top-p. Some APIs expose both; if in doubt, prefer top-p.

Deterministic extraction vs creative draft
# Extraction: temperature 0 + JSON schema (OpenAI strict mode in next guide)
extract = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    response_format={"type": "json_object"},
    messages=[{"role": "user", "content": "Extract vendor name from: Acme Corp invoice #8821"}],
)

# Marketing draft: higher temperature, human review before send
draft = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.9,
    top_p=0.95,
    messages=[{"role": "user", "content": "Write three subject lines for a product launch email."}],
)
// Spring AI — pin temperature per use case
ChatResponse extraction = chatClient.prompt()
    .options(ChatOptionsBuilder.builder().withTemperature(0.0).build())
    .user("Extract vendor name from: Acme Corp invoice #8821")
    .call();

ChatResponse draft = chatClient.prompt()
    .options(ChatOptionsBuilder.builder().withTemperature(0.9).withTopP(0.95).build())
    .user("Write three subject lines for a product launch email.")
    .call();
⚖️ Trade-off

Lower temperature improves consistency but can cause repetitive phrasing in long outputs. Support bots: 0–0.3 with a fixed system prompt. Marketing copy: 0.7–1.0 with human review—not temperature 0.

🎯 Interview Tip

“Why did JSON extraction return different fields on retry?” — Non-zero temperature, unpinned model version, or no schema validation. Answer: temperature 0, structured output / Pydantic validation, eval suite in CI, log model ID per request.

Context window: the model's working memory

The context window is the maximum tokens the model can attend to in one request—your prompt (system + history + RAG chunks) plus the completion. Exceed it and providers truncate or error.

Representative limits (check provider docs—changes often)

ModelContext (approx.)Notes
GPT-4o128K tokensStrong general capability; higher input cost on long prompts
Claude 3.5 Sonnet200K tokensStrong instruction following; prompt caching on long system prompts
Gemini 1.5 Pro1M tokensLong-doc ingest; verify quality at extreme length
GPT-4o-mini / Haiku / Flash128K+ / 200K / 1MCheap first pass; escalate on hard cases

Tokens are not characters

English prose ≈ 4 characters per token. Code, tables, and non-Latin text diverge widely. A 50-page PDF pasted as text can blow a budget you sized for “50 pages × 500 words × 4 chars.”

What happens at the limit

  • Hard error — API returns 400 context length exceeded; you must truncate or summarize.
  • Silent truncation — some clients drop oldest messages; critical instructions at the start may survive while recent user text is cut.
  • Quality degradation — even within limit, lost in the middle: models recall start and end of long context better than the middle. Put critical RAG chunks and rules at the beginning or end—not buried mid-prompt.

Long context ≠ free

Billing is per input token. Attention compute inside the model scales roughly O(n²) with sequence length during prefill— providers price and rate-limit accordingly. Stuffing 500K tokens “because it fits” can cost dollars per request and add seconds of latency.

Token counter — paste text to estimate tokens & cost

Est. tokens 0
Characters 0
GPT-4o $0.0000
GPT-4o-mini $0.0000
Claude Sonnet $0.0000
Gemini Flash $0.0000

Token budget template (single request)

BucketExample budgetPurpose
System prompt500 tokensRole, rules, output format
Retrieved context (RAG)2,000 tokensTop-K chunks after rerank
Conversation history1,000 tokensSliding window or summary
User message500 tokensCurrent question
Response buffer1,000 (max_tokens)Reserved for completion
Total~5,000Well within 128K—room to grow deliberately
💰 Cost

monthly_cost ≈ avg_input_tokens × price_in + avg_output_tokens × price_out × requests_per_day × 30. Output tokens are often 3–5× the input price on frontier models—cap max_tokens on verbose tasks.

💡 Pro Tip

Before RAG: measure baseline prompt size with real system prompt + longest expected user message. If you are already above 8K tokens without retrieval, fix prompt bloat first—adding chunks makes recall worse, not better.

Training pipeline intuition

Frontier models go through multiple stages. Your application sits after all of them—you steer behavior with prompts, RAG, tools, and optionally fine-tuning—not by re-running pre-training.

1. Pre-training

Train on massive unlabeled text (web, books, code) with next-token prediction. The model learns grammar, facts (statistically), reasoning patterns, and languages—embedded in weights, static until retrained. Cost: millions of GPU-hours; only foundation labs do this routinely.

2. Supervised fine-tuning (SFT)

Curated instruction → response pairs teach “assistant” behavior: follow questions, refuse unsafe requests, use markdown, etc. This is why base Llama behaves differently from Llama-Instruct—you want instruct/chat variants for products.

3. RLHF (reinforcement learning from human feedback)

Humans rank multiple model answers; a reward model learns preferences; policy tuning nudges outputs toward helpful/harmless. Explains alignment quirks—models refuse more than raw pre-training would, and style varies by vendor.

4. Constitutional AI (Anthropic)

AI-generated critiques replace some human ranking for scale—same goal as RLHF with different data pipeline. As an app builder, you still add your policies via system prompt and guardrails.

TechniqueUpdates weights?Best for
Prompt engineeringNoBehavior, format, tone—try first
RAGNoFresh factual knowledge, citations
Fine-tuning (LoRA)Yes (adapters)Consistent style, domain format, smaller prompts
Full pre-trainingYes (all weights)Not for product teams—foundation scale only
⚠️ Pitfall

Fine-tuning to “teach the model your docs.” New documents require retraining; knowledge drifts. Use RAG for knowledge, fine-tuning for behavior and format—Track 6 covers when fine-tuning actually pays off.

Model families and when to use each

Benchmarks (MMLU, HumanEval, MATH, GPQA) indicate broad capability—they do not predict your task. Always run evals on your data; use benchmarks to shortlist, not to decide.

FamilyModelsStrengthsTypical use
OpenAI GPT-4o, GPT-4o-mini Tools, JSON schema mode, broad ecosystem General product features, agents, structured extraction
Anthropic Claude 3.5 Sonnet, Haiku, Opus Long context, instruction following, XML-friendly prompts RAG over docs, coding assistants, careful refusals
Google Gemini 1.5 Pro, Flash Very long context, multimodal, competitive pricing Large doc ingest, video/frame pipelines
Open weights Llama 3.1, Mistral, Phi-3 Self-host, fine-tune, no per-token vendor lock-in Privacy, high volume, custom adapters

Reference pricing (order of magnitude — verify current rates)

ModelInput / 1M tokensOutput / 1M tokens
GPT-4o~$5~$15
GPT-4o-mini~$0.15~$0.60
Claude 3.5 Sonnet~$3~$15
Claude 3 Haiku~$0.25~$1.25
Gemini 1.5 Flash~$0.075~$0.30

Routing pattern: cheap first, escalate on uncertainty

Classify or answer with Haiku / 4o-mini; if confidence low (logprobs, self-check, or classifier), retry with Sonnet / GPT-4o. Cascade routing saves 60–80% at scale when most tickets are simple—see the cost guide in this track.

💡 Pro Tip

Pin model and log it on every request. Silent provider upgrades change behavior overnight— your eval suite (Track 5) should alert when output distribution shifts without a deploy on your side.

📦 Real World

Perplexity routes queries across models and search infra; Cursor uses multiple models by task type (chat vs apply vs embed). Neither exposes a single “always GPT-4” path—routing is the product moat.

Your first chat completion

Chat APIs accept an ordered messages array: system (instructions), user (human input), assistant (prior model turns). Never commit API keys—use env vars locally and a secrets manager in production.

Minimal call (non-streaming)

Log usage.total_tokens (OpenAI) or input/output token fields on every response—cost attribution starts here.

from openai import OpenAI

client = OpenAI()  # OPENAI_API_KEY from environment
response = client.chat.completions.create(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=256,
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain what a token is in one sentence."},
    ],
)
print(response.choices[0].message.content)
print("prompt:", response.usage.prompt_tokens, "completion:", response.usage.completion_tokens)
import anthropic

client = anthropic.Anthropic()  # ANTHROPIC_API_KEY
message = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=256,
    temperature=0,
    system="You are a concise technical assistant.",
    messages=[{"role": "user", "content": "Explain what a token is in one sentence."}],
)
print(message.content[0].text)
print("in:", message.usage.input_tokens, "out:", message.usage.output_tokens)
import boto3, json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 256,
    "temperature": 0,
    "system": "You are a concise technical assistant.",
    "messages": [{"role": "user", "content": "Explain what a token is in one sentence."}],
}
resp = bedrock.invoke_model(
    modelId="anthropic.claude-3-haiku-20240307-v1:0",
    body=json.dumps(body),
)
payload = json.loads(resp["body"].read())
print(payload["content"][0]["text"])
@RestController
@RequestMapping("/v1")
public class ChatController {
  private final ChatClient chatClient;

  public ChatController(ChatClient.Builder builder) {
    this.chatClient = builder
        .defaultSystem("You are a concise technical assistant.")
        .build();
  }

  @GetMapping("/chat")
  public String chat(@RequestParam String question) {
    return chatClient.prompt()
        .user(question)
        .call()
        .content();
  }
}
AnthropicChatModel model = AnthropicChatModel.builder()
    .apiKey(System.getenv("ANTHROPIC_API_KEY"))
    .modelName("claude-3-5-haiku-20241022")
    .temperature(0.0)
    .maxTokens(256)
    .build();

String answer = model.generate(
    "Explain what a token is in one sentence."
).content().text();
var chat = new BedrockAnthropicChatModel(
    BedrockAnthropicChatOptions.builder()
        .withModel("anthropic.claude-3-haiku-20240307-v1:0")
        .withTemperature(0.0f)
        .withMaxTokens(256)
        .build());

String text = chat.call(new Prompt("Explain what a token is in one sentence."))
    .getResult().getOutput().getContent();

Streaming (production default for user-facing chat)

Set stream=True (OpenAI) or streaming API on Anthropic/Bedrock. Forward tokens to the client over SSE so users see progress; measure TTFT separately from total duration. The next guide in this track covers provider-specific streaming and error handling mid-stream.

🔒 Security

Log metadata (latency, tokens, model ID, request ID)—not raw prompts if they contain PII. Hash or truncate user content in observability backends; restrict log access like production DB credentials.

What this looks like in production

A script that prints one completion is a spike. Production is a service with SLOs, cost controls, and proof that prompt changes do not regress behavior.

Architecture sketch

Client → your API → (optional RAG/tools) → LLM gateway → provider. Gateway owns API keys, model routing, rate limits, retries, caching, and per-team cost accounting— do not scatter provider keys across microservices.

Resilience

  • 429 rate limits — exponential backoff with jitter; respect Retry-After.
  • 503 / timeouts — retry idempotent reads; cap total deadline below client timeout.
  • Model fallback — GPT-4o → 4o-mini → cached/static response for non-critical paths.
  • Provider fallback — OpenAI down → Anthropic via unified client (LiteLLM, Portkey, or in-house).
  • Circuit breaker — stop calling a failing provider after N errors; alert on-call.

Observability (log every call)

FieldWhy
model, provider, versionDebug behavior changes and cost spikes
input_tokens, output_tokens, cached_tokensCost allocation per feature / tenant
TTFT, total_latency_msSLO dashboards; separate infra vs model slowness
feature, user_id (hashed), session_idTrace bad sessions; sample for human review
error_type, http_statusAlert on error rate > 1%

Production readiness checklist

  • Structured logging per LLM call with token counts and cost estimate
  • max_tokens and input length validated before the API call
  • Temperature 0 (or pinned) on deterministic code paths
  • Rate limit and timeout handling tested under load
  • Streaming to client for chat UX; graceful partial response on stream errors
  • Secrets in vault—not env files in the container image
  • Golden eval suite in CI (Track 5)—block merge on regression
  • Human review path for high-stakes outputs (medical, legal, money movement)
📦 Production checklist

Ship when you can answer: “What did we spend yesterday per feature?” and “Which prompt version is in prod?” If both are unknown, you are still in demo mode—wire logging and evals before marketing the AI feature.