What is an LLM?
A large language model (LLM) is a neural network—almost always a transformer—trained on vast text to assign probabilities to the next token given everything that came before. Generation is that same prediction repeated: sample a token, append it, predict again until you hit a stop condition.
The model does not “look up” answers in a database at inference time (unless you add retrieval or tools). It completes patterns it saw during training—grammar, APIs, refund policies, code idioms—weighted by the prompt you send. That is why the same model can be brilliant on familiar tasks and confidently wrong on niche facts: it is optimizing for plausible continuation, not guaranteed truth.
Foundation model is the industry term for the base weights (GPT-4 class, Claude, Llama). Application is your system prompt, RAG chunks, tool results, guardrails, and UI—everything that turns raw completion into a product feature.
Treating the LLM as a search engine or database. Users ask factual questions; the model invents plausible answers. For knowledge-heavy features, plan RAG or tool calls in Track 2—do not rely on parametric memory alone.
The inference loop (what happens on every API call)
Each generation step follows the same pattern:
- Encode prompt — text → token IDs via the model’s tokenizer.
- Forward pass — transformer layers output a vector of logits (scores) over the vocabulary.
- Softmax — logits become a probability distribution over all possible next tokens.
- Sample — pick one token (argmax at temperature 0, or stochastic with temperature / top-p / top-k).
- Append & repeat — add token to context; stop at max_tokens, stop sequence, or EOS token.
Output length directly drives latency and cost: step 2–4 run once per generated token. A 500-token answer is roughly 500 forward passes (batched efficiently on GPU, but still proportional).
Autoregressive means token t only sees tokens 1…t−1 (causal mask in attention). There is no parallel generation of the full answer inside one forward pass—providers hide this with streaming, which sends each token as it is sampled so time-to-first-token (TTFT) feels fast even when total time is unchanged.
Hosted API vs self-hosted inference
You almost always integrate via HTTP API or a local OpenAI-compatible server—not by loading weights yourself unless you operate ML infra.
| Option | Pros | Cons | When to use |
|---|---|---|---|
| Hosted frontier (OpenAI, Anthropic, Google) | Best quality, tools, long context, no GPU ops | Per-token cost, data leaves VPC, rate limits | Default for product features, agents, RAG answers |
| Cloud enterprise (Bedrock, Azure OpenAI, Vertex) | VPC, IAM, existing cloud billing, compliance | Model lag vs direct API, higher friction | Regulated industries, existing AWS/Azure/GCP spend |
| Local / self-host (Ollama, vLLM, llama.cpp) | Zero marginal token cost, data stays on box | Ops burden, weaker models, scaling GPUs | High volume, sensitive data, offline, fine-tuned weights |
Stripe and many SaaS products route classification and extraction to smaller hosted or self-hosted models, and reserve frontier models for complex reasoning. GitHub Copilot runs specialized models on dedicated inference infra—not a raw public API call per keystroke without caching and routing.
Self-hosting Llama 3.1 8B looks cheap on paper until you count GPU availability, model updates, and eval regressions when you change weights. Most teams start hosted, measure cost, then self-host only proven high-volume paths.
Transformer architecture (intuition, no math)
You do not need to implement attention to ship LLM features—but you need the mental model to debug context limits, “lost in the middle,” and why longer prompts cost more.
A decoder-only transformer (GPT-style) stacks repeated blocks. Each block roughly does:
- Self-attention — every token gathers information from prior tokens.
- Feed-forward MLP — per-token non-linear transform.
- Residual + layer norm — stabilizes deep stacks (dozens to 100+ layers on frontier models).
Deeper stacks tend to capture more abstract patterns (multi-step reasoning, code structure) at the cost of compute and memory. Model size (parameters) and context length are separate knobs—a 8B model can still have a 128K context window.
Tokenization: text → token IDs
Raw Unicode text is split into tokens using subword algorithms: BPE (GPT, Llama), WordPiece (older BERT lineage), SentencePiece (many multilingual models). Each token maps to an integer ID in a fixed vocabulary (often 32K–200K entries).
Token ≠ word. English averages ~4 characters per token, but:
- unbelievable → often 3 tokens (un, believ, able)
- One emoji → 1–4 tokens depending on tokenizer
- Code, JSON, and base64 → more tokens per character than prose
- CJK text → often one token per character or small groups
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Refund policy: partial months are not refundable on annual plans."
ids = enc.encode(text)
print("tokens:", len(ids))
print("ids:", ids[:12], "...")
// jtokkit — OpenAI-compatible token counting in JVM services
EncodingRegistry registry = Encodings.newDefaultEncodingRegistry();
Encoding enc = registry.getEncoding(EncodingType.CL100K_BASE);
String text = "Refund policy: partial months are not refundable on annual plans.";
List<Integer> ids = enc.encode(text);
System.out.println("tokens: " + ids.size());
Embeddings and attention
Embedding layer: each token ID → dense vector (e.g. 4096 dimensions). Similar tokens sit near each other in space; the model learns these vectors during training.
Attention: for each token, the model computes how much to “look at” every previous token and builds a weighted mix. Classic example: “The animal didn't cross the street because it was tired.” — attention links it to animal, not street. Without that mechanism, pronoun resolution and long-range dependencies fail.
Output head: final hidden state → logits over vocabulary → softmax → probabilities → sample next token.
Budgeting with len(text) / 4 for production prompts. Code, JSON, and legal text routinely break that rule by 2×. Always count with the same tokenizer as the model you call—or use the provider’s counting API before send.
“Explain attention in one minute.” — Each token produces query/key/value vectors; softmax weights determine which prior tokens matter; stacked layers compose local syntax into document-level reasoning. You do not need matrix formulas—show you understand why context length and position matter.
Temperature and sampling
After softmax, you rarely take the single highest-probability token for product work—you sample from the distribution. Sampling controls creativity, diversity, and reproducibility.
Temperature
Temperature scales logits before softmax. Lower → sharper distribution (more deterministic); higher → flatter (more random).
| Temperature | Effect | Typical use |
|---|---|---|
| 0 (or omitted with greedy) | Argmax — same input → same output (modulo provider updates) | JSON extraction, classification, code gen with tests |
| 0.2 – 0.5 | Low variance, still natural language | Support bots, internal copilots, policy Q&A |
| ~1.0 | Balanced — close to model’s raw distribution | General chat, brainstorming with review |
| 1.5 – 2.0+ | High chaos — rare words, unstable structure | Creative writing diversity; bad for facts |
Top-p (nucleus sampling)
Sort tokens by probability; take the smallest set whose cumulative mass ≥ p (e.g. 0.9); sample only within that set. Adapts to context—narrow when one token dominates, wide when many are plausible. Most production apps use temperature 0–0.7 + top_p 0.9–0.95.
Top-k
Sample only from the top k tokens (e.g. 40). Simpler but less adaptive than top-p. Some APIs expose both; if in doubt, prefer top-p.
# Extraction: temperature 0 + JSON schema (OpenAI strict mode in next guide)
extract = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": "Extract vendor name from: Acme Corp invoice #8821"}],
)
# Marketing draft: higher temperature, human review before send
draft = client.chat.completions.create(
model="gpt-4o",
temperature=0.9,
top_p=0.95,
messages=[{"role": "user", "content": "Write three subject lines for a product launch email."}],
)
// Spring AI — pin temperature per use case
ChatResponse extraction = chatClient.prompt()
.options(ChatOptionsBuilder.builder().withTemperature(0.0).build())
.user("Extract vendor name from: Acme Corp invoice #8821")
.call();
ChatResponse draft = chatClient.prompt()
.options(ChatOptionsBuilder.builder().withTemperature(0.9).withTopP(0.95).build())
.user("Write three subject lines for a product launch email.")
.call();
Lower temperature improves consistency but can cause repetitive phrasing in long outputs. Support bots: 0–0.3 with a fixed system prompt. Marketing copy: 0.7–1.0 with human review—not temperature 0.
“Why did JSON extraction return different fields on retry?” — Non-zero temperature, unpinned model version, or no schema validation. Answer: temperature 0, structured output / Pydantic validation, eval suite in CI, log model ID per request.
Context window: the model's working memory
The context window is the maximum tokens the model can attend to in one request—your prompt (system + history + RAG chunks) plus the completion. Exceed it and providers truncate or error.
Representative limits (check provider docs—changes often)
| Model | Context (approx.) | Notes |
|---|---|---|
| GPT-4o | 128K tokens | Strong general capability; higher input cost on long prompts |
| Claude 3.5 Sonnet | 200K tokens | Strong instruction following; prompt caching on long system prompts |
| Gemini 1.5 Pro | 1M tokens | Long-doc ingest; verify quality at extreme length |
| GPT-4o-mini / Haiku / Flash | 128K+ / 200K / 1M | Cheap first pass; escalate on hard cases |
Tokens are not characters
English prose ≈ 4 characters per token. Code, tables, and non-Latin text diverge widely. A 50-page PDF pasted as text can blow a budget you sized for “50 pages × 500 words × 4 chars.”
What happens at the limit
- Hard error — API returns 400 context length exceeded; you must truncate or summarize.
- Silent truncation — some clients drop oldest messages; critical instructions at the start may survive while recent user text is cut.
- Quality degradation — even within limit, lost in the middle: models recall start and end of long context better than the middle. Put critical RAG chunks and rules at the beginning or end—not buried mid-prompt.
Long context ≠ free
Billing is per input token. Attention compute inside the model scales roughly O(n²) with sequence length during prefill— providers price and rate-limit accordingly. Stuffing 500K tokens “because it fits” can cost dollars per request and add seconds of latency.
Token budget template (single request)
| Bucket | Example budget | Purpose |
|---|---|---|
| System prompt | 500 tokens | Role, rules, output format |
| Retrieved context (RAG) | 2,000 tokens | Top-K chunks after rerank |
| Conversation history | 1,000 tokens | Sliding window or summary |
| User message | 500 tokens | Current question |
| Response buffer | 1,000 (max_tokens) | Reserved for completion |
| Total | ~5,000 | Well within 128K—room to grow deliberately |
monthly_cost ≈ avg_input_tokens × price_in + avg_output_tokens × price_out × requests_per_day × 30. Output tokens are often 3–5× the input price on frontier models—cap max_tokens on verbose tasks.
Before RAG: measure baseline prompt size with real system prompt + longest expected user message. If you are already above 8K tokens without retrieval, fix prompt bloat first—adding chunks makes recall worse, not better.
Training pipeline intuition
Frontier models go through multiple stages. Your application sits after all of them—you steer behavior with prompts, RAG, tools, and optionally fine-tuning—not by re-running pre-training.
1. Pre-training
Train on massive unlabeled text (web, books, code) with next-token prediction. The model learns grammar, facts (statistically), reasoning patterns, and languages—embedded in weights, static until retrained. Cost: millions of GPU-hours; only foundation labs do this routinely.
2. Supervised fine-tuning (SFT)
Curated instruction → response pairs teach “assistant” behavior: follow questions, refuse unsafe requests, use markdown, etc. This is why base Llama behaves differently from Llama-Instruct—you want instruct/chat variants for products.
3. RLHF (reinforcement learning from human feedback)
Humans rank multiple model answers; a reward model learns preferences; policy tuning nudges outputs toward helpful/harmless. Explains alignment quirks—models refuse more than raw pre-training would, and style varies by vendor.
4. Constitutional AI (Anthropic)
AI-generated critiques replace some human ranking for scale—same goal as RLHF with different data pipeline. As an app builder, you still add your policies via system prompt and guardrails.
| Technique | Updates weights? | Best for |
|---|---|---|
| Prompt engineering | No | Behavior, format, tone—try first |
| RAG | No | Fresh factual knowledge, citations |
| Fine-tuning (LoRA) | Yes (adapters) | Consistent style, domain format, smaller prompts |
| Full pre-training | Yes (all weights) | Not for product teams—foundation scale only |
Fine-tuning to “teach the model your docs.” New documents require retraining; knowledge drifts. Use RAG for knowledge, fine-tuning for behavior and format—Track 6 covers when fine-tuning actually pays off.
Model families and when to use each
Benchmarks (MMLU, HumanEval, MATH, GPQA) indicate broad capability—they do not predict your task. Always run evals on your data; use benchmarks to shortlist, not to decide.
| Family | Models | Strengths | Typical use |
|---|---|---|---|
| OpenAI | GPT-4o, GPT-4o-mini | Tools, JSON schema mode, broad ecosystem | General product features, agents, structured extraction |
| Anthropic | Claude 3.5 Sonnet, Haiku, Opus | Long context, instruction following, XML-friendly prompts | RAG over docs, coding assistants, careful refusals |
| Gemini 1.5 Pro, Flash | Very long context, multimodal, competitive pricing | Large doc ingest, video/frame pipelines | |
| Open weights | Llama 3.1, Mistral, Phi-3 | Self-host, fine-tune, no per-token vendor lock-in | Privacy, high volume, custom adapters |
Reference pricing (order of magnitude — verify current rates)
| Model | Input / 1M tokens | Output / 1M tokens |
|---|---|---|
| GPT-4o | ~$5 | ~$15 |
| GPT-4o-mini | ~$0.15 | ~$0.60 |
| Claude 3.5 Sonnet | ~$3 | ~$15 |
| Claude 3 Haiku | ~$0.25 | ~$1.25 |
| Gemini 1.5 Flash | ~$0.075 | ~$0.30 |
Routing pattern: cheap first, escalate on uncertainty
Classify or answer with Haiku / 4o-mini; if confidence low (logprobs, self-check, or classifier), retry with Sonnet / GPT-4o. Cascade routing saves 60–80% at scale when most tickets are simple—see the cost guide in this track.
Pin model and log it on every request. Silent provider upgrades change behavior overnight— your eval suite (Track 5) should alert when output distribution shifts without a deploy on your side.
Perplexity routes queries across models and search infra; Cursor uses multiple models by task type (chat vs apply vs embed). Neither exposes a single “always GPT-4” path—routing is the product moat.
Your first chat completion
Chat APIs accept an ordered messages array: system (instructions), user (human input), assistant (prior model turns). Never commit API keys—use env vars locally and a secrets manager in production.
Minimal call (non-streaming)
Log usage.total_tokens (OpenAI) or input/output token fields on every response—cost attribution starts here.
from openai import OpenAI
client = OpenAI() # OPENAI_API_KEY from environment
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
max_tokens=256,
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain what a token is in one sentence."},
],
)
print(response.choices[0].message.content)
print("prompt:", response.usage.prompt_tokens, "completion:", response.usage.completion_tokens)
import anthropic
client = anthropic.Anthropic() # ANTHROPIC_API_KEY
message = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=256,
temperature=0,
system="You are a concise technical assistant.",
messages=[{"role": "user", "content": "Explain what a token is in one sentence."}],
)
print(message.content[0].text)
print("in:", message.usage.input_tokens, "out:", message.usage.output_tokens)
import boto3, json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 256,
"temperature": 0,
"system": "You are a concise technical assistant.",
"messages": [{"role": "user", "content": "Explain what a token is in one sentence."}],
}
resp = bedrock.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps(body),
)
payload = json.loads(resp["body"].read())
print(payload["content"][0]["text"])
@RestController
@RequestMapping("/v1")
public class ChatController {
private final ChatClient chatClient;
public ChatController(ChatClient.Builder builder) {
this.chatClient = builder
.defaultSystem("You are a concise technical assistant.")
.build();
}
@GetMapping("/chat")
public String chat(@RequestParam String question) {
return chatClient.prompt()
.user(question)
.call()
.content();
}
}
AnthropicChatModel model = AnthropicChatModel.builder()
.apiKey(System.getenv("ANTHROPIC_API_KEY"))
.modelName("claude-3-5-haiku-20241022")
.temperature(0.0)
.maxTokens(256)
.build();
String answer = model.generate(
"Explain what a token is in one sentence."
).content().text();
var chat = new BedrockAnthropicChatModel(
BedrockAnthropicChatOptions.builder()
.withModel("anthropic.claude-3-haiku-20240307-v1:0")
.withTemperature(0.0f)
.withMaxTokens(256)
.build());
String text = chat.call(new Prompt("Explain what a token is in one sentence."))
.getResult().getOutput().getContent();
Streaming (production default for user-facing chat)
Set stream=True (OpenAI) or streaming API on Anthropic/Bedrock. Forward tokens to the client over SSE so users see progress; measure TTFT separately from total duration. The next guide in this track covers provider-specific streaming and error handling mid-stream.
Log metadata (latency, tokens, model ID, request ID)—not raw prompts if they contain PII. Hash or truncate user content in observability backends; restrict log access like production DB credentials.
What this looks like in production
A script that prints one completion is a spike. Production is a service with SLOs, cost controls, and proof that prompt changes do not regress behavior.
Architecture sketch
Client → your API → (optional RAG/tools) → LLM gateway → provider. Gateway owns API keys, model routing, rate limits, retries, caching, and per-team cost accounting— do not scatter provider keys across microservices.
Resilience
- 429 rate limits — exponential backoff with jitter; respect Retry-After.
- 503 / timeouts — retry idempotent reads; cap total deadline below client timeout.
- Model fallback — GPT-4o → 4o-mini → cached/static response for non-critical paths.
- Provider fallback — OpenAI down → Anthropic via unified client (LiteLLM, Portkey, or in-house).
- Circuit breaker — stop calling a failing provider after N errors; alert on-call.
Observability (log every call)
| Field | Why |
|---|---|
| model, provider, version | Debug behavior changes and cost spikes |
| input_tokens, output_tokens, cached_tokens | Cost allocation per feature / tenant |
| TTFT, total_latency_ms | SLO dashboards; separate infra vs model slowness |
| feature, user_id (hashed), session_id | Trace bad sessions; sample for human review |
| error_type, http_status | Alert on error rate > 1% |
Production readiness checklist
- Structured logging per LLM call with token counts and cost estimate
- max_tokens and input length validated before the API call
- Temperature 0 (or pinned) on deterministic code paths
- Rate limit and timeout handling tested under load
- Streaming to client for chat UX; graceful partial response on stream errors
- Secrets in vault—not env files in the container image
- Golden eval suite in CI (Track 5)—block merge on regression
- Human review path for high-stakes outputs (medical, legal, money movement)
Ship when you can answer: “What did we spend yesterday per feature?” and “Which prompt version is in prod?” If both are unknown, you are still in demo mode—wire logging and evals before marketing the AI feature.