RAG explained
Retrieval-augmented generation (RAG) gives an LLM just-in-time knowledge: search your corpus at query time, inject the best-matching passages into the prompt, and instruct the model to answer from that evidence. Production RAG is not “vector DB + ChatGPT”—it is chunking, recall, reranking, citations, ACLs, and evals wired into a service with SLOs. This guide establishes the mental model before you tune chunk sizes or hybrid search.
After reading, you should be able to: explain RAG vs fine-tuning vs long context; draw the ingest→retrieve→generate pipeline; name the top failure modes; implement a minimal top-K RAG loop in Python and Java; sketch recall@k evals and a production checklist with logging and citations.
What is RAG?
Retrieval-augmented generation combines a search step with an LLM generation step. Instead of hoping the model “remembered” your refund policy from pre-training, you retrieve the relevant policy chunk at question time and augment the prompt with it before the model writes an answer.
Think of RAG as an open-book exam for the LLM. The model’s parametric memory (weights frozen at training) holds language, reasoning patterns, and broad public facts—but not your private wiki, ticket exports, or yesterday’s pricing sheet. RAG supplies just-in-time knowledge: fresh, attributable, and revocable when documents change.
The pattern was popularized by Lewis et al. (2020) but the production shape is simpler than the paper: index offline (chunk → embed → store), retrieve online (embed query → nearest neighbors), generate (stuff context + question → LLM). Your application owns chunk boundaries, metadata, ACL filters, and the system prompt that says “answer only from context; cite sources.”
When RAG is the right default
- Private or frequently updated knowledge — HR policies, API docs, support macros, legal playbooks.
- Answers must be attributable — users need links or doc IDs, not plausible prose.
- Corpus too large for one context window — even 1M-token models cannot afford to paste everything per query.
- Multiple tenants or ACLs — filter retrieval by tenant_id or role before generation.
When RAG is not enough alone
- Precise structured queries — “sum of invoices last quarter” needs SQL or an API tool, not semantic search over PDFs.
- Behavior and tone — consistent brand voice or JSON shape is prompt engineering or fine-tuning, not retrieval.
- Multi-hop reasoning across many systems — agents with tools (Track 4) often wrap RAG as one capability among many.
Calling any chatbot that reads uploaded files “RAG.” Upload-to-context (paste entire PDF into the prompt) is not retrieval— it breaks on large docs, costs scale linearly with corpus size, and provides no citation granularity. Real RAG retrieves top-K chunks, not whole files, unless you explicitly use parent-document expansion (Guide 4).
Notion AI, Intercom Fin, and internal bank copilots share the same skeleton: ingest docs → index → retrieve on question → generate with citations → log chunk IDs for feedback. The moat is ingest quality and evals, not which vector DB logo appears in a slide deck.
Start with 20–50 golden questions from real users before tuning chunk size. If recall@5 is below 80% on those questions, no amount of prompt polish fixes wrong retrieval—fix search first (Guides 2–4), then generation prompts.
RAG vs fine-tuning vs long context
Teams debate three ways to “teach the model your stuff.” They solve different problems. RAG injects facts at query time. Fine-tuning updates weights for behavior and format. Long context pastes more text into one prompt—useful, but not a substitute for search at scale.
| Approach | Updates weights? | Knowledge freshness | Best for | Weakness |
|---|---|---|---|---|
| RAG | No | Minutes (re-index on doc change) | Facts, policies, docs, citations | Retrieval must work; latency + infra |
| Fine-tuning (LoRA) | Yes (adapters) | Stale until retrain | Tone, format, domain phrasing, classification | Expensive to refresh knowledge; hallucination on new facts |
| Long context paste | No | Static per request | Small corpus, one-shot analysis | Cost O(n) per query; lost-in-the-middle; no ACL per chunk |
| Tools / SQL / APIs | No | Live from source systems | Structured data, actions, calculations | Requires schema design and auth—not prose search |
RAG for facts, fine-tuning for behavior
If marketing updates the refund policy weekly, RAG re-indexes the PDF—fine-tuning would require a new training run every time. If support replies must always use a specific JSON schema and empathetic tone, LoRA or heavy system prompting may beat stuffing examples into every RAG prompt.
Production systems often combine all three: RAG for knowledge, fine-tuned small model for routing/format, tools for billing lookups, and a frontier model for hard reasoning—with evals proving each layer still works after changes.
Fine-tuning on your entire doc corpus “so we don’t need RAG” fails the day a policy changes and the model still cites the old version confidently. Use fine-tuning when the task is how to answer; use RAG when the task is what is true.
RAG per-query cost ≈ embedding(query) + vector search + LLM(input chunks + question). Fine-tuning upfront: GPU hours + eval cycles; marginal inference may be cheaper if you shrink prompts. Run the spreadsheet: if 90% of value is factual freshness, RAG wins on TCO—Track 1’s cost guide helps size token budgets.
“When would you fine-tune instead of RAG?” — Strong answer: fine-tune for consistent output structure, classification, or domain language; RAG for mutable knowledge and citations; never fine-tune to memorize facts that change weekly. Mention eval harness and rollback for both paths.
Fine-tuning adjusts logits on training examples; RAG changes the input tokens the model conditions on. At inference, RAG is mathematically “the same transformer”—but the attention layers see your retrieved chunks, not weights encoding last month’s policy. That is why RAG is easier to audit: log which chunks were in context.
The basic RAG pipeline
Every production RAG system decomposes into an offline indexing path and an online query path. Debug by asking which stage failed—bad chunks, bad embeddings, bad search, or bad generation.
Offline: build the index
- Document — load PDF, HTML, Markdown, Confluence export, ticket JSON.
- Chunk — split into passages (~200–800 tokens) with optional overlap and metadata (source, page, section).
- Embed — run each chunk through an embedding model → fixed-size vector (e.g. 1536 dims).
- Store — upsert vectors + metadata into a vector database or pgvector table with ANN index.
Online: answer a question
- Query — user question (optionally rewritten or expanded—Guide 4).
- Embed query — same embedding model as indexing (critical—model mismatch kills recall).
- Retrieve — nearest-neighbor search for top-K chunks; apply metadata filters (tenant, ACL, date).
- Stuff — assemble prompt: system rules + numbered context blocks + user question.
- Generate — LLM produces answer; ideally with inline citations mapping to chunk IDs.
flowchart LR
subgraph offline["Offline indexing"]
D[Document] --> C[Chunk]
C --> E[Embed]
E --> S[(Vector store)]
end
subgraph online["Online query"]
Q[User query] --> EQ[Embed query]
EQ --> R[Retrieve top-K]
S --> R
R --> ST[Stuff prompt]
ST --> G[Generate answer]
end
Stuffing strategies (how context lands in the prompt)
Stuffing is assembling retrieved text into the LLM prompt. Three common patterns:
- Simple concat — numbered blocks separated by blank lines; cheapest; works for support Q&A.
- XML / tagged blocks — <document id="..."> wrappers; Claude and some enterprise templates parse these reliably.
- Map-reduce — retrieve many chunks, summarize each in parallel, then answer from summaries; for corpora where K×chunk_size exceeds budget (Guide 5).
Put the user question after context so the model’s last strong signal is what to answer. Duplicate critical instructions at the top (system) and a one-line reminder after context (“Use only the above.”).
Metadata you should store on every chunk
| Field | Example | Used for |
|---|---|---|
| chunk_id | doc_8821_p3_c2 | Logging, citations, feedback loops |
| source_uri | s3://kb/refund-policy.pdf | Link-back in UI |
| title, section | “Refund policy §2.1” | Prompt headers, debugging |
| updated_at | ISO timestamp | Recency bias, stale answer warnings |
| acl / tenant_id | role:internal, tenant:acme | Filter before retrieve—never after |
Bi-encoder retrieval embeds query and chunks independently, then ranks by cosine similarity. Fast at scale (ANN indexes). Cross-encoder reranking (Guide 4) scores query–chunk pairs jointly— slower but much more accurate on top-20 candidates. Production stacks almost always bi-encode first, rerank second.
Log the retrieval set (chunk IDs + scores) separately from the LLM response. When users report a wrong answer, you can see in one glance whether retrieval missed the right doc or the model ignored good context—two different fixes.
Stripe’s support AI and many fintech copilots run nightly ingest jobs that diff wiki changes, re-chunk only affected pages, and version indexes so rollback is “point alias to index v42.” Treat the index as a deploy artifact with its own version number.
Failure modes (what breaks in production)
Most “the AI lied” incidents are retrieval or assembly failures—not model IQ. Learn to classify failures before swapping to GPT-4.5.
1. Wrong chunks retrieved (low recall)
The answer existed in your corpus but never appeared in top-K. Causes: chunk split mid-sentence, embedding model mismatch, synonym gap (“billing” vs “invoicing”), stale index, or ACL filter too aggressive. Fix: hybrid BM25 + dense search, query rewriting, better chunking (Guide 2), reranker (Guide 4).
2. Chunk size too small
128-token chunks lose surrounding context—a table header lands in one chunk, rows in another. The model sees “Column A” without knowing it is “Annual renewal fee.” Fix: larger chunks, parent-child retrieval, or structure-aware splitting for tables.
3. Chunk size too large
2,000-token chunks dilute relevance signals; one relevant sentence sits in a sea of boilerplate. Embedding averages semantics across the whole chunk—precision drops. Fix: smaller chunks + reranker, or semantic chunking at paragraph boundaries.
4. LLM ignores retrieved context
Retrieval worked; the model still answers from parametric memory or invents details. Common when system prompt is weak, context is buried mid-prompt, or chunks contradict each other. Fix: “Answer ONLY from context below; if insufficient say I don’t know”; put rules at start and end; reduce K; use citation-required formats; lower temperature for Q&A.
5. Hallucination despite good retrieval
Model paraphrases incorrectly or merges two policies. Especially on numbers, dates, and legal language. Fix: require verbatim quotes for critical fields; structured output; human review for high-stakes; evals that check numeric equality not BLEU score.
6. Latency and cost blowups
Sequential embed + retrieve + large context + slow model → multi-second p95. Embedding every query, retrieving K=20, stuffing 15K tokens, then calling GPT-4o adds up. Fix: cache query embeddings; cap context budget; rerank top-20 → keep top-5; cascade to smaller models; parallel retrieve + LLM prefill optimizations (Guide 6).
| Symptom | Likely stage | First debug step |
|---|---|---|
| “That policy doesn’t exist” (but it does in wiki) | Retrieve / chunk | Search index manually for question keywords |
| Answer cites doc but details wrong | Generate | Compare answer to retrieved chunk text |
| Works in English, fails in Spanish | Embed / chunk | Multilingual embedding model; language metadata filter |
| Random user sees confidential doc | Store / ACL | Audit metadata filters at query time |
| p95 > 8s | Whole pipeline | Trace waterfall: embed ms, ANN ms, LLM TTFT |
Raising K from 5 to 50 “so we don’t miss anything.” More noise overwhelms the context window, increases cost, and triggers lost-in-the-middle—recall may improve slightly while answer quality drops. Prefer reranking over raw K inflation.
“Our RAG bot gives wrong refund answers—how do you debug?” — Walk the pipeline: golden question → log retrieved chunks → human labels correct doc present? → if yes, prompt/generation issue; if no, chunking/search issue. Mention eval set and index versioning.
Aggressive “I don’t know” when context is thin reduces hallucination but increases deflection rate. Product must choose: support bot may prefer safe deflection; sales copilot may prefer best-effort with disclaimer.
Minimal RAG implementation
Below is the smallest production-shaped loop: embed the query, retrieve top-K from a vector store, stuff context into a template, generate with citation instructions. Replace Chroma with pgvector or Pinecone in Guide 3— the control flow stays the same.
Sample corpus (three support docs)
Use the same fictional Acme Corp snippets across Track 2 guides so evals stay comparable.
- refund-policy.txt — 30-day annual refunds; no partial months on monthly plans.
- shipping.txt — 3–5 business days standard US; express $12.
- password-reset.txt — Settings → Security; links expire 24h.
End-to-end query path
from openai import OpenAI
import chromadb
client = OpenAI()
collection = chromadb.PersistentClient(path="./chroma").get_or_create_collection("acme_kb")
def embed(texts: list[str]) -> list[list[float]]:
resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
return [d.embedding for d in resp.data]
def retrieve(question: str, k: int = 5) -> list[dict]:
q_vec = embed([question])[0]
hits = collection.query(query_embeddings=[q_vec], n_results=k, include=["documents", "metadatas", "distances"])
return [
{"text": doc, "meta": meta, "score": 1 - dist}
for doc, meta, dist in zip(hits["documents"][0], hits["metadatas"][0], hits["distances"][0])
]
RAG_SYSTEM = """Answer using ONLY the numbered context below.
Cite sources as [1], [2]. If context is insufficient, say you don't know."""
def answer(question: str) -> str:
chunks = retrieve(question)
context = "\n\n".join(f"[{i+1}] ({c['meta']['source']})\n{c['text']}" for i, c in enumerate(chunks))
resp = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{"role": "system", "content": RAG_SYSTEM},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
],
)
return resp.choices[0].message.content
print(answer("Can I get a refund on a monthly plan after two weeks?"))
// Spring AI — VectorStore + ChatClient (index built at startup or via batch job)
@Service
public class RagService {
private final VectorStore vectorStore;
private final ChatClient chatClient;
public RagService(VectorStore vectorStore, ChatClient.Builder builder) {
this.vectorStore = vectorStore;
this.chatClient = builder
.defaultSystem("""
Answer using ONLY the numbered context below.
Cite sources as [1], [2]. If context is insufficient, say you don't know.""")
.build();
}
public String answer(String question) {
List<Document> hits = vectorStore.similaritySearch(
SearchRequest.builder().query(question).topK(5).build());
StringBuilder context = new StringBuilder();
for (int i = 0; i < hits.size(); i++) {
Document d = hits.get(i);
context.append("[").append(i + 1).append("] (")
.append(d.getMetadata().get("source")).append(")\n")
.append(d.getText()).append("\n\n");
}
return chatClient.prompt()
.user("Context:\n" + context + "\nQuestion: " + question)
.call()
.content();
}
}
Indexing side (run once or on schedule)
from pathlib import Path
def chunk_file(path: Path, size: int = 400, overlap: int = 80) -> list[dict]:
text = path.read_text()
chunks, start, i = [], 0, 0
while start < len(text):
end = min(start + size, len(text))
chunks.append({"id": f"{path.stem}_{i}", "text": text[start:end], "source": path.name})
start = end - overlap if end < len(text) else end
i += 1
return chunks
all_chunks = [c for p in Path("docs").glob("*.txt") for c in chunk_file(p)]
vectors = embed([c["text"] for c in all_chunks])
collection.upsert(
ids=[c["id"] for c in all_chunks],
embeddings=vectors,
documents=[c["text"] for c in all_chunks],
metadatas=[{"source": c["source"]} for c in all_chunks],
)
// Spring AI DocumentReader + batch embed at startup (simplified)
@Bean
CommandLineRunner indexKb(VectorStore store, ResourcePatternResolver resolver) {
return args -> {
for (Resource r : resolver.getResources("classpath:docs/*.txt")) {
String text = r.getContentAsString(StandardCharsets.UTF_8);
List<Document> docs = chunk(text, 400, 80).stream()
.map(t -> new Document(t, Map.of("source", r.getFilename())))
.toList();
store.add(docs); // EmbeddingModel invoked by VectorStore implementation
}
};
}
Indexing 10K chunks with text-embedding-3-small ≈ 10K × avg 200 tokens × $0.02/1M ≈ $0.04 one-time. Per query: ~50 tokens embed + ~2K context + ~300 output on gpt-4o-mini—fractions of a cent at moderate volume. Dominant cost is often re-embedding on every doc edit if you fail to incrementalize (Guide 6).
Using different embedding models for index vs query—or upgrading the embedding model without full re-index. Vectors live in incompatible spaces; similarity scores become meaningless. Version your embedding model in index metadata and block queries against mismatched indexes.
Wrap retrieve() and answer() in separate functions with unit tests. Feed retrieve() a golden question and assert the expected chunk_id is in top-5 before you ever tune prompts.
Eval basics — prove retrieval works
RAG quality splits into retrieval metrics (did we fetch the right chunks?) and generation metrics (did the answer match ground truth given those chunks?). Measure retrieval first—generation evals lie when context is wrong.
Golden questions
Collect 30–100 real user questions with expected answer snippets and source document IDs. Store as JSON in git; run nightly against staging index. Each row:
- question — user text
- expected_chunk_ids — at least one must appear in top-K
- expected_answer_contains — substring checks for key facts
- tags — refund, shipping, edge-case
Recall@K
Recall@K = fraction of golden questions where at least one expected chunk appears in the top-K retrieved set. If Recall@5 = 0.72, 28% of questions never get the right context—fix chunking/search before prompt tuning.
recall_at_k = hits / total_golden where a hit means expected_chunk_ids ∩ retrieved_ids ≠ ∅.
GOLDEN = [
{"q": "monthly plan refund?", "expected_ids": {"refund-policy_0"}},
{"q": "express shipping cost?", "expected_ids": {"shipping_0"}},
]
def recall_at_k(golden, k=5) -> float:
hits = 0
for row in golden:
retrieved_ids = {c["meta"]["chunk_id"] for c in retrieve(row["q"], k=k)}
if row["expected_ids"] & retrieved_ids:
hits += 1
return hits / len(golden)
assert recall_at_k(GOLDEN, k=5) >= 0.8, "Fix retrieval before shipping"
record GoldenRow(String question, Set<String> expectedChunkIds) {}
double recallAtK(List<GoldenRow> golden, int k) {
long hits = golden.stream()
.filter(row -> {
Set<String> retrieved = ragService.retrieveIds(row.question(), k);
return row.expectedChunkIds().stream().anyMatch(retrieved::contains);
})
.count();
return (double) hits / golden.size();
}
MRR and precision (optional retrieval metrics)
Mean reciprocal rank (MRR) rewards ranking the first correct chunk higher: MRR = average(1 / rank_of_first_hit). If the right chunk is always retrieved but at position 8, Recall@10 looks fine while stuffing quality suffers—MRR catches that.
Precision@K measures noise: what fraction of retrieved chunks are relevant? Low precision means you are paying tokens to confuse the model—tighten K or add a reranker threshold.
Vector search returns approximate nearest neighbors (HNSW, IVF)—not guaranteed global optimum. A chunk at cosine 0.82 can outrank the true answer at 0.79 due to index parameters. That is why hybrid retrieval (keyword + vector) and reranking are standard, not optional luxuries.
Generation checks (lightweight)
- Contains check — answer includes “partial months” for monthly refund question.
- Citation check — answer references [1] and retrieved chunk [1] is refund-policy.
- Refusal check — out-of-corpus questions should trigger “don’t know,” not fabrication.
Track 5 (Evaluation, safety & quality) covers RAGAS, LLM-as-judge, CI gates, and regression alerts when embedding model or chunk strategy changes. Wire recall@5 in CI now—even a 10-row golden set catches catastrophic index regressions.
“How do you evaluate RAG?” — Separate retrieval vs generation; report Recall@K and MRR on labeled chunk IDs; use answer correctness only after retrieval passes; run evals on every index deploy; sample production queries with chunk IDs logged for human review.
Teams that ship RAG without golden sets discover regressions from user tweets, not dashboards. Perplexity-style products invest heavily in offline eval + online thumbs-down tied to retrieved URLs—copy that feedback loop early.
LLM-as-judge evals are fast to author but can favor verbose answers and correlate with the judge model’s biases. Use deterministic checks (recall, substring, JSON schema) for CI; use judges for exploratory analysis and spot checks.
What this looks like in production
A notebook that answers three questions is a spike. Production RAG is a service with versioned indexes, ACL-aware retrieval, structured logs, citations users can verify, and eval gates on every deploy.
Architecture sketch
Client → RAG API → (optional query rewrite) → embed → vector store with metadata filters → rerank → prompt builder → LLM gateway → response with citations. Ingest runs asynchronously: webhook or cron → parse → chunk → embed → upsert → alias flip to new index version.
Logging (every query)
| Field | Why |
|---|---|
| request_id, user_id (hashed) | Trace bad sessions; support replay |
| index_version, embedding_model | Debug regressions after deploy |
| retrieved_chunk_ids, scores | Retrieval debugging without storing full docs |
| prompt_tokens, completion_tokens | Cost allocation per tenant/feature |
| latency_ms embed / retrieve / llm | SLO dashboards; find bottlenecks |
| citation_map | UI link-backs; eval citation accuracy |
Citations in the UI
Return structured citations alongside markdown: { "chunk_id", "title", "source_uri", "snippet" }. Users trust answers they can click—legal and support teams require it. Never claim a citation that was not in the retrieved set (validate in post-processing).
ACL preview
Apply tenant and role filters inside the vector query, not by retrieving everything and filtering in app code. Store allowed_roles or tenant_id on each chunk at ingest; pass the caller’s claims into where clauses (pgvector JSONB, Pinecone metadata filters, etc.). Guide 6 covers propagating ACL changes when source docs move or permissions change.
Post-filtering top-K results after ANN search. If forbidden docs rank in top-K but get stripped, you may return fewer than K chunks—or zero—and the model hallucinates from thin context. Over-fetch (top 50) then ACL-filter only works if enough allowed chunks remain; better to push filters into the index query.
Production readiness checklist
- Golden question set with Recall@5 ≥ target (e.g. 0.85) in CI
- Index versioning + rollback (blue/green alias swap)
- Embedding model version pinned; full re-index runbook documented
- ACL metadata on every chunk; filters tested with forbidden-doc cases
- Citations returned to client; spot-check citation ↔ chunk alignment weekly
- Structured logs with chunk IDs; PII policy for question text
- Context token budget capped; rerank before stuff
- “Don’t know” path when retrieval scores below threshold
- Rate limits and timeouts on embed + LLM; fallback message on failure
- Human escalation for low-confidence or high-stakes categories
Budget RAG as: (queries × embed_cost) + (queries × avg_context_tokens × llm_input_price) + output. Re-index cost spikes when you re-embed whole corpus on small doc edits—incremental ingest (Guide 6) pays for itself quickly at >10K docs.
Ship an internal “retrieval debugger” page: paste question → see ranked chunks with scores and highlighted spans. Support engineers fix bad answers in minutes instead of filing vague “AI wrong” tickets.
Enterprise copilots gate launches on recall metrics and legal review of citation UX—not on demo sparkle. Plan two weeks of eval hardening after the first working prototype; that is normal, not delay.