RAG explained

What is RAG?

Retrieval-augmented generation combines a search step with an LLM generation step. Instead of hoping the model “remembered” your refund policy from pre-training, you retrieve the relevant policy chunk at question time and augment the prompt with it before the model writes an answer.

Think of RAG as an open-book exam for the LLM. The model’s parametric memory (weights frozen at training) holds language, reasoning patterns, and broad public facts—but not your private wiki, ticket exports, or yesterday’s pricing sheet. RAG supplies just-in-time knowledge: fresh, attributable, and revocable when documents change.

The pattern was popularized by Lewis et al. (2020) but the production shape is simpler than the paper: index offline (chunk → embed → store), retrieve online (embed query → nearest neighbors), generate (stuff context + question → LLM). Your application owns chunk boundaries, metadata, ACL filters, and the system prompt that says “answer only from context; cite sources.”

When RAG is the right default

Private or frequently updated knowledge — HR policies, API docs, support macros, legal playbooks.
Answers must be attributable — users need links or doc IDs, not plausible prose.
Corpus too large for one context window — even 1M-token models cannot afford to paste everything per query.
Multiple tenants or ACLs — filter retrieval by tenant_id or role before generation.

When RAG is not enough alone

Precise structured queries — “sum of invoices last quarter” needs SQL or an API tool, not semantic search over PDFs.
Behavior and tone — consistent brand voice or JSON shape is prompt engineering or fine-tuning, not retrieval.
Multi-hop reasoning across many systems — agents with tools (Track 4) often wrap RAG as one capability among many.

⚠️ Pitfall

Calling any chatbot that reads uploaded files “RAG.” Upload-to-context (paste entire PDF into the prompt) is not retrieval— it breaks on large docs, costs scale linearly with corpus size, and provides no citation granularity. Real RAG retrieves top-K chunks, not whole files, unless you explicitly use parent-document expansion (Guide 4).

📦 Real World

Notion AI, Intercom Fin, and internal bank copilots share the same skeleton: ingest docs → index → retrieve on question → generate with citations → log chunk IDs for feedback. The moat is ingest quality and evals, not which vector DB logo appears in a slide deck.

💡 Pro Tip

Start with 20–50 golden questions from real users before tuning chunk size. If recall@5 is below 80% on those questions, no amount of prompt polish fixes wrong retrieval—fix search first (Guides 2–4), then generation prompts.

RAG vs fine-tuning vs long context

Teams debate three ways to “teach the model your stuff.” They solve different problems. RAG injects facts at query time. Fine-tuning updates weights for behavior and format. Long context pastes more text into one prompt—useful, but not a substitute for search at scale.

Approach	Updates weights?	Knowledge freshness	Best for	Weakness
RAG	No	Minutes (re-index on doc change)	Facts, policies, docs, citations	Retrieval must work; latency + infra
Fine-tuning (LoRA)	Yes (adapters)	Stale until retrain	Tone, format, domain phrasing, classification	Expensive to refresh knowledge; hallucination on new facts
Long context paste	No	Static per request	Small corpus, one-shot analysis	Cost O(n) per query; lost-in-the-middle; no ACL per chunk
Tools / SQL / APIs	No	Live from source systems	Structured data, actions, calculations	Requires schema design and auth—not prose search

RAG for facts, fine-tuning for behavior

If marketing updates the refund policy weekly, RAG re-indexes the PDF—fine-tuning would require a new training run every time. If support replies must always use a specific JSON schema and empathetic tone, LoRA or heavy system prompting may beat stuffing examples into every RAG prompt.

Production systems often combine all three: RAG for knowledge, fine-tuned small model for routing/format, tools for billing lookups, and a frontier model for hard reasoning—with evals proving each layer still works after changes.

⚖️ Trade-off

Fine-tuning on your entire doc corpus “so we don’t need RAG” fails the day a policy changes and the model still cites the old version confidently. Use fine-tuning when the task is how to answer; use RAG when the task is what is true.

💰 Cost

RAG per-query cost ≈ embedding(query) + vector search + LLM(input chunks + question). Fine-tuning upfront: GPU hours + eval cycles; marginal inference may be cheaper if you shrink prompts. Run the spreadsheet: if 90% of value is factual freshness, RAG wins on TCO—Track 1’s cost guide helps size token budgets.

🎯 Interview Tip

“When would you fine-tune instead of RAG?” — Strong answer: fine-tune for consistent output structure, classification, or domain language; RAG for mutable knowledge and citations; never fine-tune to memorize facts that change weekly. Mention eval harness and rollback for both paths.

🔬 Under the Hood

Fine-tuning adjusts logits on training examples; RAG changes the input tokens the model conditions on. At inference, RAG is mathematically “the same transformer”—but the attention layers see your retrieved chunks, not weights encoding last month’s policy. That is why RAG is easier to audit: log which chunks were in context.

The basic RAG pipeline

Every production RAG system decomposes into an offline indexing path and an online query path. Debug by asking which stage failed—bad chunks, bad embeddings, bad search, or bad generation.

Offline: build the index

Document — load PDF, HTML, Markdown, Confluence export, ticket JSON.
Chunk — split into passages (~200–800 tokens) with optional overlap and metadata (source, page, section).
Embed — run each chunk through an embedding model → fixed-size vector (e.g. 1536 dims).
Store — upsert vectors + metadata into a vector database or pgvector table with ANN index.

Online: answer a question

Query — user question (optionally rewritten or expanded—Guide 4).
Embed query — same embedding model as indexing (critical—model mismatch kills recall).
Retrieve — nearest-neighbor search for top-K chunks; apply metadata filters (tenant, ACL, date).
Stuff — assemble prompt: system rules + numbered context blocks + user question.
Generate — LLM produces answer; ideally with inline citations mapping to chunk IDs.

flowchart LR
  subgraph offline["Offline indexing"]
    D[Document] --> C[Chunk]
    C --> E[Embed]
    E --> S[(Vector store)]
  end
  subgraph online["Online query"]
    Q[User query] --> EQ[Embed query]
    EQ --> R[Retrieve top-K]
    S --> R
    R --> ST[Stuff prompt]
    ST --> G[Generate answer]
  end

Stuffing strategies (how context lands in the prompt)

Stuffing is assembling retrieved text into the LLM prompt. Three common patterns:

Simple concat — numbered blocks separated by blank lines; cheapest; works for support Q&A.
XML / tagged blocks — <document id="..."> wrappers; Claude and some enterprise templates parse these reliably.
Map-reduce — retrieve many chunks, summarize each in parallel, then answer from summaries; for corpora where K×chunk_size exceeds budget (Guide 5).

Put the user question after context so the model’s last strong signal is what to answer. Duplicate critical instructions at the top (system) and a one-line reminder after context (“Use only the above.”).

Metadata you should store on every chunk

Field	Example	Used for
chunk_id	doc_8821_p3_c2	Logging, citations, feedback loops
source_uri	s3://kb/refund-policy.pdf	Link-back in UI
title, section	“Refund policy §2.1”	Prompt headers, debugging
updated_at	ISO timestamp	Recency bias, stale answer warnings
acl / tenant_id	role:internal, tenant:acme	Filter before retrieve—never after

🔬 Under the Hood

Bi-encoder retrieval embeds query and chunks independently, then ranks by cosine similarity. Fast at scale (ANN indexes). Cross-encoder reranking (Guide 4) scores query–chunk pairs jointly— slower but much more accurate on top-20 candidates. Production stacks almost always bi-encode first, rerank second.

💡 Pro Tip

Log the retrieval set (chunk IDs + scores) separately from the LLM response. When users report a wrong answer, you can see in one glance whether retrieval missed the right doc or the model ignored good context—two different fixes.

📦 Real World

Stripe’s support AI and many fintech copilots run nightly ingest jobs that diff wiki changes, re-chunk only affected pages, and version indexes so rollback is “point alias to index v42.” Treat the index as a deploy artifact with its own version number.

Failure modes (what breaks in production)

Most “the AI lied” incidents are retrieval or assembly failures—not model IQ. Learn to classify failures before swapping to GPT-4.5.

1. Wrong chunks retrieved (low recall)

The answer existed in your corpus but never appeared in top-K. Causes: chunk split mid-sentence, embedding model mismatch, synonym gap (“billing” vs “invoicing”), stale index, or ACL filter too aggressive. Fix: hybrid BM25 + dense search, query rewriting, better chunking (Guide 2), reranker (Guide 4).

2. Chunk size too small

128-token chunks lose surrounding context—a table header lands in one chunk, rows in another. The model sees “Column A” without knowing it is “Annual renewal fee.” Fix: larger chunks, parent-child retrieval, or structure-aware splitting for tables.

3. Chunk size too large

2,000-token chunks dilute relevance signals; one relevant sentence sits in a sea of boilerplate. Embedding averages semantics across the whole chunk—precision drops. Fix: smaller chunks + reranker, or semantic chunking at paragraph boundaries.

4. LLM ignores retrieved context

Retrieval worked; the model still answers from parametric memory or invents details. Common when system prompt is weak, context is buried mid-prompt, or chunks contradict each other. Fix: “Answer ONLY from context below; if insufficient say I don’t know”; put rules at start and end; reduce K; use citation-required formats; lower temperature for Q&A.

5. Hallucination despite good retrieval

Model paraphrases incorrectly or merges two policies. Especially on numbers, dates, and legal language. Fix: require verbatim quotes for critical fields; structured output; human review for high-stakes; evals that check numeric equality not BLEU score.

6. Latency and cost blowups

Sequential embed + retrieve + large context + slow model → multi-second p95. Embedding every query, retrieving K=20, stuffing 15K tokens, then calling GPT-4o adds up. Fix: cache query embeddings; cap context budget; rerank top-20 → keep top-5; cascade to smaller models; parallel retrieve + LLM prefill optimizations (Guide 6).

Symptom	Likely stage	First debug step
“That policy doesn’t exist” (but it does in wiki)	Retrieve / chunk	Search index manually for question keywords
Answer cites doc but details wrong	Generate	Compare answer to retrieved chunk text
Works in English, fails in Spanish	Embed / chunk	Multilingual embedding model; language metadata filter
Random user sees confidential doc	Store / ACL	Audit metadata filters at query time
p95 > 8s	Whole pipeline	Trace waterfall: embed ms, ANN ms, LLM TTFT

⚠️ Pitfall

Raising K from 5 to 50 “so we don’t miss anything.” More noise overwhelms the context window, increases cost, and triggers lost-in-the-middle—recall may improve slightly while answer quality drops. Prefer reranking over raw K inflation.

🎯 Interview Tip

“Our RAG bot gives wrong refund answers—how do you debug?” — Walk the pipeline: golden question → log retrieved chunks → human labels correct doc present? → if yes, prompt/generation issue; if no, chunking/search issue. Mention eval set and index versioning.

⚖️ Trade-off

Aggressive “I don’t know” when context is thin reduces hallucination but increases deflection rate. Product must choose: support bot may prefer safe deflection; sales copilot may prefer best-effort with disclaimer.

Minimal RAG implementation

Below is the smallest production-shaped loop: embed the query, retrieve top-K from a vector store, stuff context into a template, generate with citation instructions. Replace Chroma with pgvector or Pinecone in Guide 3— the control flow stays the same.

Sample corpus (three support docs)

Use the same fictional Acme Corp snippets across Track 2 guides so evals stay comparable.

refund-policy.txt — 30-day annual refunds; no partial months on monthly plans.
shipping.txt — 3–5 business days standard US; express $12.
password-reset.txt — Settings → Security; links expire 24h.

End-to-end query path

from openai import OpenAI
import chromadb

client = OpenAI()
collection = chromadb.PersistentClient(path="./chroma").get_or_create_collection("acme_kb")

def embed(texts: list[str]) -> list[list[float]]:
    resp = client.embeddings.create(model="text-embedding-3-small", input=texts)
    return [d.embedding for d in resp.data]

def retrieve(question: str, k: int = 5) -> list[dict]:
    q_vec = embed([question])[0]
    hits = collection.query(query_embeddings=[q_vec], n_results=k, include=["documents", "metadatas", "distances"])
    return [
        {"text": doc, "meta": meta, "score": 1 - dist}
        for doc, meta, dist in zip(hits["documents"][0], hits["metadatas"][0], hits["distances"][0])
    ]

RAG_SYSTEM = """Answer using ONLY the numbered context below.
Cite sources as [1], [2]. If context is insufficient, say you don't know."""

def answer(question: str) -> str:
    chunks = retrieve(question)
    context = "\n\n".join(f"[{i+1}] ({c['meta']['source']})\n{c['text']}" for i, c in enumerate(chunks))
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        messages=[
            {"role": "system", "content": RAG_SYSTEM},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"},
        ],
    )
    return resp.choices[0].message.content

print(answer("Can I get a refund on a monthly plan after two weeks?"))

// Spring AI — VectorStore + ChatClient (index built at startup or via batch job)
@Service
public class RagService {
  private final VectorStore vectorStore;
  private final ChatClient chatClient;

  public RagService(VectorStore vectorStore, ChatClient.Builder builder) {
    this.vectorStore = vectorStore;
    this.chatClient = builder
        .defaultSystem("""
            Answer using ONLY the numbered context below.
            Cite sources as [1], [2]. If context is insufficient, say you don't know.""")
        .build();
  }

  public String answer(String question) {
    List<Document> hits = vectorStore.similaritySearch(
        SearchRequest.builder().query(question).topK(5).build());

    StringBuilder context = new StringBuilder();
    for (int i = 0; i < hits.size(); i++) {
      Document d = hits.get(i);
      context.append("[").append(i + 1).append("] (")
          .append(d.getMetadata().get("source")).append(")\n")
          .append(d.getText()).append("\n\n");
    }

    return chatClient.prompt()
        .user("Context:\n" + context + "\nQuestion: " + question)
        .call()
        .content();
  }
}

Indexing side (run once or on schedule)

from pathlib import Path

def chunk_file(path: Path, size: int = 400, overlap: int = 80) -> list[dict]:
    text = path.read_text()
    chunks, start, i = [], 0, 0
    while start < len(text):
        end = min(start + size, len(text))
        chunks.append({"id": f"{path.stem}_{i}", "text": text[start:end], "source": path.name})
        start = end - overlap if end < len(text) else end
        i += 1
    return chunks

all_chunks = [c for p in Path("docs").glob("*.txt") for c in chunk_file(p)]
vectors = embed([c["text"] for c in all_chunks])
collection.upsert(
    ids=[c["id"] for c in all_chunks],
    embeddings=vectors,
    documents=[c["text"] for c in all_chunks],
    metadatas=[{"source": c["source"]} for c in all_chunks],
)

// Spring AI DocumentReader + batch embed at startup (simplified)
@Bean
CommandLineRunner indexKb(VectorStore store, ResourcePatternResolver resolver) {
  return args -> {
    for (Resource r : resolver.getResources("classpath:docs/*.txt")) {
      String text = r.getContentAsString(StandardCharsets.UTF_8);
      List<Document> docs = chunk(text, 400, 80).stream()
          .map(t -> new Document(t, Map.of("source", r.getFilename())))
          .toList();
      store.add(docs);  // EmbeddingModel invoked by VectorStore implementation
    }
  };
}

💰 Cost

Indexing 10K chunks with text-embedding-3-small ≈ 10K × avg 200 tokens × $0.02/1M ≈ $0.04 one-time. Per query: ~50 tokens embed + ~2K context + ~300 output on gpt-4o-mini—fractions of a cent at moderate volume. Dominant cost is often re-embedding on every doc edit if you fail to incrementalize (Guide 6).

⚠️ Pitfall

Using different embedding models for index vs query—or upgrading the embedding model without full re-index. Vectors live in incompatible spaces; similarity scores become meaningless. Version your embedding model in index metadata and block queries against mismatched indexes.

💡 Pro Tip

Wrap retrieve() and answer() in separate functions with unit tests. Feed retrieve() a golden question and assert the expected chunk_id is in top-5 before you ever tune prompts.

Eval basics — prove retrieval works

RAG quality splits into retrieval metrics (did we fetch the right chunks?) and generation metrics (did the answer match ground truth given those chunks?). Measure retrieval first—generation evals lie when context is wrong.

Golden questions

Collect 30–100 real user questions with expected answer snippets and source document IDs. Store as JSON in git; run nightly against staging index. Each row:

question — user text
expected_chunk_ids — at least one must appear in top-K
expected_answer_contains — substring checks for key facts
tags — refund, shipping, edge-case

Recall@K

Recall@K = fraction of golden questions where at least one expected chunk appears in the top-K retrieved set. If Recall@5 = 0.72, 28% of questions never get the right context—fix chunking/search before prompt tuning.

recall_at_k = hits / total_golden where a hit means expected_chunk_ids ∩ retrieved_ids ≠ ∅.

GOLDEN = [
    {"q": "monthly plan refund?", "expected_ids": {"refund-policy_0"}},
    {"q": "express shipping cost?", "expected_ids": {"shipping_0"}},
]

def recall_at_k(golden, k=5) -> float:
    hits = 0
    for row in golden:
        retrieved_ids = {c["meta"]["chunk_id"] for c in retrieve(row["q"], k=k)}
        if row["expected_ids"] & retrieved_ids:
            hits += 1
    return hits / len(golden)

assert recall_at_k(GOLDEN, k=5) >= 0.8, "Fix retrieval before shipping"

record GoldenRow(String question, Set<String> expectedChunkIds) {}

double recallAtK(List<GoldenRow> golden, int k) {
  long hits = golden.stream()
      .filter(row -> {
        Set<String> retrieved = ragService.retrieveIds(row.question(), k);
        return row.expectedChunkIds().stream().anyMatch(retrieved::contains);
      })
      .count();
  return (double) hits / golden.size();
}

MRR and precision (optional retrieval metrics)

Mean reciprocal rank (MRR) rewards ranking the first correct chunk higher: MRR = average(1 / rank_of_first_hit). If the right chunk is always retrieved but at position 8, Recall@10 looks fine while stuffing quality suffers—MRR catches that.

Precision@K measures noise: what fraction of retrieved chunks are relevant? Low precision means you are paying tokens to confuse the model—tighten K or add a reranker threshold.

🔬 Under the Hood

Vector search returns approximate nearest neighbors (HNSW, IVF)—not guaranteed global optimum. A chunk at cosine 0.82 can outrank the true answer at 0.79 due to index parameters. That is why hybrid retrieval (keyword + vector) and reranking are standard, not optional luxuries.

Generation checks (lightweight)

Contains check — answer includes “partial months” for monthly refund question.
Citation check — answer references [1] and retrieved chunk [1] is refund-policy.
Refusal check — out-of-corpus questions should trigger “don’t know,” not fabrication.

Track 5 (Evaluation, safety & quality) covers RAGAS, LLM-as-judge, CI gates, and regression alerts when embedding model or chunk strategy changes. Wire recall@5 in CI now—even a 10-row golden set catches catastrophic index regressions.

🎯 Interview Tip

“How do you evaluate RAG?” — Separate retrieval vs generation; report Recall@K and MRR on labeled chunk IDs; use answer correctness only after retrieval passes; run evals on every index deploy; sample production queries with chunk IDs logged for human review.

📦 Real World

Teams that ship RAG without golden sets discover regressions from user tweets, not dashboards. Perplexity-style products invest heavily in offline eval + online thumbs-down tied to retrieved URLs—copy that feedback loop early.

⚖️ Trade-off

LLM-as-judge evals are fast to author but can favor verbose answers and correlate with the judge model’s biases. Use deterministic checks (recall, substring, JSON schema) for CI; use judges for exploratory analysis and spot checks.

What this looks like in production

A notebook that answers three questions is a spike. Production RAG is a service with versioned indexes, ACL-aware retrieval, structured logs, citations users can verify, and eval gates on every deploy.

Architecture sketch

Client → RAG API → (optional query rewrite) → embed → vector store with metadata filters → rerank → prompt builder → LLM gateway → response with citations. Ingest runs asynchronously: webhook or cron → parse → chunk → embed → upsert → alias flip to new index version.

Logging (every query)

Field	Why
request_id, user_id (hashed)	Trace bad sessions; support replay
index_version, embedding_model	Debug regressions after deploy
retrieved_chunk_ids, scores	Retrieval debugging without storing full docs
prompt_tokens, completion_tokens	Cost allocation per tenant/feature
latency_ms embed / retrieve / llm	SLO dashboards; find bottlenecks
citation_map	UI link-backs; eval citation accuracy

Citations in the UI

Return structured citations alongside markdown: { "chunk_id", "title", "source_uri", "snippet" }. Users trust answers they can click—legal and support teams require it. Never claim a citation that was not in the retrieved set (validate in post-processing).

ACL preview

Apply tenant and role filters inside the vector query, not by retrieving everything and filtering in app code. Store allowed_roles or tenant_id on each chunk at ingest; pass the caller’s claims into where clauses (pgvector JSONB, Pinecone metadata filters, etc.). Guide 6 covers propagating ACL changes when source docs move or permissions change.

⚠️ Pitfall

Post-filtering top-K results after ANN search. If forbidden docs rank in top-K but get stripped, you may return fewer than K chunks—or zero—and the model hallucinates from thin context. Over-fetch (top 50) then ACL-filter only works if enough allowed chunks remain; better to push filters into the index query.

Production readiness checklist

Golden question set with Recall@5 ≥ target (e.g. 0.85) in CI
Index versioning + rollback (blue/green alias swap)
Embedding model version pinned; full re-index runbook documented
ACL metadata on every chunk; filters tested with forbidden-doc cases
Citations returned to client; spot-check citation ↔ chunk alignment weekly
Structured logs with chunk IDs; PII policy for question text
Context token budget capped; rerank before stuff
“Don’t know” path when retrieval scores below threshold
Rate limits and timeouts on embed + LLM; fallback message on failure
Human escalation for low-confidence or high-stakes categories

💰 Cost

Budget RAG as: (queries × embed_cost) + (queries × avg_context_tokens × llm_input_price) + output. Re-index cost spikes when you re-embed whole corpus on small doc edits—incremental ingest (Guide 6) pays for itself quickly at >10K docs.

💡 Pro Tip

Ship an internal “retrieval debugger” page: paste question → see ranked chunks with scores and highlighted spans. Support engineers fix bad answers in minutes instead of filing vague “AI wrong” tickets.

📦 Real World

Enterprise copilots gate launches on recall metrics and legal review of citation UX—not on demo sparkle. Plan two weeks of eval hardening after the first working prototype; that is normal, not delay.