Advanced RAG patterns
Naive retrieve-then-generate works for demos. Production knowledge systems need architecture choices: modular pipelines, graph-aware retrieval, self-critique loops, corrective fallbacks, and cache-augmented generation. This guide maps the pattern landscape—from GraphRAG community summaries to Self-RAG reflection tokens—so you pick complexity only where evals prove ROI.
After reading, you should be able to: compare naive, advanced, modular, and agentic RAG; explain Microsoft GraphRAG indexing and when graph structure beats flat chunks; implement Self-RAG retrieve/no-retrieve and critique steps; describe RAPTOR hierarchical retrieval; wire Corrective RAG with web search fallback; reason about long-context vs retrieval trade-offs; apply prompt caching to static knowledge prefixes; and select patterns from a production matrix tied to latency, cost, and accuracy SLOs.
Architecture variants: naive, advanced, modular, agentic
Naive RAG embeds chunks, retrieves top-k, stuffs context, generates once. Advanced RAG adds query transformation, hybrid search, reranking, and compression. Modular RAG composes swappable stages (index, retrieve, post-process, generate). Agentic RAG lets an LLM decide when and how to retrieve across multiple tools and iterations.
Most teams start with naive RAG because it ships in a weekend: chunk documents, embed, store in pgvector or Pinecone, call similarity_search(query, k=5), concatenate, prompt GPT-4o. That baseline fails on multi-hop questions, tables spread across pages, and queries that need synthesis across ten sources—not one paragraph.
Naive RAG
Flow: query → embed → ANN search → top-k chunks → LLM. Strengths: simple, debuggable, cheap at small scale. Weaknesses: no query understanding, no rerank, brittle to chunk boundaries, hallucinates when retrieval misses.
| Stage | Typical implementation | Failure mode |
|---|---|---|
| Chunking | Fixed 512 tokens, 10% overlap | Splits tables, code blocks, legal clauses |
| Retrieval | Cosine on single embedding model | Lexical mismatch ("401k" vs "retirement plan") |
| Generation | Single-shot answer | Confident wrong when context empty or irrelevant |
Advanced RAG
Adds pre-retrieval (HyDE, multi-query, step-back prompting), hybrid BM25 + dense, cross-encoder rerank, and context compression (LLMLingua, selective sentence extraction). Latency rises 200–800ms; accuracy on hard evals often jumps 15–40 points versus naive on the same corpus.
Modular RAG
Treat each stage as an interface: Retriever, Reranker, ContextBuilder, Generator. Swap hybrid for ColBERT without rewriting generation. LangChain LCEL and LlamaIndex query pipelines are modular RAG in practice—production teams wrap them in their own config-driven DAG.
Agentic RAG
An agent loop: plan → retrieve from KB / web / SQL → read → decide if enough evidence → maybe retrieve again → answer. Best for open-ended research, multi-document synthesis, and tools beyond vectors. Costs more tokens and needs guardrails (max iterations, budget caps) to prevent runaway loops.
Architecture comparison at a glance
| Dimension | Naive | Advanced | Modular | Agentic |
|---|---|---|---|---|
| LLM calls / query | 1 | 1–2 | 1–3 | 3–15 |
| Latency p95 | Lowest | Medium | Medium | Highest |
| Engineering effort | Days | Weeks | Weeks | Months |
| Debuggability | High | Medium | High (stage logs) | Low without tracing |
| Multi-hop queries | Poor | Fair | Good | Best |
Evolution path most teams follow: ship naive with metrics → add hybrid + rerank (advanced) → refactor into modular interfaces when swapping embed models or stores → introduce agentic only for a premium tier or internal research mode after evals prove chunk RAG plateaus on your hardest question set.
from dataclasses import dataclass
from typing import Protocol
class Retriever(Protocol):
def retrieve(self, query: str, k: int) -> list[dict]: ...
class Reranker(Protocol):
def rerank(self, query: str, docs: list[dict], top_n: int) -> list[dict]: ...
@dataclass
class RagPipeline:
retriever: Retriever
reranker: Reranker | None
generator: callable
def answer(self, query: str, k: int = 20, top_n: int = 5) -> str:
hits = self.retriever.retrieve(query, k=k)
if self.reranker:
hits = self.reranker.rerank(query, hits, top_n=top_n)
else:
hits = hits[:top_n]
context = "\n\n".join(h["text"] for h in hits)
return self.generator(query, context)
public interface Retriever {
List retrieve(String query, int k);
}
public interface Reranker {
List rerank(String query, List docs, int topN);
}
public record RagPipeline(Retriever retriever, Reranker reranker, BiFunction generator) {
public String answer(String query, int k, int topN) {
var hits = new ArrayList<>(retriever.retrieve(query, k));
if (reranker != null) hits = reranker.rerank(query, hits, topN);
else hits = hits.subList(0, Math.min(topN, hits.size()));
var context = hits.stream().map(Document::text).collect(Collectors.joining("\n\n"));
return generator.apply(query, context);
}
}
Ship naive RAG with logging first. Measure empty-hit rate, MRR@k, and faithfulness before adding GraphRAG or agents—otherwise you optimize architecture without knowing the baseline failure mode.
Agentic RAG improves hard questions but adds 3–10× token cost and non-deterministic latency. Gate it behind a router: easy FAQ stays on advanced modular path; only escalate when classifier confidence is low.
Internal copilots at large SaaS vendors often run two tiers: Tier A is hybrid + rerank for 90% of queries; Tier B is LangGraph agent with SQL + KB tools for power users who opt in to slower, deeper answers.
“Compare RAG architectures.” — Start with naive limitations, then advanced (hybrid + rerank), modular (swappable stages), agentic (multi-step tool use). Mention eval-driven selection, not buzzword stacking.
GraphRAG: entities, communities, and global questions
Microsoft GraphRAG builds a knowledge graph from your corpus: extract entities and relationships, cluster into communities, summarize each community, and retrieve at community or entity level for questions that need global understanding—not just the nearest chunk.
Flat chunk RAG excels at “What is our PTO policy for contractors?” It struggles at “What are the main themes across 10,000 support tickets this quarter?” or “How do product areas relate to recurring escalations?” GraphRAG precomputes structure so global queries hit community summaries instead of random fragments.
Indexing pipeline
- Chunk & extract — LLM extracts entities (people, products, concepts) and edges (relates_to, owns, blocks) per chunk.
- Graph merge — Deduplicate entities across chunks; build adjacency lists or property graph.
- Community detection — Leiden/Louvain clustering on the graph; each community is a thematic cluster.
- Community reports — LLM summarizes each community (executive-level narrative + key entities).
- Dual index — Embed raw chunks (local) + community reports (global) in vector store.
| Query type | Retrieval target | Example |
|---|---|---|
| Local | Entity-linked chunks + neighborhood | “Who owns the billing API?” |
| Global | Top community reports | “Top risks mentioned in Q3 incident docs?” |
| Hybrid | Both paths merged + rerank | “How does Team X’s work affect compliance theme Y?” |
When GraphRAG wins
- Corpus is large, interconnected, and narrative (research, legal, incident postmortems, news archives).
- Users ask thematic / aggregate questions, not only fact lookup.
- You can afford offline indexing cost (multiple LLM passes per document at ingest).
When to skip
- Small FAQ (<500 pages) where hybrid search + rerank is enough.
- Highly structured data better served by SQL or OLAP, not LLM entity extraction.
- Strict freshness requirements—graph rebuild lag may be unacceptable without incremental graph updates.
EXTRACTION_PROMPT = '''Extract entities and relationships as JSON.
Entities: {{name, type}}. Relationships: {{source, target, type, description}}.
Text: {chunk}'''
def extract_graph_elements(llm, chunk: str) -> dict:
resp = llm.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_object"},
messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(chunk=chunk[:6000])}],
)
return json.loads(resp.choices[0].message.content)
def merge_entities(store: dict, batch: dict) -> None:
for ent in batch.get("entities", []):
key = (ent["name"].lower(), ent["type"])
store.setdefault(key, ent)
public record Entity(String name, String type) {}
public record Edge(String source, String target, String relType) {}
public class GraphExtractor {
private final ChatClient client;
public GraphBatch extract(String chunk) {
var prompt = "Extract entities and relationships as JSON. Text: " + chunk.substring(0, Math.min(6000, chunk.length()));
var json = client.prompt().user(prompt).call().content();
return parseGraphBatch(json);
}
}
Community detection (Leiden algorithm) optimizes modularity—densely connected entity groups become communities. Summaries at community level are essentially precomputed map-reduce over your corpus, paid at index time instead of query time.
GraphRAG indexing can cost 10–50× naive embed-only ingest: entity extraction + community reports are LLM-heavy. Amortize over query volume; run full rebuilds weekly, incremental entity merge daily.
Entity extraction hallucinates edges on noisy PDFs. Validate graph quality with spot checks and prune low-confidence relationships before community detection—or themes will be garbage.
Graph structure can leak cross-tenant relationships if you merge graphs globally. Partition graphs by tenant or sensitivity label; never embed community reports that mix restricted and public entities.
Self-RAG: retrieve, reflect, and critique
Self-RAG trains or prompts models to emit reflection tokens: decide whether to retrieve, judge passage relevance, and critique if the answer is supported, contradicted, or no evidence—tightening faithfulness without a separate judge model on every path.
Standard RAG always retrieves—even when the model already knows the answer or when retrieval adds noise. Self-RAG inserts a retrieval decision step and a generation critique step using special tokens or structured outputs the model was fine-tuned (or prompted) to produce.
Reflection token flow
- [Retrieve] vs [No retrieve] — model chooses whether KB lookup helps.
- If retrieve: fetch passages; score [Relevant] / [Irrelevant] per passage.
- Generate candidate answer conditioned on relevant passages only.
- [Supported] / [Contradicted] / [No evidence] — self-critique vs sources.
- If contradicted or weak: retrieve again with rewritten query or abstain.
| Token / signal | Meaning | Action |
|---|---|---|
| Retrieve | External KB likely helps | Run vector + hybrid search |
| No retrieve | Parametric knowledge sufficient | Skip retrieval; still log for audit |
| Relevant / Irrelevant | Passage quality | Drop irrelevant before context assembly |
| Supported | Answer grounded in passages | Return to user |
| Contradicted | Answer conflicts with sources | Regenerate or escalate |
SELF_RAG_PROMPT = '''Step 1: Output JSON {{"retrieve": true|false, "reason": "..."}}.
If retrieve, I will provide passages. Then output {{"relevance": [...], "answer": "...", "critique": "supported|contradicted|no_evidence"}}.'''
def self_rag_answer(llm, query: str, retriever) -> dict:
plan = json.loads(call_json(llm, SELF_RAG_PROMPT + "\nQuery: " + query))
passages = retriever.retrieve(query, k=8) if plan["retrieve"] else []
if not plan["retrieve"]:
return {"answer": call_text(llm, query), "retrieved": False}
critique = json.loads(call_json(llm, format_passages(query, passages)))
if critique["critique"] == "contradicted":
passages = retriever.retrieve(query + " " + critique["answer"][:80], k=8)
critique = json.loads(call_json(llm, format_passages(query, passages)))
return {"answer": critique["answer"], "critique": critique["critique"], "retrieved": True}
public record SelfRagPlan(boolean retrieve, String reason) {}
public record SelfRagResult(String answer, String critique, boolean retrieved) {}
public SelfRagResult selfRag(ChatClient client, String query, Retriever retriever) {
var plan = parsePlan(client.prompt().user(SELF_RAG_PROMPT + query).call().content());
if (!plan.retrieve()) {
return new SelfRagResult(client.prompt().user(query).call().content(), "no_retrieve", false);
}
var passages = retriever.retrieve(query, 8);
var critique = parseCritique(client, query, passages);
if ("contradicted".equals(critique.critique())) {
passages = retriever.retrieve(query + " " + critique.answer().substring(0, 80), 8);
critique = parseCritique(client, query, passages);
}
return new SelfRagResult(critique.answer(), critique.critique(), true);
}
Fine-tuned Self-RAG models need training data with labeled reflection tokens. Prompt-only approximations work but add 2–3 LLM calls per query—justify with faithfulness SLOs on regulated content.
Log retrieve vs no-retrieve decisions. If >80% are “no retrieve” on a KB product, your users may not need RAG at all—or queries are too generic for your chunk index.
“How do you reduce hallucination in RAG?” — Mention Self-RAG critique tokens, citation-required prompts, and reranking—plus offline evals (RAGAS faithfulness), not only prompting.
RAPTOR: recursive clustering and tree retrieval
RAPTOR (Recursive Abstractive Processing for Tree Organized Retrieval) clusters chunks, summarizes clusters recursively, and builds a tree—retrieve at leaf level for specifics or higher levels for broader context spanning many documents.
Fixed-size chunks lose hierarchy: a product spec’s “Security” section may scatter across forty chunks. RAPTOR bottom-up clusters semantically similar chunks, summarizes each cluster, clusters summaries again, until a root summary represents the whole corpus subtree—enabling multi-level retrieval.
Build phase
- Embed all leaf chunks.
- Cluster with GMM or k-means on embeddings (soft assignment optional).
- Summarize each cluster with an abstractive LLM call → parent nodes.
- Repeat on parent embeddings until tree depth cap or single root.
- Index all tree nodes (leaves + summaries) in vector store with level metadata.
Query phase
Retrieve from multiple levels: e.g. top-3 summary nodes + top-5 leaf nodes, dedupe, rerank, generate. Questions needing wide context (“Summarize onboarding docs”) hit high-level nodes; detail questions hit leaves.
| Level | Content | Best for |
|---|---|---|
| 0 (leaves) | Original chunks | Exact policy clauses, API params |
| 1–2 | Section summaries | Multi-chunk reasoning within a topic |
| Root-adjacent | Corpus-wide abstractive summaries | Overview, onboarding, exec briefings |
@dataclass
class TreeNode:
id: str
text: str
level: int
parent_id: str | None
embedding: list[float]
def build_level(parents: list[TreeNode], llm, embed_fn) -> list[TreeNode]:
# cluster parents' children embeddings, summarize each cluster
clusters = cluster_embeddings([p.embedding for p in parents], k=min(8, len(parents)))
new_nodes = []
for i, cluster in enumerate(clusters):
summary = llm_summarize(llm, [parents[j].text for j in cluster])
emb = embed_fn(summary)
new_nodes.append(TreeNode(id=f"L{parents[0].level+1}-{i}", text=summary,
level=parents[0].level + 1, parent_id=None, embedding=emb))
return new_nodes
public record TreeNode(String id, String text, int level, String parentId, float[] embedding) {}
public List buildLevel(List leaves, ChatClient llm, EmbeddingClient embed) {
var clusters = clusterByEmbedding(leaves, Math.min(8, leaves.size()));
var parents = new ArrayList();
for (int i = 0; i < clusters.size(); i++) {
var texts = clusters.get(i).stream().map(TreeNode::text).toList();
var summary = llm.prompt().user("Summarize:\n" + String.join("\n", texts)).call().content();
parents.add(new TreeNode("L" + (leaves.get(0).level() + 1) + "-" + i, summary,
leaves.get(0).level() + 1, null, embed.embed(summary)));
}
return parents;
}
Each tree level adds summarization LLM calls—budget for index-time cost similar to GraphRAG lite. Rebuild subtrees when documents change instead of full tree when possible.
Handbook and policy assistants use RAPTOR-like hierarchies so “overview” chat and “cite exact clause” chat share one index—level metadata routes retrieval depth.
Abstractive summaries drop nuance (numbers, dates, negation). Always include leaf chunks in final context when answer requires precise citations.
Corrective RAG: grade retrieval, then correct
Corrective RAG (CRAG) evaluates retrieved documents with a lightweight grader. Correct passages proceed; ambiguous triggers query rewrite + re-retrieve; incorrect triggers web search or alternate corpus fallback.
Bad retrieval poisons generation—models confidently synthesize wrong answers from irrelevant chunks. CRAG inserts an explicit retrieval quality gate before context assembly, and a correction path when the KB is stale or incomplete.
Document grading
For each retrieved chunk, a grader (small LLM or cross-encoder) outputs:
- Correct — directly relevant; keep.
- Incorrect — off-topic or contradictory; discard.
- Ambiguous — partial match; may need rewrite or more retrieval.
Correction actions
| Aggregate grade | Action | Example tool |
|---|---|---|
| Mostly correct | Generate with filtered chunks | Internal KB only |
| Mixed / ambiguous | Query decomposition + re-retrieve | HyDE, multi-query |
| Mostly incorrect | Web search fallback | Tavily, Bing, Google CSE |
| Still empty | Abstain with explanation | “No verified source found” |
GRADE_PROMPT = 'Rate relevance: correct|incorrect|ambiguous. Query: {q}\nPassage: {p}'
def grade_passages(llm, query: str, docs: list[dict]) -> list[tuple[dict, str]]:
scored = []
for d in docs:
label = llm.chat.completions.create(
model="gpt-4o-mini", temperature=0, max_tokens=8,
messages=[{"role": "user", "content": GRADE_PROMPT.format(q=query, p=d["text"][:1500])}],
).choices[0].message.content.strip().lower()
scored.append((d, label))
return scored
def crag_answer(llm, query, retriever, web_search):
docs = retriever.retrieve(query, k=10)
graded = grade_passages(llm, query, docs)
correct = [d for d, g in graded if g == "correct"]
if len(correct) >= 2:
return generate(llm, query, correct)
web_docs = web_search(query) if sum(1 for _, g in graded if g == "incorrect") >= 5 else []
return generate(llm, query, correct + web_docs)
public enum Grade { CORRECT, INCORRECT, AMBIGUOUS }
public String cragAnswer(ChatClient grader, String query, Retriever kb, WebSearchClient web) {
var docs = kb.retrieve(query, 10);
var graded = docs.stream().map(d -> new GradedDoc(d, grade(grader, query, d))).toList();
var correct = graded.stream().filter(g -> g.grade() == Grade.CORRECT).map(GradedDoc::doc).toList();
if (correct.size() >= 2) return generate(grader, query, correct);
long incorrect = graded.stream().filter(g -> g.grade() == Grade.INCORRECT).count();
var context = new ArrayList<>(correct);
if (incorrect >= 5) context.addAll(web.search(query));
return generate(grader, query, context);
}
Web search fallback exfiltrates user queries to third parties and may return untrusted content. Sanitize queries, allowlist domains, and mark web-sourced answers clearly in UI.
Developer doc assistants use CRAG internally: if SDK docs score “incorrect” (old API version), fall back to live docs crawl + cache—reducing “answer from deprecated chunk” incidents.
Cache web fallback results with TTL per query hash. Repeated questions shouldn’t hit search API every time.
Long-context RAG: stuff vs retrieve
Models with 128K–1M token windows tempt teams to skip retrieval and paste entire knowledge bases. Long-context RAG hybridizes: retrieve candidates, then stuff many chunks—or whole docs—when window budget allows. Trade-offs: cost, latency, lost-in-the-middle, freshness.
Claude 3.5, Gemini 1.5, and GPT-4o-class models advertise huge context windows. For a 2,000-page handbook (~3M tokens), you still cannot stuff everything—but for a single product’s docs (<200K tokens), full-doc stuffing plus a smart reranker sometimes beats naive top-5 chunk RAG on synthesis tasks.
When stuffing wins
- Corpus fits comfortably in context after compression (with 20–40% output reserve).
- Tasks need cross-document references spread across many sections.
- Retrieval repeatedly misses because queries use non-standard phrasing.
When retrieval still wins
- Corpus >> context window (millions of chunks).
- Per-query cost must stay low—100K input tokens × high QPS is expensive.
- Strict ACL filtering—only inject authorized subsets per user.
- Freshness: index updates faster than resending full docs every query.
| Approach | Input tokens / query | Latency | Risk |
|---|---|---|---|
| Top-5 chunk RAG | ~2K–8K | Low | Missed context |
| Retrieve 50 + rerank 10 | ~10K–25K | Medium | Still boundary errors |
| Stuff full doc set | 50K–128K+ | High | Lost-in-the-middle, cost |
| RAG + long-context merge | Variable | Medium–high | Complexity |
Lost-in-the-middle
Models attend unevenly to very long prompts—facts buried in the middle of 80K tokens get ignored more often than content at the start or end. Mitigations: put highest-scored chunks at edges, use map-reduce summarization, or hierarchical RAPTOR/GraphRAG instead of one giant paste.
def assemble_long_context(query: str, retriever, max_tokens: int = 100_000) -> str:
candidates = retriever.retrieve(query, k=200)
ranked = rerank_cross_encoder(query, candidates)[:80]
# Place best chunks at start and end (lost-in-the-middle mitigation)
head = ranked[:20]
tail = ranked[20:40]
middle = ranked[40:]
ordered = head + middle + tail
blob, used = [], 0
for doc in ordered:
t = count_tokens(doc["text"])
if used + t > max_tokens:
break
blob.append(doc["text"])
used += t
return "\n\n---\n\n".join(blob)
public String assembleLongContext(String query, Retriever retriever, int maxTokens) {
var candidates = retriever.retrieve(query, 200);
var ranked = reranker.rerank(query, candidates).subList(0, Math.min(80, candidates.size()));
var head = ranked.subList(0, Math.min(20, ranked.size()));
var tail = ranked.subList(Math.min(20, ranked.size()), Math.min(40, ranked.size()));
var ordered = new ArrayList<>(head);
if (ranked.size() > 40) ordered.addAll(ranked.subList(40, ranked.size()));
ordered.addAll(tail);
var blob = new StringBuilder();
int used = 0;
for (var doc : ordered) {
int t = tokenizer.count(doc.text());
if (used + t > maxTokens) break;
blob.append(doc.text()).append("\n\n---\n\n");
used += t;
}
return blob.toString();
}
128K input at $3/1M = ~$0.39 per request before output. At 10K QPS that is unsustainable—long-context RAG is a precision tool for premium tiers, not default path.
Stuffing simplifies architecture (no vector DB) but removes incremental update benefits—any doc change resends full corpus logic unless you version prompts carefully.
Some providers use sparse attention or prompt caching on long prefixes—identical doc bundles across users benefit from KV cache hits when doc content is a stable prefix.
“Would you use a 1M context model instead of RAG?” — No by default: cost, ACL, freshness. Yes for bounded corpora or as second pass after retrieval expands candidate set.
Cache-augmented RAG: prompt caching for static KB
Cache-augmented generation (CAG) preloads stable knowledge into the model’s prompt cache (provider KV reuse) instead of retrieving dynamically every query—ideal when the KB changes rarely and fits cache-eligible prefix size.
Anthropic and OpenAI cache attention states for long identical prefixes. If your product’s “knowledge bundle” (policy PDFs, API reference) is static for hours or days, load it once as a cacheable system block; per-user queries append as suffix-only tokens at reduced input cost on hits.
CAG vs RAG
| RAG (dynamic retrieve) | CAG (cached prefix) | |
|---|---|---|
| Update latency | Minutes (re-embed chunks) | Cache TTL + prefix rebuild |
| Query cost | Embed + small context | Large prefix once, cheap suffix |
| Best corpus size | Unbounded | Fits cacheable window (~100K tokens practical) |
| Personalization | Metadata filters per user | Harder—need per-tenant cache keys |
Hybrid pattern
Cache stable core docs (brand guidelines, core API) as prefix; RAG retrieve tenant-specific or fresh content as suffix. Maximizes cache hit rate while keeping personalization.
SYSTEM_KB = open("kb/core_policies.md").read() # stable, versioned
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": SYSTEM_KB,
"cache_control": {"type": "ephemeral"},
},
{"type": "text", "text": f"User question: {user_query}"},
],
}
]
# First request populates cache; subsequent identical prefix → discounted input tokens
var systemKb = Files.readString(Path.of("kb/core_policies.md"));
var request = ChatRequest.builder()
.messages(List.of(
Message.builder()
.role(Role.USER)
.content(List.of(
ContentBlock.text(systemKb).cacheControl(CacheControl.ephemeral()),
ContentBlock.text("User question: " + userQuery)
))
.build()))
.build();
Version cache keys: kb_v{{hash}} in telemetry. When cache hit rate drops after deploy, you likely changed prefix bytes silently.
Support bots with 40-page static playbooks use CAG on Anthropic for sub-100ms TTFT after warm cache—dynamic RAG only for account-specific ticket history.
Shared cache prefixes must not contain tenant A’s data served to tenant B. One cache block per security boundary.
Production: pattern selection matrix
Pick patterns from eval evidence, not hype. This matrix maps common product shapes to recommended architectures and links forward to ingestion pipelines—because advanced retrieval fails on bad indexes.
Pattern selection matrix
| Product shape | Recommended pattern | Avoid |
|---|---|---|
| FAQ <500 pages | Advanced modular (hybrid + rerank) | GraphRAG, agent loops |
| Multi-hop internal wiki | Modular + query decomposition | Naive top-3 only |
| Executive thematic Q&A | GraphRAG global communities | Leaf-only chunk RAG |
| Regulated citations required | Self-RAG + CRAG grading | Stuff-only long context |
| Stale KB + live web | CRAG + web fallback | KB-only blind trust |
| Stable handbook <100K tokens | CAG prefix + optional RAG suffix | Full re-retrieve every turn |
| Research copilot | Agentic RAG multi-tool | Single-shot retrieve |
Decision checklist
- Baseline naive RAG eval logged (faithfulness, answer relevance, latency p95).
- Pattern change tied to A/B or shadow traffic—not architecture for its own sake.
- Max LLM calls per user query capped (budget + iteration limit).
- Fallback path defined: abstain, human handoff, or web with disclaimers.
- Ingestion freshness SLO documented (see next guide).
Advanced RAG is production-ready when you can name which pattern each product surface uses, what it costs per query, and which eval metric regressed last time you changed chunk size or embedding model.
“Design RAG for our company handbook + tickets.” — Modular hybrid baseline, CRAG for stale KB, optional agent for analytics questions, GraphRAG only if evals show global query gap.