Advanced RAG patterns

Architecture variants: naive, advanced, modular, agentic

Naive RAG embeds chunks, retrieves top-k, stuffs context, generates once. Advanced RAG adds query transformation, hybrid search, reranking, and compression. Modular RAG composes swappable stages (index, retrieve, post-process, generate). Agentic RAG lets an LLM decide when and how to retrieve across multiple tools and iterations.

Most teams start with naive RAG because it ships in a weekend: chunk documents, embed, store in pgvector or Pinecone, call similarity_search(query, k=5), concatenate, prompt GPT-4o. That baseline fails on multi-hop questions, tables spread across pages, and queries that need synthesis across ten sources—not one paragraph.

Naive RAG

Flow: query → embed → ANN search → top-k chunks → LLM. Strengths: simple, debuggable, cheap at small scale. Weaknesses: no query understanding, no rerank, brittle to chunk boundaries, hallucinates when retrieval misses.

Stage	Typical implementation	Failure mode
Chunking	Fixed 512 tokens, 10% overlap	Splits tables, code blocks, legal clauses
Retrieval	Cosine on single embedding model	Lexical mismatch ("401k" vs "retirement plan")
Generation	Single-shot answer	Confident wrong when context empty or irrelevant

Advanced RAG

Adds pre-retrieval (HyDE, multi-query, step-back prompting), hybrid BM25 + dense, cross-encoder rerank, and context compression (LLMLingua, selective sentence extraction). Latency rises 200–800ms; accuracy on hard evals often jumps 15–40 points versus naive on the same corpus.

Modular RAG

Treat each stage as an interface: Retriever, Reranker, ContextBuilder, Generator. Swap hybrid for ColBERT without rewriting generation. LangChain LCEL and LlamaIndex query pipelines are modular RAG in practice—production teams wrap them in their own config-driven DAG.

Agentic RAG

An agent loop: plan → retrieve from KB / web / SQL → read → decide if enough evidence → maybe retrieve again → answer. Best for open-ended research, multi-document synthesis, and tools beyond vectors. Costs more tokens and needs guardrails (max iterations, budget caps) to prevent runaway loops.

Architecture comparison at a glance

Dimension	Naive	Advanced	Modular	Agentic
LLM calls / query	1	1–2	1–3	3–15
Latency p95	Lowest	Medium	Medium	Highest
Engineering effort	Days	Weeks	Weeks	Months
Debuggability	High	Medium	High (stage logs)	Low without tracing
Multi-hop queries	Poor	Fair	Good	Best

Evolution path most teams follow: ship naive with metrics → add hybrid + rerank (advanced) → refactor into modular interfaces when swapping embed models or stores → introduce agentic only for a premium tier or internal research mode after evals prove chunk RAG plateaus on your hardest question set.

from dataclasses import dataclass
from typing import Protocol

class Retriever(Protocol):
    def retrieve(self, query: str, k: int) -> list[dict]: ...

class Reranker(Protocol):
    def rerank(self, query: str, docs: list[dict], top_n: int) -> list[dict]: ...

@dataclass
class RagPipeline:
    retriever: Retriever
    reranker: Reranker | None
    generator: callable

    def answer(self, query: str, k: int = 20, top_n: int = 5) -> str:
        hits = self.retriever.retrieve(query, k=k)
        if self.reranker:
            hits = self.reranker.rerank(query, hits, top_n=top_n)
        else:
            hits = hits[:top_n]
        context = "\n\n".join(h["text"] for h in hits)
        return self.generator(query, context)

public interface Retriever {
  List retrieve(String query, int k);
}

public interface Reranker {
  List rerank(String query, List docs, int topN);
}

public record RagPipeline(Retriever retriever, Reranker reranker, BiFunction generator) {
  public String answer(String query, int k, int topN) {
    var hits = new ArrayList<>(retriever.retrieve(query, k));
    if (reranker != null) hits = reranker.rerank(query, hits, topN);
    else hits = hits.subList(0, Math.min(topN, hits.size()));
    var context = hits.stream().map(Document::text).collect(Collectors.joining("\n\n"));
    return generator.apply(query, context);
  }
}

💡 Pro Tip

Ship naive RAG with logging first. Measure empty-hit rate, MRR@k, and faithfulness before adding GraphRAG or agents—otherwise you optimize architecture without knowing the baseline failure mode.

⚖️ Trade-off

Agentic RAG improves hard questions but adds 3–10× token cost and non-deterministic latency. Gate it behind a router: easy FAQ stays on advanced modular path; only escalate when classifier confidence is low.

📦 Real World

Internal copilots at large SaaS vendors often run two tiers: Tier A is hybrid + rerank for 90% of queries; Tier B is LangGraph agent with SQL + KB tools for power users who opt in to slower, deeper answers.

🎯 Interview Tip

“Compare RAG architectures.” — Start with naive limitations, then advanced (hybrid + rerank), modular (swappable stages), agentic (multi-step tool use). Mention eval-driven selection, not buzzword stacking.

GraphRAG: entities, communities, and global questions

Microsoft GraphRAG builds a knowledge graph from your corpus: extract entities and relationships, cluster into communities, summarize each community, and retrieve at community or entity level for questions that need global understanding—not just the nearest chunk.

Flat chunk RAG excels at “What is our PTO policy for contractors?” It struggles at “What are the main themes across 10,000 support tickets this quarter?” or “How do product areas relate to recurring escalations?” GraphRAG precomputes structure so global queries hit community summaries instead of random fragments.

Indexing pipeline

Chunk & extract — LLM extracts entities (people, products, concepts) and edges (relates_to, owns, blocks) per chunk.
Graph merge — Deduplicate entities across chunks; build adjacency lists or property graph.
Community detection — Leiden/Louvain clustering on the graph; each community is a thematic cluster.
Community reports — LLM summarizes each community (executive-level narrative + key entities).
Dual index — Embed raw chunks (local) + community reports (global) in vector store.

Query type	Retrieval target	Example
Local	Entity-linked chunks + neighborhood	“Who owns the billing API?”
Global	Top community reports	“Top risks mentioned in Q3 incident docs?”
Hybrid	Both paths merged + rerank	“How does Team X’s work affect compliance theme Y?”

When GraphRAG wins

Corpus is large, interconnected, and narrative (research, legal, incident postmortems, news archives).
Users ask thematic / aggregate questions, not only fact lookup.
You can afford offline indexing cost (multiple LLM passes per document at ingest).

When to skip

Small FAQ (<500 pages) where hybrid search + rerank is enough.
Highly structured data better served by SQL or OLAP, not LLM entity extraction.
Strict freshness requirements—graph rebuild lag may be unacceptable without incremental graph updates.

EXTRACTION_PROMPT = '''Extract entities and relationships as JSON.
Entities: {{name, type}}. Relationships: {{source, target, type, description}}.
Text: {chunk}'''

def extract_graph_elements(llm, chunk: str) -> dict:
    resp = llm.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(chunk=chunk[:6000])}],
    )
    return json.loads(resp.choices[0].message.content)

def merge_entities(store: dict, batch: dict) -> None:
    for ent in batch.get("entities", []):
        key = (ent["name"].lower(), ent["type"])
        store.setdefault(key, ent)

public record Entity(String name, String type) {}
public record Edge(String source, String target, String relType) {}

public class GraphExtractor {
  private final ChatClient client;

  public GraphBatch extract(String chunk) {
    var prompt = "Extract entities and relationships as JSON. Text: " + chunk.substring(0, Math.min(6000, chunk.length()));
    var json = client.prompt().user(prompt).call().content();
    return parseGraphBatch(json);
  }
}

🔬 Under the Hood

Community detection (Leiden algorithm) optimizes modularity—densely connected entity groups become communities. Summaries at community level are essentially precomputed map-reduce over your corpus, paid at index time instead of query time.

💰 Cost

GraphRAG indexing can cost 10–50× naive embed-only ingest: entity extraction + community reports are LLM-heavy. Amortize over query volume; run full rebuilds weekly, incremental entity merge daily.

⚠️ Pitfall

Entity extraction hallucinates edges on noisy PDFs. Validate graph quality with spot checks and prune low-confidence relationships before community detection—or themes will be garbage.

🔒 Security

Graph structure can leak cross-tenant relationships if you merge graphs globally. Partition graphs by tenant or sensitivity label; never embed community reports that mix restricted and public entities.

Self-RAG: retrieve, reflect, and critique

Self-RAG trains or prompts models to emit reflection tokens: decide whether to retrieve, judge passage relevance, and critique if the answer is supported, contradicted, or no evidence—tightening faithfulness without a separate judge model on every path.

Standard RAG always retrieves—even when the model already knows the answer or when retrieval adds noise. Self-RAG inserts a retrieval decision step and a generation critique step using special tokens or structured outputs the model was fine-tuned (or prompted) to produce.

Reflection token flow

[Retrieve] vs [No retrieve] — model chooses whether KB lookup helps.
If retrieve: fetch passages; score [Relevant] / [Irrelevant] per passage.
Generate candidate answer conditioned on relevant passages only.
[Supported] / [Contradicted] / [No evidence] — self-critique vs sources.
If contradicted or weak: retrieve again with rewritten query or abstain.

Token / signal	Meaning	Action
Retrieve	External KB likely helps	Run vector + hybrid search
No retrieve	Parametric knowledge sufficient	Skip retrieval; still log for audit
Relevant / Irrelevant	Passage quality	Drop irrelevant before context assembly
Supported	Answer grounded in passages	Return to user
Contradicted	Answer conflicts with sources	Regenerate or escalate

SELF_RAG_PROMPT = '''Step 1: Output JSON {{"retrieve": true|false, "reason": "..."}}.
If retrieve, I will provide passages. Then output {{"relevance": [...], "answer": "...", "critique": "supported|contradicted|no_evidence"}}.'''

def self_rag_answer(llm, query: str, retriever) -> dict:
    plan = json.loads(call_json(llm, SELF_RAG_PROMPT + "\nQuery: " + query))
    passages = retriever.retrieve(query, k=8) if plan["retrieve"] else []
    if not plan["retrieve"]:
        return {"answer": call_text(llm, query), "retrieved": False}
    critique = json.loads(call_json(llm, format_passages(query, passages)))
    if critique["critique"] == "contradicted":
        passages = retriever.retrieve(query + " " + critique["answer"][:80], k=8)
        critique = json.loads(call_json(llm, format_passages(query, passages)))
    return {"answer": critique["answer"], "critique": critique["critique"], "retrieved": True}

public record SelfRagPlan(boolean retrieve, String reason) {}
public record SelfRagResult(String answer, String critique, boolean retrieved) {}

public SelfRagResult selfRag(ChatClient client, String query, Retriever retriever) {
  var plan = parsePlan(client.prompt().user(SELF_RAG_PROMPT + query).call().content());
  if (!plan.retrieve()) {
    return new SelfRagResult(client.prompt().user(query).call().content(), "no_retrieve", false);
  }
  var passages = retriever.retrieve(query, 8);
  var critique = parseCritique(client, query, passages);
  if ("contradicted".equals(critique.critique())) {
    passages = retriever.retrieve(query + " " + critique.answer().substring(0, 80), 8);
    critique = parseCritique(client, query, passages);
  }
  return new SelfRagResult(critique.answer(), critique.critique(), true);
}

⚖️ Trade-off

Fine-tuned Self-RAG models need training data with labeled reflection tokens. Prompt-only approximations work but add 2–3 LLM calls per query—justify with faithfulness SLOs on regulated content.

💡 Pro Tip

Log retrieve vs no-retrieve decisions. If >80% are “no retrieve” on a KB product, your users may not need RAG at all—or queries are too generic for your chunk index.

🎯 Interview Tip

“How do you reduce hallucination in RAG?” — Mention Self-RAG critique tokens, citation-required prompts, and reranking—plus offline evals (RAGAS faithfulness), not only prompting.

RAPTOR: recursive clustering and tree retrieval

RAPTOR (Recursive Abstractive Processing for Tree Organized Retrieval) clusters chunks, summarizes clusters recursively, and builds a tree—retrieve at leaf level for specifics or higher levels for broader context spanning many documents.

Fixed-size chunks lose hierarchy: a product spec’s “Security” section may scatter across forty chunks. RAPTOR bottom-up clusters semantically similar chunks, summarizes each cluster, clusters summaries again, until a root summary represents the whole corpus subtree—enabling multi-level retrieval.

Build phase

Embed all leaf chunks.
Cluster with GMM or k-means on embeddings (soft assignment optional).
Summarize each cluster with an abstractive LLM call → parent nodes.
Repeat on parent embeddings until tree depth cap or single root.
Index all tree nodes (leaves + summaries) in vector store with level metadata.

Query phase

Retrieve from multiple levels: e.g. top-3 summary nodes + top-5 leaf nodes, dedupe, rerank, generate. Questions needing wide context (“Summarize onboarding docs”) hit high-level nodes; detail questions hit leaves.

Level	Content	Best for
0 (leaves)	Original chunks	Exact policy clauses, API params
1–2	Section summaries	Multi-chunk reasoning within a topic
Root-adjacent	Corpus-wide abstractive summaries	Overview, onboarding, exec briefings

@dataclass
class TreeNode:
    id: str
    text: str
    level: int
    parent_id: str | None
    embedding: list[float]

def build_level(parents: list[TreeNode], llm, embed_fn) -> list[TreeNode]:
    # cluster parents' children embeddings, summarize each cluster
    clusters = cluster_embeddings([p.embedding for p in parents], k=min(8, len(parents)))
    new_nodes = []
    for i, cluster in enumerate(clusters):
        summary = llm_summarize(llm, [parents[j].text for j in cluster])
        emb = embed_fn(summary)
        new_nodes.append(TreeNode(id=f"L{parents[0].level+1}-{i}", text=summary,
                                  level=parents[0].level + 1, parent_id=None, embedding=emb))
    return new_nodes

public record TreeNode(String id, String text, int level, String parentId, float[] embedding) {}

public List buildLevel(List leaves, ChatClient llm, EmbeddingClient embed) {
  var clusters = clusterByEmbedding(leaves, Math.min(8, leaves.size()));
  var parents = new ArrayList();
  for (int i = 0; i < clusters.size(); i++) {
    var texts = clusters.get(i).stream().map(TreeNode::text).toList();
    var summary = llm.prompt().user("Summarize:\n" + String.join("\n", texts)).call().content();
    parents.add(new TreeNode("L" + (leaves.get(0).level() + 1) + "-" + i, summary,
        leaves.get(0).level() + 1, null, embed.embed(summary)));
  }
  return parents;
}

💰 Cost

Each tree level adds summarization LLM calls—budget for index-time cost similar to GraphRAG lite. Rebuild subtrees when documents change instead of full tree when possible.

📦 Real World

Handbook and policy assistants use RAPTOR-like hierarchies so “overview” chat and “cite exact clause” chat share one index—level metadata routes retrieval depth.

⚠️ Pitfall

Abstractive summaries drop nuance (numbers, dates, negation). Always include leaf chunks in final context when answer requires precise citations.

Corrective RAG: grade retrieval, then correct

Corrective RAG (CRAG) evaluates retrieved documents with a lightweight grader. Correct passages proceed; ambiguous triggers query rewrite + re-retrieve; incorrect triggers web search or alternate corpus fallback.

Bad retrieval poisons generation—models confidently synthesize wrong answers from irrelevant chunks. CRAG inserts an explicit retrieval quality gate before context assembly, and a correction path when the KB is stale or incomplete.

Document grading

For each retrieved chunk, a grader (small LLM or cross-encoder) outputs:

Correct — directly relevant; keep.
Incorrect — off-topic or contradictory; discard.
Ambiguous — partial match; may need rewrite or more retrieval.

Correction actions

Aggregate grade	Action	Example tool
Mostly correct	Generate with filtered chunks	Internal KB only
Mixed / ambiguous	Query decomposition + re-retrieve	HyDE, multi-query
Mostly incorrect	Web search fallback	Tavily, Bing, Google CSE
Still empty	Abstain with explanation	“No verified source found”

GRADE_PROMPT = 'Rate relevance: correct|incorrect|ambiguous. Query: {q}\nPassage: {p}'

def grade_passages(llm, query: str, docs: list[dict]) -> list[tuple[dict, str]]:
    scored = []
    for d in docs:
        label = llm.chat.completions.create(
            model="gpt-4o-mini", temperature=0, max_tokens=8,
            messages=[{"role": "user", "content": GRADE_PROMPT.format(q=query, p=d["text"][:1500])}],
        ).choices[0].message.content.strip().lower()
        scored.append((d, label))
    return scored

def crag_answer(llm, query, retriever, web_search):
    docs = retriever.retrieve(query, k=10)
    graded = grade_passages(llm, query, docs)
    correct = [d for d, g in graded if g == "correct"]
    if len(correct) >= 2:
        return generate(llm, query, correct)
    web_docs = web_search(query) if sum(1 for _, g in graded if g == "incorrect") >= 5 else []
    return generate(llm, query, correct + web_docs)

public enum Grade { CORRECT, INCORRECT, AMBIGUOUS }

public String cragAnswer(ChatClient grader, String query, Retriever kb, WebSearchClient web) {
  var docs = kb.retrieve(query, 10);
  var graded = docs.stream().map(d -> new GradedDoc(d, grade(grader, query, d))).toList();
  var correct = graded.stream().filter(g -> g.grade() == Grade.CORRECT).map(GradedDoc::doc).toList();
  if (correct.size() >= 2) return generate(grader, query, correct);
  long incorrect = graded.stream().filter(g -> g.grade() == Grade.INCORRECT).count();
  var context = new ArrayList<>(correct);
  if (incorrect >= 5) context.addAll(web.search(query));
  return generate(grader, query, context);
}

🔒 Security

Web search fallback exfiltrates user queries to third parties and may return untrusted content. Sanitize queries, allowlist domains, and mark web-sourced answers clearly in UI.

📦 Real World

Developer doc assistants use CRAG internally: if SDK docs score “incorrect” (old API version), fall back to live docs crawl + cache—reducing “answer from deprecated chunk” incidents.

💡 Pro Tip

Cache web fallback results with TTL per query hash. Repeated questions shouldn’t hit search API every time.

Long-context RAG: stuff vs retrieve

Models with 128K–1M token windows tempt teams to skip retrieval and paste entire knowledge bases. Long-context RAG hybridizes: retrieve candidates, then stuff many chunks—or whole docs—when window budget allows. Trade-offs: cost, latency, lost-in-the-middle, freshness.

Claude 3.5, Gemini 1.5, and GPT-4o-class models advertise huge context windows. For a 2,000-page handbook (~3M tokens), you still cannot stuff everything—but for a single product’s docs (<200K tokens), full-doc stuffing plus a smart reranker sometimes beats naive top-5 chunk RAG on synthesis tasks.

When stuffing wins

Corpus fits comfortably in context after compression (with 20–40% output reserve).
Tasks need cross-document references spread across many sections.
Retrieval repeatedly misses because queries use non-standard phrasing.

When retrieval still wins

Corpus >> context window (millions of chunks).
Per-query cost must stay low—100K input tokens × high QPS is expensive.
Strict ACL filtering—only inject authorized subsets per user.
Freshness: index updates faster than resending full docs every query.

Approach	Input tokens / query	Latency	Risk
Top-5 chunk RAG	~2K–8K	Low	Missed context
Retrieve 50 + rerank 10	~10K–25K	Medium	Still boundary errors
Stuff full doc set	50K–128K+	High	Lost-in-the-middle, cost
RAG + long-context merge	Variable	Medium–high	Complexity

Lost-in-the-middle

Models attend unevenly to very long prompts—facts buried in the middle of 80K tokens get ignored more often than content at the start or end. Mitigations: put highest-scored chunks at edges, use map-reduce summarization, or hierarchical RAPTOR/GraphRAG instead of one giant paste.

def assemble_long_context(query: str, retriever, max_tokens: int = 100_000) -> str:
    candidates = retriever.retrieve(query, k=200)
    ranked = rerank_cross_encoder(query, candidates)[:80]
    # Place best chunks at start and end (lost-in-the-middle mitigation)
    head = ranked[:20]
    tail = ranked[20:40]
    middle = ranked[40:]
    ordered = head + middle + tail
    blob, used = [], 0
    for doc in ordered:
        t = count_tokens(doc["text"])
        if used + t > max_tokens:
            break
        blob.append(doc["text"])
        used += t
    return "\n\n---\n\n".join(blob)

public String assembleLongContext(String query, Retriever retriever, int maxTokens) {
  var candidates = retriever.retrieve(query, 200);
  var ranked = reranker.rerank(query, candidates).subList(0, Math.min(80, candidates.size()));
  var head = ranked.subList(0, Math.min(20, ranked.size()));
  var tail = ranked.subList(Math.min(20, ranked.size()), Math.min(40, ranked.size()));
  var ordered = new ArrayList<>(head);
  if (ranked.size() > 40) ordered.addAll(ranked.subList(40, ranked.size()));
  ordered.addAll(tail);
  var blob = new StringBuilder();
  int used = 0;
  for (var doc : ordered) {
    int t = tokenizer.count(doc.text());
    if (used + t > maxTokens) break;
    blob.append(doc.text()).append("\n\n---\n\n");
    used += t;
  }
  return blob.toString();
}

💰 Cost

128K input at $3/1M = ~$0.39 per request before output. At 10K QPS that is unsustainable—long-context RAG is a precision tool for premium tiers, not default path.

⚖️ Trade-off

Stuffing simplifies architecture (no vector DB) but removes incremental update benefits—any doc change resends full corpus logic unless you version prompts carefully.

🔬 Under the Hood

Some providers use sparse attention or prompt caching on long prefixes—identical doc bundles across users benefit from KV cache hits when doc content is a stable prefix.

🎯 Interview Tip

“Would you use a 1M context model instead of RAG?” — No by default: cost, ACL, freshness. Yes for bounded corpora or as second pass after retrieval expands candidate set.

Cache-augmented RAG: prompt caching for static KB

Cache-augmented generation (CAG) preloads stable knowledge into the model’s prompt cache (provider KV reuse) instead of retrieving dynamically every query—ideal when the KB changes rarely and fits cache-eligible prefix size.

Anthropic and OpenAI cache attention states for long identical prefixes. If your product’s “knowledge bundle” (policy PDFs, API reference) is static for hours or days, load it once as a cacheable system block; per-user queries append as suffix-only tokens at reduced input cost on hits.

CAG vs RAG

	RAG (dynamic retrieve)	CAG (cached prefix)
Update latency	Minutes (re-embed chunks)	Cache TTL + prefix rebuild
Query cost	Embed + small context	Large prefix once, cheap suffix
Best corpus size	Unbounded	Fits cacheable window (~100K tokens practical)
Personalization	Metadata filters per user	Harder—need per-tenant cache keys

Hybrid pattern

Cache stable core docs (brand guidelines, core API) as prefix; RAG retrieve tenant-specific or fresh content as suffix. Maximizes cache hit rate while keeping personalization.

SYSTEM_KB = open("kb/core_policies.md").read()  # stable, versioned

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_KB,
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": f"User question: {user_query}"},
        ],
    }
]
# First request populates cache; subsequent identical prefix → discounted input tokens

var systemKb = Files.readString(Path.of("kb/core_policies.md"));

var request = ChatRequest.builder()
    .messages(List.of(
        Message.builder()
            .role(Role.USER)
            .content(List.of(
                ContentBlock.text(systemKb).cacheControl(CacheControl.ephemeral()),
                ContentBlock.text("User question: " + userQuery)
            ))
            .build()))
    .build();

💡 Pro Tip

Version cache keys: kb_v{{hash}} in telemetry. When cache hit rate drops after deploy, you likely changed prefix bytes silently.

📦 Real World

Support bots with 40-page static playbooks use CAG on Anthropic for sub-100ms TTFT after warm cache—dynamic RAG only for account-specific ticket history.

🔒 Security

Shared cache prefixes must not contain tenant A’s data served to tenant B. One cache block per security boundary.

Production: pattern selection matrix

Pick patterns from eval evidence, not hype. This matrix maps common product shapes to recommended architectures and links forward to ingestion pipelines—because advanced retrieval fails on bad indexes.

Pattern selection matrix

Product shape	Recommended pattern	Avoid
FAQ <500 pages	Advanced modular (hybrid + rerank)	GraphRAG, agent loops
Multi-hop internal wiki	Modular + query decomposition	Naive top-3 only
Executive thematic Q&A	GraphRAG global communities	Leaf-only chunk RAG
Regulated citations required	Self-RAG + CRAG grading	Stuff-only long context
Stale KB + live web	CRAG + web fallback	KB-only blind trust
Stable handbook <100K tokens	CAG prefix + optional RAG suffix	Full re-retrieve every turn
Research copilot	Agentic RAG multi-tool	Single-shot retrieve

Decision checklist

Baseline naive RAG eval logged (faithfulness, answer relevance, latency p95).
Pattern change tied to A/B or shadow traffic—not architecture for its own sake.
Max LLM calls per user query capped (budget + iteration limit).
Fallback path defined: abstain, human handoff, or web with disclaimers.
Ingestion freshness SLO documented (see next guide).

📦 Production checklist

Advanced RAG is production-ready when you can name which pattern each product surface uses, what it costs per query, and which eval metric regressed last time you changed chunk size or embedding model.

🎯 Interview Tip

“Design RAG for our company handbook + tickets.” — Modular hybrid baseline, CRAG for stale KB, optional agent for analytics questions, GraphRAG only if evals show global query gap.