Lang
API

Advanced RAG patterns

Naive retrieve-then-generate works for demos. Production knowledge systems need architecture choices: modular pipelines, graph-aware retrieval, self-critique loops, corrective fallbacks, and cache-augmented generation. This guide maps the pattern landscape—from GraphRAG community summaries to Self-RAG reflection tokens—so you pick complexity only where evals prove ROI.

After reading, you should be able to: compare naive, advanced, modular, and agentic RAG; explain Microsoft GraphRAG indexing and when graph structure beats flat chunks; implement Self-RAG retrieve/no-retrieve and critique steps; describe RAPTOR hierarchical retrieval; wire Corrective RAG with web search fallback; reason about long-context vs retrieval trade-offs; apply prompt caching to static knowledge prefixes; and select patterns from a production matrix tied to latency, cost, and accuracy SLOs.

developer platform architect Track 2 GraphRAG Self-RAG LangGraph Spring AI 1.0+

Architecture variants: naive, advanced, modular, agentic

Naive RAG embeds chunks, retrieves top-k, stuffs context, generates once. Advanced RAG adds query transformation, hybrid search, reranking, and compression. Modular RAG composes swappable stages (index, retrieve, post-process, generate). Agentic RAG lets an LLM decide when and how to retrieve across multiple tools and iterations.

Most teams start with naive RAG because it ships in a weekend: chunk documents, embed, store in pgvector or Pinecone, call similarity_search(query, k=5), concatenate, prompt GPT-4o. That baseline fails on multi-hop questions, tables spread across pages, and queries that need synthesis across ten sources—not one paragraph.

Naive RAG

Flow: query → embed → ANN search → top-k chunks → LLM. Strengths: simple, debuggable, cheap at small scale. Weaknesses: no query understanding, no rerank, brittle to chunk boundaries, hallucinates when retrieval misses.

StageTypical implementationFailure mode
ChunkingFixed 512 tokens, 10% overlapSplits tables, code blocks, legal clauses
RetrievalCosine on single embedding modelLexical mismatch ("401k" vs "retirement plan")
GenerationSingle-shot answerConfident wrong when context empty or irrelevant

Advanced RAG

Adds pre-retrieval (HyDE, multi-query, step-back prompting), hybrid BM25 + dense, cross-encoder rerank, and context compression (LLMLingua, selective sentence extraction). Latency rises 200–800ms; accuracy on hard evals often jumps 15–40 points versus naive on the same corpus.

Modular RAG

Treat each stage as an interface: Retriever, Reranker, ContextBuilder, Generator. Swap hybrid for ColBERT without rewriting generation. LangChain LCEL and LlamaIndex query pipelines are modular RAG in practice—production teams wrap them in their own config-driven DAG.

Agentic RAG

An agent loop: plan → retrieve from KB / web / SQL → read → decide if enough evidence → maybe retrieve again → answer. Best for open-ended research, multi-document synthesis, and tools beyond vectors. Costs more tokens and needs guardrails (max iterations, budget caps) to prevent runaway loops.

Architecture comparison at a glance

DimensionNaiveAdvancedModularAgentic
LLM calls / query11–21–33–15
Latency p95LowestMediumMediumHighest
Engineering effortDaysWeeksWeeksMonths
DebuggabilityHighMediumHigh (stage logs)Low without tracing
Multi-hop queriesPoorFairGoodBest

Evolution path most teams follow: ship naive with metrics → add hybrid + rerank (advanced) → refactor into modular interfaces when swapping embed models or stores → introduce agentic only for a premium tier or internal research mode after evals prove chunk RAG plateaus on your hardest question set.

Modular RAG stage registry
from dataclasses import dataclass
from typing import Protocol

class Retriever(Protocol):
    def retrieve(self, query: str, k: int) -> list[dict]: ...

class Reranker(Protocol):
    def rerank(self, query: str, docs: list[dict], top_n: int) -> list[dict]: ...

@dataclass
class RagPipeline:
    retriever: Retriever
    reranker: Reranker | None
    generator: callable

    def answer(self, query: str, k: int = 20, top_n: int = 5) -> str:
        hits = self.retriever.retrieve(query, k=k)
        if self.reranker:
            hits = self.reranker.rerank(query, hits, top_n=top_n)
        else:
            hits = hits[:top_n]
        context = "\n\n".join(h["text"] for h in hits)
        return self.generator(query, context)
public interface Retriever {
  List retrieve(String query, int k);
}

public interface Reranker {
  List rerank(String query, List docs, int topN);
}

public record RagPipeline(Retriever retriever, Reranker reranker, BiFunction generator) {
  public String answer(String query, int k, int topN) {
    var hits = new ArrayList<>(retriever.retrieve(query, k));
    if (reranker != null) hits = reranker.rerank(query, hits, topN);
    else hits = hits.subList(0, Math.min(topN, hits.size()));
    var context = hits.stream().map(Document::text).collect(Collectors.joining("\n\n"));
    return generator.apply(query, context);
  }
}
💡 Pro Tip

Ship naive RAG with logging first. Measure empty-hit rate, MRR@k, and faithfulness before adding GraphRAG or agents—otherwise you optimize architecture without knowing the baseline failure mode.

⚖️ Trade-off

Agentic RAG improves hard questions but adds 3–10× token cost and non-deterministic latency. Gate it behind a router: easy FAQ stays on advanced modular path; only escalate when classifier confidence is low.

📦 Real World

Internal copilots at large SaaS vendors often run two tiers: Tier A is hybrid + rerank for 90% of queries; Tier B is LangGraph agent with SQL + KB tools for power users who opt in to slower, deeper answers.

🎯 Interview Tip

“Compare RAG architectures.” — Start with naive limitations, then advanced (hybrid + rerank), modular (swappable stages), agentic (multi-step tool use). Mention eval-driven selection, not buzzword stacking.

GraphRAG: entities, communities, and global questions

Microsoft GraphRAG builds a knowledge graph from your corpus: extract entities and relationships, cluster into communities, summarize each community, and retrieve at community or entity level for questions that need global understanding—not just the nearest chunk.

Flat chunk RAG excels at “What is our PTO policy for contractors?” It struggles at “What are the main themes across 10,000 support tickets this quarter?” or “How do product areas relate to recurring escalations?” GraphRAG precomputes structure so global queries hit community summaries instead of random fragments.

Indexing pipeline

  1. Chunk & extract — LLM extracts entities (people, products, concepts) and edges (relates_to, owns, blocks) per chunk.
  2. Graph merge — Deduplicate entities across chunks; build adjacency lists or property graph.
  3. Community detection — Leiden/Louvain clustering on the graph; each community is a thematic cluster.
  4. Community reports — LLM summarizes each community (executive-level narrative + key entities).
  5. Dual index — Embed raw chunks (local) + community reports (global) in vector store.
Query typeRetrieval targetExample
LocalEntity-linked chunks + neighborhood“Who owns the billing API?”
GlobalTop community reports“Top risks mentioned in Q3 incident docs?”
HybridBoth paths merged + rerank“How does Team X’s work affect compliance theme Y?”

When GraphRAG wins

  • Corpus is large, interconnected, and narrative (research, legal, incident postmortems, news archives).
  • Users ask thematic / aggregate questions, not only fact lookup.
  • You can afford offline indexing cost (multiple LLM passes per document at ingest).

When to skip

  • Small FAQ (<500 pages) where hybrid search + rerank is enough.
  • Highly structured data better served by SQL or OLAP, not LLM entity extraction.
  • Strict freshness requirements—graph rebuild lag may be unacceptable without incremental graph updates.
Entity extraction sketch (GraphRAG-style)
EXTRACTION_PROMPT = '''Extract entities and relationships as JSON.
Entities: {{name, type}}. Relationships: {{source, target, type, description}}.
Text: {chunk}'''

def extract_graph_elements(llm, chunk: str) -> dict:
    resp = llm.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[{"role": "user", "content": EXTRACTION_PROMPT.format(chunk=chunk[:6000])}],
    )
    return json.loads(resp.choices[0].message.content)

def merge_entities(store: dict, batch: dict) -> None:
    for ent in batch.get("entities", []):
        key = (ent["name"].lower(), ent["type"])
        store.setdefault(key, ent)
public record Entity(String name, String type) {}
public record Edge(String source, String target, String relType) {}

public class GraphExtractor {
  private final ChatClient client;

  public GraphBatch extract(String chunk) {
    var prompt = "Extract entities and relationships as JSON. Text: " + chunk.substring(0, Math.min(6000, chunk.length()));
    var json = client.prompt().user(prompt).call().content();
    return parseGraphBatch(json);
  }
}
🔬 Under the Hood

Community detection (Leiden algorithm) optimizes modularity—densely connected entity groups become communities. Summaries at community level are essentially precomputed map-reduce over your corpus, paid at index time instead of query time.

💰 Cost

GraphRAG indexing can cost 10–50× naive embed-only ingest: entity extraction + community reports are LLM-heavy. Amortize over query volume; run full rebuilds weekly, incremental entity merge daily.

⚠️ Pitfall

Entity extraction hallucinates edges on noisy PDFs. Validate graph quality with spot checks and prune low-confidence relationships before community detection—or themes will be garbage.

🔒 Security

Graph structure can leak cross-tenant relationships if you merge graphs globally. Partition graphs by tenant or sensitivity label; never embed community reports that mix restricted and public entities.

Self-RAG: retrieve, reflect, and critique

Self-RAG trains or prompts models to emit reflection tokens: decide whether to retrieve, judge passage relevance, and critique if the answer is supported, contradicted, or no evidence—tightening faithfulness without a separate judge model on every path.

Standard RAG always retrieves—even when the model already knows the answer or when retrieval adds noise. Self-RAG inserts a retrieval decision step and a generation critique step using special tokens or structured outputs the model was fine-tuned (or prompted) to produce.

Reflection token flow

  1. [Retrieve] vs [No retrieve] — model chooses whether KB lookup helps.
  2. If retrieve: fetch passages; score [Relevant] / [Irrelevant] per passage.
  3. Generate candidate answer conditioned on relevant passages only.
  4. [Supported] / [Contradicted] / [No evidence] — self-critique vs sources.
  5. If contradicted or weak: retrieve again with rewritten query or abstain.
Token / signalMeaningAction
RetrieveExternal KB likely helpsRun vector + hybrid search
No retrieveParametric knowledge sufficientSkip retrieval; still log for audit
Relevant / IrrelevantPassage qualityDrop irrelevant before context assembly
SupportedAnswer grounded in passagesReturn to user
ContradictedAnswer conflicts with sourcesRegenerate or escalate
Self-RAG-style prompted loop
SELF_RAG_PROMPT = '''Step 1: Output JSON {{"retrieve": true|false, "reason": "..."}}.
If retrieve, I will provide passages. Then output {{"relevance": [...], "answer": "...", "critique": "supported|contradicted|no_evidence"}}.'''

def self_rag_answer(llm, query: str, retriever) -> dict:
    plan = json.loads(call_json(llm, SELF_RAG_PROMPT + "\nQuery: " + query))
    passages = retriever.retrieve(query, k=8) if plan["retrieve"] else []
    if not plan["retrieve"]:
        return {"answer": call_text(llm, query), "retrieved": False}
    critique = json.loads(call_json(llm, format_passages(query, passages)))
    if critique["critique"] == "contradicted":
        passages = retriever.retrieve(query + " " + critique["answer"][:80], k=8)
        critique = json.loads(call_json(llm, format_passages(query, passages)))
    return {"answer": critique["answer"], "critique": critique["critique"], "retrieved": True}
public record SelfRagPlan(boolean retrieve, String reason) {}
public record SelfRagResult(String answer, String critique, boolean retrieved) {}

public SelfRagResult selfRag(ChatClient client, String query, Retriever retriever) {
  var plan = parsePlan(client.prompt().user(SELF_RAG_PROMPT + query).call().content());
  if (!plan.retrieve()) {
    return new SelfRagResult(client.prompt().user(query).call().content(), "no_retrieve", false);
  }
  var passages = retriever.retrieve(query, 8);
  var critique = parseCritique(client, query, passages);
  if ("contradicted".equals(critique.critique())) {
    passages = retriever.retrieve(query + " " + critique.answer().substring(0, 80), 8);
    critique = parseCritique(client, query, passages);
  }
  return new SelfRagResult(critique.answer(), critique.critique(), true);
}
⚖️ Trade-off

Fine-tuned Self-RAG models need training data with labeled reflection tokens. Prompt-only approximations work but add 2–3 LLM calls per query—justify with faithfulness SLOs on regulated content.

💡 Pro Tip

Log retrieve vs no-retrieve decisions. If >80% are “no retrieve” on a KB product, your users may not need RAG at all—or queries are too generic for your chunk index.

🎯 Interview Tip

“How do you reduce hallucination in RAG?” — Mention Self-RAG critique tokens, citation-required prompts, and reranking—plus offline evals (RAGAS faithfulness), not only prompting.

RAPTOR: recursive clustering and tree retrieval

RAPTOR (Recursive Abstractive Processing for Tree Organized Retrieval) clusters chunks, summarizes clusters recursively, and builds a tree—retrieve at leaf level for specifics or higher levels for broader context spanning many documents.

Fixed-size chunks lose hierarchy: a product spec’s “Security” section may scatter across forty chunks. RAPTOR bottom-up clusters semantically similar chunks, summarizes each cluster, clusters summaries again, until a root summary represents the whole corpus subtree—enabling multi-level retrieval.

Build phase

  1. Embed all leaf chunks.
  2. Cluster with GMM or k-means on embeddings (soft assignment optional).
  3. Summarize each cluster with an abstractive LLM call → parent nodes.
  4. Repeat on parent embeddings until tree depth cap or single root.
  5. Index all tree nodes (leaves + summaries) in vector store with level metadata.

Query phase

Retrieve from multiple levels: e.g. top-3 summary nodes + top-5 leaf nodes, dedupe, rerank, generate. Questions needing wide context (“Summarize onboarding docs”) hit high-level nodes; detail questions hit leaves.

LevelContentBest for
0 (leaves)Original chunksExact policy clauses, API params
1–2Section summariesMulti-chunk reasoning within a topic
Root-adjacentCorpus-wide abstractive summariesOverview, onboarding, exec briefings
RAPTOR-style tree node upsert
@dataclass
class TreeNode:
    id: str
    text: str
    level: int
    parent_id: str | None
    embedding: list[float]

def build_level(parents: list[TreeNode], llm, embed_fn) -> list[TreeNode]:
    # cluster parents' children embeddings, summarize each cluster
    clusters = cluster_embeddings([p.embedding for p in parents], k=min(8, len(parents)))
    new_nodes = []
    for i, cluster in enumerate(clusters):
        summary = llm_summarize(llm, [parents[j].text for j in cluster])
        emb = embed_fn(summary)
        new_nodes.append(TreeNode(id=f"L{parents[0].level+1}-{i}", text=summary,
                                  level=parents[0].level + 1, parent_id=None, embedding=emb))
    return new_nodes
public record TreeNode(String id, String text, int level, String parentId, float[] embedding) {}

public List buildLevel(List leaves, ChatClient llm, EmbeddingClient embed) {
  var clusters = clusterByEmbedding(leaves, Math.min(8, leaves.size()));
  var parents = new ArrayList();
  for (int i = 0; i < clusters.size(); i++) {
    var texts = clusters.get(i).stream().map(TreeNode::text).toList();
    var summary = llm.prompt().user("Summarize:\n" + String.join("\n", texts)).call().content();
    parents.add(new TreeNode("L" + (leaves.get(0).level() + 1) + "-" + i, summary,
        leaves.get(0).level() + 1, null, embed.embed(summary)));
  }
  return parents;
}
💰 Cost

Each tree level adds summarization LLM calls—budget for index-time cost similar to GraphRAG lite. Rebuild subtrees when documents change instead of full tree when possible.

📦 Real World

Handbook and policy assistants use RAPTOR-like hierarchies so “overview” chat and “cite exact clause” chat share one index—level metadata routes retrieval depth.

⚠️ Pitfall

Abstractive summaries drop nuance (numbers, dates, negation). Always include leaf chunks in final context when answer requires precise citations.

Corrective RAG: grade retrieval, then correct

Corrective RAG (CRAG) evaluates retrieved documents with a lightweight grader. Correct passages proceed; ambiguous triggers query rewrite + re-retrieve; incorrect triggers web search or alternate corpus fallback.

Bad retrieval poisons generation—models confidently synthesize wrong answers from irrelevant chunks. CRAG inserts an explicit retrieval quality gate before context assembly, and a correction path when the KB is stale or incomplete.

Document grading

For each retrieved chunk, a grader (small LLM or cross-encoder) outputs:

  • Correct — directly relevant; keep.
  • Incorrect — off-topic or contradictory; discard.
  • Ambiguous — partial match; may need rewrite or more retrieval.

Correction actions

Aggregate gradeActionExample tool
Mostly correctGenerate with filtered chunksInternal KB only
Mixed / ambiguousQuery decomposition + re-retrieveHyDE, multi-query
Mostly incorrectWeb search fallbackTavily, Bing, Google CSE
Still emptyAbstain with explanation“No verified source found”
CRAG grader + web fallback
GRADE_PROMPT = 'Rate relevance: correct|incorrect|ambiguous. Query: {q}\nPassage: {p}'

def grade_passages(llm, query: str, docs: list[dict]) -> list[tuple[dict, str]]:
    scored = []
    for d in docs:
        label = llm.chat.completions.create(
            model="gpt-4o-mini", temperature=0, max_tokens=8,
            messages=[{"role": "user", "content": GRADE_PROMPT.format(q=query, p=d["text"][:1500])}],
        ).choices[0].message.content.strip().lower()
        scored.append((d, label))
    return scored

def crag_answer(llm, query, retriever, web_search):
    docs = retriever.retrieve(query, k=10)
    graded = grade_passages(llm, query, docs)
    correct = [d for d, g in graded if g == "correct"]
    if len(correct) >= 2:
        return generate(llm, query, correct)
    web_docs = web_search(query) if sum(1 for _, g in graded if g == "incorrect") >= 5 else []
    return generate(llm, query, correct + web_docs)
public enum Grade { CORRECT, INCORRECT, AMBIGUOUS }

public String cragAnswer(ChatClient grader, String query, Retriever kb, WebSearchClient web) {
  var docs = kb.retrieve(query, 10);
  var graded = docs.stream().map(d -> new GradedDoc(d, grade(grader, query, d))).toList();
  var correct = graded.stream().filter(g -> g.grade() == Grade.CORRECT).map(GradedDoc::doc).toList();
  if (correct.size() >= 2) return generate(grader, query, correct);
  long incorrect = graded.stream().filter(g -> g.grade() == Grade.INCORRECT).count();
  var context = new ArrayList<>(correct);
  if (incorrect >= 5) context.addAll(web.search(query));
  return generate(grader, query, context);
}
🔒 Security

Web search fallback exfiltrates user queries to third parties and may return untrusted content. Sanitize queries, allowlist domains, and mark web-sourced answers clearly in UI.

📦 Real World

Developer doc assistants use CRAG internally: if SDK docs score “incorrect” (old API version), fall back to live docs crawl + cache—reducing “answer from deprecated chunk” incidents.

💡 Pro Tip

Cache web fallback results with TTL per query hash. Repeated questions shouldn’t hit search API every time.

Long-context RAG: stuff vs retrieve

Models with 128K–1M token windows tempt teams to skip retrieval and paste entire knowledge bases. Long-context RAG hybridizes: retrieve candidates, then stuff many chunks—or whole docs—when window budget allows. Trade-offs: cost, latency, lost-in-the-middle, freshness.

Claude 3.5, Gemini 1.5, and GPT-4o-class models advertise huge context windows. For a 2,000-page handbook (~3M tokens), you still cannot stuff everything—but for a single product’s docs (<200K tokens), full-doc stuffing plus a smart reranker sometimes beats naive top-5 chunk RAG on synthesis tasks.

When stuffing wins

  • Corpus fits comfortably in context after compression (with 20–40% output reserve).
  • Tasks need cross-document references spread across many sections.
  • Retrieval repeatedly misses because queries use non-standard phrasing.

When retrieval still wins

  • Corpus >> context window (millions of chunks).
  • Per-query cost must stay low—100K input tokens × high QPS is expensive.
  • Strict ACL filtering—only inject authorized subsets per user.
  • Freshness: index updates faster than resending full docs every query.
ApproachInput tokens / queryLatencyRisk
Top-5 chunk RAG~2K–8KLowMissed context
Retrieve 50 + rerank 10~10K–25KMediumStill boundary errors
Stuff full doc set50K–128K+HighLost-in-the-middle, cost
RAG + long-context mergeVariableMedium–highComplexity

Lost-in-the-middle

Models attend unevenly to very long prompts—facts buried in the middle of 80K tokens get ignored more often than content at the start or end. Mitigations: put highest-scored chunks at edges, use map-reduce summarization, or hierarchical RAPTOR/GraphRAG instead of one giant paste.

Long-context merge after retrieval
def assemble_long_context(query: str, retriever, max_tokens: int = 100_000) -> str:
    candidates = retriever.retrieve(query, k=200)
    ranked = rerank_cross_encoder(query, candidates)[:80]
    # Place best chunks at start and end (lost-in-the-middle mitigation)
    head = ranked[:20]
    tail = ranked[20:40]
    middle = ranked[40:]
    ordered = head + middle + tail
    blob, used = [], 0
    for doc in ordered:
        t = count_tokens(doc["text"])
        if used + t > max_tokens:
            break
        blob.append(doc["text"])
        used += t
    return "\n\n---\n\n".join(blob)
public String assembleLongContext(String query, Retriever retriever, int maxTokens) {
  var candidates = retriever.retrieve(query, 200);
  var ranked = reranker.rerank(query, candidates).subList(0, Math.min(80, candidates.size()));
  var head = ranked.subList(0, Math.min(20, ranked.size()));
  var tail = ranked.subList(Math.min(20, ranked.size()), Math.min(40, ranked.size()));
  var ordered = new ArrayList<>(head);
  if (ranked.size() > 40) ordered.addAll(ranked.subList(40, ranked.size()));
  ordered.addAll(tail);
  var blob = new StringBuilder();
  int used = 0;
  for (var doc : ordered) {
    int t = tokenizer.count(doc.text());
    if (used + t > maxTokens) break;
    blob.append(doc.text()).append("\n\n---\n\n");
    used += t;
  }
  return blob.toString();
}
💰 Cost

128K input at $3/1M = ~$0.39 per request before output. At 10K QPS that is unsustainable—long-context RAG is a precision tool for premium tiers, not default path.

⚖️ Trade-off

Stuffing simplifies architecture (no vector DB) but removes incremental update benefits—any doc change resends full corpus logic unless you version prompts carefully.

🔬 Under the Hood

Some providers use sparse attention or prompt caching on long prefixes—identical doc bundles across users benefit from KV cache hits when doc content is a stable prefix.

🎯 Interview Tip

“Would you use a 1M context model instead of RAG?” — No by default: cost, ACL, freshness. Yes for bounded corpora or as second pass after retrieval expands candidate set.

Cache-augmented RAG: prompt caching for static KB

Cache-augmented generation (CAG) preloads stable knowledge into the model’s prompt cache (provider KV reuse) instead of retrieving dynamically every query—ideal when the KB changes rarely and fits cache-eligible prefix size.

Anthropic and OpenAI cache attention states for long identical prefixes. If your product’s “knowledge bundle” (policy PDFs, API reference) is static for hours or days, load it once as a cacheable system block; per-user queries append as suffix-only tokens at reduced input cost on hits.

CAG vs RAG

RAG (dynamic retrieve)CAG (cached prefix)
Update latencyMinutes (re-embed chunks)Cache TTL + prefix rebuild
Query costEmbed + small contextLarge prefix once, cheap suffix
Best corpus sizeUnboundedFits cacheable window (~100K tokens practical)
PersonalizationMetadata filters per userHarder—need per-tenant cache keys

Hybrid pattern

Cache stable core docs (brand guidelines, core API) as prefix; RAG retrieve tenant-specific or fresh content as suffix. Maximizes cache hit rate while keeping personalization.

Anthropic-style cacheable KB prefix
SYSTEM_KB = open("kb/core_policies.md").read()  # stable, versioned

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_KB,
                "cache_control": {"type": "ephemeral"},
            },
            {"type": "text", "text": f"User question: {user_query}"},
        ],
    }
]
# First request populates cache; subsequent identical prefix → discounted input tokens
var systemKb = Files.readString(Path.of("kb/core_policies.md"));

var request = ChatRequest.builder()
    .messages(List.of(
        Message.builder()
            .role(Role.USER)
            .content(List.of(
                ContentBlock.text(systemKb).cacheControl(CacheControl.ephemeral()),
                ContentBlock.text("User question: " + userQuery)
            ))
            .build()))
    .build();
💡 Pro Tip

Version cache keys: kb_v{{hash}} in telemetry. When cache hit rate drops after deploy, you likely changed prefix bytes silently.

📦 Real World

Support bots with 40-page static playbooks use CAG on Anthropic for sub-100ms TTFT after warm cache—dynamic RAG only for account-specific ticket history.

🔒 Security

Shared cache prefixes must not contain tenant A’s data served to tenant B. One cache block per security boundary.

Production: pattern selection matrix

Pick patterns from eval evidence, not hype. This matrix maps common product shapes to recommended architectures and links forward to ingestion pipelines—because advanced retrieval fails on bad indexes.

Pattern selection matrix

Product shapeRecommended patternAvoid
FAQ <500 pagesAdvanced modular (hybrid + rerank)GraphRAG, agent loops
Multi-hop internal wikiModular + query decompositionNaive top-3 only
Executive thematic Q&AGraphRAG global communitiesLeaf-only chunk RAG
Regulated citations requiredSelf-RAG + CRAG gradingStuff-only long context
Stale KB + live webCRAG + web fallbackKB-only blind trust
Stable handbook <100K tokensCAG prefix + optional RAG suffixFull re-retrieve every turn
Research copilotAgentic RAG multi-toolSingle-shot retrieve

Decision checklist

  • Baseline naive RAG eval logged (faithfulness, answer relevance, latency p95).
  • Pattern change tied to A/B or shadow traffic—not architecture for its own sake.
  • Max LLM calls per user query capped (budget + iteration limit).
  • Fallback path defined: abstain, human handoff, or web with disclaimers.
  • Ingestion freshness SLO documented (see next guide).
📦 Production checklist

Advanced RAG is production-ready when you can name which pattern each product surface uses, what it costs per query, and which eval metric regressed last time you changed chunk size or embedding model.

🎯 Interview Tip

“Design RAG for our company handbook + tickets.” — Modular hybrid baseline, CRAG for stale KB, optional agent for analytics questions, GraphRAG only if evals show global query gap.