Agentic RAG & research agents

Agentic RAG — when and how to retrieve iteratively

Agentic RAG gives the LLM control over retrieval: it decides whether to search, which index or tool to use, and when it has enough evidence—often across multiple loops. This beats single-shot RAG on multi-hop and research questions, at the cost of latency, token spend, and eval complexity.

Grading retrieved evidence before synthesis

Agentic pipelines often add a grader node after each retrieval batch: a lightweight LLM or cross-encoder scores whether chunks are relevant and sufficient. Low scores trigger reformulation; high scores skip further retrieval. This pattern—used in Self-RAG and LangGraph tutorials—prevents both under- and over-retrieval.

Grade	Action
Relevant + sufficient	Proceed to synthesis
Relevant + insufficient	Decompose further; new sub-query
Irrelevant	Rewrite query; switch index (BM25 vs vector)
Contradictory	Fetch tie-breaker; flag in answer

GRADE_PROMPT = """Score retrieval for the question.
Return JSON: {"relevant": bool, "sufficient": bool, "follow_up_query": str|null}

Question: {question}
Chunks: {chunks}"""

def grade_retrieval(question: str, chunks: list[dict], llm) -> dict:
    raw = llm.complete(GRADE_PROMPT.format(question=question, chunks=chunks[:5]))
    grade = json.loads(raw)
    if grade.get("relevant") and grade.get("sufficient"):
        return {"action": "synthesize", "chunks": chunks}
    if grade.get("follow_up_query"):
        return {"action": "retrieve_again", "query": grade["follow_up_query"]}
    return {"action": "rewrite", "query": question}

public GradeDecision gradeRetrieval(String question, List<Document> chunks) {
    RetrievalGrade grade = graderLlm.grade(question, chunks.subList(0, Math.min(5, chunks.size())));
    if (grade.relevant() && grade.sufficient()) {
        return GradeDecision.synthesize(chunks);
    }
    if (grade.followUpQuery() != null) {
        return GradeDecision.retrieveAgain(grade.followUpQuery());
    }
    return GradeDecision.rewrite(question);
}

Single-shot vs agentic retrieval

Pattern	Flow	Best for	Risk
Naive RAG	embed → search once → answer	FAQ, simple policy lookup	Misses multi-hop evidence
Advanced RAG	rewrite → hybrid → rerank → answer	Enterprise KB with tuned pipelines	Fixed strategy regardless of question
Agentic RAG	plan → retrieve → read → maybe retrieve again	Research, analytics, compound questions	Runaway loops, higher cost

When to retrieve iteratively

Multi-hop — answer needs entity A before you can query for relationship B.
Unknown corpus scope — user asks "what changed in Q3 releases?" without naming products.
Tool selection — hybrid of vector DB, SQL, web search, and calculator.
Self-correction — first retrieval set lacks supporting evidence; agent reformulates query.

Query decomposition

Decomposition splits a compound question into ordered or parallel sub-queries. Unlike multi-query expansion (same intent, different phrasing), decomposition assumes different information needs: "Compare our DPA with Vendor X's SOC 2 scope" → (1) fetch DPA section, (2) fetch Vendor X SOC 2 summary, (3) synthesize comparison.

MAX_RETRIEVE_STEPS = 4

DECOMPOSE_PROMPT = """Break this question into 1-4 search sub-queries needed to answer it.
Return JSON: {"sub_queries": ["...", "..."]}"""

def agentic_retrieve(question: str, retriever, llm) -> list[dict]:
    plan = json.loads(llm.complete(DECOMPOSE_PROMPT + question))
    evidence: list[dict] = []
    seen_ids: set[str] = set()

    for sub_q in plan.get("sub_queries", [question]):
        if len(evidence) >= 20:
            break
        hits = retriever.search(sub_q, k=5)
        for h in hits:
            if h["id"] not in seen_ids:
                seen_ids.add(h["id"])
                evidence.append({**h, "from_query": sub_q})

    # Optional: critic step — need more?
    critique = llm.complete(f"Evidence count {len(evidence)}. Need more retrieval for: {question}? yes/no")
    if "yes" in critique.lower() and len(plan.get("sub_queries", [])) < MAX_RETRIEVE_STEPS:
        follow_up = llm.complete(f"One follow-up search query for: {question}")
        evidence.extend(retriever.search(follow_up.strip(), k=3))
    return evidence

public List agenticRetrieve(String question) {
    DecompositionPlan plan = decomposer.decompose(question);
    Set seen = new HashSet<>();
    List evidence = new ArrayList<>();
    for (String subQ : plan.subQueries()) {
        for (Document d : retriever.search(subQ, 5)) {
            if (seen.add(d.getId())) evidence.add(d);
        }
    }
    if (critic.needsMore(question, evidence)) {
        String followUp = decomposer.followUp(question, evidence);
        retriever.search(followUp, 3).forEach(d -> { if (seen.add(d.getId())) evidence.add(d); });
    }
    return evidence;
}

⚖️ Trade-off

Agentic RAG can 3–10× retrieval cost vs advanced static pipelines. Cap steps, cache sub-query results, and route easy FAQ traffic to single-shot path.

⚠️ Pitfall

Without step limits, models loop retrieve→vague read→retrieve forever. Hard cap tool calls and require explicit "sufficient evidence" structured output.

🔬 Under the Hood

LangGraph, AutoGen, and Spring AI graphs model agentic RAG as state machines: nodes for plan, retrieve, grade, generate—with conditional edges on retrieval quality scores.

💡 Pro Tip

Log each sub-query and hit IDs. When answers fail, you see whether decomposition or retrieval—not synthesis—was wrong.

Research agent — 7-step pattern with citations

A production research agent follows a repeatable loop: clarify scope, plan sources, retrieve, extract claims, cross-check, synthesize with citations, and surface confidence gaps. This pattern powers internal briefings, competitive intel, and support escalation summaries.

Citation format standards

Pick one citation style per product and enforce in post-processing. Common patterns: inline numeric [1] with footnotes linking to chunk viewer; or inline markdown links to signed URLs expiring in 24h. Store chunk_id, index_version, and char_span so citations survive reindex if chunk boundaries shift slightly.

Style	Pros	Cons
Numeric [n]	Compact in chat UI	Requires citation panel
Markdown link	Clickable in exports	Long URLs in prose
Footnote doc id	Stable across reindex	Less readable inline

Seven-step research pattern

Clarify — restate question; identify required facets (time range, geography, product line).
Plan sources — pick indexes (KB, web, CRM); define forbidden sources (unapproved wikis).
Retrieve — run decomposed queries; prefer primary docs over summaries.
Extract — pull atomic claims with source span IDs (chunk_id, page, URL).
Cross-check — flag contradictions; retrieve tie-breaker evidence if needed.
Synthesize — write answer with inline citations [1][2]; separate facts vs inference.
Confidence — list gaps ("SOC 2 scope for Vendor X not in corpus"); suggest next searches.

Step	Output artifact	Quality gate
Clarify	Structured brief JSON	Ambiguity score < threshold or ask user
Extract	List of {claim, source_id, quote}	Every claim has source_id
Synthesize	Markdown + citation map	No citation → sentence removed
Confidence	Gap list	User sees what was not found

@dataclass
class Claim:
    text: str
    source_id: str
    quote: str

def research_agent(question: str, retriever, llm) -> dict:
    brief = llm.complete_json(CLARIFY_PROMPT, question)
    evidence = agentic_retrieve(brief["restated_question"], retriever, llm)

    claims: list[Claim] = []
    for chunk in evidence:
        extracted = llm.complete_json(EXTRACT_PROMPT, chunk["text"])
        for c in extracted.get("claims", []):
            claims.append(Claim(c["text"], chunk["id"], c["quote"]))

    synthesis = llm.complete(SYNTHESIZE_PROMPT, {
        "question": question,
        "claims": [c.__dict__ for c in claims],
    })
    return {
        "answer": synthesis,
        "citations": {c.source_id: c.quote for c in claims},
        "gaps": llm.complete(GAP_PROMPT, {"question": question, "claims": len(claims)}),
    }

public ResearchResult researchAgent(String question) {
    ResearchBrief brief = clarifier.clarify(question);
    List evidence = agenticRetrieve(brief.restatement());
    List claims = claimExtractor.extract(evidence);
    contradictions.check(claims).forEach(c -> evidence.addAll(retriever.search(c.tieBreakerQuery(), 2)));
    String answer = synthesizer.withCitations(question, claims);
    List gaps = gapAnalyzer.analyze(question, claims);
    return new ResearchResult(answer, claims, gaps);
}

📦 Real World

Legal and compliance teams require citation maps stored with each briefing PDF export—auditors click [2] to open the exact policy chunk version indexed at answer time.

🔒 Security

Research agents exfiltrate if they can search unrestricted web or email indexes. Scope retriever tools per user RBAC before the plan step.

🎯 Interview Tip

Describe research agents as orchestration + citation discipline—not "GPT browses the web." Mention claim extraction, contradiction handling, and gap reporting.

💰 Cost

Seven-step research on 10+ chunks may consume 20–50k tokens. Offer "quick answer" (single-shot) vs "deep research" (full loop) UX tiers.

Deep research — Perplexity / ChatGPT pattern

Deep research products (Perplexity Deep Research, ChatGPT deep browse, Gemini Deep Research) expand a topic tree: breadth-first exploration, deduplication of sources, progressive grounding checks, and a final long-form report. Understanding this pattern helps you build—or compete with—consumer-grade research UX on private corpora.

Topic tree expansion

Start from the user question as the root node. The planner generates 3–7 subtopics (breadth). Each subtopic triggers retrieval; salient findings spawn child nodes (depth) until a budget (max nodes, max tokens, max wall time) stops expansion. This is BFS/DFS hybrid with LLM-scored priority.

Phase	Action	Stop condition
Expand	Generate subtopics from frontier queue	Max depth or novelty < ε
Retrieve	Search per subtopic; fetch full pages if web	Top-K per node capped
Dedup	Canonical URL / chunk hash / embedding similarity	Skip near-duplicates > 0.92 cosine
Ground	Verify claims against source text	Drop unsupported bullets
Report	Outline → section drafts → merge	Token budget for output

Deduplication strategies

URL canonicalization — strip tracking params; normalize www.
Content hash — SHA-256 of extracted main text.
Embedding cluster — merge chunks with cosine > 0.92 from same domain.
Temporal prefer — keep newest when sources repeat news wire stories.

Grounding checks

After draft bullets, a grounding model (often same LLM with NLI-style prompt) labels each sentence SUPPORTED / CONTRADICTED / NOT_ENOUGH_INFO against retrieved spans. Unsupported sentences are removed or rewritten. This reduces hallucination in long reports where synthesis drift accumulates.

@dataclass
class TopicNode:
    topic: str
    depth: int
    parent: str | None

def deep_research(root_question: str, retriever, llm, max_nodes: int = 24) -> dict:
    frontier: list[TopicNode] = [TopicNode(root_question, 0, None)]
    visited_topics: set[str] = set()
    source_pool: dict[str, dict] = {}  # id -> chunk

    while frontier and len(visited_topics) < max_nodes:
        node = frontier.pop(0)
        key = node.topic.lower().strip()
        if key in visited_topics:
            continue
        visited_topics.add(key)

        hits = retriever.search(node.topic, k=6)
        for h in dedupe_hits(hits, source_pool):
            source_pool[h["id"]] = h

        if node.depth < 3:
            subtopics = llm.complete_json(EXPAND_PROMPT, node.topic).get("subtopics", [])
            for st in subtopics[:4]:
                frontier.append(TopicNode(st, node.depth + 1, key))

    outline = llm.complete(OUTLINE_PROMPT, list(source_pool.values()))
    draft = llm.complete(REPORT_PROMPT, {"outline": outline, "sources": list(source_pool.values())})
    grounded = grounding_filter(draft, source_pool, llm)
    return {"report": grounded, "sources": list(source_pool.values()), "nodes_explored": len(visited_topics)}

public DeepResearchResult deepResearch(String rootQuestion) {
    Queue frontier = new ArrayDeque<>(List.of(new TopicNode(rootQuestion, 0, null)));
    Set visited = new HashSet<>();
    Map sourcePool = new LinkedHashMap<>();

    while (!frontier.isEmpty() && visited.size() < maxNodes) {
        TopicNode node = frontier.poll();
        if (!visited.add(normalize(node.topic()))) continue;
        List hits = deduper.merge(sourcePool, retriever.search(node.topic(), 6));
        hits.forEach(d -> sourcePool.putIfAbsent(d.getId(), d));
        if (node.depth() < 3) {
            expander.expand(node.topic()).stream()
                .limit(4)
                .map(st -> new TopicNode(st, node.depth() + 1, node.topic()))
                .forEach(frontier::add);
        }
    }
    String draft = reporter.draft(sourcePool.values());
    return new DeepResearchResult(groundingFilter.apply(draft, sourcePool), sourcePool.values());
}

🔬 Under the Hood

Commercial deep research runs many parallel retrievals with async workers—similar to your async job queue pattern. Wall time is dominated by fetch + LLM, not vector search alone.

⚖️ Trade-off

Deep research reports take 2–15 minutes. Run as background jobs with email/Slack delivery; never block synchronous chat on full tree expansion.

⚠️ Pitfall

Dedup only by URL misses paraphrased plagiarism across domains. Add embedding similarity dedup for paragraph-level repeats.

💡 Pro Tip

Show users the topic tree as a progress UI—it sets expectations and increases trust during long runs.

Code agent — generate, execute, fix loop

A code agent writes Python (or SQL, shell) to answer quantitative questions, transform data, or produce charts. The loop is: generate → execute in sandbox → read stderr/stdout → fix until tests pass or step cap. E2B, Docker, Modal, and vendor code interpreters implement the sandbox layer.

Code agent loop

Stage	Input	Output
Generate	Question + data schema + prior errors	Python cell
Execute	Code + sandbox limits	stdout, stderr, artifacts
Fix	Error trace	Revised code
Answer	Successful run output	Natural language + optional chart

Sandbox requirements

No network egress (or allowlist only)
CPU/memory/time limits per execution
Ephemeral filesystem; no host mounts
Preinstalled pandas/numpy; no arbitrary pip install in prod
Capture matplotlib/plotly bytes as artifacts

from e2b_code_interpreter import Sandbox

MAX_FIX_ATTEMPTS = 5

def run_code_agent(question: str, data_csv: str, llm) -> dict:
    code = llm.complete(CODEGEN_PROMPT, {"question": question, "schema": data_csv[:500]})
    errors: list[str] = []

    with Sandbox(timeout=120) as sbx:
        sbx.files.write("/data/input.csv", data_csv)
        for attempt in range(MAX_FIX_ATTEMPTS):
            execution = sbx.run_code(code)
            if execution.error is None:
                return {
                    "answer": llm.complete(SUMMARIZE_PROMPT, execution.logs.stdout),
                    "code": code,
                    "artifacts": execution.results,
                }
            errors.append(str(execution.error))
            code = llm.complete(FIX_PROMPT, {"code": code, "errors": errors})
    return {"error": "max_attempts", "last_code": code}

@Service
public class CodeAgentService {
    private final ChatClient llm;
    private final SandboxClient sandbox;

    public CodeAgentResult run(String question, String dataCsv) {
        String code = llm.prompt().user(codegenTemplate(question, dataCsv)).call().content();
        List errors = new ArrayList<>();
        for (int i = 0; i < MAX_FIX_ATTEMPTS; i++) {
            ExecutionResult result = sandbox.runPython(code, dataCsv, Duration.ofSeconds(120));
            if (result.success()) {
                return new CodeAgentResult(llm.summarize(result.stdout()), code, result.artifacts());
            }
            errors.add(result.stderr());
            code = llm.prompt().user(fixTemplate(code, errors)).call().content();
        }
        throw new CodeAgentException("max attempts exceeded");
    }
}

🔒 Security

Never execute model-generated code on the host OS. Sandboxes must block filesystem escape, network, and fork bombs. Treat stdout as untrusted before showing to users.

💰 Cost

Each fix attempt is an LLM call + sandbox minute. Pre-validate with static analysis (bandit, timeout wrappers) before execution.

📦 Real World

OpenAI Code Interpreter and Anthropic analysis tool use vendor sandboxes; self-hosted teams standardize on E2B Firecracker microVMs or gVisor-wrapped containers.

🎯 Interview Tip

Code agents combine tool loops with deterministic execution—mention step caps, sandbox isolation, and validating numeric answers against retrieved facts when possible.

Agent evaluation — completion, efficiency, tools, faithfulness

Agent quality is multi-dimensional: did it finish the task, how many steps did it waste, were tools called correctly, and is the final answer grounded? Track 4 agents need offline eval harnesses before production—full methodology lives in Track 5 — Evaluation, but the metrics start here.

Core agent metrics

Metric	Definition	How to measure
Task completion	User goal achieved end-to-end	LLM judge + human spot check on golden tasks
Step efficiency	Tool calls vs optimal path	Compare trace length to expert trajectory
Tool accuracy	Correct tool + valid args	Parse trace; match expected tool sequence
Faithfulness	Answer supported by retrieved evidence	NLI-style claim check per sentence
Retrieval recall	Gold doc in evidence set	Label chunk IDs in eval set

Golden task fixtures

Build 30–100 scripted tasks per agent persona: "Research refund policy for EU enterprise tier," "Plot Q3 revenue by region from attached CSV." Each fixture specifies expected tools (order may vary), required citation sources, and forbidden actions (no web, no delete). Run in CI on prompt/graph changes.

Trace logging schema

trace_id, agent_version, user_tenant
Ordered steps: {type: tool|llm|retrieve, name, latency_ms, input_hash, output_hash}
Final answer + citation map + retrieval snapshot IDs
User feedback signal linked to trace

@dataclass
class GoldenTask:
    id: str
    question: str
    expected_tools: set[str]
    required_source_ids: set[str]

def eval_agent(agent_fn, tasks: list[GoldenTask], judge_llm) -> dict:
    results = []
    for task in tasks:
        trace = agent_fn.run_with_trace(task.question)
        tools_used = {s["name"] for s in trace.steps if s["type"] == "tool"}
        sources = set(trace.retrieval_ids)

        tool_score = len(task.expected_tools & tools_used) / max(len(task.expected_tools), 1)
        recall = len(task.required_source_ids & sources) / max(len(task.required_source_ids), 1)
        completion = judge_llm.score(task.question, trace.answer, task.rubric)

        results.append({
            "task_id": task.id,
            "completion": completion,
            "tool_accuracy": tool_score,
            "retrieval_recall": recall,
            "steps": len(trace.steps),
        })
    return {"tasks": results, "avg_completion": mean(r["completion"] for r in results)}

public AgentEvalReport evalAgent(AgentRunner runner, List tasks) {
    List scores = new ArrayList<>();
    for (GoldenTask task : tasks) {
        AgentTrace trace = runner.runWithTrace(task.question());
        Set toolsUsed = trace.toolNames();
        double toolAcc = overlap(task.expectedTools(), toolsUsed);
        double recall = overlap(task.requiredSourceIds(), trace.retrievalIds());
        double completion = judge.score(task, trace.answer());
        scores.add(new TaskScore(task.id(), completion, toolAcc, recall, trace.stepCount()));
    }
    return AgentEvalReport.from(scores);
}

💡 Pro Tip

Optimize step efficiency only after task completion plateaus— shaving steps while breaking edge cases is a common regression.

⚠️ Pitfall

LLM-as-judge favors verbose answers. Calibrate judges against human labels monthly; use rubric checkpoints not vibes.

🎯 Interview Tip

Name four metrics: completion, efficiency, tool accuracy, faithfulness. Mention golden traces and linking evals to Track 5 CI gates.

Deep dive: Track 5 — Evaluation covers golden datasets, LLM judges, regression gates, and online evals.

What this looks like in production

Shipping agentic RAG means routing traffic between fast single-shot and slow research paths, enforcing budgets, and operating observability across retrieval, tools, and synthesis. This is the Track 4 capstone before Track 5 — Evaluation.

Routing: fast vs agentic path

Signal	Route	Target latency
FAQ match > 0.9 confidence	Single-shot RAG	< 3 s
Compound / research keywords	Agentic retrieve loop	5–30 s
"Deep research" explicit UX	Topic tree async job	2–15 min
CSV / numeric analysis attached	Code agent sandbox	10–60 s

Track 4 production checklist

Router with logged decision reason per request
Hard caps: max tool calls, max retrieve steps, max tokens per trace
Citation map stored with every research output
Sandbox isolation for code agents
MCP allowlists for external tools (see MCP guide)
Golden agent tasks in CI (30+ per persona)
Async queue for deep research with progress notifications
Cost attribution per tenant: retrieval + LLM + sandbox minutes

Observability dashboard

p95 latency by route (fast / agentic / deep / code)
Tool error rate by server name
Avg steps per successful task
Faithfulness score sample (offline daily job)
Retrieval recall@K on labeled slice

def route_and_run(question: str, attachments: list, cfg) -> dict:
    route = "fast"
    if cfg.deep_research_phrase in question.lower():
        route = "deep"
    elif attachments and any(a.endswith(".csv") for a in attachments):
        route = "code"
    elif classifier.needs_agentic_rag(question):
        route = "agentic"

    log.info("route_selected", route=route, question_hash=hash(question))
    handlers = {
        "fast": lambda: single_shot_rag(question),
        "agentic": lambda: research_agent(question, retriever, llm),
        "deep": lambda: enqueue_deep_research(question),
        "code": lambda: run_code_agent(question, attachments[0], llm),
    }
    return {"route": route, "result": handlers[route]()}

public AgentResponse routeAndRun(AgentRequest request) {
    Route route = router.classify(request);
    metrics.counter("agent.route", "type", route.name()).increment();
    return switch (route) {
        case FAST -> fastRag.answer(request);
        case AGENTIC -> researchAgent.run(request);
        case DEEP -> deepResearchQueue.enqueue(request);
        case CODE -> codeAgent.run(request);
    };
}

📦 Real World

Perplexity-style products expose route implicitly; enterprise copilots label modes: "Quick" vs "Research" vs "Analyze data" so users understand latency and cost.

💰 Cost

Agentic traffic may be 5–20% of queries but 40–70% of LLM spend. Monitor spend by route and adjust router thresholds quarterly.

🔒 Security

Deep research and code agents are high exfiltration risk. Enforce RBAC on retriever scopes and sandbox network policy per tenant tier.

Combining agentic RAG with MCP tools

Retrieval does not have to live in-process. Expose search_kb, fetch_chunk, and search_web as MCP tools so research agents discover them alongside CRM and calendar servers. The orchestrator merges allowlists; the model picks retrieval tools per step. See MCP explained for server patterns.

Agent mode	Typical tools	Max steps
Research	search_kb, fetch_chunk, cite_source	6–10
Deep research	search_web, fetch_url, dedupe_check	20–40 (async)
Code analysis	run_python, load_csv, plot	5–8

Failure modes and mitigations

Failure	Symptom	Mitigation
Retrieve loop	Same query repeated 5+ times	Query dedup set; force synthesize after cap
Citation drift	[3] points to wrong chunk	Stable chunk IDs; validate map before render
Over-research	30 sources, shallow summary	Max sources; require outline before report
Code hallucination	Prints answer without running code	Require execution artifact for numeric claims

def agent_loop(question: str, tools, llm, max_steps: int = 8):
    messages = [{"role": "user", "content": question}]
    queries_seen: set[str] = set()
    for step in range(max_steps):
        resp = llm.chat(messages, tools=tools)
        if not resp.tool_calls:
            return resp.content
        for call in resp.tool_calls:
            if call.name == "search_kb":
                q = json.loads(call.arguments)["query"]
                if q in queries_seen:
                    messages.append({"role": "user", "content": "Duplicate query. Synthesize now with existing evidence."})
                    continue
                queries_seen.add(q)
            messages.append(run_tool(call))
    return llm.chat(messages + [{"role": "user", "content": "Step limit reached. Answer with citations and list gaps."}]).content

public String agentLoop(String question, int maxSteps) {
    List<Message> messages = new ArrayList<>(List.of(new UserMessage(question)));
    Set<String> queriesSeen = new HashSet<>();
    for (int step = 0; step < maxSteps; step++) {
        ChatResponse resp = chatClient.prompt().messages(messages).tools(tools).call();
        if (!resp.hasToolCalls()) return resp.getResult().getOutput().getContent();
        for (ToolCall call : resp.getToolCalls()) {
            if ("search_kb".equals(call.name())) {
                String q = parseQuery(call.arguments());
                if (!queriesSeen.add(q)) {
                    messages.add(new UserMessage("Duplicate query. Synthesize with existing evidence."));
                    continue;
                }
            }
            messages.add(ToolResponseMessage.from(call.id(), execute(call)));
        }
    }
    messages.add(new UserMessage("Step limit reached. Answer with citations and list gaps."));
    return chatClient.prompt().messages(messages).call().content();
}

Cross-track references

Retrieval strategies — static pipelines agentic RAG extends
Advanced RAG patterns — Self-RAG and corrective retrieval
MCP explained — external retrieval and action tools
Track 5 — Evaluation — golden tasks and faithfulness scoring

💡 Pro Tip

Start agentic RAG on internal KB only (no web) until citation accuracy exceeds 90% on golden set—then add web tools with stricter grounding.

Rollout phases

Phase 1 — Agentic retrieve on internal KB for internal power users; log traces only.
Phase 2 — Enable research mode with citations in UI; golden eval gate in CI.
Phase 3 — Deep research async jobs for analyst persona; cost caps per tenant.
Phase 4 — Code agent for CSV attachments; sandbox in isolated VPC.

Each phase has explicit exit criteria tied to Track 5 metrics before the next phase ships to all tenants.

Pair this guide with Advanced RAG patterns for Self-RAG grading concepts and with Multi-agent orchestration when research sub-agents run in parallel.