Lang
API

Agentic RAG & research agents

Single-shot RAG fails on multi-hop questions, open-ended research, and quantitative analysis over user data. Agentic RAG puts the model in control of when and how to retrieve—often in loops—with query decomposition, citation discipline, and optional code execution. This guide covers iterative retrieval, the 7-step research pattern, deep-research topic trees, sandboxed code agents, evaluation metrics, and production routing.

After reading, you should be able to: decide when agentic retrieval beats static pipelines; implement query decomposition and research loops with citations; build deep-research topic expansion with dedup and grounding; run generate→execute→fix code agent loops in sandboxes; measure task completion, step efficiency, tool accuracy, and faithfulness; and route production traffic across fast, agentic, deep, and code paths.

developer platform architect Track 4 Agentic RAG Deep research E2B LangGraph Citations

Agentic RAG — when and how to retrieve iteratively

Agentic RAG gives the LLM control over retrieval: it decides whether to search, which index or tool to use, and when it has enough evidence—often across multiple loops. This beats single-shot RAG on multi-hop and research questions, at the cost of latency, token spend, and eval complexity.

Grading retrieved evidence before synthesis

Agentic pipelines often add a grader node after each retrieval batch: a lightweight LLM or cross-encoder scores whether chunks are relevant and sufficient. Low scores trigger reformulation; high scores skip further retrieval. This pattern—used in Self-RAG and LangGraph tutorials—prevents both under- and over-retrieval.

GradeAction
Relevant + sufficientProceed to synthesis
Relevant + insufficientDecompose further; new sub-query
IrrelevantRewrite query; switch index (BM25 vs vector)
ContradictoryFetch tie-breaker; flag in answer
Retrieval grader node
GRADE_PROMPT = """Score retrieval for the question.
Return JSON: {"relevant": bool, "sufficient": bool, "follow_up_query": str|null}

Question: {question}
Chunks: {chunks}"""

def grade_retrieval(question: str, chunks: list[dict], llm) -> dict:
    raw = llm.complete(GRADE_PROMPT.format(question=question, chunks=chunks[:5]))
    grade = json.loads(raw)
    if grade.get("relevant") and grade.get("sufficient"):
        return {"action": "synthesize", "chunks": chunks}
    if grade.get("follow_up_query"):
        return {"action": "retrieve_again", "query": grade["follow_up_query"]}
    return {"action": "rewrite", "query": question}
public GradeDecision gradeRetrieval(String question, List<Document> chunks) {
    RetrievalGrade grade = graderLlm.grade(question, chunks.subList(0, Math.min(5, chunks.size())));
    if (grade.relevant() && grade.sufficient()) {
        return GradeDecision.synthesize(chunks);
    }
    if (grade.followUpQuery() != null) {
        return GradeDecision.retrieveAgain(grade.followUpQuery());
    }
    return GradeDecision.rewrite(question);
}

Single-shot vs agentic retrieval

PatternFlowBest forRisk
Naive RAGembed → search once → answerFAQ, simple policy lookupMisses multi-hop evidence
Advanced RAGrewrite → hybrid → rerank → answerEnterprise KB with tuned pipelinesFixed strategy regardless of question
Agentic RAGplan → retrieve → read → maybe retrieve againResearch, analytics, compound questionsRunaway loops, higher cost

When to retrieve iteratively

  • Multi-hop — answer needs entity A before you can query for relationship B.
  • Unknown corpus scope — user asks "what changed in Q3 releases?" without naming products.
  • Tool selection — hybrid of vector DB, SQL, web search, and calculator.
  • Self-correction — first retrieval set lacks supporting evidence; agent reformulates query.

Query decomposition

Decomposition splits a compound question into ordered or parallel sub-queries. Unlike multi-query expansion (same intent, different phrasing), decomposition assumes different information needs: "Compare our DPA with Vendor X's SOC 2 scope" → (1) fetch DPA section, (2) fetch Vendor X SOC 2 summary, (3) synthesize comparison.

Agentic retrieve loop with decomposition
MAX_RETRIEVE_STEPS = 4

DECOMPOSE_PROMPT = """Break this question into 1-4 search sub-queries needed to answer it.
Return JSON: {"sub_queries": ["...", "..."]}"""

def agentic_retrieve(question: str, retriever, llm) -> list[dict]:
    plan = json.loads(llm.complete(DECOMPOSE_PROMPT + question))
    evidence: list[dict] = []
    seen_ids: set[str] = set()

    for sub_q in plan.get("sub_queries", [question]):
        if len(evidence) >= 20:
            break
        hits = retriever.search(sub_q, k=5)
        for h in hits:
            if h["id"] not in seen_ids:
                seen_ids.add(h["id"])
                evidence.append({**h, "from_query": sub_q})

    # Optional: critic step — need more?
    critique = llm.complete(f"Evidence count {len(evidence)}. Need more retrieval for: {question}? yes/no")
    if "yes" in critique.lower() and len(plan.get("sub_queries", [])) < MAX_RETRIEVE_STEPS:
        follow_up = llm.complete(f"One follow-up search query for: {question}")
        evidence.extend(retriever.search(follow_up.strip(), k=3))
    return evidence
public List agenticRetrieve(String question) {
    DecompositionPlan plan = decomposer.decompose(question);
    Set seen = new HashSet<>();
    List evidence = new ArrayList<>();
    for (String subQ : plan.subQueries()) {
        for (Document d : retriever.search(subQ, 5)) {
            if (seen.add(d.getId())) evidence.add(d);
        }
    }
    if (critic.needsMore(question, evidence)) {
        String followUp = decomposer.followUp(question, evidence);
        retriever.search(followUp, 3).forEach(d -> { if (seen.add(d.getId())) evidence.add(d); });
    }
    return evidence;
}
⚖️ Trade-off

Agentic RAG can 3–10× retrieval cost vs advanced static pipelines. Cap steps, cache sub-query results, and route easy FAQ traffic to single-shot path.

⚠️ Pitfall

Without step limits, models loop retrieve→vague read→retrieve forever. Hard cap tool calls and require explicit "sufficient evidence" structured output.

🔬 Under the Hood

LangGraph, AutoGen, and Spring AI graphs model agentic RAG as state machines: nodes for plan, retrieve, grade, generate—with conditional edges on retrieval quality scores.

💡 Pro Tip

Log each sub-query and hit IDs. When answers fail, you see whether decomposition or retrieval—not synthesis—was wrong.

Research agent — 7-step pattern with citations

A production research agent follows a repeatable loop: clarify scope, plan sources, retrieve, extract claims, cross-check, synthesize with citations, and surface confidence gaps. This pattern powers internal briefings, competitive intel, and support escalation summaries.

Citation format standards

Pick one citation style per product and enforce in post-processing. Common patterns: inline numeric [1] with footnotes linking to chunk viewer; or inline markdown links to signed URLs expiring in 24h. Store chunk_id, index_version, and char_span so citations survive reindex if chunk boundaries shift slightly.

StyleProsCons
Numeric [n]Compact in chat UIRequires citation panel
Markdown linkClickable in exportsLong URLs in prose
Footnote doc idStable across reindexLess readable inline

Seven-step research pattern

  1. Clarify — restate question; identify required facets (time range, geography, product line).
  2. Plan sources — pick indexes (KB, web, CRM); define forbidden sources (unapproved wikis).
  3. Retrieve — run decomposed queries; prefer primary docs over summaries.
  4. Extract — pull atomic claims with source span IDs (chunk_id, page, URL).
  5. Cross-check — flag contradictions; retrieve tie-breaker evidence if needed.
  6. Synthesize — write answer with inline citations [1][2]; separate facts vs inference.
  7. Confidence — list gaps ("SOC 2 scope for Vendor X not in corpus"); suggest next searches.
StepOutput artifactQuality gate
ClarifyStructured brief JSONAmbiguity score < threshold or ask user
ExtractList of {claim, source_id, quote}Every claim has source_id
SynthesizeMarkdown + citation mapNo citation → sentence removed
ConfidenceGap listUser sees what was not found
Research agent with citation map
@dataclass
class Claim:
    text: str
    source_id: str
    quote: str

def research_agent(question: str, retriever, llm) -> dict:
    brief = llm.complete_json(CLARIFY_PROMPT, question)
    evidence = agentic_retrieve(brief["restated_question"], retriever, llm)

    claims: list[Claim] = []
    for chunk in evidence:
        extracted = llm.complete_json(EXTRACT_PROMPT, chunk["text"])
        for c in extracted.get("claims", []):
            claims.append(Claim(c["text"], chunk["id"], c["quote"]))

    synthesis = llm.complete(SYNTHESIZE_PROMPT, {
        "question": question,
        "claims": [c.__dict__ for c in claims],
    })
    return {
        "answer": synthesis,
        "citations": {c.source_id: c.quote for c in claims},
        "gaps": llm.complete(GAP_PROMPT, {"question": question, "claims": len(claims)}),
    }
public ResearchResult researchAgent(String question) {
    ResearchBrief brief = clarifier.clarify(question);
    List evidence = agenticRetrieve(brief.restatement());
    List claims = claimExtractor.extract(evidence);
    contradictions.check(claims).forEach(c -> evidence.addAll(retriever.search(c.tieBreakerQuery(), 2)));
    String answer = synthesizer.withCitations(question, claims);
    List gaps = gapAnalyzer.analyze(question, claims);
    return new ResearchResult(answer, claims, gaps);
}
📦 Real World

Legal and compliance teams require citation maps stored with each briefing PDF export—auditors click [2] to open the exact policy chunk version indexed at answer time.

🔒 Security

Research agents exfiltrate if they can search unrestricted web or email indexes. Scope retriever tools per user RBAC before the plan step.

🎯 Interview Tip

Describe research agents as orchestration + citation discipline—not "GPT browses the web." Mention claim extraction, contradiction handling, and gap reporting.

💰 Cost

Seven-step research on 10+ chunks may consume 20–50k tokens. Offer "quick answer" (single-shot) vs "deep research" (full loop) UX tiers.

Deep research — Perplexity / ChatGPT pattern

Deep research products (Perplexity Deep Research, ChatGPT deep browse, Gemini Deep Research) expand a topic tree: breadth-first exploration, deduplication of sources, progressive grounding checks, and a final long-form report. Understanding this pattern helps you build—or compete with—consumer-grade research UX on private corpora.

Topic tree expansion

Start from the user question as the root node. The planner generates 3–7 subtopics (breadth). Each subtopic triggers retrieval; salient findings spawn child nodes (depth) until a budget (max nodes, max tokens, max wall time) stops expansion. This is BFS/DFS hybrid with LLM-scored priority.

PhaseActionStop condition
ExpandGenerate subtopics from frontier queueMax depth or novelty < ε
RetrieveSearch per subtopic; fetch full pages if webTop-K per node capped
DedupCanonical URL / chunk hash / embedding similaritySkip near-duplicates > 0.92 cosine
GroundVerify claims against source textDrop unsupported bullets
ReportOutline → section drafts → mergeToken budget for output

Deduplication strategies

  • URL canonicalization — strip tracking params; normalize www.
  • Content hash — SHA-256 of extracted main text.
  • Embedding cluster — merge chunks with cosine > 0.92 from same domain.
  • Temporal prefer — keep newest when sources repeat news wire stories.

Grounding checks

After draft bullets, a grounding model (often same LLM with NLI-style prompt) labels each sentence SUPPORTED / CONTRADICTED / NOT_ENOUGH_INFO against retrieved spans. Unsupported sentences are removed or rewritten. This reduces hallucination in long reports where synthesis drift accumulates.

Topic tree deep research loop
@dataclass
class TopicNode:
    topic: str
    depth: int
    parent: str | None

def deep_research(root_question: str, retriever, llm, max_nodes: int = 24) -> dict:
    frontier: list[TopicNode] = [TopicNode(root_question, 0, None)]
    visited_topics: set[str] = set()
    source_pool: dict[str, dict] = {}  # id -> chunk

    while frontier and len(visited_topics) < max_nodes:
        node = frontier.pop(0)
        key = node.topic.lower().strip()
        if key in visited_topics:
            continue
        visited_topics.add(key)

        hits = retriever.search(node.topic, k=6)
        for h in dedupe_hits(hits, source_pool):
            source_pool[h["id"]] = h

        if node.depth < 3:
            subtopics = llm.complete_json(EXPAND_PROMPT, node.topic).get("subtopics", [])
            for st in subtopics[:4]:
                frontier.append(TopicNode(st, node.depth + 1, key))

    outline = llm.complete(OUTLINE_PROMPT, list(source_pool.values()))
    draft = llm.complete(REPORT_PROMPT, {"outline": outline, "sources": list(source_pool.values())})
    grounded = grounding_filter(draft, source_pool, llm)
    return {"report": grounded, "sources": list(source_pool.values()), "nodes_explored": len(visited_topics)}
public DeepResearchResult deepResearch(String rootQuestion) {
    Queue frontier = new ArrayDeque<>(List.of(new TopicNode(rootQuestion, 0, null)));
    Set visited = new HashSet<>();
    Map sourcePool = new LinkedHashMap<>();

    while (!frontier.isEmpty() && visited.size() < maxNodes) {
        TopicNode node = frontier.poll();
        if (!visited.add(normalize(node.topic()))) continue;
        List hits = deduper.merge(sourcePool, retriever.search(node.topic(), 6));
        hits.forEach(d -> sourcePool.putIfAbsent(d.getId(), d));
        if (node.depth() < 3) {
            expander.expand(node.topic()).stream()
                .limit(4)
                .map(st -> new TopicNode(st, node.depth() + 1, node.topic()))
                .forEach(frontier::add);
        }
    }
    String draft = reporter.draft(sourcePool.values());
    return new DeepResearchResult(groundingFilter.apply(draft, sourcePool), sourcePool.values());
}
🔬 Under the Hood

Commercial deep research runs many parallel retrievals with async workers—similar to your async job queue pattern. Wall time is dominated by fetch + LLM, not vector search alone.

⚖️ Trade-off

Deep research reports take 2–15 minutes. Run as background jobs with email/Slack delivery; never block synchronous chat on full tree expansion.

⚠️ Pitfall

Dedup only by URL misses paraphrased plagiarism across domains. Add embedding similarity dedup for paragraph-level repeats.

💡 Pro Tip

Show users the topic tree as a progress UI—it sets expectations and increases trust during long runs.

Code agent — generate, execute, fix loop

A code agent writes Python (or SQL, shell) to answer quantitative questions, transform data, or produce charts. The loop is: generate → execute in sandbox → read stderr/stdout → fix until tests pass or step cap. E2B, Docker, Modal, and vendor code interpreters implement the sandbox layer.

Code agent loop

StageInputOutput
GenerateQuestion + data schema + prior errorsPython cell
ExecuteCode + sandbox limitsstdout, stderr, artifacts
FixError traceRevised code
AnswerSuccessful run outputNatural language + optional chart

Sandbox requirements

  • No network egress (or allowlist only)
  • CPU/memory/time limits per execution
  • Ephemeral filesystem; no host mounts
  • Preinstalled pandas/numpy; no arbitrary pip install in prod
  • Capture matplotlib/plotly bytes as artifacts
Code agent with E2B sandbox
from e2b_code_interpreter import Sandbox

MAX_FIX_ATTEMPTS = 5

def run_code_agent(question: str, data_csv: str, llm) -> dict:
    code = llm.complete(CODEGEN_PROMPT, {"question": question, "schema": data_csv[:500]})
    errors: list[str] = []

    with Sandbox(timeout=120) as sbx:
        sbx.files.write("/data/input.csv", data_csv)
        for attempt in range(MAX_FIX_ATTEMPTS):
            execution = sbx.run_code(code)
            if execution.error is None:
                return {
                    "answer": llm.complete(SUMMARIZE_PROMPT, execution.logs.stdout),
                    "code": code,
                    "artifacts": execution.results,
                }
            errors.append(str(execution.error))
            code = llm.complete(FIX_PROMPT, {"code": code, "errors": errors})
    return {"error": "max_attempts", "last_code": code}
@Service
public class CodeAgentService {
    private final ChatClient llm;
    private final SandboxClient sandbox;

    public CodeAgentResult run(String question, String dataCsv) {
        String code = llm.prompt().user(codegenTemplate(question, dataCsv)).call().content();
        List errors = new ArrayList<>();
        for (int i = 0; i < MAX_FIX_ATTEMPTS; i++) {
            ExecutionResult result = sandbox.runPython(code, dataCsv, Duration.ofSeconds(120));
            if (result.success()) {
                return new CodeAgentResult(llm.summarize(result.stdout()), code, result.artifacts());
            }
            errors.add(result.stderr());
            code = llm.prompt().user(fixTemplate(code, errors)).call().content();
        }
        throw new CodeAgentException("max attempts exceeded");
    }
}
🔒 Security

Never execute model-generated code on the host OS. Sandboxes must block filesystem escape, network, and fork bombs. Treat stdout as untrusted before showing to users.

💰 Cost

Each fix attempt is an LLM call + sandbox minute. Pre-validate with static analysis (bandit, timeout wrappers) before execution.

📦 Real World

OpenAI Code Interpreter and Anthropic analysis tool use vendor sandboxes; self-hosted teams standardize on E2B Firecracker microVMs or gVisor-wrapped containers.

🎯 Interview Tip

Code agents combine tool loops with deterministic execution—mention step caps, sandbox isolation, and validating numeric answers against retrieved facts when possible.

Agent evaluation — completion, efficiency, tools, faithfulness

Agent quality is multi-dimensional: did it finish the task, how many steps did it waste, were tools called correctly, and is the final answer grounded? Track 4 agents need offline eval harnesses before production—full methodology lives in Track 5 — Evaluation, but the metrics start here.

Core agent metrics

MetricDefinitionHow to measure
Task completionUser goal achieved end-to-endLLM judge + human spot check on golden tasks
Step efficiencyTool calls vs optimal pathCompare trace length to expert trajectory
Tool accuracyCorrect tool + valid argsParse trace; match expected tool sequence
FaithfulnessAnswer supported by retrieved evidenceNLI-style claim check per sentence
Retrieval recallGold doc in evidence setLabel chunk IDs in eval set

Golden task fixtures

Build 30–100 scripted tasks per agent persona: "Research refund policy for EU enterprise tier," "Plot Q3 revenue by region from attached CSV." Each fixture specifies expected tools (order may vary), required citation sources, and forbidden actions (no web, no delete). Run in CI on prompt/graph changes.

Trace logging schema

  • trace_id, agent_version, user_tenant
  • Ordered steps: {type: tool|llm|retrieve, name, latency_ms, input_hash, output_hash}
  • Final answer + citation map + retrieval snapshot IDs
  • User feedback signal linked to trace
Agent eval harness (offline)
@dataclass
class GoldenTask:
    id: str
    question: str
    expected_tools: set[str]
    required_source_ids: set[str]

def eval_agent(agent_fn, tasks: list[GoldenTask], judge_llm) -> dict:
    results = []
    for task in tasks:
        trace = agent_fn.run_with_trace(task.question)
        tools_used = {s["name"] for s in trace.steps if s["type"] == "tool"}
        sources = set(trace.retrieval_ids)

        tool_score = len(task.expected_tools & tools_used) / max(len(task.expected_tools), 1)
        recall = len(task.required_source_ids & sources) / max(len(task.required_source_ids), 1)
        completion = judge_llm.score(task.question, trace.answer, task.rubric)

        results.append({
            "task_id": task.id,
            "completion": completion,
            "tool_accuracy": tool_score,
            "retrieval_recall": recall,
            "steps": len(trace.steps),
        })
    return {"tasks": results, "avg_completion": mean(r["completion"] for r in results)}
public AgentEvalReport evalAgent(AgentRunner runner, List tasks) {
    List scores = new ArrayList<>();
    for (GoldenTask task : tasks) {
        AgentTrace trace = runner.runWithTrace(task.question());
        Set toolsUsed = trace.toolNames();
        double toolAcc = overlap(task.expectedTools(), toolsUsed);
        double recall = overlap(task.requiredSourceIds(), trace.retrievalIds());
        double completion = judge.score(task, trace.answer());
        scores.add(new TaskScore(task.id(), completion, toolAcc, recall, trace.stepCount()));
    }
    return AgentEvalReport.from(scores);
}
💡 Pro Tip

Optimize step efficiency only after task completion plateaus— shaving steps while breaking edge cases is a common regression.

⚠️ Pitfall

LLM-as-judge favors verbose answers. Calibrate judges against human labels monthly; use rubric checkpoints not vibes.

🎯 Interview Tip

Name four metrics: completion, efficiency, tool accuracy, faithfulness. Mention golden traces and linking evals to Track 5 CI gates.

Deep dive: Track 5 — Evaluation covers golden datasets, LLM judges, regression gates, and online evals.

What this looks like in production

Shipping agentic RAG means routing traffic between fast single-shot and slow research paths, enforcing budgets, and operating observability across retrieval, tools, and synthesis. This is the Track 4 capstone before Track 5 — Evaluation.

Routing: fast vs agentic path

SignalRouteTarget latency
FAQ match > 0.9 confidenceSingle-shot RAG< 3 s
Compound / research keywordsAgentic retrieve loop5–30 s
"Deep research" explicit UXTopic tree async job2–15 min
CSV / numeric analysis attachedCode agent sandbox10–60 s

Track 4 production checklist

  • Router with logged decision reason per request
  • Hard caps: max tool calls, max retrieve steps, max tokens per trace
  • Citation map stored with every research output
  • Sandbox isolation for code agents
  • MCP allowlists for external tools (see MCP guide)
  • Golden agent tasks in CI (30+ per persona)
  • Async queue for deep research with progress notifications
  • Cost attribution per tenant: retrieval + LLM + sandbox minutes

Observability dashboard

  • p95 latency by route (fast / agentic / deep / code)
  • Tool error rate by server name
  • Avg steps per successful task
  • Faithfulness score sample (offline daily job)
  • Retrieval recall@K on labeled slice
Production agent router
def route_and_run(question: str, attachments: list, cfg) -> dict:
    route = "fast"
    if cfg.deep_research_phrase in question.lower():
        route = "deep"
    elif attachments and any(a.endswith(".csv") for a in attachments):
        route = "code"
    elif classifier.needs_agentic_rag(question):
        route = "agentic"

    log.info("route_selected", route=route, question_hash=hash(question))
    handlers = {
        "fast": lambda: single_shot_rag(question),
        "agentic": lambda: research_agent(question, retriever, llm),
        "deep": lambda: enqueue_deep_research(question),
        "code": lambda: run_code_agent(question, attachments[0], llm),
    }
    return {"route": route, "result": handlers[route]()}
public AgentResponse routeAndRun(AgentRequest request) {
    Route route = router.classify(request);
    metrics.counter("agent.route", "type", route.name()).increment();
    return switch (route) {
        case FAST -> fastRag.answer(request);
        case AGENTIC -> researchAgent.run(request);
        case DEEP -> deepResearchQueue.enqueue(request);
        case CODE -> codeAgent.run(request);
    };
}
📦 Real World

Perplexity-style products expose route implicitly; enterprise copilots label modes: "Quick" vs "Research" vs "Analyze data" so users understand latency and cost.

💰 Cost

Agentic traffic may be 5–20% of queries but 40–70% of LLM spend. Monitor spend by route and adjust router thresholds quarterly.

🔒 Security

Deep research and code agents are high exfiltration risk. Enforce RBAC on retriever scopes and sandbox network policy per tenant tier.

Combining agentic RAG with MCP tools

Retrieval does not have to live in-process. Expose search_kb, fetch_chunk, and search_web as MCP tools so research agents discover them alongside CRM and calendar servers. The orchestrator merges allowlists; the model picks retrieval tools per step. See MCP explained for server patterns.

Agent modeTypical toolsMax steps
Researchsearch_kb, fetch_chunk, cite_source6–10
Deep researchsearch_web, fetch_url, dedupe_check20–40 (async)
Code analysisrun_python, load_csv, plot5–8

Failure modes and mitigations

FailureSymptomMitigation
Retrieve loopSame query repeated 5+ timesQuery dedup set; force synthesize after cap
Citation drift[3] points to wrong chunkStable chunk IDs; validate map before render
Over-research30 sources, shallow summaryMax sources; require outline before report
Code hallucinationPrints answer without running codeRequire execution artifact for numeric claims
Step cap with forced synthesis
def agent_loop(question: str, tools, llm, max_steps: int = 8):
    messages = [{"role": "user", "content": question}]
    queries_seen: set[str] = set()
    for step in range(max_steps):
        resp = llm.chat(messages, tools=tools)
        if not resp.tool_calls:
            return resp.content
        for call in resp.tool_calls:
            if call.name == "search_kb":
                q = json.loads(call.arguments)["query"]
                if q in queries_seen:
                    messages.append({"role": "user", "content": "Duplicate query. Synthesize now with existing evidence."})
                    continue
                queries_seen.add(q)
            messages.append(run_tool(call))
    return llm.chat(messages + [{"role": "user", "content": "Step limit reached. Answer with citations and list gaps."}]).content
public String agentLoop(String question, int maxSteps) {
    List<Message> messages = new ArrayList<>(List.of(new UserMessage(question)));
    Set<String> queriesSeen = new HashSet<>();
    for (int step = 0; step < maxSteps; step++) {
        ChatResponse resp = chatClient.prompt().messages(messages).tools(tools).call();
        if (!resp.hasToolCalls()) return resp.getResult().getOutput().getContent();
        for (ToolCall call : resp.getToolCalls()) {
            if ("search_kb".equals(call.name())) {
                String q = parseQuery(call.arguments());
                if (!queriesSeen.add(q)) {
                    messages.add(new UserMessage("Duplicate query. Synthesize with existing evidence."));
                    continue;
                }
            }
            messages.add(ToolResponseMessage.from(call.id(), execute(call)));
        }
    }
    messages.add(new UserMessage("Step limit reached. Answer with citations and list gaps."));
    return chatClient.prompt().messages(messages).call().content();
}

Cross-track references

💡 Pro Tip

Start agentic RAG on internal KB only (no web) until citation accuracy exceeds 90% on golden set—then add web tools with stricter grounding.

Rollout phases

  1. Phase 1 — Agentic retrieve on internal KB for internal power users; log traces only.
  2. Phase 2 — Enable research mode with citations in UI; golden eval gate in CI.
  3. Phase 3 — Deep research async jobs for analyst persona; cost caps per tenant.
  4. Phase 4 — Code agent for CSV attachments; sandbox in isolated VPC.

Each phase has explicit exit criteria tied to Track 5 metrics before the next phase ships to all tenants.

Pair this guide with Advanced RAG patterns for Self-RAG grading concepts and with Multi-agent orchestration when research sub-agents run in parallel.