Agentic RAG & research agents
Single-shot RAG fails on multi-hop questions, open-ended research, and quantitative analysis over user data. Agentic RAG puts the model in control of when and how to retrieve—often in loops—with query decomposition, citation discipline, and optional code execution. This guide covers iterative retrieval, the 7-step research pattern, deep-research topic trees, sandboxed code agents, evaluation metrics, and production routing.
After reading, you should be able to: decide when agentic retrieval beats static pipelines; implement query decomposition and research loops with citations; build deep-research topic expansion with dedup and grounding; run generate→execute→fix code agent loops in sandboxes; measure task completion, step efficiency, tool accuracy, and faithfulness; and route production traffic across fast, agentic, deep, and code paths.
Agentic RAG — when and how to retrieve iteratively
Agentic RAG gives the LLM control over retrieval: it decides whether to search, which index or tool to use, and when it has enough evidence—often across multiple loops. This beats single-shot RAG on multi-hop and research questions, at the cost of latency, token spend, and eval complexity.
Grading retrieved evidence before synthesis
Agentic pipelines often add a grader node after each retrieval batch: a lightweight LLM or cross-encoder scores whether chunks are relevant and sufficient. Low scores trigger reformulation; high scores skip further retrieval. This pattern—used in Self-RAG and LangGraph tutorials—prevents both under- and over-retrieval.
| Grade | Action |
|---|---|
| Relevant + sufficient | Proceed to synthesis |
| Relevant + insufficient | Decompose further; new sub-query |
| Irrelevant | Rewrite query; switch index (BM25 vs vector) |
| Contradictory | Fetch tie-breaker; flag in answer |
GRADE_PROMPT = """Score retrieval for the question.
Return JSON: {"relevant": bool, "sufficient": bool, "follow_up_query": str|null}
Question: {question}
Chunks: {chunks}"""
def grade_retrieval(question: str, chunks: list[dict], llm) -> dict:
raw = llm.complete(GRADE_PROMPT.format(question=question, chunks=chunks[:5]))
grade = json.loads(raw)
if grade.get("relevant") and grade.get("sufficient"):
return {"action": "synthesize", "chunks": chunks}
if grade.get("follow_up_query"):
return {"action": "retrieve_again", "query": grade["follow_up_query"]}
return {"action": "rewrite", "query": question}
public GradeDecision gradeRetrieval(String question, List<Document> chunks) {
RetrievalGrade grade = graderLlm.grade(question, chunks.subList(0, Math.min(5, chunks.size())));
if (grade.relevant() && grade.sufficient()) {
return GradeDecision.synthesize(chunks);
}
if (grade.followUpQuery() != null) {
return GradeDecision.retrieveAgain(grade.followUpQuery());
}
return GradeDecision.rewrite(question);
}
Single-shot vs agentic retrieval
| Pattern | Flow | Best for | Risk |
|---|---|---|---|
| Naive RAG | embed → search once → answer | FAQ, simple policy lookup | Misses multi-hop evidence |
| Advanced RAG | rewrite → hybrid → rerank → answer | Enterprise KB with tuned pipelines | Fixed strategy regardless of question |
| Agentic RAG | plan → retrieve → read → maybe retrieve again | Research, analytics, compound questions | Runaway loops, higher cost |
When to retrieve iteratively
- Multi-hop — answer needs entity A before you can query for relationship B.
- Unknown corpus scope — user asks "what changed in Q3 releases?" without naming products.
- Tool selection — hybrid of vector DB, SQL, web search, and calculator.
- Self-correction — first retrieval set lacks supporting evidence; agent reformulates query.
Query decomposition
Decomposition splits a compound question into ordered or parallel sub-queries. Unlike multi-query expansion (same intent, different phrasing), decomposition assumes different information needs: "Compare our DPA with Vendor X's SOC 2 scope" → (1) fetch DPA section, (2) fetch Vendor X SOC 2 summary, (3) synthesize comparison.
MAX_RETRIEVE_STEPS = 4
DECOMPOSE_PROMPT = """Break this question into 1-4 search sub-queries needed to answer it.
Return JSON: {"sub_queries": ["...", "..."]}"""
def agentic_retrieve(question: str, retriever, llm) -> list[dict]:
plan = json.loads(llm.complete(DECOMPOSE_PROMPT + question))
evidence: list[dict] = []
seen_ids: set[str] = set()
for sub_q in plan.get("sub_queries", [question]):
if len(evidence) >= 20:
break
hits = retriever.search(sub_q, k=5)
for h in hits:
if h["id"] not in seen_ids:
seen_ids.add(h["id"])
evidence.append({**h, "from_query": sub_q})
# Optional: critic step — need more?
critique = llm.complete(f"Evidence count {len(evidence)}. Need more retrieval for: {question}? yes/no")
if "yes" in critique.lower() and len(plan.get("sub_queries", [])) < MAX_RETRIEVE_STEPS:
follow_up = llm.complete(f"One follow-up search query for: {question}")
evidence.extend(retriever.search(follow_up.strip(), k=3))
return evidence
public List agenticRetrieve(String question) {
DecompositionPlan plan = decomposer.decompose(question);
Set seen = new HashSet<>();
List evidence = new ArrayList<>();
for (String subQ : plan.subQueries()) {
for (Document d : retriever.search(subQ, 5)) {
if (seen.add(d.getId())) evidence.add(d);
}
}
if (critic.needsMore(question, evidence)) {
String followUp = decomposer.followUp(question, evidence);
retriever.search(followUp, 3).forEach(d -> { if (seen.add(d.getId())) evidence.add(d); });
}
return evidence;
}
Agentic RAG can 3–10× retrieval cost vs advanced static pipelines. Cap steps, cache sub-query results, and route easy FAQ traffic to single-shot path.
Without step limits, models loop retrieve→vague read→retrieve forever. Hard cap tool calls and require explicit "sufficient evidence" structured output.
LangGraph, AutoGen, and Spring AI graphs model agentic RAG as state machines: nodes for plan, retrieve, grade, generate—with conditional edges on retrieval quality scores.
Log each sub-query and hit IDs. When answers fail, you see whether decomposition or retrieval—not synthesis—was wrong.
Research agent — 7-step pattern with citations
A production research agent follows a repeatable loop: clarify scope, plan sources, retrieve, extract claims, cross-check, synthesize with citations, and surface confidence gaps. This pattern powers internal briefings, competitive intel, and support escalation summaries.
Citation format standards
Pick one citation style per product and enforce in post-processing. Common patterns: inline numeric [1] with footnotes linking to chunk viewer; or inline markdown links to signed URLs expiring in 24h. Store chunk_id, index_version, and char_span so citations survive reindex if chunk boundaries shift slightly.
| Style | Pros | Cons |
|---|---|---|
| Numeric [n] | Compact in chat UI | Requires citation panel |
| Markdown link | Clickable in exports | Long URLs in prose |
| Footnote doc id | Stable across reindex | Less readable inline |
Seven-step research pattern
- Clarify — restate question; identify required facets (time range, geography, product line).
- Plan sources — pick indexes (KB, web, CRM); define forbidden sources (unapproved wikis).
- Retrieve — run decomposed queries; prefer primary docs over summaries.
- Extract — pull atomic claims with source span IDs (chunk_id, page, URL).
- Cross-check — flag contradictions; retrieve tie-breaker evidence if needed.
- Synthesize — write answer with inline citations [1][2]; separate facts vs inference.
- Confidence — list gaps ("SOC 2 scope for Vendor X not in corpus"); suggest next searches.
| Step | Output artifact | Quality gate |
|---|---|---|
| Clarify | Structured brief JSON | Ambiguity score < threshold or ask user |
| Extract | List of {claim, source_id, quote} | Every claim has source_id |
| Synthesize | Markdown + citation map | No citation → sentence removed |
| Confidence | Gap list | User sees what was not found |
@dataclass
class Claim:
text: str
source_id: str
quote: str
def research_agent(question: str, retriever, llm) -> dict:
brief = llm.complete_json(CLARIFY_PROMPT, question)
evidence = agentic_retrieve(brief["restated_question"], retriever, llm)
claims: list[Claim] = []
for chunk in evidence:
extracted = llm.complete_json(EXTRACT_PROMPT, chunk["text"])
for c in extracted.get("claims", []):
claims.append(Claim(c["text"], chunk["id"], c["quote"]))
synthesis = llm.complete(SYNTHESIZE_PROMPT, {
"question": question,
"claims": [c.__dict__ for c in claims],
})
return {
"answer": synthesis,
"citations": {c.source_id: c.quote for c in claims},
"gaps": llm.complete(GAP_PROMPT, {"question": question, "claims": len(claims)}),
}
public ResearchResult researchAgent(String question) {
ResearchBrief brief = clarifier.clarify(question);
List evidence = agenticRetrieve(brief.restatement());
List claims = claimExtractor.extract(evidence);
contradictions.check(claims).forEach(c -> evidence.addAll(retriever.search(c.tieBreakerQuery(), 2)));
String answer = synthesizer.withCitations(question, claims);
List gaps = gapAnalyzer.analyze(question, claims);
return new ResearchResult(answer, claims, gaps);
}
Legal and compliance teams require citation maps stored with each briefing PDF export—auditors click [2] to open the exact policy chunk version indexed at answer time.
Research agents exfiltrate if they can search unrestricted web or email indexes. Scope retriever tools per user RBAC before the plan step.
Describe research agents as orchestration + citation discipline—not "GPT browses the web." Mention claim extraction, contradiction handling, and gap reporting.
Seven-step research on 10+ chunks may consume 20–50k tokens. Offer "quick answer" (single-shot) vs "deep research" (full loop) UX tiers.
Deep research — Perplexity / ChatGPT pattern
Deep research products (Perplexity Deep Research, ChatGPT deep browse, Gemini Deep Research) expand a topic tree: breadth-first exploration, deduplication of sources, progressive grounding checks, and a final long-form report. Understanding this pattern helps you build—or compete with—consumer-grade research UX on private corpora.
Topic tree expansion
Start from the user question as the root node. The planner generates 3–7 subtopics (breadth). Each subtopic triggers retrieval; salient findings spawn child nodes (depth) until a budget (max nodes, max tokens, max wall time) stops expansion. This is BFS/DFS hybrid with LLM-scored priority.
| Phase | Action | Stop condition |
|---|---|---|
| Expand | Generate subtopics from frontier queue | Max depth or novelty < ε |
| Retrieve | Search per subtopic; fetch full pages if web | Top-K per node capped |
| Dedup | Canonical URL / chunk hash / embedding similarity | Skip near-duplicates > 0.92 cosine |
| Ground | Verify claims against source text | Drop unsupported bullets |
| Report | Outline → section drafts → merge | Token budget for output |
Deduplication strategies
- URL canonicalization — strip tracking params; normalize www.
- Content hash — SHA-256 of extracted main text.
- Embedding cluster — merge chunks with cosine > 0.92 from same domain.
- Temporal prefer — keep newest when sources repeat news wire stories.
Grounding checks
After draft bullets, a grounding model (often same LLM with NLI-style prompt) labels each sentence SUPPORTED / CONTRADICTED / NOT_ENOUGH_INFO against retrieved spans. Unsupported sentences are removed or rewritten. This reduces hallucination in long reports where synthesis drift accumulates.
@dataclass
class TopicNode:
topic: str
depth: int
parent: str | None
def deep_research(root_question: str, retriever, llm, max_nodes: int = 24) -> dict:
frontier: list[TopicNode] = [TopicNode(root_question, 0, None)]
visited_topics: set[str] = set()
source_pool: dict[str, dict] = {} # id -> chunk
while frontier and len(visited_topics) < max_nodes:
node = frontier.pop(0)
key = node.topic.lower().strip()
if key in visited_topics:
continue
visited_topics.add(key)
hits = retriever.search(node.topic, k=6)
for h in dedupe_hits(hits, source_pool):
source_pool[h["id"]] = h
if node.depth < 3:
subtopics = llm.complete_json(EXPAND_PROMPT, node.topic).get("subtopics", [])
for st in subtopics[:4]:
frontier.append(TopicNode(st, node.depth + 1, key))
outline = llm.complete(OUTLINE_PROMPT, list(source_pool.values()))
draft = llm.complete(REPORT_PROMPT, {"outline": outline, "sources": list(source_pool.values())})
grounded = grounding_filter(draft, source_pool, llm)
return {"report": grounded, "sources": list(source_pool.values()), "nodes_explored": len(visited_topics)}
public DeepResearchResult deepResearch(String rootQuestion) {
Queue frontier = new ArrayDeque<>(List.of(new TopicNode(rootQuestion, 0, null)));
Set visited = new HashSet<>();
Map sourcePool = new LinkedHashMap<>();
while (!frontier.isEmpty() && visited.size() < maxNodes) {
TopicNode node = frontier.poll();
if (!visited.add(normalize(node.topic()))) continue;
List hits = deduper.merge(sourcePool, retriever.search(node.topic(), 6));
hits.forEach(d -> sourcePool.putIfAbsent(d.getId(), d));
if (node.depth() < 3) {
expander.expand(node.topic()).stream()
.limit(4)
.map(st -> new TopicNode(st, node.depth() + 1, node.topic()))
.forEach(frontier::add);
}
}
String draft = reporter.draft(sourcePool.values());
return new DeepResearchResult(groundingFilter.apply(draft, sourcePool), sourcePool.values());
}
Commercial deep research runs many parallel retrievals with async workers—similar to your async job queue pattern. Wall time is dominated by fetch + LLM, not vector search alone.
Deep research reports take 2–15 minutes. Run as background jobs with email/Slack delivery; never block synchronous chat on full tree expansion.
Dedup only by URL misses paraphrased plagiarism across domains. Add embedding similarity dedup for paragraph-level repeats.
Show users the topic tree as a progress UI—it sets expectations and increases trust during long runs.
Code agent — generate, execute, fix loop
A code agent writes Python (or SQL, shell) to answer quantitative questions, transform data, or produce charts. The loop is: generate → execute in sandbox → read stderr/stdout → fix until tests pass or step cap. E2B, Docker, Modal, and vendor code interpreters implement the sandbox layer.
Code agent loop
| Stage | Input | Output |
|---|---|---|
| Generate | Question + data schema + prior errors | Python cell |
| Execute | Code + sandbox limits | stdout, stderr, artifacts |
| Fix | Error trace | Revised code |
| Answer | Successful run output | Natural language + optional chart |
Sandbox requirements
- No network egress (or allowlist only)
- CPU/memory/time limits per execution
- Ephemeral filesystem; no host mounts
- Preinstalled pandas/numpy; no arbitrary pip install in prod
- Capture matplotlib/plotly bytes as artifacts
from e2b_code_interpreter import Sandbox
MAX_FIX_ATTEMPTS = 5
def run_code_agent(question: str, data_csv: str, llm) -> dict:
code = llm.complete(CODEGEN_PROMPT, {"question": question, "schema": data_csv[:500]})
errors: list[str] = []
with Sandbox(timeout=120) as sbx:
sbx.files.write("/data/input.csv", data_csv)
for attempt in range(MAX_FIX_ATTEMPTS):
execution = sbx.run_code(code)
if execution.error is None:
return {
"answer": llm.complete(SUMMARIZE_PROMPT, execution.logs.stdout),
"code": code,
"artifacts": execution.results,
}
errors.append(str(execution.error))
code = llm.complete(FIX_PROMPT, {"code": code, "errors": errors})
return {"error": "max_attempts", "last_code": code}
@Service
public class CodeAgentService {
private final ChatClient llm;
private final SandboxClient sandbox;
public CodeAgentResult run(String question, String dataCsv) {
String code = llm.prompt().user(codegenTemplate(question, dataCsv)).call().content();
List errors = new ArrayList<>();
for (int i = 0; i < MAX_FIX_ATTEMPTS; i++) {
ExecutionResult result = sandbox.runPython(code, dataCsv, Duration.ofSeconds(120));
if (result.success()) {
return new CodeAgentResult(llm.summarize(result.stdout()), code, result.artifacts());
}
errors.add(result.stderr());
code = llm.prompt().user(fixTemplate(code, errors)).call().content();
}
throw new CodeAgentException("max attempts exceeded");
}
}
Never execute model-generated code on the host OS. Sandboxes must block filesystem escape, network, and fork bombs. Treat stdout as untrusted before showing to users.
Each fix attempt is an LLM call + sandbox minute. Pre-validate with static analysis (bandit, timeout wrappers) before execution.
OpenAI Code Interpreter and Anthropic analysis tool use vendor sandboxes; self-hosted teams standardize on E2B Firecracker microVMs or gVisor-wrapped containers.
Code agents combine tool loops with deterministic execution—mention step caps, sandbox isolation, and validating numeric answers against retrieved facts when possible.
Agent evaluation — completion, efficiency, tools, faithfulness
Agent quality is multi-dimensional: did it finish the task, how many steps did it waste, were tools called correctly, and is the final answer grounded? Track 4 agents need offline eval harnesses before production—full methodology lives in Track 5 — Evaluation, but the metrics start here.
Core agent metrics
| Metric | Definition | How to measure |
|---|---|---|
| Task completion | User goal achieved end-to-end | LLM judge + human spot check on golden tasks |
| Step efficiency | Tool calls vs optimal path | Compare trace length to expert trajectory |
| Tool accuracy | Correct tool + valid args | Parse trace; match expected tool sequence |
| Faithfulness | Answer supported by retrieved evidence | NLI-style claim check per sentence |
| Retrieval recall | Gold doc in evidence set | Label chunk IDs in eval set |
Golden task fixtures
Build 30–100 scripted tasks per agent persona: "Research refund policy for EU enterprise tier," "Plot Q3 revenue by region from attached CSV." Each fixture specifies expected tools (order may vary), required citation sources, and forbidden actions (no web, no delete). Run in CI on prompt/graph changes.
Trace logging schema
- trace_id, agent_version, user_tenant
- Ordered steps: {type: tool|llm|retrieve, name, latency_ms, input_hash, output_hash}
- Final answer + citation map + retrieval snapshot IDs
- User feedback signal linked to trace
@dataclass
class GoldenTask:
id: str
question: str
expected_tools: set[str]
required_source_ids: set[str]
def eval_agent(agent_fn, tasks: list[GoldenTask], judge_llm) -> dict:
results = []
for task in tasks:
trace = agent_fn.run_with_trace(task.question)
tools_used = {s["name"] for s in trace.steps if s["type"] == "tool"}
sources = set(trace.retrieval_ids)
tool_score = len(task.expected_tools & tools_used) / max(len(task.expected_tools), 1)
recall = len(task.required_source_ids & sources) / max(len(task.required_source_ids), 1)
completion = judge_llm.score(task.question, trace.answer, task.rubric)
results.append({
"task_id": task.id,
"completion": completion,
"tool_accuracy": tool_score,
"retrieval_recall": recall,
"steps": len(trace.steps),
})
return {"tasks": results, "avg_completion": mean(r["completion"] for r in results)}
public AgentEvalReport evalAgent(AgentRunner runner, List tasks) {
List scores = new ArrayList<>();
for (GoldenTask task : tasks) {
AgentTrace trace = runner.runWithTrace(task.question());
Set toolsUsed = trace.toolNames();
double toolAcc = overlap(task.expectedTools(), toolsUsed);
double recall = overlap(task.requiredSourceIds(), trace.retrievalIds());
double completion = judge.score(task, trace.answer());
scores.add(new TaskScore(task.id(), completion, toolAcc, recall, trace.stepCount()));
}
return AgentEvalReport.from(scores);
}
Optimize step efficiency only after task completion plateaus— shaving steps while breaking edge cases is a common regression.
LLM-as-judge favors verbose answers. Calibrate judges against human labels monthly; use rubric checkpoints not vibes.
Name four metrics: completion, efficiency, tool accuracy, faithfulness. Mention golden traces and linking evals to Track 5 CI gates.
Deep dive: Track 5 — Evaluation covers golden datasets, LLM judges, regression gates, and online evals.
What this looks like in production
Shipping agentic RAG means routing traffic between fast single-shot and slow research paths, enforcing budgets, and operating observability across retrieval, tools, and synthesis. This is the Track 4 capstone before Track 5 — Evaluation.
Routing: fast vs agentic path
| Signal | Route | Target latency |
|---|---|---|
| FAQ match > 0.9 confidence | Single-shot RAG | < 3 s |
| Compound / research keywords | Agentic retrieve loop | 5–30 s |
| "Deep research" explicit UX | Topic tree async job | 2–15 min |
| CSV / numeric analysis attached | Code agent sandbox | 10–60 s |
Track 4 production checklist
- Router with logged decision reason per request
- Hard caps: max tool calls, max retrieve steps, max tokens per trace
- Citation map stored with every research output
- Sandbox isolation for code agents
- MCP allowlists for external tools (see MCP guide)
- Golden agent tasks in CI (30+ per persona)
- Async queue for deep research with progress notifications
- Cost attribution per tenant: retrieval + LLM + sandbox minutes
Observability dashboard
- p95 latency by route (fast / agentic / deep / code)
- Tool error rate by server name
- Avg steps per successful task
- Faithfulness score sample (offline daily job)
- Retrieval recall@K on labeled slice
def route_and_run(question: str, attachments: list, cfg) -> dict:
route = "fast"
if cfg.deep_research_phrase in question.lower():
route = "deep"
elif attachments and any(a.endswith(".csv") for a in attachments):
route = "code"
elif classifier.needs_agentic_rag(question):
route = "agentic"
log.info("route_selected", route=route, question_hash=hash(question))
handlers = {
"fast": lambda: single_shot_rag(question),
"agentic": lambda: research_agent(question, retriever, llm),
"deep": lambda: enqueue_deep_research(question),
"code": lambda: run_code_agent(question, attachments[0], llm),
}
return {"route": route, "result": handlers[route]()}
public AgentResponse routeAndRun(AgentRequest request) {
Route route = router.classify(request);
metrics.counter("agent.route", "type", route.name()).increment();
return switch (route) {
case FAST -> fastRag.answer(request);
case AGENTIC -> researchAgent.run(request);
case DEEP -> deepResearchQueue.enqueue(request);
case CODE -> codeAgent.run(request);
};
}
Perplexity-style products expose route implicitly; enterprise copilots label modes: "Quick" vs "Research" vs "Analyze data" so users understand latency and cost.
Agentic traffic may be 5–20% of queries but 40–70% of LLM spend. Monitor spend by route and adjust router thresholds quarterly.
Deep research and code agents are high exfiltration risk. Enforce RBAC on retriever scopes and sandbox network policy per tenant tier.
Combining agentic RAG with MCP tools
Retrieval does not have to live in-process. Expose search_kb, fetch_chunk, and search_web as MCP tools so research agents discover them alongside CRM and calendar servers. The orchestrator merges allowlists; the model picks retrieval tools per step. See MCP explained for server patterns.
| Agent mode | Typical tools | Max steps |
|---|---|---|
| Research | search_kb, fetch_chunk, cite_source | 6–10 |
| Deep research | search_web, fetch_url, dedupe_check | 20–40 (async) |
| Code analysis | run_python, load_csv, plot | 5–8 |
Failure modes and mitigations
| Failure | Symptom | Mitigation |
|---|---|---|
| Retrieve loop | Same query repeated 5+ times | Query dedup set; force synthesize after cap |
| Citation drift | [3] points to wrong chunk | Stable chunk IDs; validate map before render |
| Over-research | 30 sources, shallow summary | Max sources; require outline before report |
| Code hallucination | Prints answer without running code | Require execution artifact for numeric claims |
def agent_loop(question: str, tools, llm, max_steps: int = 8):
messages = [{"role": "user", "content": question}]
queries_seen: set[str] = set()
for step in range(max_steps):
resp = llm.chat(messages, tools=tools)
if not resp.tool_calls:
return resp.content
for call in resp.tool_calls:
if call.name == "search_kb":
q = json.loads(call.arguments)["query"]
if q in queries_seen:
messages.append({"role": "user", "content": "Duplicate query. Synthesize now with existing evidence."})
continue
queries_seen.add(q)
messages.append(run_tool(call))
return llm.chat(messages + [{"role": "user", "content": "Step limit reached. Answer with citations and list gaps."}]).content
public String agentLoop(String question, int maxSteps) {
List<Message> messages = new ArrayList<>(List.of(new UserMessage(question)));
Set<String> queriesSeen = new HashSet<>();
for (int step = 0; step < maxSteps; step++) {
ChatResponse resp = chatClient.prompt().messages(messages).tools(tools).call();
if (!resp.hasToolCalls()) return resp.getResult().getOutput().getContent();
for (ToolCall call : resp.getToolCalls()) {
if ("search_kb".equals(call.name())) {
String q = parseQuery(call.arguments());
if (!queriesSeen.add(q)) {
messages.add(new UserMessage("Duplicate query. Synthesize with existing evidence."));
continue;
}
}
messages.add(ToolResponseMessage.from(call.id(), execute(call)));
}
}
messages.add(new UserMessage("Step limit reached. Answer with citations and list gaps."));
return chatClient.prompt().messages(messages).call().content();
}
Cross-track references
- Retrieval strategies — static pipelines agentic RAG extends
- Advanced RAG patterns — Self-RAG and corrective retrieval
- MCP explained — external retrieval and action tools
- Track 5 — Evaluation — golden tasks and faithfulness scoring
Start agentic RAG on internal KB only (no web) until citation accuracy exceeds 90% on golden set—then add web tools with stricter grounding.
Rollout phases
- Phase 1 — Agentic retrieve on internal KB for internal power users; log traces only.
- Phase 2 — Enable research mode with citations in UI; golden eval gate in CI.
- Phase 3 — Deep research async jobs for analyst persona; cost caps per tenant.
- Phase 4 — Code agent for CSV attachments; sandbox in isolated VPC.
Each phase has explicit exit criteria tied to Track 5 metrics before the next phase ships to all tenants.
Pair this guide with Advanced RAG patterns for Self-RAG grading concepts and with Multi-agent orchestration when research sub-agents run in parallel.