Multi-agent systems & orchestration
One agent with twenty tools in one prompt eventually hits context limits, skill dilution, and un-debuggable loops. Multi-agent systems split work across specialists coordinated by a graph: supervisor routing, parallel fan-out, critic loops, and framework-native handoffs. This guide maps when teams earn their complexity—and how LangGraph, AutoGen, CrewAI, and OpenAI Swarm implement the same patterns with different ergonomics.
After reading, you should be able to: justify multi-agent vs single-agent; compare supervisor, peer-to-peer, pipeline, parallel, and critic/judge patterns; build a LangGraph state machine with conditional edges and checkpointing; sketch AutoGen GroupChat and CrewAI role crews; explain OpenAI Swarm handoffs; and ship production controls—bounded scope, observability, circuit breakers, and cost caps.
Why multi-agent systems
A single LLM agent can plan, retrieve, call APIs, and write prose—but stuffing every capability into one system prompt creates context pressure, tool sprawl, and reasoning overload. Multi-agent design trades coordination overhead for narrower prompts, scoped tool allowlists, and parallel execution where subtasks are independent.
The default architecture for most products should remain one agent with a well-designed tool router. You escalate to multiple agents when evals show a measurable gap: one role consistently pollutes another's context (policy facts mixed with creative marketing copy), subtasks are embarrassingly parallel (parse ten PDFs independently), or governance requires hard separation (research agent has read-only DB; writer agent has no SQL tools at all).
Context and skill limits
Every agent turn pays for the full transcript plus tool definitions. A monolithic support agent might carry: system instructions (2K tokens), twelve tool schemas (4K), conversation history (8K), retrieved policy chunks (6K), and scratch notes from prior tool rounds (3K)—before the model reasons about the user's latest message. Specialists shrink each slice: the researcher sees order APIs and compact facts; the writer sees facts + tone guide, not raw JSON from five failed tool attempts.
| Pressure | Single-agent symptom | Multi-agent mitigation |
|---|---|---|
| Context window | History + tools + RAG exceed budget; middle gets lost | Shared state slots; pass summaries not full chat replay |
| Skill dilution | Model "forgets" refund rules while drafting empathetic tone | Policy agent with narrow system prompt + tools |
| Tool confusion | Calls issue_refund when user asked for status | Writer has zero write tools; executor is separate node |
| Debuggability | One blob log; unclear which step hallucinated | Per-agent spans: supervisor → researcher → writer |
Parallelization wins
When subtasks do not depend on each other's outputs, fan-out beats serial ReAct. Example: competitive intelligence brief—Agent A scrapes pricing pages, Agent B summarizes SEC filings, Agent C extracts features from docs; an aggregator merges structured JSON. Latency becomes max(subtask) plus merge—not sum. The orchestrator must define a merge contract (schema, conflict rules) so parallel outputs compose deterministically.
When one agent is enough
- Linear workflows with <5 tool types and predictable routing.
- Latency SLO <3s—multi-agent handoffs add 1–3 LLM calls minimum.
- Team lacks tracing infrastructure to debug inter-agent message buses.
- Eval shows single-agent with better prompts matches specialist accuracy.
Decision rubric
| Signal | Lean single-agent | Lean multi-agent |
|---|---|---|
| Tool count | <8 well-grouped tools | >15 tools across domains |
| Compliance | Uniform RBAC on gateway | Role-based tool ACLs per agent |
| Parallelism | None needed | Independent subtasks >2 |
| Iteration style | Fixed pipeline | Dynamic routing / debate loops |
from dataclasses import dataclass, field
from typing import Any
@dataclass
class TeamState:
request_id: str
user_message: str
order_facts: dict[str, Any] = field(default_factory=dict)
policy_facts: dict[str, Any] = field(default_factory=dict)
draft_answer: str = ""
active_agent: str = "supervisor"
handoff_count: int = 0
max_handoffs: int = 8
def researcher_prompt(state: TeamState) -> str:
return f"""Look up order details. User: {state.user_message}
Known facts: {state.order_facts or 'none'}"""
def writer_prompt(state: TeamState) -> str:
return f"""Write customer reply using ONLY these facts:
order={state.order_facts}
policy={state.policy_facts}"""
public record TeamState(
String requestId,
String userMessage,
Map orderFacts,
Map policyFacts,
String draftAnswer,
String activeAgent,
int handoffCount,
int maxHandoffs
) {
public static String researcherPrompt(TeamState s) {
return "Look up order. User: " + s.userMessage()
+ "\nKnown: " + s.orderFacts();
}
public static String writerPrompt(TeamState s) {
return "Write reply using order=" + s.orderFacts()
+ " policy=" + s.policyFacts();
}
}
Start with a single agent and a structured TeamState object even before splitting prompts—if you cannot name what each slot stores, multi-agent will duplicate work.
Multi-agent adds 2–5× LLM calls per user request. Parallel fan-out saves wall-clock time but doubles peak token spend unless you cap concurrent agents.
Spinning up five agents because the demo looked impressive—without eval proof—usually increases cost and failure modes while accuracy stays flat.
"When would you use multi-agent?" — Default single agent + tools. Multi-agent when tool ACLs, parallel subtasks, or context isolation measurably improve evals. Mention bounded handoffs and shared state schema.
Legal-tech copilots split clause retrieval (RAG agent, read-only) from drafting (no DB tools) after audits showed combined agents issued citations from wrong tenant's corpus twice in shadow traffic.
Parallel three-agent research with 8K context each costs ~3× single-agent sequential—but wall-clock drops from 45s to 18s. Price the premium tier accordingly; batch offline jobs tolerate higher spend than chat.
Specialization works because smaller tool schemas reduce mistaken function calls—OpenAI tool-choice accuracy degrades measurably beyond ~10–15 tools in one request; split across agents restores precision.
Bridge from agent memory
The prior guide (Agent memory) covered what to persist in the loop vs trim each round. Multi-agent adds a rule: write to shared state slots, not broadcast full transcripts. Working memory lives in TeamState.facts; long-term memory stays in external stores keyed by user_id—agents read summaries injected at handoff, not each other's raw tool JSON.
Orchestration patterns
Production multi-agent systems reuse a small set of topology patterns. Pick the one that minimizes coordination friction: agents should know when to act, when to wait, and what schema they hand off—not broadcast unstructured chat forever.
Supervisor + workers
A supervisor (router node or LLM) reads shared state and delegates to specialists: researcher, coder, reviewer. Workers return structured updates to state; supervisor decides next hop or termination. This mirrors a manager assigning tickets— predictable, auditable, easy to cap with max_handoffs. Risk: supervisor becomes bottleneck and may route incorrectly; mitigate with deterministic pre-routing rules before LLM routing.
Peer-to-peer (network)
Agents message each other without a central boss—debate, negotiate, or request clarifications. Powerful for open-ended research but hard to terminate: conversations drift, duplicate retrieval, or ping-pong blame. Use only with strict turn limits, stagnation detection (same content hash twice), and a global budget clock.
Pipeline (sequential)
Fixed DAG: ingest → extract → validate → summarize → publish. Each stage is an agent or tool node; output of stage N is input to N+1. Best for ETL-style workflows where order never changes. Easy to test per stage; failure handling is retry or dead-letter per node.
Parallel fan-out / gather
Map phase: spawn N workers on partitioned inputs. Reduce phase: aggregator merges with schema validation and deduplication. LangGraph Send API and Celery task groups implement the same idea. Watch for partial failures—define whether merge proceeds with missing shards or fails closed.
Critic / judge loop
Generator produces draft; critic scores against rubric (citations present, policy compliant, tone); conditional edge loops back or exits. Same pattern as Self-RAG reflection—applied to multi-agent teams. Cap iterations (typically 2–3); critics should output structured pass/fail + fix instructions, not prose essays that bloat context.
| Pattern | Best for | Watch out for |
|---|---|---|
| Supervisor + workers | Support triage, research teams, tool ACL separation | Supervisor mis-routing; over-handoff |
| Peer-to-peer | Brainstorming, adversarial review | Non-termination, duplicated work |
| Pipeline | Doc processing, codegen → review → deploy | Rigid; hard to skip stages dynamically |
| Parallel | Batch analysis, multi-source gather | Merge conflicts, partial failures |
| Critic / judge | Regulated outputs, citation-heavy answers | Runaway refine loops, cost explosion |
Pattern selection flow
- Can one agent with scoped tools pass eval? If yes, stop here.
- Are subtasks independent? → Parallel + aggregator.
- Is order fixed? → Pipeline.
- Need dynamic delegation? → Supervisor + workers.
- Need quality gate before ship? → Add critic loop on final stage only.
Critic / judge implementation sketch
The critic node returns structured JSON—never free-form "looks good." Pass/fail drives the conditional edge; fail appends fix_instructions to state for the generator's next pass.
| Critic check | Pass criteria | On fail |
|---|---|---|
| Citation coverage | Every claim has source_id | Loop to researcher |
| Policy compliance | No banned refund phrases | Loop to writer with fixes |
| Format | Valid JSON schema | Loop to formatter node |
| Confidence | Self-score ≥ 0.85 | Abstain or HITL |
Parallel gather merge contract
Define aggregator input schema upfront: {shard_id, status, payload, error}. Merge rules: dedupe by primary key, prefer newest timestamp on conflict, surface partial failure count in final answer metadata. Without this contract, parallel agents produce irreconcilable prose blobs.
def route_supervisor(state: TeamState) -> str:
if state.handoff_count >= state.max_handoffs:
return "writer" # force finish
if not state.order_facts and "ORD-" in state.user_message.upper():
return "researcher"
if not state.policy_facts and any(k in state.user_message.lower() for k in ("refund", "policy")):
return "policy"
if state.order_facts and state.policy_facts:
return "writer"
return "researcher"
# LangGraph: add_conditional_edges("supervisor", route_supervisor, {...})
Function supervisorRouter = state -> {
var s = (TeamState) state;
if (s.handoffCount() >= s.maxHandoffs()) return "writer";
if (s.orderFacts().isEmpty() && s.userMessage().contains("ORD-"))
return "researcher";
if (s.policyFacts().isEmpty() && s.userMessage().toLowerCase().contains("refund"))
return "policy";
if (!s.orderFacts().isEmpty() && !s.policyFacts().isEmpty())
return "writer";
return "researcher";
};
Tier-2 support at a fintech runs supervisor → order lookup → policy RAG → writer, with deterministic routing for known intents and LLM supervisor only on ambiguous tickets—cuts mis-routes 40% vs pure LLM routing.
Supervisor routing can be hybrid: regex and intent classifier first (free, fast), LLM supervisor only when confidence <0.7—same pattern as model cascades in Track 1.
Peer-to-peer debate with four agents × five rounds = 20+ LLM calls per user question. Reserve for offline batch, not synchronous chat.
Peer networks amplify prompt injection—one agent can pass malicious "observations" to another. Validate and sanitize all inter-agent payloads like external tool results.
LangGraph: nodes, edges, state, and checkpoints
LangGraph models agent workflows as a directed graph: nodes mutate shared state, edges define transitions, and conditional routing enables cycles (tool loop, critic refine) with explicit termination. Checkpointing and human-in-the-loop interrupts make long-running multi-agent flows production-durable.
Core concepts
- State — TypedDict / Pydantic model / Java record holding messages, facts, flags. Reducers (e.g. operator.add on messages) define merge semantics.
- Nodes — Pure functions state → partial state update. Each node should do one job: plan, call tools, summarize, route.
- Edges — Fixed transitions (researcher → supervisor) or conditional (route(state) returns next node name).
- Cycles — Deliberate loops for ReAct; must exit via iteration counter, success flag, or critic pass.
Conditional routing
add_conditional_edges("supervisor", route_fn, path_map) is the control plane. Route functions should be small, testable, and logged—avoid hiding routing inside LLM prose. Pattern: LLM proposes next agent as structured JSON; router validates against allowlist before transition.
Human-in-the-loop (HITL)
interrupt_before=["human"] pauses graph before executing the human node—state persists in checkpointer. UI shows pending tool write; operator approves; graph.invoke(None, config) resumes. Use for refunds, deploys, PII exports—not for every turn.
Checkpointing
MemorySaver for dev; Postgres / Redis checkpointers for prod. Thread ID = conversation or ticket ID. Enables: crash recovery, time-travel debug ("what was state before bad tool call?"), and audit replay for compliance.
| Feature | Dev default | Production |
|---|---|---|
| Checkpointer | MemorySaver | PostgresSaver with TTL |
| Thread key | UUID per session | tenant_id + ticket_id |
| Streaming | graph.stream | SSE to UI per node completion |
| Parallel sends | Manual threads | LangGraph Send API |
from typing import Annotated, Literal, TypedDict
import operator
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver
class AgentState(TypedDict):
messages: Annotated[list, operator.add]
facts: dict
iteration: int
needs_human: bool
def researcher(state: AgentState) -> AgentState:
# call tools, append messages, update facts
return {"messages": [{"role": "assistant", "content": "..."}],
"facts": {**state["facts"], "order_id": "ORD-42"}}
def supervisor(state: AgentState) -> AgentState:
return {"iteration": state["iteration"] + 1}
def route(state: AgentState) -> Literal["researcher", "writer", "human", "__end__"]:
if state.get("needs_human"):
return "human"
if state["iteration"] >= 6:
return "__end__"
if "order_id" not in state.get("facts", {}):
return "researcher"
return "writer"
builder = StateGraph(AgentState)
builder.add_node("supervisor", supervisor)
builder.add_node("researcher", researcher)
builder.add_node("writer", lambda s: {"messages": [{"role": "assistant", "content": "done"}]})
builder.add_node("human", lambda s: s) # HITL interrupt target
builder.add_edge(START, "supervisor")
builder.add_conditional_edges("supervisor", route)
builder.add_edge("researcher", "supervisor")
builder.add_edge("writer", END)
builder.add_edge("human", "supervisor")
checkpointer = MemorySaver()
graph = builder.compile(checkpointer=checkpointer, interrupt_before=["human"])
config = {"configurable": {"thread_id": "ticket-991"}}
for event in graph.stream({"messages": [], "facts": {}, "iteration": 0}, config):
print(event)
# Resume after human approval: graph.invoke(None, config)
// Spring AI + custom graph runner (LangGraph4j-style sketch)
record AgentState(List messages, Map facts,
int iteration, boolean needsHuman) {}
enum Route { RESEARCHER, WRITER, HUMAN, END }
Route route(AgentState state) {
if (state.needsHuman()) return Route.HUMAN;
if (state.iteration() >= 6) return Route.END;
if (!state.facts().containsKey("order_id")) return Route.RESEARCHER;
return Route.WRITER;
}
AgentState researcher(AgentState state) {
var facts = new HashMap<>(state.facts());
facts.put("order_id", "ORD-42");
return new AgentState(state.messages(), facts, state.iteration(), false);
}
// GraphRunner: Map> nodes
// CheckpointStore.save(threadId, state) before HITL interrupt
Unit-test route(state) with table-driven cases before testing full LLM nodes—routing bugs cause 80% of "agent went rogue" incidents.
LangGraph compiles to a Pregel-style superstep runner: each node reads latest state snapshot, writes partial update, reducer merges—enables parallel node execution when graph topology allows.
Set recursion_limit in config (default 25) lower in prod—e.g. 12—for multi-agent graphs with cycles. Pair with application-level max_handoffs.
Storing full message history in state without trimming causes exponential token growth across cycles—persist structured facts in facts dict, not replay entire tool transcripts to every agent.
"Explain LangGraph vs a while loop." — Graph gives named nodes, conditional edges, checkpointing, HITL interrupts, and visual debug. While loop is fine for demos; graphs scale observability and resume.
Support platforms store LangGraph checkpoints in Postgres keyed by tenant_id:ticket_id—agents resume after worker pod restart without re-running expensive research nodes.
Checkpoint writes add DB I/O but save re-execution after HITL—one avoided re-research pass pays for thousands of checkpoint rows.
Checkpoint stores hold full state including tool outputs—encrypt at rest, scope by tenant, TTL expired threads, never replay prod checkpoints into dev environments with live keys.
Parallel fan-out with Send
LangGraph's Send API dispatches the same node with different state slices—map over document IDs, run researcher on each, reducer merges into facts["shards"]. Combine with supervisor pattern: supervisor emits Send list; gather node validates all shards before writer runs.
Mapping patterns to LangGraph primitives
| Pattern | LangGraph building blocks |
|---|---|
| Supervisor + workers | Conditional edges from supervisor node; workers return to supervisor |
| Pipeline | Linear edges A → B → C → END |
| Parallel | Send API + gather reducer node |
| Critic loop | Cycle writer → critic → conditional → writer or END |
| HITL | interrupt_before + checkpoint resume |
AutoGen: AssistantAgent, UserProxyAgent, GroupChat
Microsoft AutoGen (and AutoGen AgentChat) centers on conversable agents that exchange messages until a termination condition. AssistantAgent calls the LLM; UserProxyAgent stands in for the human or executes code; GroupChat rotates speakers via a manager—peer-style orchestration with less explicit graph wiring than LangGraph.
AssistantAgent
Wraps LLM config, system message, and optional tools. Receives messages, returns assistant messages or tool calls. You scope capabilities per assistant: one for Python coding, one for SQL read-only, one for summarization.
UserProxyAgent
Can proxy human input (human_input_mode="ALWAYS" for HITL) or auto-approve (NEVER) for batch jobs. Often paired with AssistantAgent in two-agent coding loops: assistant writes code, user proxy executes in sandbox.
GroupChat + GroupChatManager
Register multiple assistants; manager selects next speaker (round-robin, LLM-picked, or custom). Good for prototyping debate and research teams. Production requires: max_round, speaker allowlists, and message filtering so agents do not leak secrets across roles.
Code execution sandbox
AutoGen popularized LLM-generated code → Docker/local executor. Treat as high risk: network-isolated container, no prod credentials, CPU/time limits, stdout size caps. Prefer structured tool APIs over arbitrary code when possible.
from autogen import AssistantAgent, UserProxyAgent
assistant = AssistantAgent(
name="coder",
llm_config={"model": "gpt-4o-mini", "temperature": 0},
system_message="Write Python. Wrap code in ```python blocks.",
)
user_proxy = UserProxyAgent(
name="executor",
human_input_mode="NEVER",
max_consecutive_auto_reply=3,
code_execution_config={"work_dir": "sandbox", "use_docker": True},
)
user_proxy.initiate_chat(
assistant,
message="Plot histogram of sample.csv column 'amount'",
)
// AutoGen Java / Spring AI: use Assistant + ToolExecutor pattern
// UserProxy ≈ HumanApprovalGateway or SandboxedScriptRunner
var coder = AssistantAgent.builder()
.name("coder")
.systemMessage("Write Java tests only.")
.model("gpt-4o-mini")
.build();
var executor = UserProxyAgent.builder()
.humanInputMode(HumanInputMode.NEVER)
.maxConsecutiveAutoReply(3)
.codeExecution(SandboxConfig.docker("work-dir"))
.build();
executor.initiateChat(coder, "Generate unit test for RefundService");
GroupChat feels fast to demo but speaker-selection LLM adds hidden cost every round. LangGraph explicit edges are easier to cap and audit.
Data science teams use AutoGen UserProxy + Docker for notebook-style exploration internally; customer-facing paths use fixed LangGraph pipelines with no arbitrary code exec.
Never run UserProxy with use_docker: False on server hosting secrets. Sandbox egress should be deny-by-default.
Coding loops: assistant draft + executor error feedback + assistant fix = 3–8 calls per task. Set max_consecutive_auto_reply aggressively in prod.
GroupChat manager LLM picking speakers unpredictably makes reproducing bugs impossible—log speaker selection rationale and freeze manager prompt version in eval fixtures.
"AssistantAgent vs UserProxyAgent?" — Assistant reasons and proposes; UserProxy executes code or represents human approval. Separation keeps codegen out of the same turn as arbitrary shell access without a gate.
Framework fit summary
AutoGen shines when conversation-shaped collaboration is the product requirement—research dialogues, coding pairs, internal experimentation. It is weaker when you need deterministic DAGs, strict SLAs, or fine-grained checkpoint audit—reach for LangGraph instead.
CrewAI: roles, tasks, and crews
CrewAI optimizes for readable role definitions: each agent has a role, goal, and backstory; tasks bind agents to outputs; a Crew runs tasks sequential or hierarchical (manager delegates). Ideal when non-engineers need to edit agent personas in config files.
Researcher / writer / reviewer crew
Classic content pipeline: Researcher gathers facts with search tools; Writer drafts from research output only; Reviewer checks citations and tone, sends back fixes. CrewAI sequential process enforces order without writing graph boilerplate—under the hood it is still a pipeline with shared context passed forward.
Task outputs as contracts
Each Task should declare expected_output and optionally output_json / Pydantic model. Downstream tasks reference prior task outputs via context=[research_task]—this is your schema blackboard.
Process modes
| Process | Behavior | Maps to pattern |
|---|---|---|
| Sequential | Task1 → Task2 → Task3 | Pipeline |
| Hierarchical | Manager agent assigns tasks | Supervisor + workers |
| Consensual (custom) | Debate until agreement | Peer + critic (use caps) |
Tools per agent
Bind tools at agent construction—not globally. Researcher gets Serper + scrape; Writer gets none; Reviewer gets policy lookup only. Matches tool ACL pattern from Track 4 tool-calling guide.
from crewai import Agent, Task, Crew, Process
researcher = Agent(
role="Senior Research Analyst",
goal="Find verified facts about {topic}",
backstory="You cite primary sources only.",
tools=[search_tool, scrape_tool],
verbose=True,
)
writer = Agent(
role="Content Writer",
goal="Draft article from research brief",
backstory="No new facts—only provided research.",
)
reviewer = Agent(
role="Compliance Reviewer",
goal="Ensure citations and banned phrase check",
backstory="Reject uncited claims.",
)
research = Task(description="Research {topic}", agent=researcher,
expected_output="Bullet list with URLs")
draft = Task(description="Write 800 words", agent=writer, context=[research])
review = Task(description="Review draft", agent=reviewer, context=[draft])
crew = Crew(agents=[researcher, writer, reviewer], tasks=[research, draft, review],
process=Process.sequential)
result = crew.kickoff(inputs={"topic": "EU AI Act summary"})
// CrewAI is Python-first; Java teams mirror with Spring AI + explicit Task DTOs
record TaskSpec(String id, String agentRole, String description, List dependsOn) {}
List pipeline = List.of(
new TaskSpec("research", "researcher", "Gather facts", List.of()),
new TaskSpec("draft", "writer", "Write article", List.of("research")),
new TaskSpec("review", "reviewer", "Compliance pass", List.of("draft"))
);
// Orchestrator runs in topological order, passing TaskOutput maps
Keep backstories short—every token repeats each task. Put stable rules in shared expected_output templates, not duplicated persona prose.
Hierarchical process with manager LLM picking tasks duplicates supervisor routing costs—use sequential when order is known.
CrewAI agents run ReAct internally when tools attached—your crew-level sequential process wraps inner tool loops; count total LLM calls = tasks × avg tool rounds.
"CrewAI vs LangGraph?" — CrewAI: rapid role/task DSL for pipelines. LangGraph: arbitrary graphs, cycles, HITL, checkpointing. Many teams prototype in CrewAI, port critical paths to LangGraph.
Marketing teams edit CrewAI YAML roles without touching Python—platform engineers wrap crew.kickoff() in a FastAPI job with cost caps and S3 artifact storage for drafts.
Researcher tools (web search) can return injected content—sanitize before Writer sees it; Reviewer should not fetch URLs directly unless allowlisted.
CrewAI readability trades explicit graph visibility—you may not see cycles until runtime. Add explicit task timeouts and output schema validation at crew boundaries.
OpenAI Swarm: handoffs with context preserved
Swarm (OpenAI experimental SDK) is a minimal orchestration layer: agents expose functions that return other agents (handoffs). The client runtime switches active agent while preserving conversation context—lightweight triage without a full graph framework.
Handoff mechanics
Agent A handles general intake. Its tool transfer_to_billing returns Agent B instance (or name). Runtime appends handoff metadata; Agent B continues with prior messages visible—user does not repeat context. Contrast with explicit state slots: Swarm keeps transcript continuity; you still should trim tool noise for token control.
When Swarm fits
- Fast triage prototypes: sales vs support vs technical.
- Small agent count (<5) with clear handoff points.
- Teams already on OpenAI SDK wanting minimal deps.
When to graduate to LangGraph
- Need checkpoint/resume, parallel fan-out, or critic cycles.
- Complex RBAC on which agent may hand off to whom.
- Multi-provider routing (Bedrock + OpenAI) in one workflow.
Context preservation caveats
Preserved context includes prior tool failures and PII from earlier agents—implement handoff filters that redact or summarize before sensitive agents (billing) receive transcript. Treat handoff like passing audit log, not raw memory dump.
from openai import OpenAI
client = OpenAI()
triage = {
"name": "Triage",
"instructions": "Route billing or technical issues.",
"tools": [{
"type": "function",
"function": {
"name": "transfer_to_billing",
"description": "Hand off billing questions",
"parameters": {"type": "object", "properties": {}},
},
}],
}
def handle_tool(name: str, active_agent: str):
if name == "transfer_to_billing":
return "Billing", {"handoff": True, "preserve_context": True}
return active_agent, {}
# Agents SDK: handoff updates active agent; messages array continues
// Java: model handoff as enum + ConversationState.activeAgent
enum AgentId { TRIAGE, BILLING, TECHNICAL }
HandoffResult transferToBilling(ConversationState state) {
var summary = summarizer.trimForHandoff(state.messages());
return new HandoffResult(AgentId.BILLING, summary, true);
}
SaaS onboarding bots use Swarm-style triage: general FAQ agent hands to provisioning agent when user mentions "setup API keys"—one continuous chat thread, no user-visible "you were transferred".
Context preservation simplifies UX but increases token load—billing agent reads full triage tool spam. Insert summarization handoff payload for long threads.
Handoffs must not bypass auth: billing agent should re-verify account scope even if triage agent saw user email in context.
Each handoff often triggers a fresh LLM call with full history—monitor tokens per agent segment in traces.
Production: bounded scope, observability, and cost controls
Multi-agent systems fail in production from unbounded loops, untraceable handoffs, and spend spikes—not from wrong framework choice. Ship with explicit budgets, circuit breakers, structured telemetry per agent, and a fallback to single-agent or human.
Bounded scope
- max_handoffs / max_iterations / recursion_limit — pick one app-level knob operators understand.
- Tool allowlists per agent; deny inter-agent message tools unless explicitly designed.
- Timeout per node (30s tool, 120s LLM); global request deadline enforced at gateway.
- Stagnation detection: same state hash or empty delta twice → escalate or abort.
Observability
Trace shape: trace_id → supervisor_span → researcher_span → tool_span. Log structured handoffs (from, to, reason, state snapshot hash)—not full prompts in prod logs. Export metrics: handoffs_per_request, tokens_per_agent, critic_loop_count, human_interrupt_rate.
Circuit breaker
When error rate or latency for an agent node exceeds threshold, skip to fallback: cached answer, simpler single-agent path, or human queue. Per-agent breakers prevent one broken tool from stalling entire crew. Reset breaker after cooldown + synthetic health check.
Cost cap
Pre-request estimate: max_tokens × price × expected_agents. Mid-flight abort when spend > cap. Attribute cost by agent_id for FinOps—researcher embedding calls vs writer pure completion.
Production readiness matrix
| Control | Minimum | Mature |
|---|---|---|
| Iteration limits | Fixed max_handoffs=8 | Per-tenant configurable + alerts |
| Tracing | JSONL handoff log | OpenTelemetry + LangSmith/Arize |
| HITL | Manual approve writes | Role-based approval queues |
| Evals | Golden multi-agent scenarios | CI regression on routing + output |
| Fallback | Human queue | Degrade to single-agent FAQ |
Pre-ship checklist
- Document topology diagram (which pattern: supervisor, pipeline, parallel).
- Every agent has named owner, tool list, and token budget.
- Routing function has unit tests independent of LLM.
- Checkpoint store sized with retention policy.
- Incident runbook: how to disable one agent without killing fleet.
Golden scenario evals for multi-agent
Single-turn answer evals miss routing bugs. Maintain scenarios that assert: correct agent sequence, handoff count ≤ N, tools called by authorized agent only, and final JSON schema. Run in CI on prompt or graph version bumps—same discipline as Track 3 PromptFoo gates.
| Scenario | Assert |
|---|---|
| Refund + order lookup | researcher → policy → writer; no write tools in writer |
| Ambiguous greeting | supervisor terminates in ≤2 hops without tool spam |
| Parallel doc ingest | 3 shards merged; partial failure surfaced |
| Policy violation draft | critic loops once then HITL flag |
Degrade paths
- Agent node failure — circuit breaker → single-agent FAQ or cached macro response.
- Cost cap hit — return partial answer + "continue in human queue" token.
- Handoff limit — force writer with best-effort facts; log incomplete state for review.
- Model outage — fallback model per agent tier (mini for router, full for writer).
class AgentGateway:
def __init__(self, max_usd: float = 0.50, max_handoffs: int = 8):
self.max_usd = max_usd
self.max_handoffs = max_handoffs
self.breakers: dict[str, CircuitBreaker] = {}
async def run(self, graph, state, thread_id: str):
spend = 0.0
handoffs = 0
async for event in graph.astream(state, {"configurable": {"thread_id": thread_id}}):
spend += event.get("usage_usd", 0)
handoffs += event.get("handoff", 0)
if spend > self.max_usd:
raise CostCapExceeded(spend)
if handoffs > self.max_handoffs:
raise HandoffLimitExceeded(handoffs)
node = event.get("node")
if node and not self.breakers.get(node, CB_CLOSED).allow():
return fallback_response(state)
return event
public class AgentGateway {
private final double maxUsd;
private final int maxHandoffs;
private final Map breakers = new ConcurrentHashMap<>();
public Mono run(StateGraph graph, AgentState state, String threadId) {
AtomicDouble spend = new AtomicDouble();
AtomicInteger handoffs = new AtomicInteger();
return graph.stream(state, threadId)
.takeWhile(ev -> spend.addAndGet(ev.usageUsd()) <= maxUsd
&& handoffs.addAndGet(ev.handoffs()) <= maxHandoffs)
.last()
.onErrorResume(CostCapExceeded.class, e -> Mono.just(fallback(state)));
}
}
📦 Production checklist: you can diagram the team, name max cost per request, replay a checkpoint, and prove evals cover routing—not just final answer quality.
Run agent workers as Knative or Deployment with KEDA scale-from-queue; isolate sandbox executors in separate namespace with NetworkPolicy deny-all egress.
Terraform agent gateway: API Gateway + Lambda/ECS task per graph version; SSM params for model keys; CloudWatch alarms on handoffs_p99 and cost_per_request_p95.
"How do you prevent runaway multi-agent loops?" — Hard caps (handoffs, tokens, time), stagnation detection, circuit breakers, structured routing tests, HITL on writes, cost attribution per agent.
Enterprise agent platforms expose per-tenant max_daily_spend_usd and email ops when any single request exceeds 3× p95—catching runaway critic loops before finance notices.
Treat inter-agent messages as untrusted input—one compromised or jailbroken worker must not pass instructions that escalate privileges to another agent's tool set.
Tag OpenTelemetry spans with agent_id and graph_version—FinOps dashboards per team surface which crew iteration blew the monthly inference budget.
Strict circuit breakers improve availability but may hide degrading tool quality—pair breakers with synthetic canary requests every five minutes per agent node.
Logging full prompts for "debuggability" in multi-agent flows leaks PII across retention systems—log state hashes, agent transitions, and token counts; fetch full replay only from encrypted checkpoint store on incident.
Gateway cost caps work best as streaming accumulators—update spend after each node completes using provider usage headers, not post-hoc billing exports that arrive hours late.