Multi-agent systems & orchestration

Why multi-agent systems

A single LLM agent can plan, retrieve, call APIs, and write prose—but stuffing every capability into one system prompt creates context pressure, tool sprawl, and reasoning overload. Multi-agent design trades coordination overhead for narrower prompts, scoped tool allowlists, and parallel execution where subtasks are independent.

The default architecture for most products should remain one agent with a well-designed tool router. You escalate to multiple agents when evals show a measurable gap: one role consistently pollutes another's context (policy facts mixed with creative marketing copy), subtasks are embarrassingly parallel (parse ten PDFs independently), or governance requires hard separation (research agent has read-only DB; writer agent has no SQL tools at all).

Context and skill limits

Every agent turn pays for the full transcript plus tool definitions. A monolithic support agent might carry: system instructions (2K tokens), twelve tool schemas (4K), conversation history (8K), retrieved policy chunks (6K), and scratch notes from prior tool rounds (3K)—before the model reasons about the user's latest message. Specialists shrink each slice: the researcher sees order APIs and compact facts; the writer sees facts + tone guide, not raw JSON from five failed tool attempts.

Pressure	Single-agent symptom	Multi-agent mitigation
Context window	History + tools + RAG exceed budget; middle gets lost	Shared state slots; pass summaries not full chat replay
Skill dilution	Model "forgets" refund rules while drafting empathetic tone	Policy agent with narrow system prompt + tools
Tool confusion	Calls issue_refund when user asked for status	Writer has zero write tools; executor is separate node
Debuggability	One blob log; unclear which step hallucinated	Per-agent spans: supervisor → researcher → writer

Parallelization wins

When subtasks do not depend on each other's outputs, fan-out beats serial ReAct. Example: competitive intelligence brief—Agent A scrapes pricing pages, Agent B summarizes SEC filings, Agent C extracts features from docs; an aggregator merges structured JSON. Latency becomes max(subtask) plus merge—not sum. The orchestrator must define a merge contract (schema, conflict rules) so parallel outputs compose deterministically.

When one agent is enough

Linear workflows with <5 tool types and predictable routing.
Latency SLO <3s—multi-agent handoffs add 1–3 LLM calls minimum.
Team lacks tracing infrastructure to debug inter-agent message buses.
Eval shows single-agent with better prompts matches specialist accuracy.

Decision rubric

Signal	Lean single-agent	Lean multi-agent
Tool count	<8 well-grouped tools	>15 tools across domains
Compliance	Uniform RBAC on gateway	Role-based tool ACLs per agent
Parallelism	None needed	Independent subtasks >2
Iteration style	Fixed pipeline	Dynamic routing / debate loops

from dataclasses import dataclass, field
from typing import Any

@dataclass
class TeamState:
    request_id: str
    user_message: str
    order_facts: dict[str, Any] = field(default_factory=dict)
    policy_facts: dict[str, Any] = field(default_factory=dict)
    draft_answer: str = ""
    active_agent: str = "supervisor"
    handoff_count: int = 0
    max_handoffs: int = 8

def researcher_prompt(state: TeamState) -> str:
    return f"""Look up order details. User: {state.user_message}
Known facts: {state.order_facts or 'none'}"""

def writer_prompt(state: TeamState) -> str:
    return f"""Write customer reply using ONLY these facts:
order={state.order_facts}
policy={state.policy_facts}"""

public record TeamState(
    String requestId,
    String userMessage,
    Map orderFacts,
    Map policyFacts,
    String draftAnswer,
    String activeAgent,
    int handoffCount,
    int maxHandoffs
) {
  public static String researcherPrompt(TeamState s) {
    return "Look up order. User: " + s.userMessage()
        + "\nKnown: " + s.orderFacts();
  }
  public static String writerPrompt(TeamState s) {
    return "Write reply using order=" + s.orderFacts()
        + " policy=" + s.policyFacts();
  }
}

💡 Pro Tip

Start with a single agent and a structured TeamState object even before splitting prompts—if you cannot name what each slot stores, multi-agent will duplicate work.

⚖️ Trade-off

Multi-agent adds 2–5× LLM calls per user request. Parallel fan-out saves wall-clock time but doubles peak token spend unless you cap concurrent agents.

⚠️ Pitfall

Spinning up five agents because the demo looked impressive—without eval proof—usually increases cost and failure modes while accuracy stays flat.

🎯 Interview Tip

"When would you use multi-agent?" — Default single agent + tools. Multi-agent when tool ACLs, parallel subtasks, or context isolation measurably improve evals. Mention bounded handoffs and shared state schema.

📦 Real World

Legal-tech copilots split clause retrieval (RAG agent, read-only) from drafting (no DB tools) after audits showed combined agents issued citations from wrong tenant's corpus twice in shadow traffic.

💰 Cost

Parallel three-agent research with 8K context each costs ~3× single-agent sequential—but wall-clock drops from 45s to 18s. Price the premium tier accordingly; batch offline jobs tolerate higher spend than chat.

🔬 Under the Hood

Specialization works because smaller tool schemas reduce mistaken function calls—OpenAI tool-choice accuracy degrades measurably beyond ~10–15 tools in one request; split across agents restores precision.

Bridge from agent memory

The prior guide (Agent memory) covered what to persist in the loop vs trim each round. Multi-agent adds a rule: write to shared state slots, not broadcast full transcripts. Working memory lives in TeamState.facts; long-term memory stays in external stores keyed by user_id—agents read summaries injected at handoff, not each other's raw tool JSON.

Orchestration patterns

Production multi-agent systems reuse a small set of topology patterns. Pick the one that minimizes coordination friction: agents should know when to act, when to wait, and what schema they hand off—not broadcast unstructured chat forever.

Supervisor + workers

A supervisor (router node or LLM) reads shared state and delegates to specialists: researcher, coder, reviewer. Workers return structured updates to state; supervisor decides next hop or termination. This mirrors a manager assigning tickets— predictable, auditable, easy to cap with max_handoffs. Risk: supervisor becomes bottleneck and may route incorrectly; mitigate with deterministic pre-routing rules before LLM routing.

Peer-to-peer (network)

Agents message each other without a central boss—debate, negotiate, or request clarifications. Powerful for open-ended research but hard to terminate: conversations drift, duplicate retrieval, or ping-pong blame. Use only with strict turn limits, stagnation detection (same content hash twice), and a global budget clock.

Pipeline (sequential)

Fixed DAG: ingest → extract → validate → summarize → publish. Each stage is an agent or tool node; output of stage N is input to N+1. Best for ETL-style workflows where order never changes. Easy to test per stage; failure handling is retry or dead-letter per node.

Parallel fan-out / gather

Map phase: spawn N workers on partitioned inputs. Reduce phase: aggregator merges with schema validation and deduplication. LangGraph Send API and Celery task groups implement the same idea. Watch for partial failures—define whether merge proceeds with missing shards or fails closed.

Critic / judge loop

Generator produces draft; critic scores against rubric (citations present, policy compliant, tone); conditional edge loops back or exits. Same pattern as Self-RAG reflection—applied to multi-agent teams. Cap iterations (typically 2–3); critics should output structured pass/fail + fix instructions, not prose essays that bloat context.

Pattern	Best for	Watch out for
Supervisor + workers	Support triage, research teams, tool ACL separation	Supervisor mis-routing; over-handoff
Peer-to-peer	Brainstorming, adversarial review	Non-termination, duplicated work
Pipeline	Doc processing, codegen → review → deploy	Rigid; hard to skip stages dynamically
Parallel	Batch analysis, multi-source gather	Merge conflicts, partial failures
Critic / judge	Regulated outputs, citation-heavy answers	Runaway refine loops, cost explosion

Pattern selection flow

Can one agent with scoped tools pass eval? If yes, stop here.
Are subtasks independent? → Parallel + aggregator.
Is order fixed? → Pipeline.
Need dynamic delegation? → Supervisor + workers.
Need quality gate before ship? → Add critic loop on final stage only.

Critic / judge implementation sketch

The critic node returns structured JSON—never free-form "looks good." Pass/fail drives the conditional edge; fail appends fix_instructions to state for the generator's next pass.

Critic check	Pass criteria	On fail
Citation coverage	Every claim has source_id	Loop to researcher
Policy compliance	No banned refund phrases	Loop to writer with fixes
Format	Valid JSON schema	Loop to formatter node
Confidence	Self-score ≥ 0.85	Abstain or HITL

Parallel gather merge contract

Define aggregator input schema upfront: {shard_id, status, payload, error}. Merge rules: dedupe by primary key, prefer newest timestamp on conflict, surface partial failure count in final answer metadata. Without this contract, parallel agents produce irreconcilable prose blobs.

def route_supervisor(state: TeamState) -> str:
    if state.handoff_count >= state.max_handoffs:
        return "writer"  # force finish
    if not state.order_facts and "ORD-" in state.user_message.upper():
        return "researcher"
    if not state.policy_facts and any(k in state.user_message.lower() for k in ("refund", "policy")):
        return "policy"
    if state.order_facts and state.policy_facts:
        return "writer"
    return "researcher"

# LangGraph: add_conditional_edges("supervisor", route_supervisor, {...})

Function supervisorRouter = state -> {
    var s = (TeamState) state;
    if (s.handoffCount() >= s.maxHandoffs()) return "writer";
    if (s.orderFacts().isEmpty() && s.userMessage().contains("ORD-"))
        return "researcher";
    if (s.policyFacts().isEmpty() && s.userMessage().toLowerCase().contains("refund"))
        return "policy";
    if (!s.orderFacts().isEmpty() && !s.policyFacts().isEmpty())
        return "writer";
    return "researcher";
};

📦 Real World

Tier-2 support at a fintech runs supervisor → order lookup → policy RAG → writer, with deterministic routing for known intents and LLM supervisor only on ambiguous tickets—cuts mis-routes 40% vs pure LLM routing.

🔬 Under the Hood

Supervisor routing can be hybrid: regex and intent classifier first (free, fast), LLM supervisor only when confidence <0.7—same pattern as model cascades in Track 1.

💰 Cost

Peer-to-peer debate with four agents × five rounds = 20+ LLM calls per user question. Reserve for offline batch, not synchronous chat.

🔒 Security

Peer networks amplify prompt injection—one agent can pass malicious "observations" to another. Validate and sanitize all inter-agent payloads like external tool results.

LangGraph: nodes, edges, state, and checkpoints

LangGraph models agent workflows as a directed graph: nodes mutate shared state, edges define transitions, and conditional routing enables cycles (tool loop, critic refine) with explicit termination. Checkpointing and human-in-the-loop interrupts make long-running multi-agent flows production-durable.

Core concepts

State — TypedDict / Pydantic model / Java record holding messages, facts, flags. Reducers (e.g. operator.add on messages) define merge semantics.
Nodes — Pure functions state → partial state update. Each node should do one job: plan, call tools, summarize, route.
Edges — Fixed transitions (researcher → supervisor) or conditional (route(state) returns next node name).
Cycles — Deliberate loops for ReAct; must exit via iteration counter, success flag, or critic pass.

Conditional routing

add_conditional_edges("supervisor", route_fn, path_map) is the control plane. Route functions should be small, testable, and logged—avoid hiding routing inside LLM prose. Pattern: LLM proposes next agent as structured JSON; router validates against allowlist before transition.

Human-in-the-loop (HITL)

interrupt_before=["human"] pauses graph before executing the human node—state persists in checkpointer. UI shows pending tool write; operator approves; graph.invoke(None, config) resumes. Use for refunds, deploys, PII exports—not for every turn.

Checkpointing

MemorySaver for dev; Postgres / Redis checkpointers for prod. Thread ID = conversation or ticket ID. Enables: crash recovery, time-travel debug ("what was state before bad tool call?"), and audit replay for compliance.

Feature	Dev default	Production
Checkpointer	MemorySaver	PostgresSaver with TTL
Thread key	UUID per session	tenant_id + ticket_id
Streaming	graph.stream	SSE to UI per node completion
Parallel sends	Manual threads	LangGraph Send API

from typing import Annotated, Literal, TypedDict
import operator
from langgraph.graph import StateGraph, START, END
from langgraph.checkpoint.memory import MemorySaver

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    facts: dict
    iteration: int
    needs_human: bool

def researcher(state: AgentState) -> AgentState:
    # call tools, append messages, update facts
    return {"messages": [{"role": "assistant", "content": "..."}],
            "facts": {**state["facts"], "order_id": "ORD-42"}}

def supervisor(state: AgentState) -> AgentState:
    return {"iteration": state["iteration"] + 1}

def route(state: AgentState) -> Literal["researcher", "writer", "human", "__end__"]:
    if state.get("needs_human"):
        return "human"
    if state["iteration"] >= 6:
        return "__end__"
    if "order_id" not in state.get("facts", {}):
        return "researcher"
    return "writer"

builder = StateGraph(AgentState)
builder.add_node("supervisor", supervisor)
builder.add_node("researcher", researcher)
builder.add_node("writer", lambda s: {"messages": [{"role": "assistant", "content": "done"}]})
builder.add_node("human", lambda s: s)  # HITL interrupt target
builder.add_edge(START, "supervisor")
builder.add_conditional_edges("supervisor", route)
builder.add_edge("researcher", "supervisor")
builder.add_edge("writer", END)
builder.add_edge("human", "supervisor")

checkpointer = MemorySaver()
graph = builder.compile(checkpointer=checkpointer, interrupt_before=["human"])

config = {"configurable": {"thread_id": "ticket-991"}}
for event in graph.stream({"messages": [], "facts": {}, "iteration": 0}, config):
    print(event)
# Resume after human approval: graph.invoke(None, config)

// Spring AI + custom graph runner (LangGraph4j-style sketch)
record AgentState(List messages, Map facts,
                  int iteration, boolean needsHuman) {}

enum Route { RESEARCHER, WRITER, HUMAN, END }

Route route(AgentState state) {
  if (state.needsHuman()) return Route.HUMAN;
  if (state.iteration() >= 6) return Route.END;
  if (!state.facts().containsKey("order_id")) return Route.RESEARCHER;
  return Route.WRITER;
}

AgentState researcher(AgentState state) {
  var facts = new HashMap<>(state.facts());
  facts.put("order_id", "ORD-42");
  return new AgentState(state.messages(), facts, state.iteration(), false);
}

// GraphRunner: Map> nodes
// CheckpointStore.save(threadId, state) before HITL interrupt

💡 Pro Tip

Unit-test route(state) with table-driven cases before testing full LLM nodes—routing bugs cause 80% of "agent went rogue" incidents.

🔬 Under the Hood

LangGraph compiles to a Pregel-style superstep runner: each node reads latest state snapshot, writes partial update, reducer merges—enables parallel node execution when graph topology allows.

⚙️ Config

Set recursion_limit in config (default 25) lower in prod—e.g. 12—for multi-agent graphs with cycles. Pair with application-level max_handoffs.

⚠️ Pitfall

Storing full message history in state without trimming causes exponential token growth across cycles—persist structured facts in facts dict, not replay entire tool transcripts to every agent.

🎯 Interview Tip

"Explain LangGraph vs a while loop." — Graph gives named nodes, conditional edges, checkpointing, HITL interrupts, and visual debug. While loop is fine for demos; graphs scale observability and resume.

📦 Real World

Support platforms store LangGraph checkpoints in Postgres keyed by tenant_id:ticket_id—agents resume after worker pod restart without re-running expensive research nodes.

💰 Cost

Checkpoint writes add DB I/O but save re-execution after HITL—one avoided re-research pass pays for thousands of checkpoint rows.

🔒 Security

Checkpoint stores hold full state including tool outputs—encrypt at rest, scope by tenant, TTL expired threads, never replay prod checkpoints into dev environments with live keys.

Parallel fan-out with Send

LangGraph's Send API dispatches the same node with different state slices—map over document IDs, run researcher on each, reducer merges into facts["shards"]. Combine with supervisor pattern: supervisor emits Send list; gather node validates all shards before writer runs.

Mapping patterns to LangGraph primitives

Pattern	LangGraph building blocks
Supervisor + workers	Conditional edges from supervisor node; workers return to supervisor
Pipeline	Linear edges A → B → C → END
Parallel	Send API + gather reducer node
Critic loop	Cycle writer → critic → conditional → writer or END
HITL	interrupt_before + checkpoint resume

AutoGen: AssistantAgent, UserProxyAgent, GroupChat

Microsoft AutoGen (and AutoGen AgentChat) centers on conversable agents that exchange messages until a termination condition. AssistantAgent calls the LLM; UserProxyAgent stands in for the human or executes code; GroupChat rotates speakers via a manager—peer-style orchestration with less explicit graph wiring than LangGraph.

AssistantAgent

Wraps LLM config, system message, and optional tools. Receives messages, returns assistant messages or tool calls. You scope capabilities per assistant: one for Python coding, one for SQL read-only, one for summarization.

UserProxyAgent

Can proxy human input (human_input_mode="ALWAYS" for HITL) or auto-approve (NEVER) for batch jobs. Often paired with AssistantAgent in two-agent coding loops: assistant writes code, user proxy executes in sandbox.

GroupChat + GroupChatManager

Register multiple assistants; manager selects next speaker (round-robin, LLM-picked, or custom). Good for prototyping debate and research teams. Production requires: max_round, speaker allowlists, and message filtering so agents do not leak secrets across roles.

Code execution sandbox

AutoGen popularized LLM-generated code → Docker/local executor. Treat as high risk: network-isolated container, no prod credentials, CPU/time limits, stdout size caps. Prefer structured tool APIs over arbitrary code when possible.

from autogen import AssistantAgent, UserProxyAgent

assistant = AssistantAgent(
    name="coder",
    llm_config={"model": "gpt-4o-mini", "temperature": 0},
    system_message="Write Python. Wrap code in ```python blocks.",
)
user_proxy = UserProxyAgent(
    name="executor",
    human_input_mode="NEVER",
    max_consecutive_auto_reply=3,
    code_execution_config={"work_dir": "sandbox", "use_docker": True},
)

user_proxy.initiate_chat(
    assistant,
    message="Plot histogram of sample.csv column 'amount'",
)

// AutoGen Java / Spring AI: use Assistant + ToolExecutor pattern
// UserProxy ≈ HumanApprovalGateway or SandboxedScriptRunner
var coder = AssistantAgent.builder()
    .name("coder")
    .systemMessage("Write Java tests only.")
    .model("gpt-4o-mini")
    .build();
var executor = UserProxyAgent.builder()
    .humanInputMode(HumanInputMode.NEVER)
    .maxConsecutiveAutoReply(3)
    .codeExecution(SandboxConfig.docker("work-dir"))
    .build();
executor.initiateChat(coder, "Generate unit test for RefundService");

⚖️ Trade-off

GroupChat feels fast to demo but speaker-selection LLM adds hidden cost every round. LangGraph explicit edges are easier to cap and audit.

📦 Real World

Data science teams use AutoGen UserProxy + Docker for notebook-style exploration internally; customer-facing paths use fixed LangGraph pipelines with no arbitrary code exec.

🔒 Security

Never run UserProxy with use_docker: False on server hosting secrets. Sandbox egress should be deny-by-default.

💰 Cost

Coding loops: assistant draft + executor error feedback + assistant fix = 3–8 calls per task. Set max_consecutive_auto_reply aggressively in prod.

⚠️ Pitfall

GroupChat manager LLM picking speakers unpredictably makes reproducing bugs impossible—log speaker selection rationale and freeze manager prompt version in eval fixtures.

🎯 Interview Tip

"AssistantAgent vs UserProxyAgent?" — Assistant reasons and proposes; UserProxy executes code or represents human approval. Separation keeps codegen out of the same turn as arbitrary shell access without a gate.

Framework fit summary

AutoGen shines when conversation-shaped collaboration is the product requirement—research dialogues, coding pairs, internal experimentation. It is weaker when you need deterministic DAGs, strict SLAs, or fine-grained checkpoint audit—reach for LangGraph instead.

CrewAI: roles, tasks, and crews

CrewAI optimizes for readable role definitions: each agent has a role, goal, and backstory; tasks bind agents to outputs; a Crew runs tasks sequential or hierarchical (manager delegates). Ideal when non-engineers need to edit agent personas in config files.

Researcher / writer / reviewer crew

Classic content pipeline: Researcher gathers facts with search tools; Writer drafts from research output only; Reviewer checks citations and tone, sends back fixes. CrewAI sequential process enforces order without writing graph boilerplate—under the hood it is still a pipeline with shared context passed forward.

Task outputs as contracts

Each Task should declare expected_output and optionally output_json / Pydantic model. Downstream tasks reference prior task outputs via context=[research_task]—this is your schema blackboard.

Process modes

Process	Behavior	Maps to pattern
Sequential	Task1 → Task2 → Task3	Pipeline
Hierarchical	Manager agent assigns tasks	Supervisor + workers
Consensual (custom)	Debate until agreement	Peer + critic (use caps)

Tools per agent

Bind tools at agent construction—not globally. Researcher gets Serper + scrape; Writer gets none; Reviewer gets policy lookup only. Matches tool ACL pattern from Track 4 tool-calling guide.

from crewai import Agent, Task, Crew, Process

researcher = Agent(
    role="Senior Research Analyst",
    goal="Find verified facts about {topic}",
    backstory="You cite primary sources only.",
    tools=[search_tool, scrape_tool],
    verbose=True,
)
writer = Agent(
    role="Content Writer",
    goal="Draft article from research brief",
    backstory="No new facts—only provided research.",
)
reviewer = Agent(
    role="Compliance Reviewer",
    goal="Ensure citations and banned phrase check",
    backstory="Reject uncited claims.",
)

research = Task(description="Research {topic}", agent=researcher,
                expected_output="Bullet list with URLs")
draft = Task(description="Write 800 words", agent=writer, context=[research])
review = Task(description="Review draft", agent=reviewer, context=[draft])

crew = Crew(agents=[researcher, writer, reviewer], tasks=[research, draft, review],
            process=Process.sequential)
result = crew.kickoff(inputs={"topic": "EU AI Act summary"})

// CrewAI is Python-first; Java teams mirror with Spring AI + explicit Task DTOs
record TaskSpec(String id, String agentRole, String description, List dependsOn) {}

List pipeline = List.of(
    new TaskSpec("research", "researcher", "Gather facts", List.of()),
    new TaskSpec("draft", "writer", "Write article", List.of("research")),
    new TaskSpec("review", "reviewer", "Compliance pass", List.of("draft"))
);
// Orchestrator runs in topological order, passing TaskOutput maps

💡 Pro Tip

Keep backstories short—every token repeats each task. Put stable rules in shared expected_output templates, not duplicated persona prose.

⚠️ Pitfall

Hierarchical process with manager LLM picking tasks duplicates supervisor routing costs—use sequential when order is known.

🔬 Under the Hood

CrewAI agents run ReAct internally when tools attached—your crew-level sequential process wraps inner tool loops; count total LLM calls = tasks × avg tool rounds.

🎯 Interview Tip

"CrewAI vs LangGraph?" — CrewAI: rapid role/task DSL for pipelines. LangGraph: arbitrary graphs, cycles, HITL, checkpointing. Many teams prototype in CrewAI, port critical paths to LangGraph.

📦 Real World

Marketing teams edit CrewAI YAML roles without touching Python—platform engineers wrap crew.kickoff() in a FastAPI job with cost caps and S3 artifact storage for drafts.

🔒 Security

Researcher tools (web search) can return injected content—sanitize before Writer sees it; Reviewer should not fetch URLs directly unless allowlisted.

⚖️ Trade-off

CrewAI readability trades explicit graph visibility—you may not see cycles until runtime. Add explicit task timeouts and output schema validation at crew boundaries.

OpenAI Swarm: handoffs with context preserved

Swarm (OpenAI experimental SDK) is a minimal orchestration layer: agents expose functions that return other agents (handoffs). The client runtime switches active agent while preserving conversation context—lightweight triage without a full graph framework.

Handoff mechanics

Agent A handles general intake. Its tool transfer_to_billing returns Agent B instance (or name). Runtime appends handoff metadata; Agent B continues with prior messages visible—user does not repeat context. Contrast with explicit state slots: Swarm keeps transcript continuity; you still should trim tool noise for token control.

When Swarm fits

Fast triage prototypes: sales vs support vs technical.
Small agent count (<5) with clear handoff points.
Teams already on OpenAI SDK wanting minimal deps.

When to graduate to LangGraph

Need checkpoint/resume, parallel fan-out, or critic cycles.
Complex RBAC on which agent may hand off to whom.
Multi-provider routing (Bedrock + OpenAI) in one workflow.

Context preservation caveats

Preserved context includes prior tool failures and PII from earlier agents—implement handoff filters that redact or summarize before sensitive agents (billing) receive transcript. Treat handoff like passing audit log, not raw memory dump.

from openai import OpenAI

client = OpenAI()

triage = {
    "name": "Triage",
    "instructions": "Route billing or technical issues.",
    "tools": [{
        "type": "function",
        "function": {
            "name": "transfer_to_billing",
            "description": "Hand off billing questions",
            "parameters": {"type": "object", "properties": {}},
        },
    }],
}

def handle_tool(name: str, active_agent: str):
    if name == "transfer_to_billing":
        return "Billing", {"handoff": True, "preserve_context": True}
    return active_agent, {}

# Agents SDK: handoff updates active agent; messages array continues

// Java: model handoff as enum + ConversationState.activeAgent
enum AgentId { TRIAGE, BILLING, TECHNICAL }

HandoffResult transferToBilling(ConversationState state) {
  var summary = summarizer.trimForHandoff(state.messages());
  return new HandoffResult(AgentId.BILLING, summary, true);
}

📦 Real World

SaaS onboarding bots use Swarm-style triage: general FAQ agent hands to provisioning agent when user mentions "setup API keys"—one continuous chat thread, no user-visible "you were transferred".

⚖️ Trade-off

Context preservation simplifies UX but increases token load—billing agent reads full triage tool spam. Insert summarization handoff payload for long threads.

🔒 Security

Handoffs must not bypass auth: billing agent should re-verify account scope even if triage agent saw user email in context.

💰 Cost

Each handoff often triggers a fresh LLM call with full history—monitor tokens per agent segment in traces.

Production: bounded scope, observability, and cost controls

Multi-agent systems fail in production from unbounded loops, untraceable handoffs, and spend spikes—not from wrong framework choice. Ship with explicit budgets, circuit breakers, structured telemetry per agent, and a fallback to single-agent or human.

Bounded scope

max_handoffs / max_iterations / recursion_limit — pick one app-level knob operators understand.
Tool allowlists per agent; deny inter-agent message tools unless explicitly designed.
Timeout per node (30s tool, 120s LLM); global request deadline enforced at gateway.
Stagnation detection: same state hash or empty delta twice → escalate or abort.

Observability

Trace shape: trace_id → supervisor_span → researcher_span → tool_span. Log structured handoffs (from, to, reason, state snapshot hash)—not full prompts in prod logs. Export metrics: handoffs_per_request, tokens_per_agent, critic_loop_count, human_interrupt_rate.

Circuit breaker

When error rate or latency for an agent node exceeds threshold, skip to fallback: cached answer, simpler single-agent path, or human queue. Per-agent breakers prevent one broken tool from stalling entire crew. Reset breaker after cooldown + synthetic health check.

Cost cap

Pre-request estimate: max_tokens × price × expected_agents. Mid-flight abort when spend > cap. Attribute cost by agent_id for FinOps—researcher embedding calls vs writer pure completion.

Production readiness matrix

Control	Minimum	Mature
Iteration limits	Fixed max_handoffs=8	Per-tenant configurable + alerts
Tracing	JSONL handoff log	OpenTelemetry + LangSmith/Arize
HITL	Manual approve writes	Role-based approval queues
Evals	Golden multi-agent scenarios	CI regression on routing + output
Fallback	Human queue	Degrade to single-agent FAQ

Pre-ship checklist

Document topology diagram (which pattern: supervisor, pipeline, parallel).
Every agent has named owner, tool list, and token budget.
Routing function has unit tests independent of LLM.
Checkpoint store sized with retention policy.
Incident runbook: how to disable one agent without killing fleet.

Golden scenario evals for multi-agent

Single-turn answer evals miss routing bugs. Maintain scenarios that assert: correct agent sequence, handoff count ≤ N, tools called by authorized agent only, and final JSON schema. Run in CI on prompt or graph version bumps—same discipline as Track 3 PromptFoo gates.

Scenario	Assert
Refund + order lookup	researcher → policy → writer; no write tools in writer
Ambiguous greeting	supervisor terminates in ≤2 hops without tool spam
Parallel doc ingest	3 shards merged; partial failure surfaced
Policy violation draft	critic loops once then HITL flag

Degrade paths

Agent node failure — circuit breaker → single-agent FAQ or cached macro response.
Cost cap hit — return partial answer + "continue in human queue" token.
Handoff limit — force writer with best-effort facts; log incomplete state for review.
Model outage — fallback model per agent tier (mini for router, full for writer).

class AgentGateway:
    def __init__(self, max_usd: float = 0.50, max_handoffs: int = 8):
        self.max_usd = max_usd
        self.max_handoffs = max_handoffs
        self.breakers: dict[str, CircuitBreaker] = {}

    async def run(self, graph, state, thread_id: str):
        spend = 0.0
        handoffs = 0
        async for event in graph.astream(state, {"configurable": {"thread_id": thread_id}}):
            spend += event.get("usage_usd", 0)
            handoffs += event.get("handoff", 0)
            if spend > self.max_usd:
                raise CostCapExceeded(spend)
            if handoffs > self.max_handoffs:
                raise HandoffLimitExceeded(handoffs)
            node = event.get("node")
            if node and not self.breakers.get(node, CB_CLOSED).allow():
                return fallback_response(state)
        return event

public class AgentGateway {
  private final double maxUsd;
  private final int maxHandoffs;
  private final Map breakers = new ConcurrentHashMap<>();

  public Mono run(StateGraph graph, AgentState state, String threadId) {
    AtomicDouble spend = new AtomicDouble();
    AtomicInteger handoffs = new AtomicInteger();
    return graph.stream(state, threadId)
        .takeWhile(ev -> spend.addAndGet(ev.usageUsd()) <= maxUsd
            && handoffs.addAndGet(ev.handoffs()) <= maxHandoffs)
        .last()
        .onErrorResume(CostCapExceeded.class, e -> Mono.just(fallback(state)));
  }
}

💡 Pro Tip

📦 Production checklist: you can diagram the team, name max cost per request, replay a checkpoint, and prove evals cover routing—not just final answer quality.

☸️ OpenShift

Run agent workers as Knative or Deployment with KEDA scale-from-queue; isolate sandbox executors in separate namespace with NetworkPolicy deny-all egress.

🏗️ IaC

Terraform agent gateway: API Gateway + Lambda/ECS task per graph version; SSM params for model keys; CloudWatch alarms on handoffs_p99 and cost_per_request_p95.

🎯 Interview Tip

"How do you prevent runaway multi-agent loops?" — Hard caps (handoffs, tokens, time), stagnation detection, circuit breakers, structured routing tests, HITL on writes, cost attribution per agent.

📦 Real World

Enterprise agent platforms expose per-tenant max_daily_spend_usd and email ops when any single request exceeds 3× p95—catching runaway critic loops before finance notices.

🔒 Security

Treat inter-agent messages as untrusted input—one compromised or jailbroken worker must not pass instructions that escalate privileges to another agent's tool set.

💰 Cost

Tag OpenTelemetry spans with agent_id and graph_version—FinOps dashboards per team surface which crew iteration blew the monthly inference budget.

⚖️ Trade-off

Strict circuit breakers improve availability but may hide degrading tool quality—pair breakers with synthetic canary requests every five minutes per agent node.

⚠️ Pitfall

Logging full prompts for "debuggability" in multi-agent flows leaks PII across retention systems—log state hashes, agent transitions, and token counts; fetch full replay only from encrypted checkpoint store on incident.

🔬 Under the Hood

Gateway cost caps work best as streaming accumulators—update spend after each node completes using provider usage headers, not post-hoc billing exports that arrive hours late.