Lang
API

Tool calling & function calling

Agents do not guess database rows—they request typed RPC through the model API. Tool calling (OpenAI) and tool use (Anthropic) let the LLM emit structured function invocations; your application validates arguments, executes handlers, and injects results back into the conversation until a final answer emerges. This guide is Guide 2 in Track 4: the contract layer every agent loop depends on before memory, graphs, or MCP.

After reading, you should be able to: diagram the schema→call→execute→result→response loop; configure OpenAI tools and tool_choice; map Anthropic tool_use / tool_result blocks; design single-purpose tools with typed params; classify read vs write tools with confirmation gates; run parallel read-only calls; inject results per provider; pause for human approval on high-stakes writes; and ship a tool registry with structured logging.

developer platform architect Track 4 OpenAI tools Anthropic tool_use Bedrock Converse Spring AI

Architecture: schema → call → execute → result → response

Tool calling (also function calling) lets an LLM request typed side effects instead of hallucinating facts. Your application owns execution; the model only proposes structured calls against JSON Schema you register. Every production agent loop follows the same five-step rhythm—whether OpenAI tool_calls, Anthropic tool_use, or Bedrock Converse toolUse.

Think of tools as an internal RPC surface exposed to the model. You declare names, descriptions, and parameter schemas; the model returns arguments matching those schemas; your router validates, authorizes, executes, and feeds results back as new messages. The loop continues until the model emits a final natural-language answer or you hit iteration/time limits. This is the bridge from Track 1 (APIs) and Track 3 (prompting) into Track 4 (agents).

The five-step loop

  1. Schema registration — attach tool definitions (name, description, JSON Schema parameters) to the API request.
  2. Model call — send messages; model may return text, tool call(s), or both.
  3. Execute — your code parses arguments, checks allowlists, runs the handler (DB query, HTTP, compute).
  4. Result injection — append provider-specific tool-result messages tied to each call ID.
  5. Response — call the model again with updated history until finish_reason is not tool-related or max rounds reached.

Sequence diagram (conceptual)

User ──► App ──► LLM (tools registered)
              ◄── assistant: tool_call lookup_order(order_id="ORD-8821")
App ──► execute lookup_order ──► {status, items, total}
App ──► LLM (messages + tool result)
              ◄── assistant: "Your order shipped yesterday…"
App ──► User

Responsibility split

LayerOwnsMust not
ModelWhich tool(s) to call; argument values from contextDirect DB/network access; secrets
Tool routerValidation, authZ, timeouts, idempotency, auditTrust model args without schema check
HandlersBusiness logic; return JSON-serializable resultsReturn unbounded blobs or PII dumps
Agent runnerIteration cap, parallel dispatch, HITL gatesInfinite loops on repeated tool errors

Minimal loop (all providers, same shape)

End-to-end agent loop — register tools, execute, inject results
from openai import OpenAI
import json

TOOLS = [{
    "type": "function",
    "function": {
        "name": "lookup_order",
        "description": "Fetch order status by order ID (ORD- prefix).",
        "parameters": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
            "additionalProperties": False,
        },
    },
}]

def lookup_order(order_id: str) -> dict:
    return {"order_id": order_id, "status": "shipped", "eta_days": 1}

HANDLERS = {"lookup_order": lookup_order}

def run_agent(user: str, max_rounds: int = 5) -> str:
    client = OpenAI()
    messages = [{"role": "user", "content": user}]
    for _ in range(max_rounds):
        resp = client.chat.completions.create(
            model="gpt-4o-mini", tools=TOOLS, tool_choice="auto", messages=messages,
        )
        msg = resp.choices[0].message
        if not msg.tool_calls:
            return msg.content or ""
        messages.append(msg.model_dump(exclude_none=True))
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            out = HANDLERS[call.function.name](**args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(out),
            })
    raise RuntimeError("max tool rounds exceeded")
// Spring AI ChatClient — ToolCallback registry maps name → handler
public String runAgent(String user, int maxRounds) {
  List<Message> messages = new ArrayList<>(List.of(new UserMessage(user)));
  for (int i = 0; i < maxRounds; i++) {
    ChatResponse resp = chatClient.prompt(new Prompt(messages)).tools(orderTools).call();
    AssistantMessage assistant = resp.getResult().getOutput();
    if (!assistant.hasToolCalls()) return assistant.getText();
    messages.add(assistant);
    for (ToolCall call : assistant.getToolCalls()) {
      String output = toolRouter.invoke(call.getName(), call.getArguments());
      messages.add(new ToolResponseMessage(call.getId(), output));
    }
  }
  throw new IllegalStateException("max tool rounds exceeded");
}
import anthropic, json

client = anthropic.Anthropic()
tools = [{
    "name": "lookup_order",
    "description": "Fetch order status by order ID.",
    "input_schema": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
}]

def run_agent(user: str) -> str:
    messages = [{"role": "user", "content": user}]
    while True:
        resp = client.messages.create(
            model="claude-3-5-sonnet-20241022", max_tokens=1024,
            tools=tools, messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})
        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return "".join(b.text for b in resp.content if hasattr(b, "text"))
        results = []
        for tu in tool_uses:
            out = lookup_order(**tu.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": json.dumps(out),
            })
        messages.append({"role": "user", "content": results})
// AnthropicMessageChatModel + ToolCallback
// tool_use blocks in assistant; tool_result in following user message
import boto3, json

client = boto3.client("bedrock-runtime")
tool_config = {"tools": [{"toolSpec": {
    "name": "lookup_order",
    "description": "Fetch order status",
    "inputSchema": {"json": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    }},
}}]}

def run_agent(user: str) -> str:
    messages = [{"role": "user", "content": [{"text": user}]}]
    while True:
        resp = client.converse(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            messages=messages, toolConfig=tool_config,
        )
        out_msg = resp["output"]["message"]
        messages.append(out_msg)
        tool_uses = [b for b in out_msg["content"] if "toolUse" in b]
        if not tool_uses:
            return "".join(b.get("text", "") for b in out_msg["content"])
        results = []
        for tu in tool_uses:
            use = tu["toolUse"]
            out = lookup_order(**use["input"])
            results.append({"toolResult": {
                "toolUseId": use["toolUseId"],
                "content": [{"json": out}],
            }})
        messages.append({"role": "user", "content": results})
// BedrockRuntimeClient.converse — toolConfig + toolResult blocks
// Spring AI BedrockConverseChatModel wraps same API
🔬 Under the Hood

The model never runs your Python—it emits a structured intent. Execution happens in your process (or sandbox). That separation is why authZ on the router matters more than prompt pleading.

⚠️ Pitfall

Storing edited summaries of tool results in message history. Replay exact API payloads—especially tool_call_id / tool_use_id pairing—or the next model turn will hallucinate continuity.

💡 Pro Tip

Log the full message array (redacted) per round. When a user reports a wrong refund, you need round 2 tool args—not only the final assistant text.

🎯 Interview Tip

Whiteboard the five steps and mark where authZ lives. Interviewers want execution outside the model with explicit iteration caps.

OpenAI: tools array, tool_choice, parallel calls

OpenAI Chat Completions expose tools as a top-level tools array of function definitions. The assistant message may include tool_calls with stable IDs; you reply with role: tool messages. Modern models can emit multiple parallel tool calls in one assistant turn when parallel_tool_calls=true.

OpenAI popularized the name "function calling"; the current API uses "tools" with type: function. The assistant message is stored with a tool_calls array—each entry has id, function.name, and function.arguments (JSON string). Your runner must append the full assistant message before tool results so the API can correlate IDs on the next turn.

tools array shape

Each entry is {"type": "function", "function": { name, description, parameters }}. Parameters must be JSON Schema. Set additionalProperties: false on object schemas to reduce spurious keys in arguments.

FieldPurposeGuidance
nameStable identifier for routersnake_case; match handler registry key exactly
descriptionWhen to use this toolInclude negative scope: "Do not use for refunds"
parametersArgument schemaEnums for categorical fields; regex patterns for IDs

tool_choice modes

ValueBehaviorUse when
"auto" (default)Model may call zero or more toolsGeneral agents; support bots
"none"Disable tools for this turnFinal summarization; policy-only response
{"type":"function","function":{"name":"X"}}Force specific toolStructured extraction pipelines; eval harnesses
"required" (supported models)Must call at least one toolRouter step that always delegates to backend

Parallel tool_calls in one assistant message

When the user asks "What's the status of ORD-8821 and ORD-8822?", gpt-4o family often returns two tool_calls in a single assistant message. If both are read-only, execute concurrently, append two role: tool messages (each with matching tool_call_id), then call the model once.

OpenAI tools request — tool_choice and parallel_tool_calls
resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=TOOLS,
    tool_choice="auto",  # "none" | {"type":"function","function":{"name":"lookup_order"}}
    parallel_tool_calls=True,
)
msg = resp.choices[0].message
finish = resp.choices[0].finish_reason  # "tool_calls" | "stop"

if msg.tool_calls:
    messages.append(msg.model_dump(exclude_none=True))
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        out = HANDLERS[call.function.name](**args)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(out),
        })
ChatResponse resp = chatClient.prompt(new Prompt(messages))
    .options(ChatOptions.builder()
        .toolChoice("auto")
        .parallelToolCalls(true)
        .build())
    .tools(toolCallbacks)
    .call();
AssistantMessage assistant = resp.getResult().getOutput();
if (assistant.hasToolCalls()) {
  messages.add(assistant);
  for (ToolCall call : assistant.getToolCalls()) {
    String json = toolRegistry.invoke(call.getName(), call.getArguments());
    messages.add(new ToolResponseMessage(call.getId(), json));
  }
}
# Anthropic: tools + tool_choice={"type":"auto"|"any"|"tool",...}
# Parallel tool_use blocks in one assistant content array
// AnthropicChatOptions.toolChoice(ToolChoiceBuilder.AUTO)
# Bedrock Converse: toolConfig={"tools":[...]} — multiple toolUse per turn
// ConverseRequest.toolConfig(ToolConfiguration.builder()...)

finish_reason and empty tool_calls

When finish_reason == "tool_calls", expect at least one call—unless you hit token limits mid-generation. When the model answers directly, tool_calls is null and finish_reason == "stop". Do not assume every turn needs tools; over-registering tools increases mistaken invocations.

⚖️ Trade-off

Forcing a tool via tool_choice improves reliability but removes the model's ability to ask clarifying questions in text. Use forced tools for batch ETL; use auto for conversational agents.

⚙️ Config

Feature flag per endpoint: tool_choice: auto in prod chat, none on summarization-only routes. Log effective choice and tool count per request.

📦 Real World

Support agents register 8–12 tools max. Beyond ~15, wrong-tool rate rises—split into specialist sub-agents or dynamic tool retrieval (later guides).

🔒 Security

Never pass user-controlled strings into tool names. Allowlist call.function.name against registry keys before dispatch—prompt injection loves inventing admin_override.

Anthropic: tool_use and tool_result blocks

Anthropic Messages API represents tools as tools with input_schema (JSON Schema). Assistant replies are content block arrays: interleaved text and tool_use blocks. Results return in a subsequent user message as tool_result blocks referencing tool_use_id.

Unlike OpenAI's dedicated role: tool message, Anthropic nests results inside a user turn. The assistant message before results must be preserved verbatim—including any leading text blocks where the model explains its plan. Bedrock Converse copies this block model for Claude models; adapters in multi-provider gateways must not flatten these into generic "tool" roles without translation.

Block flow

  1. Request includes tools=[{name, description, input_schema}] and optional tool_choice.
  2. Assistant content may include {"type":"tool_use","id":"toolu_…","name":"…","input":{…}}.
  3. You execute handlers and append a user message whose content is an array of tool_result blocks.
  4. Model continues with more text and/or additional tool_use blocks until stop.

tool_choice on Anthropic

tool_choiceEffectTypical use
{"type":"auto"}Model decides whether to use toolsDefault agent behavior
{"type":"any"}Must invoke at least one toolForced delegation steps
{"type":"tool","name":"lookup_order"}Must call named toolEval fixtures; structured extract

Parallel tool_use blocks

One assistant turn may contain multiple tool_use blocks (e.g. two order lookups). Batch all corresponding tool_result entries in the next single user message—order of results need not match order of calls, but each tool_use_id must be unique and correctly paired.

Anthropic tool_use → execute → tool_result user message
# OpenAI: role "tool" + tool_call_id (not user+tool_result blocks)
// ToolResponseMessage per OpenAI tool_call_id
import anthropic, json

client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[{
        "name": "search_policy",
        "description": "Search internal policy KB. Query max 200 chars.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string", "maxLength": 200}},
            "required": ["query"],
        },
    }],
    tool_choice={"type": "auto"},
    messages=[{"role": "user", "content": "Can I return an opened laptop after 45 days?"}],
)
messages = [
    {"role": "user", "content": "Can I return an opened laptop after 45 days?"},
    {"role": "assistant", "content": resp.content},
]
tool_results = []
for block in resp.content:
    if block.type == "tool_use":
        data = run_search_policy(**block.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": json.dumps(data),
        })
messages.append({"role": "user", "content": tool_results})
follow_up = client.messages.create(
    model="claude-3-5-sonnet-20241022", max_tokens=1024, tools=tools, messages=messages,
)
// Spring AI: AnthropicChatModel with ToolDefinition inputSchema
// Map ToolResponse to Anthropic tool_result content blocks in UserMessage
# Bedrock Converse mirrors Anthropic block shapes:
# assistant content: [{"toolUse": {"toolUseId", "name", "input"}}]
# user content: [{"toolResult": {"toolUseId", "content": [{"json": ...}]}}]
// BedrockConverseChatModel — same toolUse / toolResult semantics

is_error on tool_result

Anthropic supports "is_error": true on tool_result when execution failed. Use this for validation failures and timeouts so the model distinguishes user-fixable errors from successful empty results.

⚠️ Pitfall

Omitting the full assistant message (including text blocks) before tool_result. Anthropic expects the exact prior assistant content array preserved in history.

🔬 Under the Hood

tool_use IDs are provider-generated (e.g. toolu_01…). Never reuse IDs across turns; each invocation gets a fresh identifier tied to one result block.

💰 Cost

Interleaved text + tool_use in one turn still bills output tokens for both. Trim verbose pre-tool narration in the system prompt if latency-sensitive.

🎯 Interview Tip

Contrast OpenAI role:tool vs Anthropic user+tool_result. Shows you build multi-provider gateways without losing required semantics at the adapter layer.

Tool design: single responsibility, descriptions, typed params

Tools are UX for the model. Poor schemas cause wrong-tool selection, invalid args, and retry storms. Design each tool like a public API: one job, explicit description, strict types, predictable errors, idempotent writes.

The model chooses tools from names and descriptions alone—it cannot read your handler source code. A vague description guarantees misuse; an oversized god-tool guarantees wrong enum branches. Invest in schema quality before tuning temperature or swapping models.

Design principles

PrinciplePracticeAnti-pattern
Single responsibilitylookup_order not manage_orderGod tool with action enum covering 12 operations
DescriptionsWhen to use + when NOT + example ID format"Handles orders" (useless to the model)
Typed paramsJSON Schema types, enums, min/max, patternsSingle payload: string blob
Error-safeReturn {"error":"…","retryable":true}Stack traces or HTML error pages as tool content
Idempotent writesRequire idempotency_key on mutating toolsDouble refund on duplicate tool_call retry

Description template

Structure: one-line purpose → input constraints → output shape → negative scope.

Fetch order details by ID. Input: order_id must match ORD-[digits]. Returns status, line items, ship date. Do not use for refunds—use issue_refund.

Server-side validation before execute

Validate args with Pydantic / Bean Validation — fail closed
from pydantic import BaseModel, Field, ValidationError

class LookupOrderArgs(BaseModel):
    order_id: str = Field(pattern=r"^ORD-\d+$")

class IssueRefundArgs(BaseModel):
    order_id: str
    amount_usd: float = Field(gt=0, le=10_000)
    reason: str = Field(max_length=200)
    idempotency_key: str

def dispatch(name: str, raw: dict) -> dict:
    try:
        if name == "lookup_order":
            args = LookupOrderArgs.model_validate(raw)
            return lookup_order(args.order_id)
        if name == "issue_refund":
            args = IssueRefundArgs.model_validate(raw)
            return issue_refund(**args.model_dump())
    except ValidationError as e:
        return {"error": "invalid_arguments", "details": e.errors(), "retryable": False}
    return {"error": "unknown_tool", "retryable": False}
public record LookupOrderArgs(@Pattern(regexp = "ORD-\\d+") String orderId) {}

public String dispatch(String name, String argsJson) throws JsonProcessingException {
  try {
    if ("lookup_order".equals(name)) {
      LookupOrderArgs args = objectMapper.readValue(argsJson, LookupOrderArgs.class);
      return objectMapper.writeValueAsString(lookupOrder(args.orderId()));
    }
  } catch (ValidationException | JsonProcessingException e) {
    return "{\"error\":\"invalid_arguments\",\"retryable\":false}";
  }
  return "{\"error\":\"unknown_tool\",\"retryable\":false}";
}

Returning errors the model can recover from

  • order_not_found — model asks user to verify ID
  • rate_limited with retryable: true — runner may backoff and retry
  • policy_denied — model explains constraint without retry loop
💡 Pro Tip

Return structured errors the model can read. A tool result {"error":"order_not_found"} beats an exception string the model misinterprets as success.

⚖️ Trade-off

Granular tools (10+) vs coarse tools (3). Granular improves selection accuracy but inflates schema tokens in every request. Measure wrong-tool rate on a golden set.

📦 Real World

Version tool descriptions in a registry (lookup_order@v3). A/B description changes like any API contract—with eval gates before full rollout.

🔒 Security

Validate argument length and rate-limit per tool. search_policy with a 10KB query is a DoS vector against your vector DB.

Tool categories: read / write / compute + confirmation for writes

Classify every tool by side-effect profile. Read tools query state; write tools mutate external systems; compute tools run deterministic logic with no external IO. Apply different guardrails: parallelize reads; serialize writes; require human confirmation before high-stakes mutations.

Category matrix

CategoryExamplesParallel?HITL?Idempotency
Readlookup_order, search_policy, get_balanceYesNoOptional (cache keys)
Writeissue_refund, send_email, update_addressNo (or queued)Yes for high impactRequired
Computecalculate_shipping, format_table, validate_ibanYesNoN/A (pure function)

Confirmation pattern for writes

  1. Model calls write tool → router intercepts before execute.
  2. Present pending action to user or approver queue: tool name, args, human-readable diff.
  3. On approve: execute with idempotency key tied to tool_call_id or tool_use_id.
  4. On reject: inject tool result {"status":"denied_by_user"} so the model apologizes and offers alternatives.
  5. On timeout: inject {"status":"approval_expired"}; model asks user to restate intent.

Registry metadata drives policy

Tool kind metadata — parallel and approval policy
from enum import Enum

class ToolKind(str, Enum):
    READ = "read"
    WRITE = "write"
    COMPUTE = "compute"

TOOL_META = {
    "lookup_order": {"kind": ToolKind.READ, "timeout_sec": 5},
    "search_policy": {"kind": ToolKind.READ, "timeout_sec": 5},
    "issue_refund": {"kind": ToolKind.WRITE, "requires_approval": True, "risk_tier": "high"},
    "calculate_shipping": {"kind": ToolKind.COMPUTE, "timeout_sec": 2},
}

WRITE_TOOLS = {n for n, m in TOOL_META.items() if m["kind"] == ToolKind.WRITE}
READ_TOOLS = {n for n, m in TOOL_META.items() if m["kind"] in (ToolKind.READ, ToolKind.COMPUTE)}
public enum ToolKind { READ, WRITE, COMPUTE }

public record ToolMeta(ToolKind kind, boolean requiresApproval, int timeoutSec, String riskTier) {}

Map<String, ToolMeta> registry = Map.of(
    "lookup_order", new ToolMeta(ToolKind.READ, false, 5, "low"),
    "issue_refund", new ToolMeta(ToolKind.WRITE, true, 30, "high")
);

Gray-area tools

Some tools blur lines: create_support_ticket writes to Zendesk but feels "safe." Rule of thumb: if a customer or finance system observes the effect without another human step, treat as write. Log-only internal annotations may be medium tier—auto-execute with audit.

⚠️ Pitfall

Classifying send_email_to_customer as read because it "only reads a template." If the recipient receives mail, it is a write—confirm before send.

⚙️ Config

WRITE_TOOLS=issue_refund,send_email in env; router refuses writes when HITL_MODE=required and no approval token is present on the request context.

🎯 Interview Tip

Explain read/write split for parallel execution policy. Shows operational maturity beyond demo agent loops that treat every tool the same.

💰 Cost

Write tools hit payment and email APIs—duplicate execution cost dwarfs LLM tokens. Idempotency keys keyed to provider call IDs are non-negotiable on refunds.

Parallel calling: N simultaneous tools

When the model returns N read-only tool calls in one turn, execute them concurrently to minimize wall-clock latency. Never parallelize writes unless your backend supports isolated transactions and you enforce per-resource locking.

OpenAI's parallel_tool_calls and Anthropic multi-tool_use turns are only wins if your executor cooperates. A sequential loop over three lookup_order calls triples latency; a thread pool with sane limits cuts it toward the slowest single dependency.

When to parallelize

  • All calls classified as READ or COMPUTE per registry metadata
  • Handlers are thread-safe or pure async without shared mutable state
  • Downstream APIs have headroom (bulkhead / circuit breaker not open)
  • Total fan-out ≤ configured max (commonly 8)

When to refuse parallel fan-out

  • Any write tool in the batch—split reads parallel, writes sequential
  • Tools hitting the same row (two updates to one order)—serialize or merge
  • Global rate limit on a shared API key—use semaphore per dependency
Parallel read tools — split bucket, ThreadPoolExecutor
import json
from concurrent.futures import ThreadPoolExecutor, as_completed

def execute_tool_calls(request_id: str, round_num: int, tool_calls: list) -> list[tuple[str, str, str]]:
    """Returns (tool_call_id, name, output_json)."""
    read_calls = [c for c in tool_calls if c.function.name in READ_TOOLS]
    write_calls = [c for c in tool_calls if c.function.name in WRITE_TOOLS]
    results: list[tuple[str, str, str]] = []

    def _one(call):
        name = call.function.name
        args = json.loads(call.function.arguments)
        out = dispatch(name, args)
        audit_log(request_id, round_num, name, args, out)
        return call.id, name, json.dumps(out)

    max_workers = min(8, len(read_calls) or 1)
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = [pool.submit(_one, c) for c in read_calls]
        for fut in as_completed(futures):
            results.append(fut.result())

    for call in write_calls:
        results.append(_one(call))  # sequential; HITL inside dispatch for writes
    return results
public List<ToolResult> executeToolCalls(String requestId, int round, List<ToolCall> calls) {
  List<ToolCall> reads = calls.stream().filter(c -> READ_TOOLS.contains(c.getName())).toList();
  List<ToolCall> writes = calls.stream().filter(c -> WRITE_TOOLS.contains(c.getName())).toList();
  List<ToolResult> results = reads.parallelStream()
      .map(c -> invokeAudited(requestId, round, c))
      .collect(Collectors.toCollection(ArrayList::new));
  writes.forEach(c -> results.add(invokeWithApproval(requestId, round, c)));
  return results;
}

Async Python variant

For async handlers (HTTP/2 clients), prefer asyncio.gather with a concurrency semaphore instead of threads—same policy gates on read vs write buckets apply.

⚖️ Trade-off

Parallel reads cut user-visible latency but multiply load on fragile backends. Use semaphores per dependency (CRM, vector DB) and backpressure when error rate spikes.

🔬 Under the Hood

OpenAI parallel_tool_calls is a generation-time feature; your executor must still respect downstream rate limits—the model does not know your CRM's 429 threshold.

📦 Real World

Cap parallel fan-out at 8 even if the model requests 12 lookups—execute first 8, return partial with truncated: true in metadata, let model ask to continue.

💡 Pro Tip

as_completed returns results out of order—that is fine for message injection. Each tool_call_id pairing is independent of append order.

Result injection: message format per provider

Each provider requires a specific message shape when feeding tool outputs back into the conversation. Gateways that normalize history must preserve provider-native ID pairing—or translation bugs surface on turn 3+ when the model "forgets" which call failed.

Format comparison

ProviderAssistant carries callResult messageID field
OpenAIassistant.tool_calls[]role: "tool"tool_call_id
Anthropiccontent[].type == "tool_use"user.content[].tool_resulttool_use_id
Bedrock Conversecontent[].toolUseuser.content[].toolResulttoolUseId

Content encoding rules

  • OpenAI — tool message content is a string; JSON.stringify is standard for structured handler output.
  • Anthropictool_result.content may be string or content blocks; string JSON works for most handlers; set is_error on failures.
  • Bedrock — prefer content: [{"json": {...}}] for structured success; [{"text": "..."}] for errors with status: "error".

Multi-provider injection helpers

Inject tool results — provider-specific message shapes
def inject_openai(messages: list, call_id: str, result: dict) -> None:
    messages.append({
        "role": "tool",
        "tool_call_id": call_id,
        "content": json.dumps(result),
    })

# One tool message per tool_call_id — never combine
messages.add(new ToolResponseMessage(
    ToolResponse.builder(callId, toolName).responseData(json).build()
));
def inject_anthropic_batch(results: list[dict]) -> dict:
    """Batch multiple tool_result blocks in ONE user message."""
    blocks = []
    for call_id, result in results:
        block = {"type": "tool_result", "tool_use_id": call_id, "content": json.dumps(result)}
        if result.get("error"):
            block["is_error"] = True
        blocks.append(block)
    return {"role": "user", "content": blocks}
// UserMessage with List<AnthropicToolResultContent> batched per turn
def inject_bedrock(tool_use_id: str, result: dict) -> dict:
    if "error" in result:
        return {"toolResult": {
            "toolUseId": tool_use_id,
            "content": [{"text": json.dumps(result)}],
            "status": "error",
        }}
    return {"toolResult": {
        "toolUseId": tool_use_id,
        "content": [{"json": result}],
    }}
// ContentBlock.fromToolResult(ToolResultBlock.builder()...)

Canonical internal model (gateway pattern)

Store conversation as provider-neutral records: {turn, assistant_tool_calls[], tool_results[]}. Adapters translate to OpenAI or Anthropic shapes at API boundary only. Never persist provider-specific messages without also storing the neutral form—you will migrate models.

⚠️ Pitfall

Mismatched IDs—injecting a result for call_abc when the model emitted call_xyz. The model hallucinates success. Assert every ID exists in the immediately prior assistant turn.

🔒 Security

Scrub PII from tool results before re-prompt unless policy allows. Logs and model context share the same leakage surface—hash or truncate account numbers in results.

🎯 Interview Tip

Ask how you'd build a provider-agnostic message store. Strong answer: canonical internal event log + thin adapters; never normalize away ID semantics at persistence layer.

Human-in-the-loop: pause for approval on high-stakes

Models misread policy edge cases. For refunds, account closures, wire transfers, and outbound email to customers, pause execution until a human approves, edits, or rejects the pending tool call. The agent loop stays intact—you inject a synthetic tool result reflecting the human decision.

Human-in-the-loop (HITL) is not "no agents"—it is gated side effects. The model may still plan, gather reads, and draft the refund rationale; only the mutating RPC waits on approval. Sync UIs show a pending card; async workflows enqueue to Slack/email and resume via webhook when decided.

Approval flow

  1. Detect write tool in router → create PendingAction (args, user, request_id, provider call ID).
  2. Return holding response to client or continue session with "awaiting approval" tool result if blocking is unacceptable.
  3. Approver UI shows diff: "Refund $149.99 to ORD-8821 — reason: duplicate charge".
  4. On approve → execute with idempotency key = provider call ID; inject success result; resume loop.
  5. On reject → inject {"status":"denied_by_user"}; model explains to end user.

Risk tiers

TierExamplesGate
LowRead, computeAuto-execute
MediumUpdate nickname, internal noteAuto + audit log
HighRefund > $50, email external partyHuman approval (single approver)
CriticalWire transfer, delete accountTwo-person rule + ticket ID
Write tool gate — pending approval injects status into tool result
from dataclasses import dataclass
from enum import Enum

class ApprovalState(str, Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"

@dataclass
class PendingAction:
    id: str
    tool_call_id: str
    tool_name: str
    arguments: dict
    state: ApprovalState = ApprovalState.PENDING

def run_write_tool(name: str, args: dict, *, approval_id: str) -> str:
    pending = approval_store.get(approval_id)
    if pending.state == ApprovalState.PENDING:
        notify_approver(pending)
        return json.dumps({"status": "awaiting_approval", "approval_id": approval_id})
    if pending.state == ApprovalState.REJECTED:
        return json.dumps({"status": "denied_by_user"})
    return json.dumps(issue_refund(**args, idempotency_key=pending.tool_call_id))
public record PendingAction(String id, String toolCallId, String toolName, JsonNode args, ApprovalState state) {}

public String runWriteTool(String name, String argsJson, String approvalId) {
  PendingAction p = store.get(approvalId);
  if (p.state() == ApprovalState.PENDING) {
    notifier.send(p);
    return "{\"status\":\"awaiting_approval\"}";
  }
  if (p.state() == ApprovalState.REJECTED) {
    return "{\"status\":\"denied_by_user\"}";
  }
  return invokeRefund(argsJson, p.toolCallId());
}

Preventing self-approval via prompt injection

  • Approvers authenticate on a separate channel (admin SSO), not the chat session
  • Approval tokens are single-use, bound to approval_id + approver identity
  • Chat user cannot pass approved=true in tool args—decision lives only in approval store
📦 Real World

Slack interactive buttons with 15-minute TTL. Expired pending actions inject approval_expired so the model asks the user to restate intent—not loop forever on stale pending state.

⚖️ Trade-off

Sync HITL blocks the HTTP request—poor chat UX. Prefer async: persist message history, notify approver, resume agent via webhook + stored checkpoint.

🔒 Security

Separate approver identity from chat user. Prompt injection must not forge approval headers or pre-signed URLs.

💡 Pro Tip

Include provider call ID in approval deep links. Replay attacks reuse old approvals—idempotency on execute must reject duplicate keys with same outcome, not re-run.

Production: tool registry, logging, and checklist

Shipping tool calling means more than a demo loop: centralized registry, schema versioning, structured audit logs, per-tool SLOs, and eval suites that assert correct tool selection—not just final answer quality.

Tool registry responsibilities

  • Single source of truth: name → handler, schema, kind (read/write/compute), owning team
  • Version schemas; deprecate with dual-register period and migration notes
  • Feature flags per tool (issue_refund_enabled: false in staging)
  • Deploy smoke test: invoke each tool with golden args before traffic shift
  • Document SLAs: p99 latency budget per downstream dependency

Structured logging fields

FieldWhy
request_idTrace full agent session across rounds
roundWhich iteration of the tool loop
tool_nameAggregate error rates per tool
latency_msSLO dashboards; detect slow CRM
statusok | validation_error | timeout | denied | unknown_tool
args_hashDebug without logging raw PII args
provideropenai | anthropic | bedrock — adapter debugging

Registry + JSONL audit

Tool registry with audit log — hash args, record latency
import hashlib, json, time
from pathlib import Path

LOG = Path("logs/tool_audit.jsonl")

class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, dict] = {}

    def register(self, name: str, schema: dict, handler, *, kind: str, version: str = "v1"):
        self._tools[name] = {"schema": schema, "handler": handler, "kind": kind, "version": version}

    def invoke(self, request_id: str, round_num: int, name: str, args: dict) -> dict:
        if name not in self._tools:
            return {"error": "unknown_tool", "retryable": False}
        meta = self._tools[name]
        start = time.perf_counter()
        try:
            out = meta["handler"](**args)
            status = "ok"
        except TimeoutError:
            out = {"error": "timeout", "retryable": True}
            status = "timeout"
        except Exception as e:
            out = {"error": str(e), "retryable": True}
            status = "error"
        row = {
            "request_id": request_id,
            "round": round_num,
            "tool": name,
            "tool_version": meta["version"],
            "kind": meta["kind"],
            "status": status,
            "latency_ms": round((time.perf_counter() - start) * 1000, 2),
            "args_hash": hashlib.sha256(json.dumps(args, sort_keys=True).encode()).hexdigest()[:16],
        }
        LOG.parent.mkdir(parents=True, exist_ok=True)
        with LOG.open("a", encoding="utf-8") as f:
            f.write(json.dumps(row) + "\n")
        return out
public class ToolRegistry {
  private final Map<String, ToolDefinition> tools = new ConcurrentHashMap<>();
  private final ToolAuditLogger auditLog;

  public void register(String name, ToolDefinition def) { tools.put(name, def); }

  public JsonNode invoke(String requestId, int round, String name, JsonNode args) {
    long start = System.nanoTime();
    ToolDefinition def = tools.get(name);
    if (def == null) return error("unknown_tool");
    JsonNode out = def.handler().apply(args);
    auditLog.log(requestId, round, name, def.version(), statusOk, start, hashArgs(args));
    return out;
  }
}

Eval beyond final answer text

Golden traces should assert:

  • Given user query X, expect tool Y (not Z) on round 1
  • Args contain required fields with valid formats
  • Write tools trigger HITL when amount > threshold
  • After injected error result, model recovers without infinite retry

Production checklist

  • Allowlist tool names in router—reject unknown at dispatch
  • Max rounds (5) and max wall time (30s) enforced independently
  • Read/write classification drives parallel and HITL policy
  • Idempotency keys on all writes keyed to provider call ID
  • Provider-correct result injection in durable message store
  • Golden evals for tool selection + args; block deploy on regression
  • Alert on tool error rate > baseline; on zero-tool loops burning max rounds
  • Runbook: validation_error spike → schema deploy mismatch; timeout spike → dependency outage

Observability dashboard (starter panels)

PanelSignal
Tool calls / minute by nameTraffic mix; detect runaway search_policy
p95 latency by toolDownstream degradation
Wrong-tool rate (eval sample)Schema/description drift
Approval pending queue depthHITL staffing
Rounds to completionLoop efficiency; prompt regressions
⚙️ Config

Example deploy manifest: tool_registry_version: 2026-06-07 · max_tool_rounds: 5 · parallel_read_max: 8 · hitl_tier_high: issue_refund,send_email

💰 Cost

Wrong-tool calls waste tokens and downstream quota. Track cost per successful task, not per LLM call—tool-heavy turns dominate CRM and payment API bills.

📦 Real World

On-call: spike in validation_error after deploy → schema mismatch between registry and OpenAI tools array. Roll back registry version, not the model.

🎯 Interview Tip

Close with eval strategy: tool-selection accuracy on labeled traces beats BLEU on final answers for agent features. Mention hash-based arg logging for GDPR-friendly audit.