Tool calling & function calling

Architecture: schema → call → execute → result → response

Tool calling (also function calling) lets an LLM request typed side effects instead of hallucinating facts. Your application owns execution; the model only proposes structured calls against JSON Schema you register. Every production agent loop follows the same five-step rhythm—whether OpenAI tool_calls, Anthropic tool_use, or Bedrock Converse toolUse.

Think of tools as an internal RPC surface exposed to the model. You declare names, descriptions, and parameter schemas; the model returns arguments matching those schemas; your router validates, authorizes, executes, and feeds results back as new messages. The loop continues until the model emits a final natural-language answer or you hit iteration/time limits. This is the bridge from Track 1 (APIs) and Track 3 (prompting) into Track 4 (agents).

The five-step loop

Schema registration — attach tool definitions (name, description, JSON Schema parameters) to the API request.
Model call — send messages; model may return text, tool call(s), or both.
Execute — your code parses arguments, checks allowlists, runs the handler (DB query, HTTP, compute).
Result injection — append provider-specific tool-result messages tied to each call ID.
Response — call the model again with updated history until finish_reason is not tool-related or max rounds reached.

Sequence diagram (conceptual)

User ──► App ──► LLM (tools registered)
              ◄── assistant: tool_call lookup_order(order_id="ORD-8821")
App ──► execute lookup_order ──► {status, items, total}
App ──► LLM (messages + tool result)
              ◄── assistant: "Your order shipped yesterday…"
App ──► User

Responsibility split

Layer	Owns	Must not
Model	Which tool(s) to call; argument values from context	Direct DB/network access; secrets
Tool router	Validation, authZ, timeouts, idempotency, audit	Trust model args without schema check
Handlers	Business logic; return JSON-serializable results	Return unbounded blobs or PII dumps
Agent runner	Iteration cap, parallel dispatch, HITL gates	Infinite loops on repeated tool errors

Minimal loop (all providers, same shape)

from openai import OpenAI
import json

TOOLS = [{
    "type": "function",
    "function": {
        "name": "lookup_order",
        "description": "Fetch order status by order ID (ORD- prefix).",
        "parameters": {
            "type": "object",
            "properties": {"order_id": {"type": "string"}},
            "required": ["order_id"],
            "additionalProperties": False,
        },
    },
}]

def lookup_order(order_id: str) -> dict:
    return {"order_id": order_id, "status": "shipped", "eta_days": 1}

HANDLERS = {"lookup_order": lookup_order}

def run_agent(user: str, max_rounds: int = 5) -> str:
    client = OpenAI()
    messages = [{"role": "user", "content": user}]
    for _ in range(max_rounds):
        resp = client.chat.completions.create(
            model="gpt-4o-mini", tools=TOOLS, tool_choice="auto", messages=messages,
        )
        msg = resp.choices[0].message
        if not msg.tool_calls:
            return msg.content or ""
        messages.append(msg.model_dump(exclude_none=True))
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            out = HANDLERS[call.function.name](**args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": json.dumps(out),
            })
    raise RuntimeError("max tool rounds exceeded")

// Spring AI ChatClient — ToolCallback registry maps name → handler
public String runAgent(String user, int maxRounds) {
  List<Message> messages = new ArrayList<>(List.of(new UserMessage(user)));
  for (int i = 0; i < maxRounds; i++) {
    ChatResponse resp = chatClient.prompt(new Prompt(messages)).tools(orderTools).call();
    AssistantMessage assistant = resp.getResult().getOutput();
    if (!assistant.hasToolCalls()) return assistant.getText();
    messages.add(assistant);
    for (ToolCall call : assistant.getToolCalls()) {
      String output = toolRouter.invoke(call.getName(), call.getArguments());
      messages.add(new ToolResponseMessage(call.getId(), output));
    }
  }
  throw new IllegalStateException("max tool rounds exceeded");
}

import anthropic, json

client = anthropic.Anthropic()
tools = [{
    "name": "lookup_order",
    "description": "Fetch order status by order ID.",
    "input_schema": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    },
}]

def run_agent(user: str) -> str:
    messages = [{"role": "user", "content": user}]
    while True:
        resp = client.messages.create(
            model="claude-3-5-sonnet-20241022", max_tokens=1024,
            tools=tools, messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})
        tool_uses = [b for b in resp.content if b.type == "tool_use"]
        if not tool_uses:
            return "".join(b.text for b in resp.content if hasattr(b, "text"))
        results = []
        for tu in tool_uses:
            out = lookup_order(**tu.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": tu.id,
                "content": json.dumps(out),
            })
        messages.append({"role": "user", "content": results})

// AnthropicMessageChatModel + ToolCallback
// tool_use blocks in assistant; tool_result in following user message

import boto3, json

client = boto3.client("bedrock-runtime")
tool_config = {"tools": [{"toolSpec": {
    "name": "lookup_order",
    "description": "Fetch order status",
    "inputSchema": {"json": {
        "type": "object",
        "properties": {"order_id": {"type": "string"}},
        "required": ["order_id"],
    }},
}}]}

def run_agent(user: str) -> str:
    messages = [{"role": "user", "content": [{"text": user}]}]
    while True:
        resp = client.converse(
            modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
            messages=messages, toolConfig=tool_config,
        )
        out_msg = resp["output"]["message"]
        messages.append(out_msg)
        tool_uses = [b for b in out_msg["content"] if "toolUse" in b]
        if not tool_uses:
            return "".join(b.get("text", "") for b in out_msg["content"])
        results = []
        for tu in tool_uses:
            use = tu["toolUse"]
            out = lookup_order(**use["input"])
            results.append({"toolResult": {
                "toolUseId": use["toolUseId"],
                "content": [{"json": out}],
            }})
        messages.append({"role": "user", "content": results})

// BedrockRuntimeClient.converse — toolConfig + toolResult blocks
// Spring AI BedrockConverseChatModel wraps same API

🔬 Under the Hood

The model never runs your Python—it emits a structured intent. Execution happens in your process (or sandbox). That separation is why authZ on the router matters more than prompt pleading.

⚠️ Pitfall

Storing edited summaries of tool results in message history. Replay exact API payloads—especially tool_call_id / tool_use_id pairing—or the next model turn will hallucinate continuity.

💡 Pro Tip

Log the full message array (redacted) per round. When a user reports a wrong refund, you need round 2 tool args—not only the final assistant text.

🎯 Interview Tip

Whiteboard the five steps and mark where authZ lives. Interviewers want execution outside the model with explicit iteration caps.

OpenAI: tools array, tool_choice, parallel calls

OpenAI Chat Completions expose tools as a top-level tools array of function definitions. The assistant message may include tool_calls with stable IDs; you reply with role: tool messages. Modern models can emit multiple parallel tool calls in one assistant turn when parallel_tool_calls=true.

OpenAI popularized the name "function calling"; the current API uses "tools" with type: function. The assistant message is stored with a tool_calls array—each entry has id, function.name, and function.arguments (JSON string). Your runner must append the full assistant message before tool results so the API can correlate IDs on the next turn.

tools array shape

Each entry is {"type": "function", "function": { name, description, parameters }}. Parameters must be JSON Schema. Set additionalProperties: false on object schemas to reduce spurious keys in arguments.

Field	Purpose	Guidance
name	Stable identifier for router	snake_case; match handler registry key exactly
description	When to use this tool	Include negative scope: "Do not use for refunds"
parameters	Argument schema	Enums for categorical fields; regex patterns for IDs

tool_choice modes

Value	Behavior	Use when
"auto" (default)	Model may call zero or more tools	General agents; support bots
"none"	Disable tools for this turn	Final summarization; policy-only response
{"type":"function","function":{"name":"X"}}	Force specific tool	Structured extraction pipelines; eval harnesses
"required" (supported models)	Must call at least one tool	Router step that always delegates to backend

Parallel tool_calls in one assistant message

When the user asks "What's the status of ORD-8821 and ORD-8822?", gpt-4o family often returns two tool_calls in a single assistant message. If both are read-only, execute concurrently, append two role: tool messages (each with matching tool_call_id), then call the model once.

resp = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    tools=TOOLS,
    tool_choice="auto",  # "none" | {"type":"function","function":{"name":"lookup_order"}}
    parallel_tool_calls=True,
)
msg = resp.choices[0].message
finish = resp.choices[0].finish_reason  # "tool_calls" | "stop"

if msg.tool_calls:
    messages.append(msg.model_dump(exclude_none=True))
    for call in msg.tool_calls:
        args = json.loads(call.function.arguments)
        out = HANDLERS[call.function.name](**args)
        messages.append({
            "role": "tool",
            "tool_call_id": call.id,
            "content": json.dumps(out),
        })

ChatResponse resp = chatClient.prompt(new Prompt(messages))
    .options(ChatOptions.builder()
        .toolChoice("auto")
        .parallelToolCalls(true)
        .build())
    .tools(toolCallbacks)
    .call();
AssistantMessage assistant = resp.getResult().getOutput();
if (assistant.hasToolCalls()) {
  messages.add(assistant);
  for (ToolCall call : assistant.getToolCalls()) {
    String json = toolRegistry.invoke(call.getName(), call.getArguments());
    messages.add(new ToolResponseMessage(call.getId(), json));
  }
}

# Anthropic: tools + tool_choice={"type":"auto"|"any"|"tool",...}
# Parallel tool_use blocks in one assistant content array

// AnthropicChatOptions.toolChoice(ToolChoiceBuilder.AUTO)

# Bedrock Converse: toolConfig={"tools":[...]} — multiple toolUse per turn

// ConverseRequest.toolConfig(ToolConfiguration.builder()...)

finish_reason and empty tool_calls

When finish_reason == "tool_calls", expect at least one call—unless you hit token limits mid-generation. When the model answers directly, tool_calls is null and finish_reason == "stop". Do not assume every turn needs tools; over-registering tools increases mistaken invocations.

⚖️ Trade-off

Forcing a tool via tool_choice improves reliability but removes the model's ability to ask clarifying questions in text. Use forced tools for batch ETL; use auto for conversational agents.

⚙️ Config

Feature flag per endpoint: tool_choice: auto in prod chat, none on summarization-only routes. Log effective choice and tool count per request.

📦 Real World

Support agents register 8–12 tools max. Beyond ~15, wrong-tool rate rises—split into specialist sub-agents or dynamic tool retrieval (later guides).

🔒 Security

Never pass user-controlled strings into tool names. Allowlist call.function.name against registry keys before dispatch—prompt injection loves inventing admin_override.

Anthropic: tool_use and tool_result blocks

Anthropic Messages API represents tools as tools with input_schema (JSON Schema). Assistant replies are content block arrays: interleaved text and tool_use blocks. Results return in a subsequent user message as tool_result blocks referencing tool_use_id.

Unlike OpenAI's dedicated role: tool message, Anthropic nests results inside a user turn. The assistant message before results must be preserved verbatim—including any leading text blocks where the model explains its plan. Bedrock Converse copies this block model for Claude models; adapters in multi-provider gateways must not flatten these into generic "tool" roles without translation.

Block flow

Request includes tools=[{name, description, input_schema}] and optional tool_choice.
Assistant content may include {"type":"tool_use","id":"toolu_…","name":"…","input":{…}}.
You execute handlers and append a user message whose content is an array of tool_result blocks.
Model continues with more text and/or additional tool_use blocks until stop.

tool_choice on Anthropic

tool_choice	Effect	Typical use
{"type":"auto"}	Model decides whether to use tools	Default agent behavior
{"type":"any"}	Must invoke at least one tool	Forced delegation steps
{"type":"tool","name":"lookup_order"}	Must call named tool	Eval fixtures; structured extract

Parallel tool_use blocks

One assistant turn may contain multiple tool_use blocks (e.g. two order lookups). Batch all corresponding tool_result entries in the next single user message—order of results need not match order of calls, but each tool_use_id must be unique and correctly paired.

# OpenAI: role "tool" + tool_call_id (not user+tool_result blocks)

// ToolResponseMessage per OpenAI tool_call_id

import anthropic, json

client = anthropic.Anthropic()
resp = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    tools=[{
        "name": "search_policy",
        "description": "Search internal policy KB. Query max 200 chars.",
        "input_schema": {
            "type": "object",
            "properties": {"query": {"type": "string", "maxLength": 200}},
            "required": ["query"],
        },
    }],
    tool_choice={"type": "auto"},
    messages=[{"role": "user", "content": "Can I return an opened laptop after 45 days?"}],
)
messages = [
    {"role": "user", "content": "Can I return an opened laptop after 45 days?"},
    {"role": "assistant", "content": resp.content},
]
tool_results = []
for block in resp.content:
    if block.type == "tool_use":
        data = run_search_policy(**block.input)
        tool_results.append({
            "type": "tool_result",
            "tool_use_id": block.id,
            "content": json.dumps(data),
        })
messages.append({"role": "user", "content": tool_results})
follow_up = client.messages.create(
    model="claude-3-5-sonnet-20241022", max_tokens=1024, tools=tools, messages=messages,
)

// Spring AI: AnthropicChatModel with ToolDefinition inputSchema
// Map ToolResponse to Anthropic tool_result content blocks in UserMessage

# Bedrock Converse mirrors Anthropic block shapes:
# assistant content: [{"toolUse": {"toolUseId", "name", "input"}}]
# user content: [{"toolResult": {"toolUseId", "content": [{"json": ...}]}}]

// BedrockConverseChatModel — same toolUse / toolResult semantics

is_error on tool_result

Anthropic supports "is_error": true on tool_result when execution failed. Use this for validation failures and timeouts so the model distinguishes user-fixable errors from successful empty results.

⚠️ Pitfall

Omitting the full assistant message (including text blocks) before tool_result. Anthropic expects the exact prior assistant content array preserved in history.

🔬 Under the Hood

tool_use IDs are provider-generated (e.g. toolu_01…). Never reuse IDs across turns; each invocation gets a fresh identifier tied to one result block.

💰 Cost

Interleaved text + tool_use in one turn still bills output tokens for both. Trim verbose pre-tool narration in the system prompt if latency-sensitive.

🎯 Interview Tip

Contrast OpenAI role:tool vs Anthropic user+tool_result. Shows you build multi-provider gateways without losing required semantics at the adapter layer.

Tool design: single responsibility, descriptions, typed params

Tools are UX for the model. Poor schemas cause wrong-tool selection, invalid args, and retry storms. Design each tool like a public API: one job, explicit description, strict types, predictable errors, idempotent writes.

The model chooses tools from names and descriptions alone—it cannot read your handler source code. A vague description guarantees misuse; an oversized god-tool guarantees wrong enum branches. Invest in schema quality before tuning temperature or swapping models.

Design principles

Principle	Practice	Anti-pattern
Single responsibility	lookup_order not manage_order	God tool with action enum covering 12 operations
Descriptions	When to use + when NOT + example ID format	"Handles orders" (useless to the model)
Typed params	JSON Schema types, enums, min/max, patterns	Single payload: string blob
Error-safe	Return {"error":"…","retryable":true}	Stack traces or HTML error pages as tool content
Idempotent writes	Require idempotency_key on mutating tools	Double refund on duplicate tool_call retry

Description template

Structure: one-line purpose → input constraints → output shape → negative scope.

Fetch order details by ID. Input: order_id must match ORD-[digits]. Returns status, line items, ship date. Do not use for refunds—use issue_refund.

Server-side validation before execute

from pydantic import BaseModel, Field, ValidationError

class LookupOrderArgs(BaseModel):
    order_id: str = Field(pattern=r"^ORD-\d+$")

class IssueRefundArgs(BaseModel):
    order_id: str
    amount_usd: float = Field(gt=0, le=10_000)
    reason: str = Field(max_length=200)
    idempotency_key: str

def dispatch(name: str, raw: dict) -> dict:
    try:
        if name == "lookup_order":
            args = LookupOrderArgs.model_validate(raw)
            return lookup_order(args.order_id)
        if name == "issue_refund":
            args = IssueRefundArgs.model_validate(raw)
            return issue_refund(**args.model_dump())
    except ValidationError as e:
        return {"error": "invalid_arguments", "details": e.errors(), "retryable": False}
    return {"error": "unknown_tool", "retryable": False}

public record LookupOrderArgs(@Pattern(regexp = "ORD-\\d+") String orderId) {}

public String dispatch(String name, String argsJson) throws JsonProcessingException {
  try {
    if ("lookup_order".equals(name)) {
      LookupOrderArgs args = objectMapper.readValue(argsJson, LookupOrderArgs.class);
      return objectMapper.writeValueAsString(lookupOrder(args.orderId()));
    }
  } catch (ValidationException | JsonProcessingException e) {
    return "{\"error\":\"invalid_arguments\",\"retryable\":false}";
  }
  return "{\"error\":\"unknown_tool\",\"retryable\":false}";
}

Returning errors the model can recover from

order_not_found — model asks user to verify ID
rate_limited with retryable: true — runner may backoff and retry
policy_denied — model explains constraint without retry loop

💡 Pro Tip

Return structured errors the model can read. A tool result {"error":"order_not_found"} beats an exception string the model misinterprets as success.

⚖️ Trade-off

Granular tools (10+) vs coarse tools (3). Granular improves selection accuracy but inflates schema tokens in every request. Measure wrong-tool rate on a golden set.

📦 Real World

Version tool descriptions in a registry (lookup_order@v3). A/B description changes like any API contract—with eval gates before full rollout.

🔒 Security

Validate argument length and rate-limit per tool. search_policy with a 10KB query is a DoS vector against your vector DB.

Tool categories: read / write / compute + confirmation for writes

Classify every tool by side-effect profile. Read tools query state; write tools mutate external systems; compute tools run deterministic logic with no external IO. Apply different guardrails: parallelize reads; serialize writes; require human confirmation before high-stakes mutations.

Category matrix

Category	Examples	Parallel?	HITL?	Idempotency
Read	lookup_order, search_policy, get_balance	Yes	No	Optional (cache keys)
Write	issue_refund, send_email, update_address	No (or queued)	Yes for high impact	Required
Compute	calculate_shipping, format_table, validate_iban	Yes	No	N/A (pure function)

Confirmation pattern for writes

Model calls write tool → router intercepts before execute.
Present pending action to user or approver queue: tool name, args, human-readable diff.
On approve: execute with idempotency key tied to tool_call_id or tool_use_id.
On reject: inject tool result {"status":"denied_by_user"} so the model apologizes and offers alternatives.
On timeout: inject {"status":"approval_expired"}; model asks user to restate intent.

Registry metadata drives policy

from enum import Enum

class ToolKind(str, Enum):
    READ = "read"
    WRITE = "write"
    COMPUTE = "compute"

TOOL_META = {
    "lookup_order": {"kind": ToolKind.READ, "timeout_sec": 5},
    "search_policy": {"kind": ToolKind.READ, "timeout_sec": 5},
    "issue_refund": {"kind": ToolKind.WRITE, "requires_approval": True, "risk_tier": "high"},
    "calculate_shipping": {"kind": ToolKind.COMPUTE, "timeout_sec": 2},
}

WRITE_TOOLS = {n for n, m in TOOL_META.items() if m["kind"] == ToolKind.WRITE}
READ_TOOLS = {n for n, m in TOOL_META.items() if m["kind"] in (ToolKind.READ, ToolKind.COMPUTE)}

public enum ToolKind { READ, WRITE, COMPUTE }

public record ToolMeta(ToolKind kind, boolean requiresApproval, int timeoutSec, String riskTier) {}

Map<String, ToolMeta> registry = Map.of(
    "lookup_order", new ToolMeta(ToolKind.READ, false, 5, "low"),
    "issue_refund", new ToolMeta(ToolKind.WRITE, true, 30, "high")
);

Gray-area tools

Some tools blur lines: create_support_ticket writes to Zendesk but feels "safe." Rule of thumb: if a customer or finance system observes the effect without another human step, treat as write. Log-only internal annotations may be medium tier—auto-execute with audit.

⚠️ Pitfall

Classifying send_email_to_customer as read because it "only reads a template." If the recipient receives mail, it is a write—confirm before send.

⚙️ Config

WRITE_TOOLS=issue_refund,send_email in env; router refuses writes when HITL_MODE=required and no approval token is present on the request context.

🎯 Interview Tip

Explain read/write split for parallel execution policy. Shows operational maturity beyond demo agent loops that treat every tool the same.

💰 Cost

Write tools hit payment and email APIs—duplicate execution cost dwarfs LLM tokens. Idempotency keys keyed to provider call IDs are non-negotiable on refunds.

Parallel calling: N simultaneous tools

When the model returns N read-only tool calls in one turn, execute them concurrently to minimize wall-clock latency. Never parallelize writes unless your backend supports isolated transactions and you enforce per-resource locking.

OpenAI's parallel_tool_calls and Anthropic multi-tool_use turns are only wins if your executor cooperates. A sequential loop over three lookup_order calls triples latency; a thread pool with sane limits cuts it toward the slowest single dependency.

When to parallelize

All calls classified as READ or COMPUTE per registry metadata
Handlers are thread-safe or pure async without shared mutable state
Downstream APIs have headroom (bulkhead / circuit breaker not open)
Total fan-out ≤ configured max (commonly 8)

When to refuse parallel fan-out

Any write tool in the batch—split reads parallel, writes sequential
Tools hitting the same row (two updates to one order)—serialize or merge
Global rate limit on a shared API key—use semaphore per dependency

import json
from concurrent.futures import ThreadPoolExecutor, as_completed

def execute_tool_calls(request_id: str, round_num: int, tool_calls: list) -> list[tuple[str, str, str]]:
    """Returns (tool_call_id, name, output_json)."""
    read_calls = [c for c in tool_calls if c.function.name in READ_TOOLS]
    write_calls = [c for c in tool_calls if c.function.name in WRITE_TOOLS]
    results: list[tuple[str, str, str]] = []

    def _one(call):
        name = call.function.name
        args = json.loads(call.function.arguments)
        out = dispatch(name, args)
        audit_log(request_id, round_num, name, args, out)
        return call.id, name, json.dumps(out)

    max_workers = min(8, len(read_calls) or 1)
    with ThreadPoolExecutor(max_workers=max_workers) as pool:
        futures = [pool.submit(_one, c) for c in read_calls]
        for fut in as_completed(futures):
            results.append(fut.result())

    for call in write_calls:
        results.append(_one(call))  # sequential; HITL inside dispatch for writes
    return results

public List<ToolResult> executeToolCalls(String requestId, int round, List<ToolCall> calls) {
  List<ToolCall> reads = calls.stream().filter(c -> READ_TOOLS.contains(c.getName())).toList();
  List<ToolCall> writes = calls.stream().filter(c -> WRITE_TOOLS.contains(c.getName())).toList();
  List<ToolResult> results = reads.parallelStream()
      .map(c -> invokeAudited(requestId, round, c))
      .collect(Collectors.toCollection(ArrayList::new));
  writes.forEach(c -> results.add(invokeWithApproval(requestId, round, c)));
  return results;
}

Async Python variant

For async handlers (HTTP/2 clients), prefer asyncio.gather with a concurrency semaphore instead of threads—same policy gates on read vs write buckets apply.

⚖️ Trade-off

Parallel reads cut user-visible latency but multiply load on fragile backends. Use semaphores per dependency (CRM, vector DB) and backpressure when error rate spikes.

🔬 Under the Hood

OpenAI parallel_tool_calls is a generation-time feature; your executor must still respect downstream rate limits—the model does not know your CRM's 429 threshold.

📦 Real World

Cap parallel fan-out at 8 even if the model requests 12 lookups—execute first 8, return partial with truncated: true in metadata, let model ask to continue.

💡 Pro Tip

as_completed returns results out of order—that is fine for message injection. Each tool_call_id pairing is independent of append order.

Result injection: message format per provider

Each provider requires a specific message shape when feeding tool outputs back into the conversation. Gateways that normalize history must preserve provider-native ID pairing—or translation bugs surface on turn 3+ when the model "forgets" which call failed.

Format comparison

Provider	Assistant carries call	Result message	ID field
OpenAI	assistant.tool_calls[]	role: "tool"	tool_call_id
Anthropic	content[].type == "tool_use"	user.content[].tool_result	tool_use_id
Bedrock Converse	content[].toolUse	user.content[].toolResult	toolUseId

Content encoding rules

OpenAI — tool message content is a string; JSON.stringify is standard for structured handler output.
Anthropic — tool_result.content may be string or content blocks; string JSON works for most handlers; set is_error on failures.
Bedrock — prefer content: [{"json": {...}}] for structured success; [{"text": "..."}] for errors with status: "error".

Multi-provider injection helpers

def inject_openai(messages: list, call_id: str, result: dict) -> None:
    messages.append({
        "role": "tool",
        "tool_call_id": call_id,
        "content": json.dumps(result),
    })

# One tool message per tool_call_id — never combine

messages.add(new ToolResponseMessage(
    ToolResponse.builder(callId, toolName).responseData(json).build()
));

def inject_anthropic_batch(results: list[dict]) -> dict:
    """Batch multiple tool_result blocks in ONE user message."""
    blocks = []
    for call_id, result in results:
        block = {"type": "tool_result", "tool_use_id": call_id, "content": json.dumps(result)}
        if result.get("error"):
            block["is_error"] = True
        blocks.append(block)
    return {"role": "user", "content": blocks}

// UserMessage with List<AnthropicToolResultContent> batched per turn

def inject_bedrock(tool_use_id: str, result: dict) -> dict:
    if "error" in result:
        return {"toolResult": {
            "toolUseId": tool_use_id,
            "content": [{"text": json.dumps(result)}],
            "status": "error",
        }}
    return {"toolResult": {
        "toolUseId": tool_use_id,
        "content": [{"json": result}],
    }}

// ContentBlock.fromToolResult(ToolResultBlock.builder()...)

Canonical internal model (gateway pattern)

Store conversation as provider-neutral records: {turn, assistant_tool_calls[], tool_results[]}. Adapters translate to OpenAI or Anthropic shapes at API boundary only. Never persist provider-specific messages without also storing the neutral form—you will migrate models.

⚠️ Pitfall

Mismatched IDs—injecting a result for call_abc when the model emitted call_xyz. The model hallucinates success. Assert every ID exists in the immediately prior assistant turn.

🔒 Security

Scrub PII from tool results before re-prompt unless policy allows. Logs and model context share the same leakage surface—hash or truncate account numbers in results.

🎯 Interview Tip

Ask how you'd build a provider-agnostic message store. Strong answer: canonical internal event log + thin adapters; never normalize away ID semantics at persistence layer.

Human-in-the-loop: pause for approval on high-stakes

Models misread policy edge cases. For refunds, account closures, wire transfers, and outbound email to customers, pause execution until a human approves, edits, or rejects the pending tool call. The agent loop stays intact—you inject a synthetic tool result reflecting the human decision.

Human-in-the-loop (HITL) is not "no agents"—it is gated side effects. The model may still plan, gather reads, and draft the refund rationale; only the mutating RPC waits on approval. Sync UIs show a pending card; async workflows enqueue to Slack/email and resume via webhook when decided.

Approval flow

Detect write tool in router → create PendingAction (args, user, request_id, provider call ID).
Return holding response to client or continue session with "awaiting approval" tool result if blocking is unacceptable.
Approver UI shows diff: "Refund $149.99 to ORD-8821 — reason: duplicate charge".
On approve → execute with idempotency key = provider call ID; inject success result; resume loop.
On reject → inject {"status":"denied_by_user"}; model explains to end user.

Risk tiers

Tier	Examples	Gate
Low	Read, compute	Auto-execute
Medium	Update nickname, internal note	Auto + audit log
High	Refund > $50, email external party	Human approval (single approver)
Critical	Wire transfer, delete account	Two-person rule + ticket ID

from dataclasses import dataclass
from enum import Enum

class ApprovalState(str, Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"

@dataclass
class PendingAction:
    id: str
    tool_call_id: str
    tool_name: str
    arguments: dict
    state: ApprovalState = ApprovalState.PENDING

def run_write_tool(name: str, args: dict, *, approval_id: str) -> str:
    pending = approval_store.get(approval_id)
    if pending.state == ApprovalState.PENDING:
        notify_approver(pending)
        return json.dumps({"status": "awaiting_approval", "approval_id": approval_id})
    if pending.state == ApprovalState.REJECTED:
        return json.dumps({"status": "denied_by_user"})
    return json.dumps(issue_refund(**args, idempotency_key=pending.tool_call_id))

public record PendingAction(String id, String toolCallId, String toolName, JsonNode args, ApprovalState state) {}

public String runWriteTool(String name, String argsJson, String approvalId) {
  PendingAction p = store.get(approvalId);
  if (p.state() == ApprovalState.PENDING) {
    notifier.send(p);
    return "{\"status\":\"awaiting_approval\"}";
  }
  if (p.state() == ApprovalState.REJECTED) {
    return "{\"status\":\"denied_by_user\"}";
  }
  return invokeRefund(argsJson, p.toolCallId());
}

Preventing self-approval via prompt injection

Approvers authenticate on a separate channel (admin SSO), not the chat session
Approval tokens are single-use, bound to approval_id + approver identity
Chat user cannot pass approved=true in tool args—decision lives only in approval store

📦 Real World

Slack interactive buttons with 15-minute TTL. Expired pending actions inject approval_expired so the model asks the user to restate intent—not loop forever on stale pending state.

⚖️ Trade-off

Sync HITL blocks the HTTP request—poor chat UX. Prefer async: persist message history, notify approver, resume agent via webhook + stored checkpoint.

🔒 Security

Separate approver identity from chat user. Prompt injection must not forge approval headers or pre-signed URLs.

💡 Pro Tip

Include provider call ID in approval deep links. Replay attacks reuse old approvals—idempotency on execute must reject duplicate keys with same outcome, not re-run.

Production: tool registry, logging, and checklist

Shipping tool calling means more than a demo loop: centralized registry, schema versioning, structured audit logs, per-tool SLOs, and eval suites that assert correct tool selection—not just final answer quality.

Tool registry responsibilities

Single source of truth: name → handler, schema, kind (read/write/compute), owning team
Version schemas; deprecate with dual-register period and migration notes
Feature flags per tool (issue_refund_enabled: false in staging)
Deploy smoke test: invoke each tool with golden args before traffic shift
Document SLAs: p99 latency budget per downstream dependency

Structured logging fields

Field	Why
request_id	Trace full agent session across rounds
round	Which iteration of the tool loop
tool_name	Aggregate error rates per tool
latency_ms	SLO dashboards; detect slow CRM
status	ok \| validation_error \| timeout \| denied \| unknown_tool
args_hash	Debug without logging raw PII args
provider	openai \| anthropic \| bedrock — adapter debugging

Registry + JSONL audit

import hashlib, json, time
from pathlib import Path

LOG = Path("logs/tool_audit.jsonl")

class ToolRegistry:
    def __init__(self):
        self._tools: dict[str, dict] = {}

    def register(self, name: str, schema: dict, handler, *, kind: str, version: str = "v1"):
        self._tools[name] = {"schema": schema, "handler": handler, "kind": kind, "version": version}

    def invoke(self, request_id: str, round_num: int, name: str, args: dict) -> dict:
        if name not in self._tools:
            return {"error": "unknown_tool", "retryable": False}
        meta = self._tools[name]
        start = time.perf_counter()
        try:
            out = meta["handler"](**args)
            status = "ok"
        except TimeoutError:
            out = {"error": "timeout", "retryable": True}
            status = "timeout"
        except Exception as e:
            out = {"error": str(e), "retryable": True}
            status = "error"
        row = {
            "request_id": request_id,
            "round": round_num,
            "tool": name,
            "tool_version": meta["version"],
            "kind": meta["kind"],
            "status": status,
            "latency_ms": round((time.perf_counter() - start) * 1000, 2),
            "args_hash": hashlib.sha256(json.dumps(args, sort_keys=True).encode()).hexdigest()[:16],
        }
        LOG.parent.mkdir(parents=True, exist_ok=True)
        with LOG.open("a", encoding="utf-8") as f:
            f.write(json.dumps(row) + "\n")
        return out

public class ToolRegistry {
  private final Map<String, ToolDefinition> tools = new ConcurrentHashMap<>();
  private final ToolAuditLogger auditLog;

  public void register(String name, ToolDefinition def) { tools.put(name, def); }

  public JsonNode invoke(String requestId, int round, String name, JsonNode args) {
    long start = System.nanoTime();
    ToolDefinition def = tools.get(name);
    if (def == null) return error("unknown_tool");
    JsonNode out = def.handler().apply(args);
    auditLog.log(requestId, round, name, def.version(), statusOk, start, hashArgs(args));
    return out;
  }
}

Eval beyond final answer text

Golden traces should assert:

Given user query X, expect tool Y (not Z) on round 1
Args contain required fields with valid formats
Write tools trigger HITL when amount > threshold
After injected error result, model recovers without infinite retry

Production checklist

Allowlist tool names in router—reject unknown at dispatch
Max rounds (5) and max wall time (30s) enforced independently
Read/write classification drives parallel and HITL policy
Idempotency keys on all writes keyed to provider call ID
Provider-correct result injection in durable message store
Golden evals for tool selection + args; block deploy on regression
Alert on tool error rate > baseline; on zero-tool loops burning max rounds
Runbook: validation_error spike → schema deploy mismatch; timeout spike → dependency outage

Observability dashboard (starter panels)

Panel	Signal
Tool calls / minute by name	Traffic mix; detect runaway search_policy
p95 latency by tool	Downstream degradation
Wrong-tool rate (eval sample)	Schema/description drift
Approval pending queue depth	HITL staffing
Rounds to completion	Loop efficiency; prompt regressions

⚙️ Config

Example deploy manifest: tool_registry_version: 2026-06-07 · max_tool_rounds: 5 · parallel_read_max: 8 · hitl_tier_high: issue_refund,send_email

💰 Cost

Wrong-tool calls waste tokens and downstream quota. Track cost per successful task, not per LLM call—tool-heavy turns dominate CRM and payment API bills.

📦 Real World

On-call: spike in validation_error after deploy → schema mismatch between registry and OpenAI tools array. Roll back registry version, not the model.

🎯 Interview Tip

Close with eval strategy: tool-selection accuracy on labeled traces beats BLEU on final answers for agent features. Mention hash-based arg logging for GDPR-friendly audit.