Tool calling & function calling
Agents do not guess database rows—they request typed RPC through the model API. Tool calling (OpenAI) and tool use (Anthropic) let the LLM emit structured function invocations; your application validates arguments, executes handlers, and injects results back into the conversation until a final answer emerges. This guide is Guide 2 in Track 4: the contract layer every agent loop depends on before memory, graphs, or MCP.
After reading, you should be able to: diagram the schema→call→execute→result→response loop; configure OpenAI tools and tool_choice; map Anthropic tool_use / tool_result blocks; design single-purpose tools with typed params; classify read vs write tools with confirmation gates; run parallel read-only calls; inject results per provider; pause for human approval on high-stakes writes; and ship a tool registry with structured logging.
Architecture: schema → call → execute → result → response
Tool calling (also function calling) lets an LLM request typed side effects instead of hallucinating facts. Your application owns execution; the model only proposes structured calls against JSON Schema you register. Every production agent loop follows the same five-step rhythm—whether OpenAI tool_calls, Anthropic tool_use, or Bedrock Converse toolUse.
Think of tools as an internal RPC surface exposed to the model. You declare names, descriptions, and parameter schemas; the model returns arguments matching those schemas; your router validates, authorizes, executes, and feeds results back as new messages. The loop continues until the model emits a final natural-language answer or you hit iteration/time limits. This is the bridge from Track 1 (APIs) and Track 3 (prompting) into Track 4 (agents).
The five-step loop
- Schema registration — attach tool definitions (name, description, JSON Schema parameters) to the API request.
- Model call — send messages; model may return text, tool call(s), or both.
- Execute — your code parses arguments, checks allowlists, runs the handler (DB query, HTTP, compute).
- Result injection — append provider-specific tool-result messages tied to each call ID.
- Response — call the model again with updated history until finish_reason is not tool-related or max rounds reached.
Sequence diagram (conceptual)
User ──► App ──► LLM (tools registered)
◄── assistant: tool_call lookup_order(order_id="ORD-8821")
App ──► execute lookup_order ──► {status, items, total}
App ──► LLM (messages + tool result)
◄── assistant: "Your order shipped yesterday…"
App ──► User
Responsibility split
| Layer | Owns | Must not |
|---|---|---|
| Model | Which tool(s) to call; argument values from context | Direct DB/network access; secrets |
| Tool router | Validation, authZ, timeouts, idempotency, audit | Trust model args without schema check |
| Handlers | Business logic; return JSON-serializable results | Return unbounded blobs or PII dumps |
| Agent runner | Iteration cap, parallel dispatch, HITL gates | Infinite loops on repeated tool errors |
Minimal loop (all providers, same shape)
from openai import OpenAI
import json
TOOLS = [{
"type": "function",
"function": {
"name": "lookup_order",
"description": "Fetch order status by order ID (ORD- prefix).",
"parameters": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
"additionalProperties": False,
},
},
}]
def lookup_order(order_id: str) -> dict:
return {"order_id": order_id, "status": "shipped", "eta_days": 1}
HANDLERS = {"lookup_order": lookup_order}
def run_agent(user: str, max_rounds: int = 5) -> str:
client = OpenAI()
messages = [{"role": "user", "content": user}]
for _ in range(max_rounds):
resp = client.chat.completions.create(
model="gpt-4o-mini", tools=TOOLS, tool_choice="auto", messages=messages,
)
msg = resp.choices[0].message
if not msg.tool_calls:
return msg.content or ""
messages.append(msg.model_dump(exclude_none=True))
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
out = HANDLERS[call.function.name](**args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(out),
})
raise RuntimeError("max tool rounds exceeded")
// Spring AI ChatClient — ToolCallback registry maps name → handler
public String runAgent(String user, int maxRounds) {
List<Message> messages = new ArrayList<>(List.of(new UserMessage(user)));
for (int i = 0; i < maxRounds; i++) {
ChatResponse resp = chatClient.prompt(new Prompt(messages)).tools(orderTools).call();
AssistantMessage assistant = resp.getResult().getOutput();
if (!assistant.hasToolCalls()) return assistant.getText();
messages.add(assistant);
for (ToolCall call : assistant.getToolCalls()) {
String output = toolRouter.invoke(call.getName(), call.getArguments());
messages.add(new ToolResponseMessage(call.getId(), output));
}
}
throw new IllegalStateException("max tool rounds exceeded");
}
import anthropic, json
client = anthropic.Anthropic()
tools = [{
"name": "lookup_order",
"description": "Fetch order status by order ID.",
"input_schema": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
},
}]
def run_agent(user: str) -> str:
messages = [{"role": "user", "content": user}]
while True:
resp = client.messages.create(
model="claude-3-5-sonnet-20241022", max_tokens=1024,
tools=tools, messages=messages,
)
messages.append({"role": "assistant", "content": resp.content})
tool_uses = [b for b in resp.content if b.type == "tool_use"]
if not tool_uses:
return "".join(b.text for b in resp.content if hasattr(b, "text"))
results = []
for tu in tool_uses:
out = lookup_order(**tu.input)
results.append({
"type": "tool_result",
"tool_use_id": tu.id,
"content": json.dumps(out),
})
messages.append({"role": "user", "content": results})
// AnthropicMessageChatModel + ToolCallback
// tool_use blocks in assistant; tool_result in following user message
import boto3, json
client = boto3.client("bedrock-runtime")
tool_config = {"tools": [{"toolSpec": {
"name": "lookup_order",
"description": "Fetch order status",
"inputSchema": {"json": {
"type": "object",
"properties": {"order_id": {"type": "string"}},
"required": ["order_id"],
}},
}}]}
def run_agent(user: str) -> str:
messages = [{"role": "user", "content": [{"text": user}]}]
while True:
resp = client.converse(
modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
messages=messages, toolConfig=tool_config,
)
out_msg = resp["output"]["message"]
messages.append(out_msg)
tool_uses = [b for b in out_msg["content"] if "toolUse" in b]
if not tool_uses:
return "".join(b.get("text", "") for b in out_msg["content"])
results = []
for tu in tool_uses:
use = tu["toolUse"]
out = lookup_order(**use["input"])
results.append({"toolResult": {
"toolUseId": use["toolUseId"],
"content": [{"json": out}],
}})
messages.append({"role": "user", "content": results})
// BedrockRuntimeClient.converse — toolConfig + toolResult blocks
// Spring AI BedrockConverseChatModel wraps same API
The model never runs your Python—it emits a structured intent. Execution happens in your process (or sandbox). That separation is why authZ on the router matters more than prompt pleading.
Storing edited summaries of tool results in message history. Replay exact API payloads—especially tool_call_id / tool_use_id pairing—or the next model turn will hallucinate continuity.
Log the full message array (redacted) per round. When a user reports a wrong refund, you need round 2 tool args—not only the final assistant text.
Whiteboard the five steps and mark where authZ lives. Interviewers want execution outside the model with explicit iteration caps.
OpenAI: tools array, tool_choice, parallel calls
OpenAI Chat Completions expose tools as a top-level tools array of function definitions. The assistant message may include tool_calls with stable IDs; you reply with role: tool messages. Modern models can emit multiple parallel tool calls in one assistant turn when parallel_tool_calls=true.
OpenAI popularized the name "function calling"; the current API uses "tools" with type: function. The assistant message is stored with a tool_calls array—each entry has id, function.name, and function.arguments (JSON string). Your runner must append the full assistant message before tool results so the API can correlate IDs on the next turn.
tools array shape
Each entry is {"type": "function", "function": { name, description, parameters }}. Parameters must be JSON Schema. Set additionalProperties: false on object schemas to reduce spurious keys in arguments.
| Field | Purpose | Guidance |
|---|---|---|
| name | Stable identifier for router | snake_case; match handler registry key exactly |
| description | When to use this tool | Include negative scope: "Do not use for refunds" |
| parameters | Argument schema | Enums for categorical fields; regex patterns for IDs |
tool_choice modes
| Value | Behavior | Use when |
|---|---|---|
| "auto" (default) | Model may call zero or more tools | General agents; support bots |
| "none" | Disable tools for this turn | Final summarization; policy-only response |
| {"type":"function","function":{"name":"X"}} | Force specific tool | Structured extraction pipelines; eval harnesses |
| "required" (supported models) | Must call at least one tool | Router step that always delegates to backend |
Parallel tool_calls in one assistant message
When the user asks "What's the status of ORD-8821 and ORD-8822?", gpt-4o family often returns two tool_calls in a single assistant message. If both are read-only, execute concurrently, append two role: tool messages (each with matching tool_call_id), then call the model once.
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=TOOLS,
tool_choice="auto", # "none" | {"type":"function","function":{"name":"lookup_order"}}
parallel_tool_calls=True,
)
msg = resp.choices[0].message
finish = resp.choices[0].finish_reason # "tool_calls" | "stop"
if msg.tool_calls:
messages.append(msg.model_dump(exclude_none=True))
for call in msg.tool_calls:
args = json.loads(call.function.arguments)
out = HANDLERS[call.function.name](**args)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": json.dumps(out),
})
ChatResponse resp = chatClient.prompt(new Prompt(messages))
.options(ChatOptions.builder()
.toolChoice("auto")
.parallelToolCalls(true)
.build())
.tools(toolCallbacks)
.call();
AssistantMessage assistant = resp.getResult().getOutput();
if (assistant.hasToolCalls()) {
messages.add(assistant);
for (ToolCall call : assistant.getToolCalls()) {
String json = toolRegistry.invoke(call.getName(), call.getArguments());
messages.add(new ToolResponseMessage(call.getId(), json));
}
}
# Anthropic: tools + tool_choice={"type":"auto"|"any"|"tool",...}
# Parallel tool_use blocks in one assistant content array
// AnthropicChatOptions.toolChoice(ToolChoiceBuilder.AUTO)
# Bedrock Converse: toolConfig={"tools":[...]} — multiple toolUse per turn
// ConverseRequest.toolConfig(ToolConfiguration.builder()...)
finish_reason and empty tool_calls
When finish_reason == "tool_calls", expect at least one call—unless you hit token limits mid-generation. When the model answers directly, tool_calls is null and finish_reason == "stop". Do not assume every turn needs tools; over-registering tools increases mistaken invocations.
Forcing a tool via tool_choice improves reliability but removes the model's ability to ask clarifying questions in text. Use forced tools for batch ETL; use auto for conversational agents.
Feature flag per endpoint: tool_choice: auto in prod chat, none on summarization-only routes. Log effective choice and tool count per request.
Support agents register 8–12 tools max. Beyond ~15, wrong-tool rate rises—split into specialist sub-agents or dynamic tool retrieval (later guides).
Never pass user-controlled strings into tool names. Allowlist call.function.name against registry keys before dispatch—prompt injection loves inventing admin_override.
Anthropic: tool_use and tool_result blocks
Anthropic Messages API represents tools as tools with input_schema (JSON Schema). Assistant replies are content block arrays: interleaved text and tool_use blocks. Results return in a subsequent user message as tool_result blocks referencing tool_use_id.
Unlike OpenAI's dedicated role: tool message, Anthropic nests results inside a user turn. The assistant message before results must be preserved verbatim—including any leading text blocks where the model explains its plan. Bedrock Converse copies this block model for Claude models; adapters in multi-provider gateways must not flatten these into generic "tool" roles without translation.
Block flow
- Request includes tools=[{name, description, input_schema}] and optional tool_choice.
- Assistant content may include {"type":"tool_use","id":"toolu_…","name":"…","input":{…}}.
- You execute handlers and append a user message whose content is an array of tool_result blocks.
- Model continues with more text and/or additional tool_use blocks until stop.
tool_choice on Anthropic
| tool_choice | Effect | Typical use |
|---|---|---|
| {"type":"auto"} | Model decides whether to use tools | Default agent behavior |
| {"type":"any"} | Must invoke at least one tool | Forced delegation steps |
| {"type":"tool","name":"lookup_order"} | Must call named tool | Eval fixtures; structured extract |
Parallel tool_use blocks
One assistant turn may contain multiple tool_use blocks (e.g. two order lookups). Batch all corresponding tool_result entries in the next single user message—order of results need not match order of calls, but each tool_use_id must be unique and correctly paired.
# OpenAI: role "tool" + tool_call_id (not user+tool_result blocks)
// ToolResponseMessage per OpenAI tool_call_id
import anthropic, json
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
tools=[{
"name": "search_policy",
"description": "Search internal policy KB. Query max 200 chars.",
"input_schema": {
"type": "object",
"properties": {"query": {"type": "string", "maxLength": 200}},
"required": ["query"],
},
}],
tool_choice={"type": "auto"},
messages=[{"role": "user", "content": "Can I return an opened laptop after 45 days?"}],
)
messages = [
{"role": "user", "content": "Can I return an opened laptop after 45 days?"},
{"role": "assistant", "content": resp.content},
]
tool_results = []
for block in resp.content:
if block.type == "tool_use":
data = run_search_policy(**block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": json.dumps(data),
})
messages.append({"role": "user", "content": tool_results})
follow_up = client.messages.create(
model="claude-3-5-sonnet-20241022", max_tokens=1024, tools=tools, messages=messages,
)
// Spring AI: AnthropicChatModel with ToolDefinition inputSchema
// Map ToolResponse to Anthropic tool_result content blocks in UserMessage
# Bedrock Converse mirrors Anthropic block shapes:
# assistant content: [{"toolUse": {"toolUseId", "name", "input"}}]
# user content: [{"toolResult": {"toolUseId", "content": [{"json": ...}]}}]
// BedrockConverseChatModel — same toolUse / toolResult semantics
is_error on tool_result
Anthropic supports "is_error": true on tool_result when execution failed. Use this for validation failures and timeouts so the model distinguishes user-fixable errors from successful empty results.
Omitting the full assistant message (including text blocks) before tool_result. Anthropic expects the exact prior assistant content array preserved in history.
tool_use IDs are provider-generated (e.g. toolu_01…). Never reuse IDs across turns; each invocation gets a fresh identifier tied to one result block.
Interleaved text + tool_use in one turn still bills output tokens for both. Trim verbose pre-tool narration in the system prompt if latency-sensitive.
Contrast OpenAI role:tool vs Anthropic user+tool_result. Shows you build multi-provider gateways without losing required semantics at the adapter layer.
Tool design: single responsibility, descriptions, typed params
Tools are UX for the model. Poor schemas cause wrong-tool selection, invalid args, and retry storms. Design each tool like a public API: one job, explicit description, strict types, predictable errors, idempotent writes.
The model chooses tools from names and descriptions alone—it cannot read your handler source code. A vague description guarantees misuse; an oversized god-tool guarantees wrong enum branches. Invest in schema quality before tuning temperature or swapping models.
Design principles
| Principle | Practice | Anti-pattern |
|---|---|---|
| Single responsibility | lookup_order not manage_order | God tool with action enum covering 12 operations |
| Descriptions | When to use + when NOT + example ID format | "Handles orders" (useless to the model) |
| Typed params | JSON Schema types, enums, min/max, patterns | Single payload: string blob |
| Error-safe | Return {"error":"…","retryable":true} | Stack traces or HTML error pages as tool content |
| Idempotent writes | Require idempotency_key on mutating tools | Double refund on duplicate tool_call retry |
Description template
Structure: one-line purpose → input constraints → output shape → negative scope.
Fetch order details by ID. Input: order_id must match ORD-[digits]. Returns status, line items, ship date. Do not use for refunds—use issue_refund.
Server-side validation before execute
from pydantic import BaseModel, Field, ValidationError
class LookupOrderArgs(BaseModel):
order_id: str = Field(pattern=r"^ORD-\d+$")
class IssueRefundArgs(BaseModel):
order_id: str
amount_usd: float = Field(gt=0, le=10_000)
reason: str = Field(max_length=200)
idempotency_key: str
def dispatch(name: str, raw: dict) -> dict:
try:
if name == "lookup_order":
args = LookupOrderArgs.model_validate(raw)
return lookup_order(args.order_id)
if name == "issue_refund":
args = IssueRefundArgs.model_validate(raw)
return issue_refund(**args.model_dump())
except ValidationError as e:
return {"error": "invalid_arguments", "details": e.errors(), "retryable": False}
return {"error": "unknown_tool", "retryable": False}
public record LookupOrderArgs(@Pattern(regexp = "ORD-\\d+") String orderId) {}
public String dispatch(String name, String argsJson) throws JsonProcessingException {
try {
if ("lookup_order".equals(name)) {
LookupOrderArgs args = objectMapper.readValue(argsJson, LookupOrderArgs.class);
return objectMapper.writeValueAsString(lookupOrder(args.orderId()));
}
} catch (ValidationException | JsonProcessingException e) {
return "{\"error\":\"invalid_arguments\",\"retryable\":false}";
}
return "{\"error\":\"unknown_tool\",\"retryable\":false}";
}
Returning errors the model can recover from
- order_not_found — model asks user to verify ID
- rate_limited with retryable: true — runner may backoff and retry
- policy_denied — model explains constraint without retry loop
Return structured errors the model can read. A tool result {"error":"order_not_found"} beats an exception string the model misinterprets as success.
Granular tools (10+) vs coarse tools (3). Granular improves selection accuracy but inflates schema tokens in every request. Measure wrong-tool rate on a golden set.
Version tool descriptions in a registry (lookup_order@v3). A/B description changes like any API contract—with eval gates before full rollout.
Validate argument length and rate-limit per tool. search_policy with a 10KB query is a DoS vector against your vector DB.
Tool categories: read / write / compute + confirmation for writes
Classify every tool by side-effect profile. Read tools query state; write tools mutate external systems; compute tools run deterministic logic with no external IO. Apply different guardrails: parallelize reads; serialize writes; require human confirmation before high-stakes mutations.
Category matrix
| Category | Examples | Parallel? | HITL? | Idempotency |
|---|---|---|---|---|
| Read | lookup_order, search_policy, get_balance | Yes | No | Optional (cache keys) |
| Write | issue_refund, send_email, update_address | No (or queued) | Yes for high impact | Required |
| Compute | calculate_shipping, format_table, validate_iban | Yes | No | N/A (pure function) |
Confirmation pattern for writes
- Model calls write tool → router intercepts before execute.
- Present pending action to user or approver queue: tool name, args, human-readable diff.
- On approve: execute with idempotency key tied to tool_call_id or tool_use_id.
- On reject: inject tool result {"status":"denied_by_user"} so the model apologizes and offers alternatives.
- On timeout: inject {"status":"approval_expired"}; model asks user to restate intent.
Registry metadata drives policy
from enum import Enum
class ToolKind(str, Enum):
READ = "read"
WRITE = "write"
COMPUTE = "compute"
TOOL_META = {
"lookup_order": {"kind": ToolKind.READ, "timeout_sec": 5},
"search_policy": {"kind": ToolKind.READ, "timeout_sec": 5},
"issue_refund": {"kind": ToolKind.WRITE, "requires_approval": True, "risk_tier": "high"},
"calculate_shipping": {"kind": ToolKind.COMPUTE, "timeout_sec": 2},
}
WRITE_TOOLS = {n for n, m in TOOL_META.items() if m["kind"] == ToolKind.WRITE}
READ_TOOLS = {n for n, m in TOOL_META.items() if m["kind"] in (ToolKind.READ, ToolKind.COMPUTE)}
public enum ToolKind { READ, WRITE, COMPUTE }
public record ToolMeta(ToolKind kind, boolean requiresApproval, int timeoutSec, String riskTier) {}
Map<String, ToolMeta> registry = Map.of(
"lookup_order", new ToolMeta(ToolKind.READ, false, 5, "low"),
"issue_refund", new ToolMeta(ToolKind.WRITE, true, 30, "high")
);
Gray-area tools
Some tools blur lines: create_support_ticket writes to Zendesk but feels "safe." Rule of thumb: if a customer or finance system observes the effect without another human step, treat as write. Log-only internal annotations may be medium tier—auto-execute with audit.
Classifying send_email_to_customer as read because it "only reads a template." If the recipient receives mail, it is a write—confirm before send.
WRITE_TOOLS=issue_refund,send_email in env; router refuses writes when HITL_MODE=required and no approval token is present on the request context.
Explain read/write split for parallel execution policy. Shows operational maturity beyond demo agent loops that treat every tool the same.
Write tools hit payment and email APIs—duplicate execution cost dwarfs LLM tokens. Idempotency keys keyed to provider call IDs are non-negotiable on refunds.
Parallel calling: N simultaneous tools
When the model returns N read-only tool calls in one turn, execute them concurrently to minimize wall-clock latency. Never parallelize writes unless your backend supports isolated transactions and you enforce per-resource locking.
OpenAI's parallel_tool_calls and Anthropic multi-tool_use turns are only wins if your executor cooperates. A sequential loop over three lookup_order calls triples latency; a thread pool with sane limits cuts it toward the slowest single dependency.
When to parallelize
- All calls classified as READ or COMPUTE per registry metadata
- Handlers are thread-safe or pure async without shared mutable state
- Downstream APIs have headroom (bulkhead / circuit breaker not open)
- Total fan-out ≤ configured max (commonly 8)
When to refuse parallel fan-out
- Any write tool in the batch—split reads parallel, writes sequential
- Tools hitting the same row (two updates to one order)—serialize or merge
- Global rate limit on a shared API key—use semaphore per dependency
import json
from concurrent.futures import ThreadPoolExecutor, as_completed
def execute_tool_calls(request_id: str, round_num: int, tool_calls: list) -> list[tuple[str, str, str]]:
"""Returns (tool_call_id, name, output_json)."""
read_calls = [c for c in tool_calls if c.function.name in READ_TOOLS]
write_calls = [c for c in tool_calls if c.function.name in WRITE_TOOLS]
results: list[tuple[str, str, str]] = []
def _one(call):
name = call.function.name
args = json.loads(call.function.arguments)
out = dispatch(name, args)
audit_log(request_id, round_num, name, args, out)
return call.id, name, json.dumps(out)
max_workers = min(8, len(read_calls) or 1)
with ThreadPoolExecutor(max_workers=max_workers) as pool:
futures = [pool.submit(_one, c) for c in read_calls]
for fut in as_completed(futures):
results.append(fut.result())
for call in write_calls:
results.append(_one(call)) # sequential; HITL inside dispatch for writes
return results
public List<ToolResult> executeToolCalls(String requestId, int round, List<ToolCall> calls) {
List<ToolCall> reads = calls.stream().filter(c -> READ_TOOLS.contains(c.getName())).toList();
List<ToolCall> writes = calls.stream().filter(c -> WRITE_TOOLS.contains(c.getName())).toList();
List<ToolResult> results = reads.parallelStream()
.map(c -> invokeAudited(requestId, round, c))
.collect(Collectors.toCollection(ArrayList::new));
writes.forEach(c -> results.add(invokeWithApproval(requestId, round, c)));
return results;
}
Async Python variant
For async handlers (HTTP/2 clients), prefer asyncio.gather with a concurrency semaphore instead of threads—same policy gates on read vs write buckets apply.
Parallel reads cut user-visible latency but multiply load on fragile backends. Use semaphores per dependency (CRM, vector DB) and backpressure when error rate spikes.
OpenAI parallel_tool_calls is a generation-time feature; your executor must still respect downstream rate limits—the model does not know your CRM's 429 threshold.
Cap parallel fan-out at 8 even if the model requests 12 lookups—execute first 8, return partial with truncated: true in metadata, let model ask to continue.
as_completed returns results out of order—that is fine for message injection. Each tool_call_id pairing is independent of append order.
Result injection: message format per provider
Each provider requires a specific message shape when feeding tool outputs back into the conversation. Gateways that normalize history must preserve provider-native ID pairing—or translation bugs surface on turn 3+ when the model "forgets" which call failed.
Format comparison
| Provider | Assistant carries call | Result message | ID field |
|---|---|---|---|
| OpenAI | assistant.tool_calls[] | role: "tool" | tool_call_id |
| Anthropic | content[].type == "tool_use" | user.content[].tool_result | tool_use_id |
| Bedrock Converse | content[].toolUse | user.content[].toolResult | toolUseId |
Content encoding rules
- OpenAI — tool message content is a string; JSON.stringify is standard for structured handler output.
- Anthropic — tool_result.content may be string or content blocks; string JSON works for most handlers; set is_error on failures.
- Bedrock — prefer content: [{"json": {...}}] for structured success; [{"text": "..."}] for errors with status: "error".
Multi-provider injection helpers
def inject_openai(messages: list, call_id: str, result: dict) -> None:
messages.append({
"role": "tool",
"tool_call_id": call_id,
"content": json.dumps(result),
})
# One tool message per tool_call_id — never combine
messages.add(new ToolResponseMessage(
ToolResponse.builder(callId, toolName).responseData(json).build()
));
def inject_anthropic_batch(results: list[dict]) -> dict:
"""Batch multiple tool_result blocks in ONE user message."""
blocks = []
for call_id, result in results:
block = {"type": "tool_result", "tool_use_id": call_id, "content": json.dumps(result)}
if result.get("error"):
block["is_error"] = True
blocks.append(block)
return {"role": "user", "content": blocks}
// UserMessage with List<AnthropicToolResultContent> batched per turn
def inject_bedrock(tool_use_id: str, result: dict) -> dict:
if "error" in result:
return {"toolResult": {
"toolUseId": tool_use_id,
"content": [{"text": json.dumps(result)}],
"status": "error",
}}
return {"toolResult": {
"toolUseId": tool_use_id,
"content": [{"json": result}],
}}
// ContentBlock.fromToolResult(ToolResultBlock.builder()...)
Canonical internal model (gateway pattern)
Store conversation as provider-neutral records: {turn, assistant_tool_calls[], tool_results[]}. Adapters translate to OpenAI or Anthropic shapes at API boundary only. Never persist provider-specific messages without also storing the neutral form—you will migrate models.
Mismatched IDs—injecting a result for call_abc when the model emitted call_xyz. The model hallucinates success. Assert every ID exists in the immediately prior assistant turn.
Scrub PII from tool results before re-prompt unless policy allows. Logs and model context share the same leakage surface—hash or truncate account numbers in results.
Ask how you'd build a provider-agnostic message store. Strong answer: canonical internal event log + thin adapters; never normalize away ID semantics at persistence layer.
Human-in-the-loop: pause for approval on high-stakes
Models misread policy edge cases. For refunds, account closures, wire transfers, and outbound email to customers, pause execution until a human approves, edits, or rejects the pending tool call. The agent loop stays intact—you inject a synthetic tool result reflecting the human decision.
Human-in-the-loop (HITL) is not "no agents"—it is gated side effects. The model may still plan, gather reads, and draft the refund rationale; only the mutating RPC waits on approval. Sync UIs show a pending card; async workflows enqueue to Slack/email and resume via webhook when decided.
Approval flow
- Detect write tool in router → create PendingAction (args, user, request_id, provider call ID).
- Return holding response to client or continue session with "awaiting approval" tool result if blocking is unacceptable.
- Approver UI shows diff: "Refund $149.99 to ORD-8821 — reason: duplicate charge".
- On approve → execute with idempotency key = provider call ID; inject success result; resume loop.
- On reject → inject {"status":"denied_by_user"}; model explains to end user.
Risk tiers
| Tier | Examples | Gate |
|---|---|---|
| Low | Read, compute | Auto-execute |
| Medium | Update nickname, internal note | Auto + audit log |
| High | Refund > $50, email external party | Human approval (single approver) |
| Critical | Wire transfer, delete account | Two-person rule + ticket ID |
from dataclasses import dataclass
from enum import Enum
class ApprovalState(str, Enum):
PENDING = "pending"
APPROVED = "approved"
REJECTED = "rejected"
@dataclass
class PendingAction:
id: str
tool_call_id: str
tool_name: str
arguments: dict
state: ApprovalState = ApprovalState.PENDING
def run_write_tool(name: str, args: dict, *, approval_id: str) -> str:
pending = approval_store.get(approval_id)
if pending.state == ApprovalState.PENDING:
notify_approver(pending)
return json.dumps({"status": "awaiting_approval", "approval_id": approval_id})
if pending.state == ApprovalState.REJECTED:
return json.dumps({"status": "denied_by_user"})
return json.dumps(issue_refund(**args, idempotency_key=pending.tool_call_id))
public record PendingAction(String id, String toolCallId, String toolName, JsonNode args, ApprovalState state) {}
public String runWriteTool(String name, String argsJson, String approvalId) {
PendingAction p = store.get(approvalId);
if (p.state() == ApprovalState.PENDING) {
notifier.send(p);
return "{\"status\":\"awaiting_approval\"}";
}
if (p.state() == ApprovalState.REJECTED) {
return "{\"status\":\"denied_by_user\"}";
}
return invokeRefund(argsJson, p.toolCallId());
}
Preventing self-approval via prompt injection
- Approvers authenticate on a separate channel (admin SSO), not the chat session
- Approval tokens are single-use, bound to approval_id + approver identity
- Chat user cannot pass approved=true in tool args—decision lives only in approval store
Slack interactive buttons with 15-minute TTL. Expired pending actions inject approval_expired so the model asks the user to restate intent—not loop forever on stale pending state.
Sync HITL blocks the HTTP request—poor chat UX. Prefer async: persist message history, notify approver, resume agent via webhook + stored checkpoint.
Separate approver identity from chat user. Prompt injection must not forge approval headers or pre-signed URLs.
Include provider call ID in approval deep links. Replay attacks reuse old approvals—idempotency on execute must reject duplicate keys with same outcome, not re-run.
Production: tool registry, logging, and checklist
Shipping tool calling means more than a demo loop: centralized registry, schema versioning, structured audit logs, per-tool SLOs, and eval suites that assert correct tool selection—not just final answer quality.
Tool registry responsibilities
- Single source of truth: name → handler, schema, kind (read/write/compute), owning team
- Version schemas; deprecate with dual-register period and migration notes
- Feature flags per tool (issue_refund_enabled: false in staging)
- Deploy smoke test: invoke each tool with golden args before traffic shift
- Document SLAs: p99 latency budget per downstream dependency
Structured logging fields
| Field | Why |
|---|---|
| request_id | Trace full agent session across rounds |
| round | Which iteration of the tool loop |
| tool_name | Aggregate error rates per tool |
| latency_ms | SLO dashboards; detect slow CRM |
| status | ok | validation_error | timeout | denied | unknown_tool |
| args_hash | Debug without logging raw PII args |
| provider | openai | anthropic | bedrock — adapter debugging |
Registry + JSONL audit
import hashlib, json, time
from pathlib import Path
LOG = Path("logs/tool_audit.jsonl")
class ToolRegistry:
def __init__(self):
self._tools: dict[str, dict] = {}
def register(self, name: str, schema: dict, handler, *, kind: str, version: str = "v1"):
self._tools[name] = {"schema": schema, "handler": handler, "kind": kind, "version": version}
def invoke(self, request_id: str, round_num: int, name: str, args: dict) -> dict:
if name not in self._tools:
return {"error": "unknown_tool", "retryable": False}
meta = self._tools[name]
start = time.perf_counter()
try:
out = meta["handler"](**args)
status = "ok"
except TimeoutError:
out = {"error": "timeout", "retryable": True}
status = "timeout"
except Exception as e:
out = {"error": str(e), "retryable": True}
status = "error"
row = {
"request_id": request_id,
"round": round_num,
"tool": name,
"tool_version": meta["version"],
"kind": meta["kind"],
"status": status,
"latency_ms": round((time.perf_counter() - start) * 1000, 2),
"args_hash": hashlib.sha256(json.dumps(args, sort_keys=True).encode()).hexdigest()[:16],
}
LOG.parent.mkdir(parents=True, exist_ok=True)
with LOG.open("a", encoding="utf-8") as f:
f.write(json.dumps(row) + "\n")
return out
public class ToolRegistry {
private final Map<String, ToolDefinition> tools = new ConcurrentHashMap<>();
private final ToolAuditLogger auditLog;
public void register(String name, ToolDefinition def) { tools.put(name, def); }
public JsonNode invoke(String requestId, int round, String name, JsonNode args) {
long start = System.nanoTime();
ToolDefinition def = tools.get(name);
if (def == null) return error("unknown_tool");
JsonNode out = def.handler().apply(args);
auditLog.log(requestId, round, name, def.version(), statusOk, start, hashArgs(args));
return out;
}
}
Eval beyond final answer text
Golden traces should assert:
- Given user query X, expect tool Y (not Z) on round 1
- Args contain required fields with valid formats
- Write tools trigger HITL when amount > threshold
- After injected error result, model recovers without infinite retry
Production checklist
- Allowlist tool names in router—reject unknown at dispatch
- Max rounds (5) and max wall time (30s) enforced independently
- Read/write classification drives parallel and HITL policy
- Idempotency keys on all writes keyed to provider call ID
- Provider-correct result injection in durable message store
- Golden evals for tool selection + args; block deploy on regression
- Alert on tool error rate > baseline; on zero-tool loops burning max rounds
- Runbook: validation_error spike → schema deploy mismatch; timeout spike → dependency outage
Observability dashboard (starter panels)
| Panel | Signal |
|---|---|
| Tool calls / minute by name | Traffic mix; detect runaway search_policy |
| p95 latency by tool | Downstream degradation |
| Wrong-tool rate (eval sample) | Schema/description drift |
| Approval pending queue depth | HITL staffing |
| Rounds to completion | Loop efficiency; prompt regressions |
Example deploy manifest: tool_registry_version: 2026-06-07 · max_tool_rounds: 5 · parallel_read_max: 8 · hitl_tier_high: issue_refund,send_email
Wrong-tool calls waste tokens and downstream quota. Track cost per successful task, not per LLM call—tool-heavy turns dominate CRM and payment API bills.
On-call: spike in validation_error after deploy → schema mismatch between registry and OpenAI tools array. Roll back registry version, not the model.
Close with eval strategy: tool-selection accuracy on labeled traces beats BLEU on final answers for agent features. Mention hash-based arg logging for GDPR-friendly audit.