Debugging LLM Tool Calls in Production

Mike Coon·March 4, 2026·8 min read

Definition

LLM tool call debugging is the process of diagnosing failures that occur when an AI agent invokes external tools — APIs, databases, code execution, or MCP servers — during multi-step workflows. Unlike traditional API debugging where inputs are deterministic, tool call failures in AI systems involve a probabilistic layer: the LLM constructs the tool call parameters based on its interpretation of the task, available tool schemas, and conversation context. Failures can originate from incorrect parameter construction, schema mismatches, missing context, or the LLM selecting the wrong tool entirely.

Architecture

A systematic debugging approach for LLM tool calls follows a five-step process that narrows the failure from "the tool call failed" to a specific, fixable root cause.

Step 1: Capture the Full Call Chain

Before you can debug a tool call, you need the complete context:

interface ToolCallDebugContext {
  // What the LLM received
  systemPrompt: string;
  toolDefinitions: ToolDefinition[];
  conversationHistory: Message[];
 
  // What the LLM produced
  selectedTool: string;
  constructedArgs: Record<string, unknown>;
  llmReasoning?: string;
 
  // What happened
  toolResponse: unknown;
  errorType?: string;
  errorMessage?: string;
}

Most debugging failures happen because one of these pieces is missing. If you do not have the tool definitions that were presented to the LLM, you cannot determine whether the schema was ambiguous. If you do not have the conversation history, you cannot understand why the LLM constructed specific parameters.

Step 2: Classify the Failure

Tool call failures fall into four categories:

Wrong tool selection. The LLM chose a tool that is not appropriate for the task. This typically happens when tool descriptions are ambiguous or when the task description overlaps with multiple tools. A query about "recent activity" might invoke a search_web tool when query_database was intended.

Incorrect parameters. The LLM chose the correct tool but constructed parameters that do not match the expected format or contain incorrect values. A date field formatted as "March 5, 2026" when the API expects "2026-03-05". A search query that is too broad or too specific for the data source.

Schema mismatch. The tool definition (JSON Schema) does not accurately describe what the tool actually accepts. The schema says a field is optional, but the tool fails without it. The schema defines a string type, but the tool expects a specific enum value.

Tool-side failure. The tool received valid input but failed due to an external issue — API rate limits, network timeouts, authentication expiration, or data not found. These are the easiest to diagnose because the tool returns an error message.

Step 3: Isolate the LLM Layer

The most common debugging mistake is assuming the tool is broken when the LLM is constructing bad parameters. To isolate the LLM layer:

# Replay the exact prompt with tool definitions
# and inspect what the LLM produces
 
debug_response = client.messages.create(
    model="claude-sonnet-4-20250514",
    system=captured_system_prompt,
    tools=captured_tool_definitions,
    messages=captured_conversation_history,
)
 
# Compare: did the LLM produce the same tool call?
# If yes: the issue is deterministic (schema or tool-side)
# If no: the issue is probabilistic (prompt or context)

If the replayed call produces the same incorrect parameters, the problem is in the tool schema or the system prompt — it is deterministic and fixable. If the replayed call produces different parameters, the problem is context-dependent and requires examining the specific conversation state that triggered the failure.

Step 4: Fix the Root Cause

Each failure category has a different fix:

Wrong tool selection → Improve tool descriptions. Make the description field specific about when the tool should and should not be used. Add negative examples: "Do NOT use this tool for historical data lookups."

Incorrect parameters → Improve the JSON Schema. Add description fields to every parameter. Use enum types for constrained values. Add examples in the parameter description. Tighten types — use integer instead of number when decimals are not valid.

Schema mismatch → Align the schema with reality. If the tool requires a field, mark it required. If the tool only accepts specific values, use enum. Test the schema against the actual tool implementation.

Tool-side failure → Add error handling in the tool. Return structured errors that the LLM can interpret and retry with modified parameters. Include retry guidance in the error response.

Step 5: Prevent Recurrence

After fixing the immediate issue, add prevention:

// Validate tool call parameters before execution
function validateToolCall(
  toolName: string,
  args: Record<string, unknown>,
  schema: ToolDefinition
): ValidationResult {
  // JSON Schema validation
  const schemaErrors = validateJsonSchema(args, schema.inputSchema);
 
  // Business logic validation
  const logicErrors = validateBusinessRules(toolName, args);
 
  return { valid: schemaErrors.length === 0 && logicErrors.length === 0, errors: [...schemaErrors, ...logicErrors] };
}

Validation between the LLM output and tool execution catches parameter issues before they reach the tool. Log validation failures as trace events so you can track which tools and parameters cause the most issues.

Example

A real debugging session from a production agent system:

Symptom: A research agent was returning "No results found" for company lookups that should have returned data.

Step 1 — Capture context. Pulled the trace for a failing request. The agent was calling a search_companies tool with the company name as the query parameter.

Step 2 — Classify. The tool selection was correct. The parameters looked reasonable. But the tool was returning empty results.

Step 3 — Isolate. Replayed the tool call directly (bypassing the LLM). The tool returned results when called with {"query": "Acme Corp"} but the LLM was sending {"query": "Acme Corp AI observability platform"} — appending context from the user's request to the company name.

Root cause: The tool's query parameter description said "Search query for finding companies." The LLM interpreted this as a free-text search and added relevant context. The tool actually expected an exact or near-exact company name match.

Fix: Changed the parameter description to "Company name to search for. Use only the company name, not a description or context." Additionally added examples: ["Acme Corp", "OpenAI", "Stripe"] to the schema.

Prevention: Added a trace event that logs the raw query sent to the tool alongside the user's original request, making the parameter construction visible without replaying the full trace.

Failure Modes

Silent Failures

The most dangerous tool call failures are silent ones — the tool returns a result, but the result is wrong. A search tool that returns results for the wrong entity. A database query that returns stale data because the timestamp parameter was constructed incorrectly. The agent incorporates the wrong data confidently.

Prevention: add result validation where possible. If a tool searches for "Acme Corp" and returns results for "Acme Corporation Ltd," flag the mismatch. If a database query returns zero rows when the entity is known to exist, treat it as an error rather than an empty result.

Cascading Parameter Errors

In multi-step workflows, a tool call error in step 1 can cascade. The agent uses incorrect results from step 1 as input to step 2, which fails or produces a different incorrect result. By step 3, the original error is invisible.

Prevention: validate intermediate results at each step. If a subagent synthesizes results from multiple tool calls, the synthesis step should flag inconsistencies — results that contradict each other or that do not match the original query.

Schema Drift

Tool implementations change over time — new required fields, renamed parameters, deprecated endpoints. If the tool definition (JSON Schema) is not updated to match, the LLM constructs calls against a stale schema. These failures are intermittent because the LLM sometimes constructs valid calls by coincidence.

Prevention: generate tool definitions from the tool implementation rather than maintaining them separately. If the tool has an OpenAPI spec or MCP server, derive the schema from the source of truth.

Temperature-Dependent Failures

Some tool call failures only occur at higher temperature settings. At temperature 0, the LLM consistently constructs valid parameters. At temperature 0.7, it occasionally constructs creative but invalid parameters — adding extra fields, using synonym values, or interpreting ambiguous schemas differently.

Prevention: use temperature 0 for tool-calling steps. Creative variation is useful for response generation but counterproductive for structured tool invocation.

Summary

What is LLM tool call debugging?

LLM tool call debugging is the systematic process of diagnosing why an AI agent's tool invocations fail. Failures originate from four categories: wrong tool selection, incorrect parameter construction, schema mismatches between the tool definition and implementation, and tool-side errors. Effective debugging requires capturing the full context — system prompt, tool definitions, conversation history, and the LLM's constructed parameters — and systematically isolating whether the failure is in the LLM layer, the schema layer, or the tool layer.

Key Idea

Most tool call failures are schema and description problems, not model problems. When an LLM constructs bad tool parameters, the first place to look is the tool definition — ambiguous descriptions, missing examples, loose types, and stale schemas cause more production failures than model limitations. Fix the schema before blaming the model.

aidebuggingagentstoolsmcp