Distributed tracing for multi-agent AI systems — following a request from user input through orchestration, tool calls, and response synthesis.
Definition
AI agent tracing is the application of distributed tracing principles to multi-agent AI systems. A trace follows a single user request through every stage of processing: orchestrator routing, subagent dispatch, skill execution, tool invocation, LLM calls, and response synthesis. Each stage produces a span with timing, inputs, outputs, and metadata. The complete trace provides a timeline of exactly what happened for any given request.
Significance
In a multi-agent system, a single user query can trigger dozens of LLM calls, tool invocations, and inter-agent messages across multiple Lambda functions or services. Without tracing, understanding the path a request took — and where it went wrong — requires correlating logs across services manually. Tracing makes the execution path explicit and queryable.
Architecture
Trace: user-query-abc123
│
├── Span: orchestrator
│ ├── llm_call: route (model: claude-3, tokens: 450)
│ ├── dispatch: subagent-research (queue: sqs)
│ ├── dispatch: subagent-analysis (queue: sqs)
│ │
│ ├── Span: subagent-research
│ │ ├── Span: skill-web-search
│ │ │ ├── tool: search_api (latency: 230ms)
│ │ │ └── tool: scrape_url (latency: 890ms)
│ │ └── llm_call: synthesize (tokens: 1200)
│ │
│ ├── Span: subagent-analysis
│ │ ├── Span: skill-data-analysis
│ │ │ └── tool: query_db (latency: 45ms)
│ │ └── llm_call: synthesize (tokens: 800)
│ │
│ └── llm_call: final_synthesis (tokens: 600)
Each span records: tier, operation type, duration, token usage, model, and any errors.Examples
Failure Modes
Related
Monitoring, tracing, and understanding AI agent behavior in production — from token usage to decision quality.
Systematic approaches to diagnosing and resolving failures in AI systems — from hallucinations to tool call failures.
Coordinating multi-step AI workflows — from single-agent task execution to multi-agent fan-out with parallel tool calls.
Engineering practices for deploying and operating AI systems in production — beyond prototypes and demos.