← Topics

AI Agent Tracing

Distributed tracing for multi-agent AI systems — following a request from user input through orchestration, tool calls, and response synthesis.

Definition

What Is AI Agent Tracing?

AI agent tracing is the application of distributed tracing principles to multi-agent AI systems. A trace follows a single user request through every stage of processing: orchestrator routing, subagent dispatch, skill execution, tool invocation, LLM calls, and response synthesis. Each stage produces a span with timing, inputs, outputs, and metadata. The complete trace provides a timeline of exactly what happened for any given request.

Significance

Why It Matters

In a multi-agent system, a single user query can trigger dozens of LLM calls, tool invocations, and inter-agent messages across multiple Lambda functions or services. Without tracing, understanding the path a request took — and where it went wrong — requires correlating logs across services manually. Tracing makes the execution path explicit and queryable.

Architecture

How It Works

AI agent traces extend OpenTelemetry-style distributed tracing with AI-specific attributes:
Trace: user-query-abc123
│
├── Span: orchestrator
│   ├── llm_call: route (model: claude-3, tokens: 450)
│   ├── dispatch: subagent-research (queue: sqs)
│   ├── dispatch: subagent-analysis (queue: sqs)
│   │
│   ├── Span: subagent-research
│   │   ├── Span: skill-web-search
│   │   │   ├── tool: search_api (latency: 230ms)
│   │   │   └── tool: scrape_url (latency: 890ms)
│   │   └── llm_call: synthesize (tokens: 1200)
│   │
│   ├── Span: subagent-analysis
│   │   ├── Span: skill-data-analysis
│   │   │   └── tool: query_db (latency: 45ms)
│   │   └── llm_call: synthesize (tokens: 800)
│   │
│   └── llm_call: final_synthesis (tokens: 600)
Each span records: tier, operation type, duration, token usage, model, and any errors.

Examples

Real-World Examples

  • Tracing a slow agent response to a tool call that timed out after 30 seconds, causing the entire subagent to stall
  • Identifying that 80% of token cost for a research agent came from a single skill that was retrieving excessive context
  • Debugging a missing response by tracing the countdown latch and finding that one subagent completed but its result was not recorded
  • Measuring end-to-end latency breakdown to discover that LLM routing decisions took longer than the actual tool executions

Failure Modes

Common Failure Modes

  • Trace ID loss across async boundaries — when messages cross SQS queues, the trace_id must be propagated in the message envelope or the trace breaks
  • Sampling too aggressively — sampling 1% of traces in a low-volume system means most incidents have no trace data
  • Storage growth — full traces with LLM inputs and outputs consume significant storage; retention policies must balance debuggability with cost
  • Clock skew — spans from different Lambda instances may have skewed timestamps, making the timeline misleading