Definition
AI agent tracing is distributed tracing applied to multi-agent systems where a single user request triggers multiple LLM calls, tool invocations, and inter-agent messages. The trace captures the complete execution path — from the initial request through orchestration, agent dispatch, skill execution, and response synthesis — with AI-specific signals at each step: token usage, model selection, prompt construction, and decision reasoning.
Traditional distributed tracing (OpenTelemetry, Jaeger, Zipkin) tracks HTTP requests across microservices. AI agent tracing extends this with signals that are unique to LLM-powered systems: which tools the model considered versus selected, how many tokens each reasoning step consumed, whether retrieved context was relevant, and why the orchestrator routed to specific agents.
Architecture
A practical AI agent tracing architecture has three components: trace propagation, span recording, and a query layer for debugging.
Trace Propagation
Every user request gets a trace_id that flows through every message boundary. In an event-driven system using SQS, the trace context rides in the message envelope:
interface AgentMessage {
type: "new_task" | "completion";
traceId: string;
parentSpanId: string;
payload: unknown;
}When an orchestrator dispatches a subagent via SQS, the message includes the traceId and the orchestrator's spanId as the parent. The subagent creates its own span linked to the parent. This chain continues through every tier — subagent to skill to tool — preserving the causal relationship between all operations.
The critical detail is async boundaries. In a synchronous HTTP-based system, trace context propagates via headers. In an async system with queues, the context must be serialized into the message body. If a single message boundary drops the trace context, the trace breaks and the downstream operations become orphaned spans with no connection to the originating request.
What to Capture
Each span in an AI agent trace should record:
interface AgentSpan {
traceId: string;
spanId: string;
parentSpanId: string | null;
tier: "orchestrator" | "subagent" | "skill" | "tool";
operation: string;
startTime: number;
duration: number;
// AI-specific fields
llmCalls: {
model: string;
provider: string;
inputTokens: number;
outputTokens: number;
latency: number;
toolsAvailable: string[];
toolsSelected: string[];
}[];
toolCalls: {
name: string;
latency: number;
success: boolean;
errorType?: string;
}[];
metadata: Record<string, unknown>;
}The AI-specific fields are what separate agent tracing from generic distributed tracing. Knowing that a span took 3 seconds is useful. Knowing that 2.5 seconds of that was an LLM call that consumed 1,200 tokens and selected 2 of 5 available tools — that is actionable.
Storage and Querying
Trace events should be append-only. Never update or delete a trace event — the immutability is what makes traces trustworthy for debugging and auditing.
For storage, the choice depends on query patterns. If you primarily query by trace_id (show me everything that happened for this request), a time-series database or even a simple append-only table works:
CREATE TABLE trace_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
trace_id UUID NOT NULL,
span_id UUID NOT NULL,
parent_span_id UUID,
tier TEXT NOT NULL,
operation TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL,
duration_ms INTEGER NOT NULL,
payload JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_trace_events_trace ON trace_events(trace_id);
CREATE INDEX idx_trace_events_time ON trace_events(created_at);The payload JSONB column holds the AI-specific fields — LLM calls, tool invocations, token counts. This keeps the schema stable while allowing the trace content to evolve.
Example
Consider a user query: "Research the competitive landscape for AI observability tools."
The trace for this request might look like:
Trace: abc-123
│
├─ Span: orchestrator (450ms)
│ ├─ llm_call: route (model: claude-sonnet, tokens: 380)
│ │ → decided: dispatch research-subagent
│ ├─ dispatch: research-subagent via SQS
│ │
│ ├─ Span: research-subagent (8200ms)
│ │ ├─ Turn 1:
│ │ │ ├─ llm_call: plan (tokens: 520)
│ │ │ │ → decided: invoke web-search, market-analysis
│ │ │ ├─ Span: skill-web-search (3100ms)
│ │ │ │ ├─ tool: search_api (230ms, success)
│ │ │ │ ├─ tool: scrape_url x3 (890ms avg, 2 success, 1 timeout)
│ │ │ │ └─ llm_call: synthesize (tokens: 1100)
│ │ │ └─ Span: skill-market-analysis (2800ms)
│ │ │ ├─ tool: query_db (45ms, success)
│ │ │ └─ llm_call: synthesize (tokens: 950)
│ │ └─ Turn 2:
│ │ ├─ llm_call: evaluate (tokens: 600)
│ │ │ → decided: sufficient, synthesize final result
│ │ └─ llm_call: synthesize (tokens: 1400)
│ │
│ └─ llm_call: final_synthesis (tokens: 800)
│
Total: 8650ms, 5800 tokens, 7 LLM calls, 5 tool calls
From this trace, you can immediately see:
- The orchestrator routing decision took 450ms and correctly dispatched a single subagent
- The subagent ran two turns — the first dispatched two skills in parallel, the second decided results were sufficient
- One
scrape_urltool call timed out but the skill continued with partial results - Token usage was dominated by the web-search skill's synthesis step (1,100 tokens)
- End-to-end latency was 8.65 seconds, with the subagent accounting for 95% of it
Failure Modes
Trace Context Loss
The most common tracing failure is losing the trace_id at an async boundary. When a Lambda function sends a message to SQS without including the trace context, the downstream span has no parent. The trace appears incomplete — you see the orchestrator dispatch but not what happened next.
Prevention: validate trace context presence in the message envelope schema. If a message is missing traceId, reject it at the handler level rather than processing it without tracing.
Over-Collection
Recording every token of every LLM prompt and response creates storage costs that scale linearly with token usage. A system processing 10,000 agent sessions per month with full prompt/response capture can generate terabytes of trace data.
Prevention: capture metadata (token counts, model, latency) for every call, but capture full prompts and responses only for sampled traces or flagged sessions. Use a sampling rate that ensures enough traces for debugging (10-25% is typical) without storing everything.
Missing Baselines
Traces are most useful when compared against baselines. If you do not know that the average token usage for a research query is 4,000 tokens, you cannot detect when a query consumes 40,000 tokens. Without baselines, anomalies are invisible.
Prevention: compute rolling averages for key metrics (tokens per request, latency per tier, tool success rates) and alert on significant deviations. Even simple z-score alerting catches most anomalies.
Clock Skew
In a serverless system where each span runs on a different Lambda instance, clock differences between instances can cause spans to appear out of order in the timeline. A child span that appears to start before its parent span creates confusion during debugging.
Prevention: use monotonic timestamps within a single Lambda invocation and rely on causal ordering (parent-child relationships) rather than wall clock times for span sequencing.
Summary
What is AI agent tracing?
AI agent tracing is distributed tracing for multi-agent AI systems. It follows a user request through every stage of processing — orchestration, agent dispatch, skill execution, tool invocation, and response synthesis — capturing AI-specific signals like token usage, model selection, and decision reasoning at each step. The trace provides a complete, queryable timeline of exactly what happened for any given request.
Key Idea
Effective AI agent tracing requires three things: propagating trace context across every async boundary without exception, capturing AI-specific signals (not just timing and status) at each span, and storing traces immutably for both debugging and auditing. The investment in tracing infrastructure pays for itself on the first production incident where you need to understand why an agent produced a specific result.