← Blog

Building a Serverless AI Agent Platform on AWS

Mike Coon··16 min read

Claude Code, Cursor, and similar AI coding agents run on your local machine. They have access to your filesystem, your shell, and your context. That model works for developer tooling. It does not work when you need agents that run autonomously, serve multiple tenants, respond to events from Slack or Telegram or email, and scale to zero when idle.

We built a serverless agent platform on AWS that replicates the hierarchical agent architecture — orchestrators that plan, subagents that coordinate multi-step workflows, skills that execute domain logic, and tools that call APIs — entirely on Lambda with SQS as the coordination layer. No idle compute. No long-running servers. Pay per invocation.

This post covers the architecture in enough detail to build your own.

Why Serverless Agents

Think of this as a serverless equivalent to frameworks like OpenClaw, Claude Code, or any hierarchical agent system — orchestrators dispatching subagents dispatching skills dispatching tools. The difference is the execution model.

Traditional agent platforms run on long-lived servers. A server sits idle between user messages, burning compute while it waits. If your agent handles 50 messages a day, the server is idle for 99% of its uptime. You are paying for that idle time.

In a serverless architecture, idle time costs nothing. Each Lambda runs only when there is a message to process. Between messages — whether that gap is 5 seconds or 5 hours — no compute is running, no cost is accruing. The system scales to zero automatically.

The second advantage is isolation. In a server-based multi-tenant system, conversations from different tenants share the same process, the same memory space, and often the same cached state. Preventing cross-tenant data leakage requires careful discipline across every code path. In a serverless model, each conversation turn executes in its own Lambda invocation with its own memory space. When the invocation completes, the execution context is gone. There is no shared mutable state between tenants by construction, not by convention. Isolation is a property of the infrastructure, not something you have to enforce in application code.

These two properties — zero idle cost and structural tenant isolation — are why serverless is not just a deployment choice but an architectural advantage for agent platforms.

The Four-Tier Hierarchy

The core design decision is separating intelligence into four tiers with distinct responsibilities:

Orchestrator → [1..N] Subagents → [1..N] Skills → [1..N] Tools

Tier 1: Orchestrator

The orchestrator is the entry point. It receives a user message from any channel, loads conversation history, and asks the LLM a single question: given this user's request and these available subagents, which should I dispatch?

The orchestrator does not know how subagents work internally. It knows their names, descriptions, and input schemas — the same way an LLM sees tool definitions. The LLM decides which subagents to invoke (one or many), and the orchestrator fans them out in parallel.

When all subagents complete, the orchestrator calls the LLM a second time: here are the results from each subagent, synthesize a response for the user. Two LLM calls per request — one to plan, one to synthesize. The orchestrator stays generic.

Tier 2: Subagents

Subagents are mini-orchestrators that own a domain of work and can run multiple turns of skills before returning a result. A subagent receives a task from the orchestrator, but unlike a single-shot skill, it can reason across multiple rounds — dispatching skills, reviewing their results, deciding whether more information is needed, and dispatching again.

Subagent Turn 1:
    → Invoke: web-research skill, market-analysis skill
    ← Receive results
Subagent Turn 2:
    → LLM decides: "I need deeper analysis on one finding"
    → Invoke: technical-signals skill
    ← Receive results
Subagent Turn 3:
    → LLM decides: "I have enough. Here's my synthesis."
    ← Return final result to orchestrator

This multi-turn capability is what separates subagents from skills. A skill runs one round of tool calls and returns. A subagent can iterate — running skills, evaluating results, and deciding whether to continue or wrap up. The subagent maintains its own conversation state across turns, persisted per turn in the database.

Subagents are the coordination layer for complex workflows. "Research this company" is not a single skill — it is a subagent that might invoke lead discovery, website analysis, technical signal detection, and competitive analysis skills across multiple turns, synthesizing findings as it goes.

Tier 3: Skills

Skills are where domain expertise lives. A skill has a definition (a detailed prompt explaining what it does and how to approach the task), a set of tools it can access, and its own LLM interaction. When invoked by a subagent, a skill:

  1. Loads its definition and available tools
  2. Asks the LLM: given this task and these tools, which tools should I call?
  3. Fans out tool invocations in parallel
  4. Collects results
  5. Asks the LLM to synthesize the findings into a structured result
  6. Returns the result to the parent subagent

Skills are the customization layer. Building a new capability means writing a new skill definition and wiring it to the right tools. You do not need to modify the orchestrator or subagents for most new features.

An important constraint: a skill's available tools are the intersection of the tools assigned to the skill and the tools assigned to the agent. This means an agent can restrict which tools a shared skill can access — the same "web-research" skill might have access to different data sources depending on which agent is running it.

Tier 4: Tools

Tools are pure functions with no LLM involvement. They take structured input (from the LLM's tool call), execute an API call or database operation, and return structured output. A tool that fetches a stock quote calls the Finnhub API. A tool that searches for companies calls the Apollo API. A tool that sends an email calls SES.

Tools are stateless, deterministic, and simple. They have retry logic for HTTP calls (exponential backoff, 3 attempts max) but no intelligence. The intelligence lives in the skill definition that teaches the LLM when and how to use the tool.

Why This Separation Matters

The four-tier hierarchy maps directly to how you want to customize and extend the system:

  • New workflow? Write a subagent that coordinates existing skills in a new way.
  • New capability? Write a skill definition and wire it to the right tools.
  • New data source? Write a tool. Skills that need it can access it through their tool configuration.
  • New channel? Write a channel gateway. The orchestrator does not care where the message came from.

Simple requests collapse naturally. "What's AAPL trading at?" passes through the orchestrator to a single subagent that invokes one skill making one tool call. Complex requests fan out. "Research this company's technical signals, insider activity, and market position" dispatches a research subagent that invokes three skills in parallel across multiple turns, each skill making multiple tool calls, all synthesized into a single response.

Event-Driven Execution with SQS

The coordination layer is SQS queues. Every boundary between tiers is a queue:

Channel Gateway → [queue:orchestrator] → Orchestrator Lambda
Orchestrator    → [queue:subagent]     → Subagent Lambda
Subagent        → [queue:skillagent]   → Skill Agent Lambda
Skill Agent     → [queue:tool:{name}]  → Tool Lambda
Tool Lambda     → completion event     → back to Skill Agent
Skill Agent     → completion event     → back to Subagent
Subagent        → completion event     → back to Orchestrator
Orchestrator    → [queue:outbound]     → Channel Gateway

Each arrow is an SQS message. Each Lambda function processes a message, does its work, and terminates. No Lambda sits idle waiting for a child to complete.

This is the critical insight: the parent does not wait for children to finish. The orchestrator dispatches subagent invocations to SQS, updates its state in the database, and exits. When all subagents complete, a completion event re-invokes the orchestrator to synthesize results. The same pattern repeats at every tier — subagents dispatch skills and exit, skills dispatch tools and exit.

The Countdown Latch

The mechanism for knowing when all children have completed is an atomic countdown:

-- When the orchestrator dispatches 3 subagents:
INSERT INTO agent_runs (run_id, pending_count, status)
VALUES ($1, 3, 'awaiting_children');
 
-- When each subagent completes:
UPDATE agent_runs
SET pending_count = pending_count - 1, updated_at = now()
WHERE run_id = $1
RETURNING pending_count, callback_queue;

The last subagent to complete sees pending_count = 0 and sends a completion event to the orchestrator's callback queue. The same pattern repeats at every tier — a subagent dispatches N skills and uses the atomic decrement to detect when all skills have completed, and a skill does the same with its tool invocations.

This avoids polling, long-running connections, and idle compute. The parent Lambda is not running while children execute. It is re-invoked only when there is work to do.

Message Envelopes

Each tier receives messages with a type discriminator that determines the processing path:

type OrchestratorMessage struct {
    Type       string           // "inbound" or "completion"
    Inbound    *InboundMessage  // new user message
    Completion *CompletionEvent // all subagents finished
}
 
type SubagentMessage struct {
    Type       string              // "new_task" or "completion"
    Invocation *SubagentInvocation // task from orchestrator
    Completion *CompletionEvent    // all skills finished (this turn)
}
 
type SkillAgentMessage struct {
    Type       string           // "new_task" or "completion"
    Invocation *SkillInvocation // task from subagent
    Completion *CompletionEvent // all tools finished
}

Every tier handles two types of events: a new task (plan and dispatch children) or a completion event (all children finished — synthesize results and either continue or return to parent). A subagent, on receiving a completion event, may decide to dispatch another round of skills rather than returning — this is what enables multi-turn reasoning within a subagent.

This dual-purpose design means each tier needs exactly one Lambda function and one SQS queue, regardless of how many subagents, skills, or tools exist.

The LLM Gateway

Every LLM call in the system goes through a gateway pipeline that enforces limits, handles failures, and records usage. The pipeline stages run in order:

Guard (pre-check). Before every LLM call, check: has the run exceeded its deadline? Has the tenant exceeded its monthly budget? Has the run exceeded its maximum turn count? If any limit is hit, the guard does not kill the execution — it strips the tools from the request and forces the LLM to synthesize with whatever results it has. Wrap up gracefully, do not crash mid-flow.

Config Loader. Resolve which LLM provider and model to use. Each agent has a configured chain — a primary provider and one or more fallbacks. Load API keys from Secrets Manager.

Compactor. Check whether the conversation history exceeds the model's context window. If it does, summarize the oldest turns with a separate LLM call and rebuild the messages with the summary as a prefix. This keeps long conversations functional without losing critical context.

Router. Send the request to the primary provider. On rate limiting (429) or server error (5xx), fall back to the next provider in the chain. On client errors (4xx other than 429), fail immediately — the request is malformed and retrying will not help. Once the first token streams back from a provider, commit to it — no mid-stream fallback.

Meter. Record token usage and cost. Write to both Redis (for fast budget checks on subsequent calls) and Postgres (as the append-only source of truth for billing and audit).

Tracer. Emit a trace event recording the provider, model, token counts, latency, and any fallback events. Every LLM call in the system is traceable.

The guard is the single enforcement point. No other layer checks limits. This prevents the scattered enforcement problem where budget checks happen in three different places with three different interpretations of the rules.

Scaling Properties

This architecture has several properties that make it scale well on AWS:

Zero idle compute. Every Lambda runs only when there is work. Between user messages, the system consumes no compute resources. This is fundamentally different from a server-based agent system that maintains WebSocket connections, polls queues, or runs background loops.

Independent scaling per tier. Tool Lambdas scale independently from skill Lambdas, which scale independently from subagent and orchestrator Lambdas. A burst of tool invocations does not affect the orchestrator's ability to accept new messages.

Parallel fan-out. When an orchestrator dispatches three subagents, those subagents execute concurrently on separate Lambda instances. When a subagent dispatches skills, and those skills dispatch tools, the parallelism cascades through every tier. SQS delivers messages to available Lambda instances automatically.

Cost proportional to usage. No provisioned capacity is required (though provisioned concurrency can reduce cold starts for latency-sensitive workloads). A system that processes 100 messages per day costs almost nothing. A system that processes 100,000 messages per day costs more, but scales without architectural changes.

Tenant isolation at the compute level. Each invocation runs in its own Lambda execution context. There is no shared state between tenant requests beyond the database. Secrets are fetched fresh from Secrets Manager on every invocation — never cached across warm starts.

Multi-Tenant Isolation

If you are building a platform that serves multiple tenants, isolation is the hardest problem and the one you cannot retrofit. The patterns that matter:

Every database query includes tenant_id. Not sometimes, not when convenient — every query. Even if the primary key uniquely identifies a record, the WHERE clause includes tenant_id. This is defense in depth against bugs in tenant resolution.

Conversation IDs are deterministically derived. conversation_id = "{tenant_id}:{agent_id}:{user_id}". Including tenant_id in the conversation ID means that even a bug in tenant routing cannot cause cross-tenant conversation leakage.

Secrets are never cached. Lambda warm starts reuse the execution environment. If you cache a tenant's API key, the next invocation on the same Lambda instance — potentially for a different tenant — has access to it. Fetch secrets fresh from Secrets Manager on every invocation. The latency cost is real (50-100ms) but the alternative is a security incident.

No package-level mutable state. All package-level variables are stateless (configuration, constants). Tenant context lives in request scope and is garbage-collected when the invocation completes.

Authorization before enqueue. When a message arrives from a channel, the channel gateway validates the webhook signature, resolves the sender's identity, verifies tenant membership, checks tenant status, and enforces rate limits — all before the message reaches the orchestrator queue. Unauthorized messages never touch the agent layer.

Conversation Locking

When a user sends multiple messages quickly, you can end up with concurrent Lambda invocations processing the same conversation. This creates race conditions — interleaved conversation history, duplicate skill invocations, conflicting state updates.

The solution is a conversation-level lock with a TTL:

SET {tenant}:{conversation}:lock {run_id} NX EX 120

The NX flag ensures only one invocation acquires the lock. The EX 120 sets a 120-second TTL so the lock expires even if the Lambda crashes. When the second message arrives and fails to acquire the lock, it can be requeued with a delay or processed after the lock releases.

This is a simple pattern but it prevents a class of bugs that are difficult to diagnose in production — duplicated responses, missing context, and inconsistent state.

Tracing Everything

In a distributed system with multiple Lambda invocations per user request, you need a correlation mechanism. Every flow gets a trace_id that propagates through every message envelope, every database write, and every LLM call.

Trace events are append-only — never updated or deleted. Each event records the tier (orchestrator, subagent, skill, tool), the event type (invoked, completed, failed), duration, and a payload with relevant details. A single user query might generate 15-30 trace events across 5-15 Lambda invocations.

This creates a complete execution timeline: which skills were invoked, which tools were called, how long each took, which LLM provider was used, how many tokens were consumed, and whether any fallbacks occurred. When something goes wrong at 3 AM, the trace tells you exactly what happened without reproducing the issue.

The Implementation Ingredients

If you want to build this yourself, here is what you need:

Infrastructure:

  • AWS Lambda (container images or zip deployments)
  • SQS queues (one per tier, plus per-tool queues if you want isolated scaling)
  • API Gateway (HTTP API for channel webhooks)
  • RDS Postgres (with pgvector if you need semantic search)
  • ElastiCache Redis (conversation locks, budget counters, rate limiting)
  • Secrets Manager (per-tenant API keys and LLM credentials)
  • ECR (if using container image deployments)

Application patterns:

  • Message envelope with type discriminator for dual-purpose Lambdas
  • Atomic countdown latch in Postgres for fan-out/fan-in coordination
  • Redis-based conversation locks with TTL for concurrency control
  • LLM gateway pipeline (guard → config → compactor → router → meter → tracer)
  • Provider fallback chain with streaming commit semantics
  • Append-only trace events with correlation IDs

Database schema (minimum viable):

  • agents — agent configuration, system prompts, limits
  • agent_runs — execution state, pending_count, callback_queue, parent_run_id, run_type (orchestrator/subagent/skill)
  • conversations — turn-by-turn history per conversation
  • subagent_definitions — subagent prompts and skill associations
  • skill_definitions — skill prompts and tool associations
  • tool_registry — tool metadata and input schemas
  • tool_results — results from tool invocations
  • trace_events — append-only execution log
  • llm_usage — token and cost tracking per call

What you can skip initially:

  • Multi-tenant isolation (if you are building for a single team)
  • Budget enforcement (if cost is not a concern yet)
  • Provider fallback chains (start with a single LLM provider)
  • Context compaction (if your conversations are short)
  • Multi-turn subagent reasoning (start with single-turn subagents that dispatch skills once and return)

The four-tier hierarchy, the SQS coordination, and the countdown latch are the core patterns. Everything else is important for production but can be added incrementally.

What This Architecture Gives You

The result is an agent system that scales from zero to thousands of concurrent executions without provisioned infrastructure, isolates tenant data by construction, traces every operation for debugging and compliance, and costs nothing when idle.

It is not the simplest architecture. A monolithic server with a WebSocket connection is easier to build and debug initially. But monolithic servers do not scale to zero, do not isolate tenants, and do not survive partial failures gracefully. If you are building an agent platform that needs to run in production — not a prototype, not a demo — the serverless event-driven architecture pays for itself quickly.

The patterns described here are not theoretical. We have run this architecture in production, hit the edge cases, and built the constraints that prevent them. If you are designing a similar system and want to avoid repeating the mistakes we have already made, that is exactly the kind of engagement we do.

aiawsserverlessarchitectureagents