Every semantic search, every RAG pipeline, every constraint similarity lookup starts with the same operation: turning text into a vector. When we built Xpand, embedding generation was on the critical path for every review, every knowledge search, and every plan enrichment. Calling OpenAI's embedding API for every operation meant latency we could not control, costs that scaled linearly with usage, and search queries leaving our infrastructure.
So we built our own. A Rust-based embedding service running on AWS Lambda that generates vectors locally, returns results in under 20ms on warm invocations, and costs effectively nothing at moderate scale. This post covers the architecture, deployment, and the reasoning behind each decision — with enough detail to build your own.
Why Not Just Call an API?
The case against external embedding APIs comes down to three things:
Latency. A round trip to OpenAI's embedding endpoint takes 100-300ms depending on batch size and network conditions. That is fine for a batch job. It is not fine when a developer is waiting for a review agent to finish, or when a planning agent needs to search the knowledge base mid-session. Our self-hosted service returns embeddings in 10-20ms on warm Lambda invocations. That difference compounds across dozens of semantic searches per session.
Cost. OpenAI charges $0.10 per million tokens for text-embedding-3-small. That sounds cheap until you are running hundreds of thousands of embedding operations per month across constraint searches, pattern matching, lesson retrieval, and plan enrichment. A Lambda function running on ARM64 costs fractions of a cent per invocation. At moderate scale, the cost difference is 40-400x.
Privacy. Every embedding request to an external API sends the text being embedded. For a governance platform, that means your constraints, your architectural rules, your security requirements, your code snippets — all leaving your infrastructure. Self-hosted embeddings mean your knowledge base never leaves your VPC.
The Architecture
The service is a single Lambda function running a Rust binary that loads a BERT-based embedding model and exposes an OpenAI-compatible API. Here is the stack:
- Language: Rust with Tokio async runtime
- ML inference: Hugging Face Candle (a Rust-native ML framework)
- Model: BAAI/bge-small-en-v1.5 — a 127MB sentence transformer that produces 384-dimensional embeddings
- Runtime: AWS Lambda with a custom runtime on Amazon Linux 2023
- Architecture: ARM64 (Graviton2) for cost and performance
- Deployment: Docker container image pushed to ECR
- Invocation: Direct Lambda invoke from internal services (no API Gateway)
The model weights are baked into the Docker image at /opt/model/. No S3 fetch on cold start, no EFS mount, no model download at runtime. The binary and the model ship as a single container image. Internal services invoke the Lambda function directly using the AWS SDK — no API Gateway in the path. This eliminates the Gateway overhead and keeps the call entirely within the VPC.
Why Rust and Candle
Python is the default for ML inference, but it is a poor fit for Lambda. Python cold starts are slow, dependency packaging is painful (especially for ML libraries with C extensions), and the runtime overhead is significant for a service that needs to respond in milliseconds.
Rust with Candle eliminates all of these problems. Candle is a Rust-native ML framework from Hugging Face that can load and run transformer models without Python, without PyTorch, and without ONNX conversion. The compiled binary is a single statically-linked executable. There is no interpreter startup, no module loading, and no JIT compilation.
The trade-off is cold starts. The binary itself initializes in milliseconds, but loading the model from /opt — which in Lambda is backed by virtual disk, not fast local storage — is slow. More on that below. Warm invocations take 10-20ms for single text embeddings.
The Model: BGE-Small-EN-v1.5
Model selection matters more than most teams realize. The obvious choice is whatever OpenAI offers, but for self-hosted inference the trade-offs are different — and for code-related workloads, the right small model can outperform a general-purpose large one.
We use BAAI/bge-small-en-v1.5, a 12-layer BERT model that produces 384-dimensional embeddings. The reasons:
Relevance to code and technical content. This was the primary driver. BGE-small was trained on a diverse corpus that includes technical documentation, code-adjacent text, and structured content. For our use case — semantic search over architectural constraints, security rules, design patterns, and code review annotations — it produces significantly better similarity rankings than general-purpose models tuned for conversational or web content. The model understands the semantic distance between "validate user input before database write" and "SQL injection prevention" in a way that models optimized for natural language search do not.
Size. The model is 127MB in safetensors format. It fits in a Lambda container image and loads into memory with room for inference buffers. Larger models like bge-large (1.3GB) would require more memory, slower cold starts, and higher cost for marginal quality improvement on our specific domain.
Inference speed. 12 layers at 384 hidden dimensions means a forward pass completes in 10-15ms on a Graviton2 CPU. No GPU required. This is the key enabler for running inference on Lambda rather than a dedicated GPU instance.
Implementation Details
Model Loading
The model loads once per Lambda container instance and stays in memory for the container's lifetime. A lazy singleton pattern ensures thread safety:
static MODEL: Lazy<Arc<EmbeddingModel>> =
Lazy::new(|| Arc::new(EmbeddingModel::load().unwrap()));The EmbeddingModel::load() function reads the safetensors file and tokenizer from /opt/model/ inside the container, builds the BERT model graph using Candle, and returns a struct ready for inference. On subsequent invocations, the model is already in memory and loading takes zero time.
The cold start is where this architecture pays its tax. The model weights are loaded via memory-mapped file access (mmap). In Lambda, /opt is backed by virtual disk — not fast local NVMe. When the model loads for the first time, every page of the 127MB safetensors file must be faulted in from that virtual disk. At ~4KB per page, that is roughly 32,000 page faults, each costing 100-500 microseconds. Combined with container initialization and the first inference pass, total cold start time runs 1-2 seconds.
This is the honest trade-off. Warm invocations are blazing fast (10-20ms). Cold starts are not. For our workload — bursts of semantic searches during agent sessions with quiet periods between — provisioned concurrency keeps a warm instance available and cold starts are rare. For latency-critical workloads with unpredictable traffic patterns, this is worth understanding before you commit to the architecture.
Inference Pipeline
For each text, the pipeline is:
- Tokenize (~1-5ms): Text to token IDs using the WordPiece tokenizer
- Tensorize (~1ms): Token IDs to Candle tensors with attention masks
- Forward pass (~10-15ms): BERT inference through 12 transformer layers
- Mean pooling (~1ms): Attention-weighted average of token embeddings into a single 384-dimensional vector
- L2 normalization (~1ms): Scale to unit length so dot product equals cosine similarity
Batch Processing
The handler accepts up to 100 texts in a single request. All texts in a batch are padded to the length of the longest text, stacked into a single tensor, and processed in one forward pass through the model. This is significantly faster than sequential processing — a batch of 10 texts takes approximately 50ms rather than 150ms.
{
"input": ["constraint text one", "constraint text two", "..."],
"model": "bge-small-en-v1.5",
"encoding_format": "float"
}The batch size limit of 100 is a practical ceiling for the Lambda memory allocation. For most use cases — searching a knowledge base, embedding a set of constraints — batch sizes of 10-32 are typical and optimal.
OpenAI-Compatible API
The response format matches OpenAI's embedding API exactly. This means any client code that calls OpenAI for embeddings can switch to the self-hosted service by changing the base URL:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.0123, -0.0456, ...],
"index": 0
}
],
"model": "bge-small-en-v1.5",
"usage": {
"prompt_tokens": 12,
"total_tokens": 12
}
}Any service that already constructs OpenAI-format embedding requests can switch to calling this Lambda with minimal changes — swap the HTTP call for a Lambda invoke and the request/response payloads stay identical.
Deployment
Docker Multi-Stage Build
The Dockerfile uses a four-stage build with cargo-chef for dependency caching:
Stage 1 — Planner. Analyzes the Cargo workspace and produces a recipe.json that captures the dependency graph without any source code.
Stage 2 — Cacher. Builds only the dependencies from the recipe. This layer is cached by Docker and only rebuilt when Cargo.lock changes. Since dependencies change infrequently, this saves minutes on most builds.
Stage 3 — Builder. Copies in source code and compiles the Lambda binary with release optimizations: LTO (link-time optimization), single codegen unit, binary stripping, and size optimization (opt-level = "z").
Stage 4 — Runtime. Starts from the provided.al2023 Lambda base image, copies in the compiled binary and model files. The final image contains only what is needed to run.
Lambda Configuration
LAMBDA_MEMORY="2048" # MB — model loading + inference buffers
LAMBDA_TIMEOUT="60" # seconds — generous for cold starts
ARCHITECTURES="arm64" # Graviton2 — 20% cheaper than x86_64The 2GB memory allocation is not just about the model size — Lambda allocates CPU proportional to memory. More memory means faster model loading and inference. The 60-second timeout accommodates worst-case cold starts with large batch sizes.
ARM64 is a straightforward win for this workload. Graviton2 instances are 20% cheaper per GB-second than x86_64 and deliver comparable or better performance for compute-bound inference. Since the cross-compilation is handled entirely in Docker, there is no additional development complexity.
Infrastructure Setup
The deployment consists of bash scripts that set up the AWS resources:
- ECR repository for the container image
- IAM role with Lambda execution and VPC access policies
- Lambda function with container image deployment
- VPC configuration so the function can reach Redis for metrics
There is no API Gateway. Internal services invoke the Lambda directly using the AWS SDK's lambda:InvokeFunction API. This eliminates Gateway latency overhead, avoids Gateway per-request costs, and keeps the entire call path within the VPC. The Lambda's resource policy restricts invocation to specific IAM roles — no public endpoint exists.
No CloudFormation, no Terraform, no CDK. For a single-function service, bash scripts with aws cli calls are simpler to understand, debug, and modify. The entire infrastructure setup is under 200 lines of shell script.
Observability
The service logs structured JSON to CloudWatch and records metrics to Redis:
- Cold start detection: Compares Lambda initialization time against the invocation timestamp. If they are within 100ms, it is a cold start.
- Request latency: Measured from handler entry to response, recorded as sorted sets in Redis for percentile analysis.
- Error tracking: Error types counted per day for alerting.
- Recent events: The last 1000 request events stored as a JSON list for debugging.
CloudWatch log retention is set to 7 days. For a stateless embedding service, longer retention is rarely useful — the interesting data is in the metrics, not the individual log lines.
The Cost Math
For a team generating 100,000 embedding operations per month:
OpenAI text-embedding-3-small: At an average of 50 tokens per text, that is 5 million tokens per month. At $0.10 per million tokens, the cost is $0.50/month. Cheap — but that is a low-volume scenario.
At 10 million operations per month (the scale where semantic search, constraint matching, and plan enrichment add up across a team), the OpenAI cost is $50/month. Still manageable, but now add latency: 100-300ms per request, thousands of requests per day, each one a network round trip to an external service.
Self-hosted Lambda: At 2GB memory and ~100ms average duration on warm invocations, each invocation costs approximately $0.0000033. Ten million invocations: ~$33/month in Lambda compute. Plus ECR storage ($0.10/GB/month for the container image). No API Gateway cost because internal services invoke the Lambda directly. Total: under $35/month.
The cost advantage is real but not the primary driver. The primary drivers are latency and privacy. The cost savings are a bonus.
When This Is Not Worth It
Self-hosted embeddings are not the right choice for every team.
If you need multilingual embeddings, the compact English-only models like BGE-small are not sufficient. Multilingual models are larger, require more memory, and may justify a GPU-backed inference endpoint instead of Lambda.
If your volume is very low (under 10,000 operations per month), the engineering investment in deploying and maintaining a Lambda function exceeds the cost savings. Just call OpenAI.
If you need the absolute highest embedding quality, larger models like text-embedding-3-large (3072 dimensions) will outperform BGE-small on broad benchmarks. For domain-specific technical content, the gap is smaller than the benchmarks suggest, but it exists.
If you do not have AWS infrastructure, the deployment scripts assume ECR and Lambda. Adapting to GCP Cloud Run or Azure Container Apps is straightforward but requires rewriting the deployment layer.
The Ingredients
If you want to build this yourself, here is what you need:
- Rust toolchain with the
aarch64-unknown-linux-gnutarget for ARM64 cross-compilation - Hugging Face Candle crates:
candle-core,candle-nn,candle-transformersfor model loading and inference - tokenizers crate from Hugging Face for WordPiece tokenization
- lambda_http crate for the Lambda runtime integration
- A sentence transformer model in safetensors format — download from Hugging Face and include in your Docker image
- Docker with multi-stage builds and
cargo-cheffor dependency caching - AWS CLI for deploying to ECR and Lambda
The total development effort from zero to a deployed, OpenAI-compatible embedding endpoint is about two to three days if you are familiar with Rust and AWS. The ongoing maintenance is near zero — the service is stateless, the model does not change frequently, and Lambda handles scaling automatically.
We have been running this in production for all of Xpand's semantic capabilities — constraint search, pattern matching, lesson retrieval, and plan knowledge enrichment. It handles the load without incident, costs almost nothing, and keeps every search query inside our infrastructure.