Every RAG demo looks impressive. You load some documents, generate embeddings, wire up a retrieval step, and suddenly your LLM is answering questions about your data. The gap between that demo and a production system that your business can depend on is where most of the engineering work lives.
After building RAG systems across several domains, the patterns that matter most have become clear — and they are rarely the ones that get attention in tutorials.
Chunking Is the Whole Game
The single biggest determinant of retrieval quality is how you chunk your source documents. Get this wrong and no amount of embedding model tuning or retrieval algorithm sophistication will save you.
The naive approach — split on a fixed token count — produces chunks that break mid-sentence, separate context from the content it describes, and mix unrelated topics. The retrieval step then returns fragments that are technically similar to the query but lack the context needed for a useful answer.
Effective chunking is domain-specific. For technical documentation, chunk boundaries should align with section headers and logical topic breaks. For conversational data, chunks should preserve the full exchange context. For code, chunks need to include function signatures, docstrings, and enough surrounding code to understand the purpose.
The investment in a custom chunking strategy pays off more than any other single optimization. We typically spend 30-40% of the pipeline engineering effort on getting chunking right.
The Embedding Pipeline Is an ETL Problem
Once you have a chunking strategy, you need a pipeline that keeps your vector store current. This is a classic ETL problem with some specific wrinkles:
Incremental updates matter. Re-embedding your entire corpus on every change is expensive and slow. You need change detection — knowing which source documents changed, which chunks are affected, and updating only those embeddings. This requires maintaining a mapping between source documents and their chunks.
Embedding model versioning is critical. When you change your embedding model (and you will — better models are released constantly), you need to re-embed everything. This means your pipeline needs to support full re-indexing while the current index serves traffic. Blue-green deployments for your vector store.
Staleness is a real problem. If your knowledge base changes frequently, your retrieval system is only as good as your last indexing run. For some use cases, near-real-time indexing is necessary. For others, a nightly batch is fine. The requirement should drive the architecture, not the other way around.
Retrieval Quality Requires Measurement
You cannot improve what you do not measure. Yet most RAG implementations have no systematic evaluation of retrieval quality.
A practical evaluation framework needs three things:
A test set of queries with known relevant documents. This is the ground truth that tells you whether your retrieval is finding the right chunks. Building this test set is manual work, but it is essential. Start with 50-100 representative queries and expand over time.
Retrieval metrics. At minimum, track recall at K (what fraction of relevant chunks appear in the top K results) and precision at K (what fraction of returned chunks are actually relevant). These numbers tell you whether your chunking and embedding strategy are working.
End-to-end answer quality. Retrieval metrics tell you about the retrieval step. You also need to evaluate whether the LLM produces good answers given the retrieved context. This can be partially automated with LLM-as-judge evaluation, but periodic human review is necessary to calibrate.
Hybrid Search Is Usually Worth It
Pure vector search has a well-known weakness: it struggles with exact matches, specific identifiers, and precise terminology. If a user asks about "error code 4012," semantic similarity might return chunks about error handling in general rather than the specific error code.
Hybrid search — combining vector similarity with keyword search (typically BM25) — addresses this directly. The keyword component handles exact matches and specific terms. The vector component handles semantic similarity and paraphrased queries. A reciprocal rank fusion or weighted combination merges the results.
In practice, hybrid search improves retrieval quality for almost every use case we have worked on. The implementation cost is modest — most vector databases now support hybrid search natively or through simple configuration.
Context Window Management
Retrieved chunks need to fit within the LLM's context window alongside the system prompt, conversation history, and any other instructions. This creates a practical constraint on how many chunks you can include.
The temptation is to stuff as many chunks as possible into the context. More context should mean better answers, right? In practice, the opposite is often true. Including marginally relevant chunks dilutes the signal. The LLM may latch onto irrelevant details from a low-quality chunk and produce a worse answer than it would with fewer, more relevant chunks.
We typically find that 3-5 highly relevant chunks outperform 10-15 chunks of mixed relevance. The key is retrieval precision — returning the right chunks, not just similar chunks.
The Production Checklist
Before deploying a RAG system to production, verify these fundamentals:
- Chunking strategy is domain-appropriate and tested against real queries
- Embedding pipeline supports incremental updates and model versioning
- Retrieval quality is measured with a test set and tracked over time
- Hybrid search is evaluated and implemented if beneficial
- Context window usage is optimized for precision over volume
- Failure modes are handled — what happens when retrieval returns nothing relevant?
- Monitoring is in place for retrieval latency, embedding pipeline health, and answer quality drift
The gap between a RAG demo and a production system is engineering discipline. The fundamental concepts are straightforward. The execution details are where reliability lives.