What is RAG?
📖 5 min readUpdated 2026-04-18
RAG is short for Retrieval-Augmented Generation. At its simplest: before you ask the language model a question, you retrieve relevant context from your own data and paste it into the prompt. The model then answers using that context. It's three moves, embed, retrieve, generate, dressed up with pipelines, metadata, and increasingly sophisticated orchestration on top.
The three-step loop
- Embed. Convert your documents into numeric vectors using an embedding model. Store the vectors in a searchable index.
- Retrieve. At query time, embed the user's question and find the closest document vectors by similarity.
- Generate. Concatenate the retrieved chunks with the user's question and send the combined prompt to an LLM.
That's vanilla RAG. Everything in this section is what you do once you realize vanilla RAG produces a good demo and a mediocre product.
What RAG gives you
- Grounded answers. The model responds with your data, not its pretraining.
- Citations. Because the retrieval step returns specific chunks, you can link answers to source documents.
- Freshness. Update the index, update the answers. No retraining required.
- Access control. Filter retrieved documents by user, tenant, or permission before generation.
- Cost control. You only pay for the tokens you actually pass into the context window.
- Auditable behavior. Logs of what was retrieved let you debug why a model said what it said.
What RAG doesn't give you
- Reasoning you don't already have. RAG can't teach a model to solve a problem it didn't learn in pretraining. It only makes facts available.
- Style or tone changes. Those come from fine-tuning or prompting, not retrieval.
- Magical "search your docs" quality. Retrieval quality is capped by chunking, embedding, and reranking choices, and vanilla setups leave most of that ceiling unclaimed.
- Guaranteed factuality. The model can still hallucinate over retrieved context if the context is noisy, contradictory, or incomplete.
The surface is small. The depth is enormous.
A naive RAG system fits on one whiteboard. But each of those three steps has its own sub-discipline:
- Before embedding, you have to parse documents, which for PDFs alone is a category of software.
- Chunking strategies affect retrieval quality as much as model choice.
- Vector indexes have failure modes that don't show up until 10M+ documents.
- Retrieval itself is a stack of techniques (dense, sparse, hybrid, rerank, query rewriting) where each layer compounds on the last.
- Evaluation is its own rabbit hole, "is my RAG better?" is genuinely hard to answer.
- Production concerns (latency, cost, observability, security) transform the system once it leaves the notebook.
The reason I wrote 56 pages on this topic is that every one of those layers matters, and most teams under-invest in all of them.
The mental model
Think of RAG as a pipeline where information flows from documents to answers. The quality of the final answer is bounded by the worst stage of that pipeline. You can have world-class embeddings and a terrible chunking strategy, and you'll get poor answers. You can have perfect retrieval and a weak generator, and you'll get poor answers. The job of a RAG engineer is to keep all stages strong enough that the overall product is good.
Call it the minimum-weak-link law. It's the single most useful frame for debugging a bad RAG system.
Next: Why RAG over fine-tuning.