What is RAG?

RAG is short for Retrieval-Augmented Generation. At its simplest: before you ask the language model a question, you retrieve relevant context from your own data and paste it into the prompt. The model then answers using that context. It's three moves, embed, retrieve, generate, dressed up with pipelines, metadata, and increasingly sophisticated orchestration on top.

The three-step loop

  1. Embed. Convert your documents into numeric vectors using an embedding model. Store the vectors in a searchable index.
  2. Retrieve. At query time, embed the user's question and find the closest document vectors by similarity.
  3. Generate. Concatenate the retrieved chunks with the user's question and send the combined prompt to an LLM.

That's vanilla RAG. Everything in this section is what you do once you realize vanilla RAG produces a good demo and a mediocre product.

What RAG gives you

What RAG doesn't give you

The surface is small. The depth is enormous.

A naive RAG system fits on one whiteboard. But each of those three steps has its own sub-discipline:

The reason I wrote 56 pages on this topic is that every one of those layers matters, and most teams under-invest in all of them.

The mental model

Think of RAG as a pipeline where information flows from documents to answers. The quality of the final answer is bounded by the worst stage of that pipeline. You can have world-class embeddings and a terrible chunking strategy, and you'll get poor answers. You can have perfect retrieval and a weak generator, and you'll get poor answers. The job of a RAG engineer is to keep all stages strong enough that the overall product is good.

Call it the minimum-weak-link law. It's the single most useful frame for debugging a bad RAG system.

Next: Why RAG over fine-tuning.