Home›Expertise›RAGS to Riches›What is RAG?

What is RAG?

📖 5 min readUpdated 2026-04-18

RAG is short for Retrieval-Augmented Generation. At its simplest: before you ask the language model a question, you retrieve relevant context from your own data and paste it into the prompt. The model then answers using that context. It's three moves, embed, retrieve, generate, dressed up with pipelines, metadata, and increasingly sophisticated orchestration on top.

The three-step loop

Embed. Convert your documents into numeric vectors using an embedding model. Store the vectors in a searchable index.
Retrieve. At query time, embed the user's question and find the closest document vectors by similarity.
Generate. Concatenate the retrieved chunks with the user's question and send the combined prompt to an LLM.

That's vanilla RAG. Everything in this section is what you do once you realize vanilla RAG produces a good demo and a mediocre product.

What RAG gives and doesn't give

The surface is small. The depth is enormous.

A naive RAG system fits on one whiteboard. But each of those three steps has its own sub-discipline:

Before embedding, you have to parse documents, which for PDFs alone is a category of software.
Chunking strategies affect retrieval quality as much as model choice.
Vector indexes have failure modes that don't show up until 10M+ documents.
Retrieval itself is a stack of techniques (dense, sparse, hybrid, rerank, query rewriting) where each layer compounds on the last.
Evaluation is its own rabbit hole, "is my RAG better?" is genuinely hard to answer.
Production concerns (latency, cost, observability, security) transform the system once it leaves the notebook.

The reason I wrote 56 pages on this topic is that every one of those layers matters, and most teams under-invest in all of them.

The mental model

Think of RAG as a pipeline where information flows from documents to answers. The quality of the final answer is bounded by the worst stage of that pipeline. You can have world-class embeddings and a terrible chunking strategy, and you'll get poor answers. You can have perfect retrieval and a weak generator, and you'll get poor answers. The job of a RAG engineer is to keep all stages strong enough that the overall product is good.

Call it the minimum-weak-link law. It's the single most useful frame for debugging a bad RAG system.

What to do with this

Before diving into any sub-topic, figure out which layer is your weakest link. Fix that first.
Read why RAG over fine-tuning to confirm RAG is the right tool.
Read the RAG architecture map for the full stack.