RAG is short for Retrieval-Augmented Generation. At its simplest: before you ask the language model a question, you retrieve relevant context from your own data and paste it into the prompt. The model then answers using that context. It's three moves, embed, retrieve, generate, dressed up with pipelines, metadata, and increasingly sophisticated orchestration on top.
That's vanilla RAG. Everything in this section is what you do once you realize vanilla RAG produces a good demo and a mediocre product.
A naive RAG system fits on one whiteboard. But each of those three steps has its own sub-discipline:
The reason I wrote 56 pages on this topic is that every one of those layers matters, and most teams under-invest in all of them.
Think of RAG as a pipeline where information flows from documents to answers. The quality of the final answer is bounded by the worst stage of that pipeline. You can have world-class embeddings and a terrible chunking strategy, and you'll get poor answers. You can have perfect retrieval and a weak generator, and you'll get poor answers. The job of a RAG engineer is to keep all stages strong enough that the overall product is good.
Call it the minimum-weak-link law. It's the single most useful frame for debugging a bad RAG system.
3Blue1Brown - Attention in transformers, visually explained