A RAG system isn't one thing. It's a pipeline with 8-10 distinct stages, each of which has its own design space. Understanding the full map is how you move from "we're using RAG" to "we're making specific defensible choices at each layer." Here's the map I use.
Where documents come from. Confluence, Google Drive, Notion, S3, SharePoint, databases, APIs, websites, Slack. Every source has its own auth model, rate limits, and permission system. Ingestion connectors are often the biggest single engineering cost.
Turning raw files into text. PDFs, HTML, Word docs, spreadsheets, scanned images, video transcripts. The quality of downstream retrieval is capped by how well parsing preserves structure (headings, tables, lists).
Stripping boilerplate, fixing encoding, deduplicating, normalizing whitespace, handling multiple languages. Small choices here affect retrieval accuracy more than most teams realize.
Splitting documents into retrievable units. The chunk size, overlap, and boundary rules determine what retrieval can possibly return. See the chunking section.
Attaching structured data to each chunk: source URL, document type, author, timestamp, tags, permissions. This is what lets you filter, sort, and scope retrieval later.
Converting each chunk into a vector using an embedding model. The model choice sets a ceiling on retrieval quality, and costs compound across every reindex.
Storing vectors in a searchable index. Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, FAISS. Each has different performance characteristics at scale.
At query time: embed the question, search the index, possibly combine with keyword search, possibly rerank. This is the layer where most production RAG systems win or lose.
Passing retrieved context plus the user's question to an LLM. Prompt design, model choice, structured output schemas, citation formatting.
Logging queries, retrieved chunks, and final answers. Collecting user feedback. Building evaluation datasets. Without this layer the system silently degrades.
These ten stages split into two workflows:
The split matters because they have different scaling profiles, different failure modes, and usually different engineering ownership. Teams that conflate them end up with systems where a re-ingestion job blocks live queries, or where a query optimization breaks the indexer.
For every new RAG project, I write down a one-pager covering:
Sources: which systems feed the index? Auth model for each?
Parsing: which file types, which library, what's the fallback?
Chunking: size, overlap, boundary rules, structure-aware or not?
Metadata: what fields per chunk, used for filtering how?
Embedding: model, dimensions, cost per million chunks, reindex cadence?
Index: which DB, algorithm (HNSW/IVF/PQ), sharding, replication?
Retrieval: dense-only, hybrid, rerank? Top-k? Score cutoff?
Generation: which model, prompt structure, citation format?
Eval: what metrics, what golden set, continuous or episodic?
Observability: logging, alerting, feedback loops, human review?
Ten questions. Most teams have answers for 3-4. The rest becomes "we'll figure that out later", and "later" is when the system breaks in production.
A surprising number of production RAG systems look like this:
It's the default path, and it's not terrible, but it's also the system that will plateau at "okay" and never get to "good." Everything in this section is the work of getting past that plateau.
Next: When not to use RAG.