The RAG architecture map

A RAG system isn't one thing. It's a pipeline with 8-10 distinct stages, each of which has its own design space. Understanding the full map is how you move from "we're using RAG" to "we're making specific defensible choices at each layer." Here's the map I use.

The ten stages

1. Source integration

Where documents come from. Confluence, Google Drive, Notion, S3, SharePoint, databases, APIs, websites, Slack. Every source has its own auth model, rate limits, and permission system. Ingestion connectors are often the biggest single engineering cost.

2. Parsing + extraction

Turning raw files into text. PDFs, HTML, Word docs, spreadsheets, scanned images, video transcripts. The quality of downstream retrieval is capped by how well parsing preserves structure (headings, tables, lists).

3. Cleaning + normalization

Stripping boilerplate, fixing encoding, deduplicating, normalizing whitespace, handling multiple languages. Small choices here affect retrieval accuracy more than most teams realize.

4. Chunking

Splitting documents into retrievable units. The chunk size, overlap, and boundary rules determine what retrieval can possibly return. See the chunking section.

5. Metadata enrichment

Attaching structured data to each chunk: source URL, document type, author, timestamp, tags, permissions. This is what lets you filter, sort, and scope retrieval later.

6. Embedding

Converting each chunk into a vector using an embedding model. The model choice sets a ceiling on retrieval quality, and costs compound across every reindex.

7. Index storage

Storing vectors in a searchable index. Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, FAISS. Each has different performance characteristics at scale.

8. Retrieval

At query time: embed the question, search the index, possibly combine with keyword search, possibly rerank. This is the layer where most production RAG systems win or lose.

9. Generation

Passing retrieved context plus the user's question to an LLM. Prompt design, model choice, structured output schemas, citation formatting.

10. Observability + feedback

Logging queries, retrieved chunks, and final answers. Collecting user feedback. Building evaluation datasets. Without this layer the system silently degrades.

The data plane vs the query plane

These ten stages split into two workflows:

The split matters because they have different scaling profiles, different failure modes, and usually different engineering ownership. Teams that conflate them end up with systems where a re-ingestion job blocks live queries, or where a query optimization breaks the indexer.

The decisions I force teams to make explicit

For every new RAG project, I write down a one-pager covering:

Sources: which systems feed the index? Auth model for each?
Parsing: which file types, which library, what's the fallback?
Chunking: size, overlap, boundary rules, structure-aware or not?
Metadata: what fields per chunk, used for filtering how?
Embedding: model, dimensions, cost per million chunks, reindex cadence?
Index: which DB, algorithm (HNSW/IVF/PQ), sharding, replication?
Retrieval: dense-only, hybrid, rerank? Top-k? Score cutoff?
Generation: which model, prompt structure, citation format?
Eval: what metrics, what golden set, continuous or episodic?
Observability: logging, alerting, feedback loops, human review?

Ten questions. Most teams have answers for 3-4. The rest becomes "we'll figure that out later", and "later" is when the system breaks in production.

The common anti-architecture

A surprising number of production RAG systems look like this:

It's the default path, and it's not terrible, but it's also the system that will plateau at "okay" and never get to "good." Everything in this section is the work of getting past that plateau.

Next: When not to use RAG.