Home›Expertise›RAGS to Riches›The ingestion pipeline

The ingestion pipeline

📖 6 min readUpdated 2026-04-18

Ingestion is the unsexy 80% of any production RAG system. Demos skip it. Real systems live and die by it. If your ingestion is broken, no amount of retrieval sophistication saves you, you're doing advanced reasoning on garbage.

The canonical pipeline

Connect to source (API, filesystem, storage bucket, database)
Detect changes (full, delta, or event-driven)
Fetch raw document with auth and rate limiting
Parse into structured text plus metadata
Clean + normalize (boilerplate, encoding, whitespace)
Enrich (extract entities, tags, summaries)
Chunk
Embed
Upsert to index with metadata
Log + monitor

Full vs delta vs event-driven

Full reindex

Simplest. Rebuild the entire index from scratch on a schedule. Safe, easy to reason about. Expensive and slow for large corpora. Okay up to ~100K documents or whenever the corpus fits in a nightly rebuild window.

Delta sync

Only re-ingest documents that changed since the last sync. Requires reliable change detection, last-modified timestamps, content hashes, or a change log from the source system. Most common pattern at medium scale.

Event-driven

Source system emits events (webhook, message queue, CDC stream) that trigger per-document updates. Near-real-time. Most complex to operate. Required when freshness matters in minutes rather than hours.

Pick the least complex pattern that meets your freshness requirement. Most teams over-engineer this.

Idempotency

Every ingestion step must be safely retryable. Networks fail, parsers crash, embedding APIs rate-limit. The pipeline needs to handle partial failures without producing duplicate chunks or corrupted indexes.

Implementation patterns:

Deterministic chunk IDs (hash of document ID + chunk position)
Upsert by ID, not insert
Store document-level metadata separately from chunk embeddings so you can reprocess chunks without re-fetching sources
Use a durable job queue (not in-memory tasks) for anything that runs longer than seconds

The change detection problem

Most RAG failures in production come from silent data drift: the source system changed, the pipeline didn't notice, and the index is stale.

What to monitor:

Document count per source (sudden drops = upstream broken)
Time since last successful sync per source
Hash collisions or version mismatches
Failed parse rate per document type
Embedding failures

The permissions propagation problem

Documents in source systems have access controls. If you don't propagate those controls into your index, you'll leak data across tenants. Every chunk needs to carry its permission metadata from source to retrieval.

Two main patterns:

Static permissions at ingest: record who can see each chunk at ingestion time. Filter at query time. Fast but requires reindexing when permissions change.
Dynamic permissions at query: store a reference to the source document, check permissions against the source system at query time. Slower but always accurate.

Static with incremental permission sync is the common production compromise.

The "where did this come from" problem

Every chunk in your index must be traceable back to its source document. At minimum:

Source system identifier
Document ID in that system
Document URL (for citations and for the user)
Ingestion timestamp
Chunk position within the document
Embedding model version (so you know when to reindex)

Without this metadata, your RAG system can't produce real citations, and it can't recover from bad chunks, you can't find and fix what you can't trace.

Tools I actually use

Unstructured.io, polymorphic document parser, handles most file types reasonably
LlamaIndex for high-level pipeline orchestration
Temporal or Airflow for production workflow scheduling
Celery or SQS for job queues at smaller scale
Custom Python for everything that actually ships, because every real corpus has weird edge cases

The build-vs-buy trap

Commercial "RAG-as-a-service" ingestion tools (Vectara, Superlinked, etc.) save time on the happy path and cost you when your data has edge cases, which it will. For a serious system, expect to own the ingestion pipeline end-to-end. The question isn't whether to build, it's whether to build now or after you've outgrown a vendor.

What to do with this

Draw your own ingestion pipeline end-to-end. Mark which steps you own vs which you outsource.
Pick the simplest pattern (full / delta / event) that meets your freshness SLA.
Read parsing PDFs for the step that trips up most projects.