The ingestion pipeline
📖 6 min readUpdated 2026-04-18
Ingestion is the unsexy 80% of any production RAG system. Demos skip it. Real systems live and die by it. If your ingestion is broken, no amount of retrieval sophistication saves you, you're doing advanced reasoning on garbage.
The canonical pipeline
- Connect to source (API, filesystem, storage bucket, database)
- Detect changes (full, delta, or event-driven)
- Fetch raw document with auth and rate limiting
- Parse into structured text plus metadata
- Clean + normalize (boilerplate, encoding, whitespace)
- Enrich (extract entities, tags, summaries)
- Chunk
- Embed
- Upsert to index with metadata
- Log + monitor
Full vs delta vs event-driven
Full reindex
Simplest. Rebuild the entire index from scratch on a schedule. Safe, easy to reason about. Expensive and slow for large corpora. Okay up to ~100K documents or whenever the corpus fits in a nightly rebuild window.
Delta sync
Only re-ingest documents that changed since the last sync. Requires reliable change detection, last-modified timestamps, content hashes, or a change log from the source system. Most common pattern at medium scale.
Event-driven
Source system emits events (webhook, message queue, CDC stream) that trigger per-document updates. Near-real-time. Most complex to operate. Required when freshness matters in minutes rather than hours.
Pick the least complex pattern that meets your freshness requirement. Most teams over-engineer this.
Idempotency
Every ingestion step must be safely retryable. Networks fail, parsers crash, embedding APIs rate-limit. The pipeline needs to handle partial failures without producing duplicate chunks or corrupted indexes.
Implementation patterns:
- Deterministic chunk IDs (hash of document ID + chunk position)
- Upsert by ID, not insert
- Store document-level metadata separately from chunk embeddings so you can reprocess chunks without re-fetching sources
- Use a durable job queue (not in-memory tasks) for anything that runs longer than seconds
The change detection problem
Most RAG failures in production come from silent data drift: the source system changed, the pipeline didn't notice, and the index is stale.
What to monitor:
- Document count per source (sudden drops = upstream broken)
- Time since last successful sync per source
- Hash collisions or version mismatches
- Failed parse rate per document type
- Embedding failures
The permissions propagation problem
Documents in source systems have access controls. If you don't propagate those controls into your index, you'll leak data across tenants. Every chunk needs to carry its permission metadata from source to retrieval.
Two main patterns:
- Static permissions at ingest: record who can see each chunk at ingestion time. Filter at query time. Fast but requires reindexing when permissions change.
- Dynamic permissions at query: store a reference to the source document, check permissions against the source system at query time. Slower but always accurate.
Static with incremental permission sync is the common production compromise.
The "where did this come from" problem
Every chunk in your index must be traceable back to its source document. At minimum:
- Source system identifier
- Document ID in that system
- Document URL (for citations and for the user)
- Ingestion timestamp
- Chunk position within the document
- Embedding model version (so you know when to reindex)
Without this metadata, your RAG system can't produce real citations, and it can't recover from bad chunks, you can't find and fix what you can't trace.
Tools I actually use
- Unstructured.io, polymorphic document parser, handles most file types reasonably
- LlamaIndex for high-level pipeline orchestration
- Temporal or Airflow for production workflow scheduling
- Celery or SQS for job queues at smaller scale
- Custom Python for everything that actually ships, because every real corpus has weird edge cases
The build-vs-buy trap
Commercial "RAG-as-a-service" ingestion tools (Vectara, Superlinked, etc.) save time on the happy path and cost you when your data has edge cases, which it will. For a serious system, expect to own the ingestion pipeline end-to-end. The question isn't whether to build, it's whether to build now or after you've outgrown a vendor.
What to do with this
- Draw your own ingestion pipeline end-to-end. Mark which steps you own vs which you outsource.
- Pick the simplest pattern (full / delta / event) that meets your freshness SLA.
- Read parsing PDFs for the step that trips up most projects.