The ingestion pipeline

Ingestion is the unsexy 80% of any production RAG system. Demos skip it. Real systems live and die by it. If your ingestion is broken, no amount of retrieval sophistication saves you, you're doing advanced reasoning on garbage.

The canonical pipeline

  1. Connect to source (API, filesystem, storage bucket, database)
  2. Detect changes (full, delta, or event-driven)
  3. Fetch raw document with auth and rate limiting
  4. Parse into structured text plus metadata
  5. Clean + normalize (boilerplate, encoding, whitespace)
  6. Enrich (extract entities, tags, summaries)
  7. Chunk
  8. Embed
  9. Upsert to index with metadata
  10. Log + monitor

Full vs delta vs event-driven

Full reindex

Simplest. Rebuild the entire index from scratch on a schedule. Safe, easy to reason about. Expensive and slow for large corpora. Okay up to ~100K documents or whenever the corpus fits in a nightly rebuild window.

Delta sync

Only re-ingest documents that changed since the last sync. Requires reliable change detection, last-modified timestamps, content hashes, or a change log from the source system. Most common pattern at medium scale.

Event-driven

Source system emits events (webhook, message queue, CDC stream) that trigger per-document updates. Near-real-time. Most complex to operate. Required when freshness matters in minutes rather than hours.

Pick the least complex pattern that meets your freshness requirement. Most teams over-engineer this.

Idempotency

Every ingestion step must be safely retryable. Networks fail, parsers crash, embedding APIs rate-limit. The pipeline needs to handle partial failures without producing duplicate chunks or corrupted indexes.

Implementation patterns:

The change detection problem

Most RAG failures in production come from silent data drift: the source system changed, the pipeline didn't notice, and the index is stale.

What to monitor:

The permissions propagation problem

Documents in source systems have access controls. If you don't propagate those controls into your index, you'll leak data across tenants. Every chunk needs to carry its permission metadata from source to retrieval.

Two main patterns:

Static with incremental permission sync is the common production compromise.

The "where did this come from" problem

Every chunk in your index must be traceable back to its source document. At minimum:

Without this metadata, your RAG system can't produce real citations, and it can't recover from bad chunks, you can't find and fix what you can't trace.

Tools I actually use

The build-vs-buy trap

Commercial "RAG-as-a-service" ingestion tools (Vectara, Superlinked, etc.) save time on the happy path and cost you when your data has edge cases, which it will. For a serious system, expect to own the ingestion pipeline end-to-end. The question isn't whether to build, it's whether to build now or after you've outgrown a vendor.

Next: Parsing PDFs.