Ingestion is the unsexy 80% of any production RAG system. Demos skip it. Real systems live and die by it. If your ingestion is broken, no amount of retrieval sophistication saves you, you're doing advanced reasoning on garbage.
Simplest. Rebuild the entire index from scratch on a schedule. Safe, easy to reason about. Expensive and slow for large corpora. Okay up to ~100K documents or whenever the corpus fits in a nightly rebuild window.
Only re-ingest documents that changed since the last sync. Requires reliable change detection, last-modified timestamps, content hashes, or a change log from the source system. Most common pattern at medium scale.
Source system emits events (webhook, message queue, CDC stream) that trigger per-document updates. Near-real-time. Most complex to operate. Required when freshness matters in minutes rather than hours.
Pick the least complex pattern that meets your freshness requirement. Most teams over-engineer this.
Every ingestion step must be safely retryable. Networks fail, parsers crash, embedding APIs rate-limit. The pipeline needs to handle partial failures without producing duplicate chunks or corrupted indexes.
Implementation patterns:
Most RAG failures in production come from silent data drift: the source system changed, the pipeline didn't notice, and the index is stale.
What to monitor:
Documents in source systems have access controls. If you don't propagate those controls into your index, you'll leak data across tenants. Every chunk needs to carry its permission metadata from source to retrieval.
Two main patterns:
Static with incremental permission sync is the common production compromise.
Every chunk in your index must be traceable back to its source document. At minimum:
Without this metadata, your RAG system can't produce real citations, and it can't recover from bad chunks, you can't find and fix what you can't trace.
Commercial "RAG-as-a-service" ingestion tools (Vectara, Superlinked, etc.) save time on the happy path and cost you when your data has edge cases, which it will. For a serious system, expect to own the ingestion pipeline end-to-end. The question isn't whether to build, it's whether to build now or after you've outgrown a vendor.
Next: Parsing PDFs.