Ingestion is 80% of a real RAG system. Here's the pipeline pattern I use and the stages that matter most.
Metadata is the quiet multiplier in RAG. Good metadata enables filtering, boosting, access control, and citations. Most teams ship without it and regret it.
OCR is necessary, imperfect, and cost-sensitive. Here's how to handle scanned PDFs, images, and handwritten content in a RAG pipeline.
Parsing HTML is deceptively hard. Boilerplate, nav menus, ads, and dynamic content all pollute the useful text. Here's how to get clean content reliably.
PDFs are where RAG projects go to die. Here's what works, what doesn't, and the specific libraries I reach for in each situation.
Tables and figures carry a disproportionate share of information in technical documents. Naive RAG drops most of it. Here's how to handle them.