Documents.

PDFs, HTML, code, structured data. Ingestion patterns that handle the mess.

The ingestion pipeline

Ingestion is 80% of a real RAG system. Here's the pipeline pattern I use and the stages that matter most.

Metadata extraction

Metadata is the quiet multiplier in RAG. Good metadata enables filtering, boosting, access control, and citations. Most teams ship without it and regret it.

OCR for scanned documents

OCR is necessary, imperfect, and cost-sensitive. Here's how to handle scanned PDFs, images, and handwritten content in a RAG pipeline.

Parsing HTML + web pages

Parsing HTML is deceptively hard. Boilerplate, nav menus, ads, and dynamic content all pollute the useful text. Here's how to get clean content reliably.

Parsing PDFs

PDFs are where RAG projects go to die. Here's what works, what doesn't, and the specific libraries I reach for in each situation.

Tables and figures

Tables and figures carry a disproportionate share of information in technical documents. Naive RAG drops most of it. Here's how to handle them.