PDFs are where the "we'll just parse the docs" plan meets reality. PDF is not a document format, it's a visual layout format that happens to contain text. Every serious RAG project eventually has a PDF problem. Here's how I think about it.
Generated from Word, LaTeX, or pandoc. Text is extractable with basic tools. Usually has a coherent reading order. Still may have headers/footers/page numbers that pollute text.
Text is there but reading order is scrambled. Multi-column layouts, text boxes, sidebars, footnotes. Basic extraction yields jumbled output. This is the most common production case.
No text layer. Just pixels. Requires OCR. See OCR.
Good for clean, simple PDFs. Fast and free. Fails on multi-column layouts, tables, and scanned content. My baseline for quick prototyping.
More robust. Handles complex layouts better. Better table detection. Slightly tricky licensing for commercial use (AGPL, requires license for closed-source).
Opinionated parser that preserves structure: headings, lists, tables. Returns elements with types, which is valuable for structure-aware chunking. Slower than raw text extraction but usually worth the tradeoff.
Open-source, strong on complex document layouts. Produces structured output with headings, tables, and lists preserved. Under active development and my current default for complex PDFs.
Highest quality on difficult documents. Handles layouts, tables, figures, and scanned content. Expensive (dollars per document for complex PDFs). Right tool when accuracy matters more than cost.
Mathpix for math-heavy PDFs. Llamaparse for general high-quality parsing. Azure and AWS are decent general-purpose. All charge per page.
PDFs don't store reading order, they store visual positions. A two-column PDF parsed naively produces "column 1 line 1, column 2 line 1, column 1 line 2, column 2 line 2..." which is nonsense. Always check that your parser reorders text correctly on multi-column layouts.
Page numbers, running headers, and footer boilerplate get concatenated into your content unless you strip them. Most retrieval failures traced to "why does it think the answer is on page 47" come from headers polluting chunks.
A table parsed as raw text is a string of numbers with no structure. Either use a parser that preserves tables as structured data (then serialize them into readable form), or detect tables and send them to a dedicated extractor. See tables and figures.
Text extraction drops them entirely. For technical documents, figures often contain 30% of the information. Strategies:
PDF text often has hyphens at line breaks that become mid-word hyphens after extraction ("im-\nportant" becomes "im-portant"). Always post-process to rejoin these.
"fi" and "fl" ligatures, Greek letters in math, non-breaking spaces, all of these break tokenization and retrieval if not normalized.
For any significant PDF corpus, take 10 random documents and spot-check the parsed output against the original. Look specifically for:
If fewer than 8 of 10 pass, your parser is wrong for this corpus. Don't embed bad text, garbage-in compounds through the rest of the pipeline.
For any serious business-critical RAG system, I recommend budgeting for LLM-based parsing or a commercial service on difficult documents. Over a full corpus, paying $0.05-0.20/page for quality parsing is often cheaper than debugging retrieval failures caused by bad text extraction.
Next: Parsing HTML + web pages.