Home›Expertise›RAGS to Riches›Parsing PDFs

Parsing PDFs

📖 6 min readUpdated 2026-04-18

PDFs are where the "we'll just parse the docs" plan meets reality. PDF is not a document format, it's a visual layout format that happens to contain text. Every serious RAG project eventually has a PDF problem. Here's how I think about it.

The three kinds of PDFs

1. Born-digital with clean text layer

Generated from Word, LaTeX, or pandoc. Text is extractable with basic tools. Usually has a coherent reading order. Still may have headers/footers/page numbers that pollute text.

2. Born-digital with messy text layer

Text is there but reading order is scrambled. Multi-column layouts, text boxes, sidebars, footnotes. Basic extraction yields jumbled output. This is the most common production case.

3. Scanned images

No text layer. Just pixels. Requires OCR. See OCR.

The library landscape

PyPDF / pdfplumber

Good for clean, simple PDFs. Fast and free. Fails on multi-column layouts, tables, and scanned content. My baseline for quick prototyping.

PyMuPDF (fitz)

More robust. Handles complex layouts better. Better table detection. Slightly tricky licensing for commercial use (AGPL, requires license for closed-source).

Unstructured.io

Opinionated parser that preserves structure: headings, lists, tables. Returns elements with types, which is valuable for structure-aware chunking. Slower than raw text extraction but usually worth the tradeoff.

Docling (IBM)

Open-source, strong on complex document layouts. Produces structured output with headings, tables, and lists preserved. Under active development and my current default for complex PDFs.

LLM-based parsers (GPT-4 Vision, Claude, Gemini)

Highest quality on difficult documents. Handles layouts, tables, figures, and scanned content. Expensive (dollars per document for complex PDFs). Right tool when accuracy matters more than cost.

Commercial (Mathpix, Azure Document Intelligence, AWS Textract, Llamaparse)

Mathpix for math-heavy PDFs. Llamaparse for general high-quality parsing. Azure and AWS are decent general-purpose. All charge per page.

The parsing decision tree

Is the PDF text-selectable in Preview/Acrobat? If no → OCR pipeline.
Is it a single-column, simple layout? pdfplumber or PyMuPDF.
Does it have tables or complex structure? Docling, Unstructured, or Llamaparse.
Does it have math, diagrams, or highly specialized content? Mathpix or LLM-based parser.
Is accuracy more important than per-document cost? LLM-based parser.

The gotchas that cost days

Reading order

PDFs don't store reading order, they store visual positions. A two-column PDF parsed naively produces "column 1 line 1, column 2 line 1, column 1 line 2, column 2 line 2..." which is nonsense. Always check that your parser reorders text correctly on multi-column layouts.

Headers and footers

Page numbers, running headers, and footer boilerplate get concatenated into your content unless you strip them. Most retrieval failures traced to "why does it think the answer is on page 47" come from headers polluting chunks.

Tables

A table parsed as raw text is a string of numbers with no structure. Either use a parser that preserves tables as structured data (then serialize them into readable form), or detect tables and send them to a dedicated extractor. See tables and figures.

Figures and charts

Text extraction drops them entirely. For technical documents, figures often contain 30% of the information. Strategies:

Extract figure captions and embed those
Send figures through a vision model to generate descriptions, embed the descriptions
In the index, store figure references alongside text chunks so retrieved chunks can pull in their surrounding visuals

Hyphenation and line breaks

PDF text often has hyphens at line breaks that become mid-word hyphens after extraction ("im-\nportant" becomes "im-portant"). Always post-process to rejoin these.

Ligatures and special characters

"ﬁ" and "ﬂ" ligatures, Greek letters in math, non-breaking spaces, all of these break tokenization and retrieval if not normalized.

The ground-truth test

For any significant PDF corpus, take 10 random documents and spot-check the parsed output against the original. Look specifically for:

Is reading order correct?
Are tables intelligible?
Are headings preserved as headings (not mixed into body)?
Is boilerplate removed?
Are figures at least referenced?

If fewer than 8 of 10 pass, your parser is wrong for this corpus. Don't embed bad text, garbage-in compounds through the rest of the pipeline.

The commercial reality

For any serious business-critical RAG system, I recommend budgeting for LLM-based parsing or a commercial service on difficult documents. Over a full corpus, paying $0.05-0.20/page for quality parsing is often cheaper than debugging retrieval failures caused by bad text extraction.

What to do with this

Run the ground-truth test on 10 random PDFs from your corpus before picking a parser.
Budget for a tier-up on hard documents; don't one-size-fits-all.
Read OCR if any of your PDFs are scanned.