Home›Expertise›RAGS to Riches›OCR for scanned documents

OCR for scanned documents

📖 5 min readUpdated 2026-04-18

Scanned documents are a fact of life in enterprise RAG, legal contracts, old manuals, medical records, government filings. Optical Character Recognition (OCR) is the bridge from pixels to text. It's imperfect, it's slow, and the quality differences between tools are enormous. Here's how I handle it.

Detecting what needs OCR

First move: detect whether a document has a usable text layer before running OCR. Running OCR on a clean digital PDF is expensive, slow, and often produces worse text than the embedded text layer.

Heuristics:

Try text extraction first. If you get less than 100 characters per page, suspect scanned.
Check if PDF has a text layer (PyMuPDF's get_text with "text" mode returns empty for pure images)
If the PDF was generated from Word/LaTeX, text layer is present. If it's a scan, it isn't.

OCR tools

Tesseract

The open-source baseline. Decades of development. Good on clean printed text, weak on complex layouts, handwriting, and poor scans. Free. Run in Docker to avoid install pain.

AWS Textract

Solid general-purpose OCR with structured output (tables, forms, key-value pairs). Charges per page. Production-ready. My default for English-language enterprise documents.

Azure Document Intelligence

Comparable to Textract. Slightly better on forms and pre-built document types (invoices, receipts, IDs). Charges per page.

Google Cloud Vision / Document AI

Strong on handwritten text and non-English languages. Per-page pricing.

GPT-4V / Claude / Gemini (LLM vision)

Highest quality on difficult documents: poor scans, unusual layouts, handwritten annotations, mixed-language content. Slowest and most expensive. Handles layout-aware extraction better than any dedicated OCR tool. My go-to for high-value, low-volume use cases.

EasyOCR, PaddleOCR

Open-source alternatives to Tesseract with better multilingual support. Useful when cost matters and Tesseract isn't cutting it.

Mistral OCR / Marker / Nougat

Newer specialized models for academic papers, scientific documents, and math. Marker and Nougat are particularly strong on technical content.

The quality tiers

Tier 1 (freebie): Tesseract on clean English print. 95%+ accuracy. Good for most scanned business documents.
Tier 2 (commercial): Textract / Azure DI. 98%+ on clean, handles layouts well.
Tier 3 (LLM vision): Claude / GPT-4V / Gemini. 99%+ on most content. Handles tables, diagrams, multi-column layouts with context-aware correction.

The specific gotchas

Low-quality scans

Pre-process: deskew, denoise, binarize, increase DPI (300+ for OCR). OpenCV or PIL can do all of this in ten lines. Worth it before any OCR pass.

Mixed languages

Some OCR tools auto-detect language, others need it specified. Test your pipeline against multilingual samples early.

Tables in scans

Most general OCR returns tables as space-separated text streams that destroy structure. Use a table-aware OCR (Textract, Azure DI) or pass the table region to a vision model for structured extraction.

Handwriting

Tesseract: poor. Textract: okay. Azure DI: good. Vision LLMs: best. If your corpus has significant handwriting, pay for the good tool.

Rotated pages

Scans often include pages rotated 90° or 180°. Most OCR tools handle this, but not all. Test.

Multi-column layouts after OCR

OCR output may have correct characters but wrong reading order. Same problem as born-digital PDFs. Most commercial tools handle this; Tesseract often doesn't.

Cost engineering

OCR is the most expensive step in most ingestion pipelines. Strategies:

Tier routing: detect document quality, send clean docs to Tesseract, difficult ones to commercial OCR, only the hardest to vision LLMs.
Cache results by document hash. Never re-OCR the same document.
Only OCR pages with actual content (skip blank or nearly-blank pages).
Deduplicate the corpus before OCR: if you have 50 copies of the same contract, OCR it once.

OCR is not the end

After OCR, run a cleanup pass:

Remove page numbers, headers, footers
Rejoin words split across lines by hyphens
Normalize whitespace and fix common OCR errors ("rn" → "m", "0" → "O" depending on context)
For LLM-based post-processing: run chunks through a fast model with a prompt to correct OCR artifacts while preserving factual content

Garbage-in still compounds. OCR'd text quality bounds retrieval quality.

What to do with this

Detect text layer first; don't OCR docs that don't need it.
Tier-route by document quality: cheap OCR for easy, expensive for hard.
Always run a post-OCR cleanup pass before embedding.