OCR for scanned documents

Scanned documents are a fact of life in enterprise RAG, legal contracts, old manuals, medical records, government filings. Optical Character Recognition (OCR) is the bridge from pixels to text. It's imperfect, it's slow, and the quality differences between tools are enormous. Here's how I handle it.

Detecting what needs OCR

First move: detect whether a document has a usable text layer before running OCR. Running OCR on a clean digital PDF is expensive, slow, and often produces worse text than the embedded text layer.

Heuristics:

OCR tools

Tesseract

The open-source baseline. Decades of development. Good on clean printed text, weak on complex layouts, handwriting, and poor scans. Free. Run in Docker to avoid install pain.

AWS Textract

Solid general-purpose OCR with structured output (tables, forms, key-value pairs). Charges per page. Production-ready. My default for English-language enterprise documents.

Azure Document Intelligence

Comparable to Textract. Slightly better on forms and pre-built document types (invoices, receipts, IDs). Charges per page.

Google Cloud Vision / Document AI

Strong on handwritten text and non-English languages. Per-page pricing.

GPT-4V / Claude / Gemini (LLM vision)

Highest quality on difficult documents: poor scans, unusual layouts, handwritten annotations, mixed-language content. Slowest and most expensive. Handles layout-aware extraction better than any dedicated OCR tool. My go-to for high-value, low-volume use cases.

EasyOCR, PaddleOCR

Open-source alternatives to Tesseract with better multilingual support. Useful when cost matters and Tesseract isn't cutting it.

Mistral OCR / Marker / Nougat

Newer specialized models for academic papers, scientific documents, and math. Marker and Nougat are particularly strong on technical content.

The quality tiers

  1. Tier 1 (freebie): Tesseract on clean English print. 95%+ accuracy. Good for most scanned business documents.
  2. Tier 2 (commercial): Textract / Azure DI. 98%+ on clean, handles layouts well.
  3. Tier 3 (LLM vision): Claude / GPT-4V / Gemini. 99%+ on most content. Handles tables, diagrams, multi-column layouts with context-aware correction.

The specific gotchas

Low-quality scans

Pre-process: deskew, denoise, binarize, increase DPI (300+ for OCR). OpenCV or PIL can do all of this in ten lines. Worth it before any OCR pass.

Mixed languages

Some OCR tools auto-detect language, others need it specified. Test your pipeline against multilingual samples early.

Tables in scans

Most general OCR returns tables as space-separated text streams that destroy structure. Use a table-aware OCR (Textract, Azure DI) or pass the table region to a vision model for structured extraction.

Handwriting

Tesseract: poor. Textract: okay. Azure DI: good. Vision LLMs: best. If your corpus has significant handwriting, pay for the good tool.

Rotated pages

Scans often include pages rotated 90° or 180°. Most OCR tools handle this, but not all. Test.

Multi-column layouts after OCR

OCR output may have correct characters but wrong reading order. Same problem as born-digital PDFs. Most commercial tools handle this; Tesseract often doesn't.

Cost engineering

OCR is the most expensive step in most ingestion pipelines. Strategies:

OCR is not the end

After OCR, run a cleanup pass:

Garbage-in still compounds. OCR'd text quality bounds retrieval quality.

Next: Metadata extraction.