Scanned documents are a fact of life in enterprise RAG, legal contracts, old manuals, medical records, government filings. Optical Character Recognition (OCR) is the bridge from pixels to text. It's imperfect, it's slow, and the quality differences between tools are enormous. Here's how I handle it.
First move: detect whether a document has a usable text layer before running OCR. Running OCR on a clean digital PDF is expensive, slow, and often produces worse text than the embedded text layer.
Heuristics:
The open-source baseline. Decades of development. Good on clean printed text, weak on complex layouts, handwriting, and poor scans. Free. Run in Docker to avoid install pain.
Solid general-purpose OCR with structured output (tables, forms, key-value pairs). Charges per page. Production-ready. My default for English-language enterprise documents.
Comparable to Textract. Slightly better on forms and pre-built document types (invoices, receipts, IDs). Charges per page.
Strong on handwritten text and non-English languages. Per-page pricing.
Highest quality on difficult documents: poor scans, unusual layouts, handwritten annotations, mixed-language content. Slowest and most expensive. Handles layout-aware extraction better than any dedicated OCR tool. My go-to for high-value, low-volume use cases.
Open-source alternatives to Tesseract with better multilingual support. Useful when cost matters and Tesseract isn't cutting it.
Newer specialized models for academic papers, scientific documents, and math. Marker and Nougat are particularly strong on technical content.
Pre-process: deskew, denoise, binarize, increase DPI (300+ for OCR). OpenCV or PIL can do all of this in ten lines. Worth it before any OCR pass.
Some OCR tools auto-detect language, others need it specified. Test your pipeline against multilingual samples early.
Most general OCR returns tables as space-separated text streams that destroy structure. Use a table-aware OCR (Textract, Azure DI) or pass the table region to a vision model for structured extraction.
Tesseract: poor. Textract: okay. Azure DI: good. Vision LLMs: best. If your corpus has significant handwriting, pay for the good tool.
Scans often include pages rotated 90° or 180°. Most OCR tools handle this, but not all. Test.
OCR output may have correct characters but wrong reading order. Same problem as born-digital PDFs. Most commercial tools handle this; Tesseract often doesn't.
OCR is the most expensive step in most ingestion pipelines. Strategies:
After OCR, run a cleanup pass:
Garbage-in still compounds. OCR'd text quality bounds retrieval quality.
Next: Metadata extraction.