Tables and figures are where most RAG systems silently lose information. Naive text extraction strips them entirely or converts them to unreadable streams. For technical, financial, scientific, or compliance documents, this means 30-50% of the actual information never reaches the index.
A table is a two-dimensional data structure. Prose is one-dimensional. When a parser flattens a table into prose, it loses:
Convert each row into a sentence: "In 2023, Revenue was $4.2M, COGS was $1.8M, Gross Margin was 57%." Preserves semantics in a format embeddings handle well. Loses compactness but gains retrievability.
Works for: small to medium tables where the row-level facts are the important thing.
Keep the table as Markdown syntax. Preserves structure. Embeddings still handle it reasonably well for retrieval, and the LLM can reason over the original structure during generation.
Works for: tables with clear headers and moderate size (<50 rows).
Each row becomes its own chunk, with the table's column headers prepended or added as metadata. Fine-grained retrieval but you lose the row-to-row comparisons.
Works for: large reference tables (product catalogs, drug databases, parts lists).
Store the table in a database. Have the LLM emit a structured query when it needs table data. Not strictly RAG, but a common hybrid for systems that need to reason over tabular data.
Works for: large financial statements, time series, anything with a lot of numeric comparison.
Look at what questions users actually ask:
Extract and embed just the figure caption. Fast, cheap, catches most user questions that reference figures by number or topic.
Pass the figure image through a vision model to generate a text description. Embed the description. Expensive at scale but captures visual content. Use for technical diagrams, charts, and anything where the image carries real information.
When a figure is retrieved, also pull the paragraphs immediately before and after it. Figures rarely stand alone, the surrounding text usually explains what the figure shows.
Use a multimodal embedding model (CLIP, Jina Multimodal, Voyage-Multimodal) to embed both images and text into a shared vector space. Retrieval can return images when queries match visual content. Emerging pattern. Works well for product catalogs, image-heavy documentation, medical imaging.
For a document with mixed content, my typical pipeline:
This is significantly more work than "dump text into a vector DB" but it's the difference between a RAG system that handles technical docs and one that handles "some docs."
Next: OCR for scanned documents.