Home›Expertise›RAGS to Riches›Tables and figures

Tables and figures

📖 5 min readUpdated 2026-04-18

Tables and figures are where most RAG systems silently lose information. Naive text extraction strips them entirely or converts them to unreadable streams. For technical, financial, scientific, or compliance documents, this means 30-50% of the actual information never reaches the index.

Why tables are hard

A table is a two-dimensional data structure. Prose is one-dimensional. When a parser flattens a table into prose, it loses:

The column-row relationship (which value goes with which row and column)
Header hierarchy (multi-level headers)
Merged cells
Units and formatting
Footnotes and caveats attached to specific cells

Four strategies for tables

1. Serialize as readable prose

Convert each row into a sentence: "In 2023, Revenue was $4.2M, COGS was $1.8M, Gross Margin was 57%." Preserves semantics in a format embeddings handle well. Loses compactness but gains retrievability.

Works for: small to medium tables where the row-level facts are the important thing.

2. Serialize as Markdown

Keep the table as Markdown syntax. Preserves structure. Embeddings still handle it reasonably well for retrieval, and the LLM can reason over the original structure during generation.

Works for: tables with clear headers and moderate size (<50 rows).

3. Chunk per row

Each row becomes its own chunk, with the table's column headers prepended or added as metadata. Fine-grained retrieval but you lose the row-to-row comparisons.

Works for: large reference tables (product catalogs, drug databases, parts lists).

4. Keep the table as structured data, retrieve by key

Store the table in a database. Have the LLM emit a structured query when it needs table data. Not strictly RAG, but a common hybrid for systems that need to reason over tabular data.

Works for: large financial statements, time series, anything with a lot of numeric comparison.

Picking a strategy

Look at what questions users actually ask:

"What was revenue in 2022?" → row-level retrieval works
"How did margin change from 2020 to 2023?" → need multi-row context, so serialize whole small tables or use structured queries
"Find products with power rating between X and Y" → structured query is the only sane option

Table extraction tools

Camelot / Tabula, classic Python table extractors. Good on bordered tables in simple PDFs.
pdfplumber, built-in table detection, solid for clean tables.
Docling, strong table detection including nested and multi-header tables.
Azure Document Intelligence, AWS Textract, commercial OCR with excellent table support.
LLM vision models (GPT-4V, Claude, Gemini), highest quality on complex or messy tables. Expensive.

Figures and images

The captions-only baseline

Extract and embed just the figure caption. Fast, cheap, catches most user questions that reference figures by number or topic.

The description-generated approach

Pass the figure image through a vision model to generate a text description. Embed the description. Expensive at scale but captures visual content. Use for technical diagrams, charts, and anything where the image carries real information.

The retrieve-adjacent-text approach

When a figure is retrieved, also pull the paragraphs immediately before and after it. Figures rarely stand alone, the surrounding text usually explains what the figure shows.

The multimodal retrieval approach

Use a multimodal embedding model (CLIP, Jina Multimodal, Voyage-Multimodal) to embed both images and text into a shared vector space. Retrieval can return images when queries match visual content. Emerging pattern. Works well for product catalogs, image-heavy documentation, medical imaging.

Putting it together

For a document with mixed content, my typical pipeline:

Parse into typed elements (text, heading, list, table, figure)
Chunk text elements with structure-aware chunking
For each table: generate both a Markdown serialization and a prose description. Embed both.
For each figure: extract caption. If the figure matters for the use case, generate a vision-model description. Embed caption and description.
Keep a document-level link between elements so retrieved chunks can pull their table/figure context.

This is significantly more work than "dump text into a vector DB" but it's the difference between a RAG system that handles technical docs and one that handles "some docs."

What to do with this

Identify what fraction of your corpus contains tables or figures. If >10%, budget for proper handling.
Match strategy to user-question shape, not to what parser defaults provide.
For high-stakes numeric docs, consider structured-query hybrid.