Tables and figures

Tables and figures are where most RAG systems silently lose information. Naive text extraction strips them entirely or converts them to unreadable streams. For technical, financial, scientific, or compliance documents, this means 30-50% of the actual information never reaches the index.

Why tables are hard

A table is a two-dimensional data structure. Prose is one-dimensional. When a parser flattens a table into prose, it loses:

Four strategies for tables

1. Serialize as readable prose

Convert each row into a sentence: "In 2023, Revenue was $4.2M, COGS was $1.8M, Gross Margin was 57%." Preserves semantics in a format embeddings handle well. Loses compactness but gains retrievability.

Works for: small to medium tables where the row-level facts are the important thing.

2. Serialize as Markdown

Keep the table as Markdown syntax. Preserves structure. Embeddings still handle it reasonably well for retrieval, and the LLM can reason over the original structure during generation.

Works for: tables with clear headers and moderate size (<50 rows).

3. Chunk per row

Each row becomes its own chunk, with the table's column headers prepended or added as metadata. Fine-grained retrieval but you lose the row-to-row comparisons.

Works for: large reference tables (product catalogs, drug databases, parts lists).

4. Keep the table as structured data, retrieve by key

Store the table in a database. Have the LLM emit a structured query when it needs table data. Not strictly RAG, but a common hybrid for systems that need to reason over tabular data.

Works for: large financial statements, time series, anything with a lot of numeric comparison.

Picking a strategy

Look at what questions users actually ask:

Table extraction tools

Figures and images

The captions-only baseline

Extract and embed just the figure caption. Fast, cheap, catches most user questions that reference figures by number or topic.

The description-generated approach

Pass the figure image through a vision model to generate a text description. Embed the description. Expensive at scale but captures visual content. Use for technical diagrams, charts, and anything where the image carries real information.

The retrieve-adjacent-text approach

When a figure is retrieved, also pull the paragraphs immediately before and after it. Figures rarely stand alone, the surrounding text usually explains what the figure shows.

The multimodal retrieval approach

Use a multimodal embedding model (CLIP, Jina Multimodal, Voyage-Multimodal) to embed both images and text into a shared vector space. Retrieval can return images when queries match visual content. Emerging pattern. Works well for product catalogs, image-heavy documentation, medical imaging.

Putting it together

For a document with mixed content, my typical pipeline:

  1. Parse into typed elements (text, heading, list, table, figure)
  2. Chunk text elements with structure-aware chunking
  3. For each table: generate both a Markdown serialization and a prose description. Embed both.
  4. For each figure: extract caption. If the figure matters for the use case, generate a vision-model description. Embed caption and description.
  5. Keep a document-level link between elements so retrieved chunks can pull their table/figure context.

This is significantly more work than "dump text into a vector DB" but it's the difference between a RAG system that handles technical docs and one that handles "some docs."

Next: OCR for scanned documents.