Structure-aware chunking

Structure-aware chunking uses the document's own organization as the chunk boundary: headings, sections, lists, tables. For any corpus that has real structure (technical docs, wikis, legal documents, Markdown, HTML), this is the highest-performing chunking strategy. It's also the most underused.

The core idea

Authors already decided where ideas begin and end, that's what headings and paragraphs are. Respect their decisions.

Chunk boundaries come from:

The algorithm

  1. Parse document into typed elements with hierarchy (heading levels, sections, lists, etc.)
  2. Walk the element tree, grouping adjacent elements under the same heading into candidate chunks
  3. If a candidate chunk exceeds target size, split it by sub-element (or by recursive chunking within)
  4. If a candidate chunk is too small, merge it with the next one (but only within the same section)
  5. Never break across top-level sections or merge across major heading transitions

The heading context trick

Every chunk carries its heading path as metadata or prepended text. For a chunk under "Setup > Authentication > OAuth > Configuration", the chunk text might be prepended with:

# Setup > Authentication > OAuth > Configuration

[actual chunk content...]

The LLM and embedding model both benefit from knowing where a chunk sits in the document hierarchy. Without this, a chunk about "rate limits" could be about API rate limits, OAuth token rate limits, or webhook rate limits, all of which might exist in the same document.

Element-type-specific rules

Headings

Never a chunk by itself. Attach to the following content.

Tables

Never split. Serialize as a single chunk with surrounding context. See tables and figures.

Code blocks

Never split within. Often a chunk by themselves plus a surrounding context chunk.

Lists

Keep lists together when possible. If a list is huge (e.g., a reference list), chunk by list sections or by logical groups.

Images/figures

Attach to nearest text chunk. Store figure metadata in chunk metadata. See tables and figures.

The Markdown case

Markdown is the easiest case. Headings are unambiguous, paragraphs are clear, code blocks are explicit. A good Markdown chunker:

  1. Parses into an AST
  2. Walks headings, creating sections
  3. Produces one chunk per section, with heading hierarchy as metadata
  4. Splits large sections on subheadings or natural paragraph boundaries
  5. Preserves code blocks, tables, and lists as atomic units

The HTML case

HTML is Markdown's messy cousin. Parse into DOM, identify content containers, extract hierarchical sections from heading tags. Convert to Markdown for downstream processing, then chunk as Markdown.

The PDF case

PDFs don't have guaranteed structure, but good parsers (Docling, Unstructured, Llamaparse) return typed elements with heading levels inferred from font size, style, and position. Quality of structure-aware chunking depends heavily on parser quality, see parsing PDFs.

Why it outperforms fixed-size

Implementation

When it's overkill

For everything else, documentation, knowledge bases, reports, books, technical manuals, structure-aware chunking is worth the extra engineering.

Next: Chunking code.